Understanding z/VM CPU Utilization
(Last revised: 2011-03-23, BKW)
In its most general sense, the phrase "CPU utilization" refers to the in-use fraction of a computing system's capacity to run instructions.
In a stacked environment where virtual servers run on z/VM, which in turn runs in a logical partition, which in turn runs on a physical System z computer, the phrase "CPU utilization" can often lead to more confusion than illumination.
In this article we want to discuss the several places where the z/VM stacked environment reports CPU utilization, and what exactly each one of those reported numbers means, and how to interpret the various reports. We will begin with how to discuss the utilization of an entire System z CEC. Following, we will refine to how to talk about the utilization of a single partition, and then about the guests running within a partition. We'll conclude with a brief summary and then offer a reference table that captures all of what we've explained in this article.
A System z computer is outfitted with a number of physical CPUs, organized into types: standard CPs, IFLs, zAAPs, zIIPs, and ICFs. The typical System z machine comes with one or more engines of each type.
When we discuss the utilization of the physical machine, we usually talk about the utilizations of the engine types separately. For example, we discuss the utilization of the standard CPs, and of the IFLs, and so on.
Also when we discuss these utilizations, we usually use percentage expressions. Such percentages always use the standard that "100%" means "one entirely consumed physical engine". So, for example, if we say the IFLs on a machine are "685% utilized", it means that 6.85 physical engines' worth of IFL capacity are being consumed.
This standard about the meaning of "100%" holds true for almost all of the numbers we will describe in this article. We will clearly caution the reader otherwise when we are discussing an exception case.
Logical PUs as Consumers
The System z PR/SM hypervisor makes possible the dividing of a physical System z computer into disjoint computing zones called logical partitions. These partitions are equipped with logical PUs, which PR/SM dispatches on physical PUs so as to run the partitions' workloads.
Recognizing this partitioning scheme, and realizing that PR/SM itself will also consume some CPU for its own ends, we can break down the consumption of System z physical CPU time into three very specific buckets:
The PR/SM hypervisor keeps counters that measure these three kinds of CPU time. It accounts the first two kinds of time on a per-logical-PU basis. It accounts the third on a per-physical-PU basis.
Periodically z/VM asks PR/SM for the values of all of these counters. z/VM then dumps those counters into a binary data stream called "CP Monitor data". The IBM-supplied utility program called MONWRITE can journal this binary data stream to disk or tape for analysis later; we call such recorded data "MONWRITE data". A reduction and reporting program such as z/VM Performance Toolkit — known herein as "Perfkit" — can report on what the MONWRITE data reveals.
For those of you interested in details, PR/SM's CPU time accumulators appear in these CP Monitor records:
For those of you with z/OS backgrounds, CP Monitor records are analogous to what z/OS calls "SMF records". The records contain binary data meant for consumption by a reduction program of some kind. On z/OS, that reduction program is called RMF. On z/VM, the reduction program is called z/VM Performance Toolkit, or herein, "Perfkit".
A dedicated partition is one for which the PR/SM hypervisor has set aside specific physical engines for said partition's exclusive use. For example, if a partition were four-way dedicated with standard CPs, PR/SM would set aside four physical standard CPs and run nothing on them except said partition's four logical standard CPs. This dedication function is great for the anointed partition, but physical CPU cycles the partition doesn't consume are lost to the ages.
Because of this exclusivity property, when PR/SM reports logical PUs' consumption of cycles, it reports "100% busy" for every dedicated logical PU. If we want to know how busy a dedicated partition really is, we have to look at CPU utilization data collected by the operating system running in the dedicated partition. As far as PR/SM is concerned, those logical PUs are 100% busy, and PR/SM reports each one as consuming a whole physical PU.
For any given partition, all of its logical PUs are either dedicated or shared. For example, there is no notion of a z/VM-mode partition where the logical standard CPs are dedicated but the logical zIIPs are shared.
Unfortunately, the System z community's conventional use of the adjectives "dedicated" and "shared" is kind of upside-down. In truth, it's the System z's physical engines themselves that are either "dedicated" to a specific partition or "shared" among multiple partitions. (Think about it: when four people celebrate a birthday by together eating a cake, it's the cake that's said to be "shared", not the people.) Even so, System z fans always use these two particular adjectives to describe partitions and logical PUs. A "shared partition" has its logical PUs time-sliced onto physical PUs. A "dedicated partition" gets its own physical PUs. Do not let this upside-down use of terminology sidetrack you.
Perfkit's Reporting of Partitions' Utilization
The PR/SM counters break out consumption by logical PU, so it makes sense that we can tabulate consumption by logical PU, or by partition. A couple different Perfkit reports give us the lenses we need. Here is a description of the reports and some discussion of their strengths and weaknesses.
FCX126 LPAR is the most complete, most detailed, most voluminous, and perhaps most useful of this family of reports. If you like granularity and details, this report is for you. For each logical PU of each partition, the FCX126 LPAR report displays two columns derived from PR/SM's CPU accumulators:
(We will talk about the other FCX126 columns, namely "%VMld", "%Logld", and "%Susp", later.)
One of the strengths of the FCX126 LPAR report is that it shows us separate utilization values for every logical PU of every partition on the whole CEC. In fact, the interim version of the report, FCX126 INTERIM LPAR, gives us said breakout on a time-interval by time-interval basis. Because of this granularity, it is very easy to use FCX126 LPAR to see a runaway, overburdened, underused, or stalled partition, or even to see such a logical PU, no matter its partition.
Keep in mind the FCX126 LPAR report's
A weakness of the FCX126 LPAR report is that it can be somewhat voluminous and consequently overwhelming to digest. Also, rarely do we need to see the logical PUs' utilizations individually.
Here is a excerpt of the FCX126 LPAR report, so you'll recognize it in your own Perfkit reports. This excerpt is edited somewhat, for size, relevance, and appearance on an HTML page.
FCX202 LPARLOG is another report that comments on partitions' CPU utilization. This report is a time-indexed, interval-by-interval summary of the CPU time PR/SM charged to partitions, rolled up by partition instead of broken out by logical PU.
An important trait of FCX202 LPARLOG is that for each partition, LPARLOG adds up the utilizations for the partitions' logical PUs and then divides, to compute the utilization for the average logical PU in the partition. This trait can be either a strength or a weakness:
FCX202 LPARLOG reveals on its right-hand side whether the partition is mixed-engine: the "Type" column will say MIX. For a homogeneous partition, the "Type" column will report the engine type: IFL, CP, or whatever.
(Once again, we will discuss "%VMld", "%Logld", and "%Susp" later.)
Here is a excerpt of FCX202 LPARLOG, again, so you'll recognize it in your own Perfkit reports.
That's pretty much it for the Perfkit reports that show us what PR/SM had to say about CPU utilization.
z/VM's Own Accounting of CPU Utilization
As the z/VM Control Program runs, it keeps track of how it spends its time on each of its logical PUs. z/VM accrues each logical PU's time into four buckets:
D0 R2 MRSYTPRP contains z/VM's four time accumulators. At each CP Monitor interval, z/VM cuts a herd of these records, one record for each logical PU in the partition.
Perfkit's Reporting of z/VM's Own Accounting of CPU Time
Several different Perfkit reports comment on z/VM's own CPU time accountings. Here's a description of the reports, again, with their strengths and weaknesses.
FCX126 LPAR again is the most granular, most detailed member of this family of reports. Notice that for the partition that collected the MONWRITE data, the LPAR report includes, for every logical PU in the partition, columns labelled "%VMld" (say "percent-VM-load") and "%Logld" (say "percent-logical-load"). These two percentage columns are derived from the four z/VM buckets we described above, calculated as follows:
Returning to the FCX126 LPAR excerpt above, we notice that
Also notice in the excerpt that
Again, FCX126 LPAR's granularity and detail are both its strength and its weakness. The strength is that for each logical PU, the report shows all of the percentages derivable from either the PR/SM-maintained counters or the z/VM-maintained counters. The corresponding weakness is the bulkiness of the report.
FCX144 PROCLOG also reports on the logical PUs' utilizations, again, using these four z/VM time buckets, but combined a little differently.
Each of these ranges up to 100%, again, "100%" meaning "one entire physical engine's worth".
A strength of FCX144 PROCLOG is that it reports on only the collecting partition. This makes it somewhat smaller than FCX126 LPAR. A weakness of PROCLOG is that it doesn't present PR/SM's counters alongside z/VM's counters, so some of the utilization picture is missing.
Here is a small excerpt from FCX144 PROCLOG.
FCX239 PROCSUM contains one column, "Pct Busy", that reports on the average CPU-busy across all logical PUs in the partition, time-interval by time-interval, again, using z/VM's first three buckets. Again, "100%" would mean "one entire physical engine's worth". The averaging PROCSUM does can be handy, but like FCX202 LPARLOG, it's misleading in mixed-engine environments.
FCX225 SYSSUMLG reports the same average CPU-busy value as FCX239 PROCSUM, also time-interval by time-interval, and with the same strengths and weaknesses. A nice property of SYSSUMLG is that it also includes, interval by interval, a wide assortment of system performance metrics drawn from CPU consumption, I/O, paging, and other interesting behaviors. The variety present in FCX225 SYSSUMLG makes it a good first stop for examining a system's basic performance properties.
is similar to FCX144 PROCLOG in that it reports on each
logical PU separately. For each logical PU, we see
(format 1) presents the four z/VM buckets a little
differently. The thing to remember here is that for FCX101 REDISP, the
over all of the logical PUs in the partition.
So, the "CPU" column is the grand total of the first three z/VM buckets,
across all logical PUs, over the time interval of the report, expressed
as a percent; again, "100" would mean one physical engine's worth. The
The Notion of Suspend Time
By now you have undoubtedly noticed that FCX126 LPAR and FCX202 LPARLOG contain a column called "%Susp". This column, called "suspend time", deserves special explanation, because it is widely misunderstood.
We can begin to understand
Now, here's the thing about those four D0 R2 counters. At each interval, the counters might not add up to 100%. In other words, time appears to be missing. Why? Here's why. z/VM measures those four values using a System z facility called the "CPU timer". Here's the thing about a logical PU's CPU timer: it doesn't advance when the logical PU is not dispatched. So, the logical PU's undispatched time isn't recorded in any of those four z/VM buckets.
So, we arrive at the definition of
To understand the significance of
The misunderstanding of
Finally, a historical note. VM performance reporting products of ages past used to call our "suspend time" by another term, "involuntary wait", and used column headings such as "%IW" to denote it. Some readers might find this old term illuminates their understanding of what's occuring. For that reason alone, we mention it here.
The Whole Picture
At this point we've explained everything about all five interesting columns on FCX126 LPAR and FCX202 LPARLOG:
We've also explained the PROCLOG, PROCSUM, SYSSUMLG, CPU, and REDISP views of CPU consumption, and how Perfkit calculates them from z/VM's D0 R2 timers.
This pretty much wraps up the discussion of how Perfkit expresses CPU utilization, for partitions and for whole CECs.
Remember, with the exception of
The Notion of T/V Ratio
When people discuss the notion of "z/VM overhead", one of the ways they do it is to talk about a computed health metric called "system total-to-virtual ratio", or "system T/V ratio". This number is the ratio of total CPU time used to virtual time used, as z/VM sees things. Another way to think of this is by returning to those four z/VM D0 R2 time buckets we mentioned earlier. System T/V ratio is just the total run cycles z/VM accounts, divided by guest run cycles z/VM accounts, or formulaically, (b1+b2+b3) / b1.
When system T/V is 1.0, it means all of the z/VM-accounted time was spent running guests. This is the perfect world: no cycles spent running the Control Program.
When system T/V is 2.0, it means for every second spent actually running guests, the z/VM Control Program spent one second running itself. Such a high T/V would usually be a sign of distress and would be a call to do some diagnosis and tuning.
System T/V ratio is very dependent on workload characteristics. Typical values for system T/V are in the 1.0 to 1.2 range.
The FCX225 SYSSUMLG and FCX239 PROCSUM reports both have a "T/V" column, calculated exactly as described here.
Guest CPU Utilization
To round out this article, we should talk about where Perfkit reports on guest CPU utilization, and what the reported numbers mean.
FCX112 USER is very useful in understanding guest CPU utilization. Four of its columns are particularly interesting. The columns apply to the time interval of the report, as notated in the report's upper-left corner. The columns are:
Here is a small excerpt from FCX112 USER.
One nice property of the FCX112 USER
FCX162 USERLOG is another place where we can find guest CPU utilization numbers. The first few columns match up to the FCX112 USER report and have the same definitions. But unlike FCX112 USER, FCX162 USERLOG tabulates one user's use of CPU over time. Here's an example.
Be aware that no Perfkit report breaks down guest CPU time by virtual CPU. The fields are there in CP Monitor but Perfkit does not report on them. However, if you are running Perfkit on a live system, one of the USER drill-down screens does report on each virtual PU separately. Perfkit creates this drill-down screen by mining timer values right out of CP control blocks, instead of by looking at CP Monitor records.
The counters that feed FCX112 USER and FCX162 USERLOG come out in various records in CP Monitor domain 4, otherwise known as the User Domain.
CPU Utilizations from Places Other Than Perfkit
So far we have devoted our entire article to discussing CPU utilization in the context of MONWRITE data and Perfkit reports.
The outputs of certain CP commands also offer us some insight about CPU consumption on the running system. These command outputs are useful for informal, ad-hoc assessments of CPU consumption at the moment. For capacity planning, benchmarking, routine health assessments, or system monitoring, always use CP Monitor data and Perfkit or equivalent.
The command CP INDICATE LOAD shows percent utilizations by logical PU. The values are not load-right-now, nor are they average-load-since-IPL. They are smoothed averages computed from datums collected at intervals over the recent past, with very recent samples having more weight than less recent samples. It is based on processor time used and voluntary wait time. Therefore, it is skewed when running in an LPAR when the partition is using shared processors. When there is contention, the value reported can be higher than actual processor time used because involuntary wait is not included. Inflated utilizations here can also occur from capped partitions.
The command CP QUERY TIME shows total CPU time and virtual CPU time used by the issuing guest, since guest logon.
The commands CP INDICATE USER and CP INDICATE USER EXPANDED show total CPU time and virtual CPU time used by the named guest, again, since guest logon.
In this article we've described that both PR/SM and z/VM keep accumulators that track CPU consumption. All of these accumulators come out in CP Monitor. Perfkit reports on all of them.
PR/SM's accumulators describe time used by logical PUs, time logical PUs induce in PR/SM, and time PR/SM uses for itself. Utilization percentages calculated from these three accumulators come out in the FCX126 LPAR and FCX202 LPARLOG reports.
z/VM's accumulators describe, for each logical PU, time used by guests, time induced in the Control Program, time spent by the Control Program for its own purposes, and time spent waiting. Utilization percentages calculated from these accumulators find their way into many Perfkit reports, notable examples being FCX126 LPAR and FCX225 SYSSUMLG.
Guests' individual CPU utilization numbers come out in a few places, notably FCX112 USER and FCX162 USERLOG. These two reports comment on two phenomena: the individual guest's own consumption, and the consumption the guest's actions induced in the Control Program.
Almost all the time, all of these percentages are of a whole physical
processor, and "100%" means "a whole physical engine's worth". The
exceptions to this are FCX126 LPAR
T/V ratio is a useful metric of overhead. FCX225 SYSSUMLG and FCX239 PROCSUM both report system T/V ratio. Individual users' T/V ratios appear on FCX112 USER and FCX162 USERLOG.
This table summarizes the various notions of CPU utilization, on what basis they are accounted, the CP Monitor records that contain the counters, and the Perfkit report fields that relate to them.