Understanding z/VM CPU Utilization

(Last revised: 2016-11-28, BKW)

In its most general sense, the phrase "CPU utilization" refers to the in-use fraction of a computing system's capacity to run instructions.

In a stacked environment where virtual servers run on z/VM, which in turn runs in a logical partition, which in turn runs on a physical System z computer, the phrase "CPU utilization" can often lead to more confusion than illumination.

In this article we want to discuss the several places where the z/VM stacked environment reports CPU utilization, and what exactly each one of those reported numbers means, and how to interpret the various reports. We will begin with how to discuss the utilization of an entire System z CEC. Following, we will refine to how to talk about the utilization of a single partition, and then about the guests running within a partition. We'll conclude with a brief summary and then offer a reference table that captures all of what we've explained in this article.

Conventions

A System z computer is outfitted with a number of physical CPUs, organized into types: standard CPs, IFLs, zAAPs, zIIPs, and ICFs. The typical System z machine comes with one or more engines of each type.

When we discuss the utilization of the physical machine, we usually talk about the utilizations of the engine types separately. For example, we discuss the utilization of the standard CPs, and of the IFLs, and so on.

Also when we discuss these utilizations, we usually use percentage expressions. Such percentages always use the standard that "100%" means "one entirely consumed physical engine". So, for example, if we say the IFLs on a machine are "685% utilized", it means that 6.85 physical engines' worth of IFL capacity are being consumed.

This standard about the meaning of "100%" holds true for almost all of the numbers we will describe in this article. We will clearly caution the reader otherwise when we are discussing an exception case.

Logical PUs as Consumers

The System z PR/SM hypervisor makes possible the dividing of a physical System z computer into disjoint computing zones called logical partitions. These partitions are equipped with logical PUs, which PR/SM dispatches on physical PUs so as to run the partitions' workloads.

Recognizing this partitioning scheme, and realizing that PR/SM itself will also consume some CPU for its own ends, we can break down the consumption of System z physical CPU time into three very specific buckets:

Cycles consumed by partitions' logical PUs, running their own instructions.
Cycles consumed by the PR/SM hypervisor, running its own instructions, but running them in direct support of the deliberate action of some specific logical PU, and consequently accounting the consumed cycles to said logical PU as overhead the logical PU caused or induced.
Cycles consumed by the PR/SM hypervisor, running its own instructions, but doing work not directly caused by, and therefore not chargeable to, any given logical PU.

The PR/SM hypervisor keeps counters that measure these three kinds of CPU time. It accounts the first two kinds of time on a per-logical-PU basis. It accounts the third on a per-physical-PU basis.

Periodically z/VM asks PR/SM for the values of all of these counters. z/VM then dumps those counters into a binary data stream called "CP Monitor data". The IBM-supplied utility program called MONWRITE can journal this binary data stream to disk or tape for analysis later; we call such recorded data "MONWRITE data". A reduction and reporting program such as z/VM Performance Toolkit — known herein as "Perfkit" — can report on what the MONWRITE data reveals.

For those of you interested in details, PR/SM's CPU time accumulators appear in these CP Monitor records:

D0 R16 MRSYTCUP reports PR/SM's CPU accumulators for exactly one partition. The record first contains a small prologue that describes the partition to which the record applies. The record continues with a sequence of clauses, one clause per logical PU defined for the partition. Each clause identifies the logical PU and conceptually contains two accumulators: one that tells how much CPU time the logical PU has used for itself, and a second that tells how much CPU time the logical PU has induced in PR/SM. A typical CP Monitor data stream would contain a herd of p D0 R16 records every monitor interval, one such record for each of the p partitions defined on the physical machine.
In D0 R16, LCUCLPTM is the accumulator for the CPU time the logical CPU has used for itself. To calculate induced time in PR/SM, sometimes also called LPAR management time, form the difference LCUCACTM - LCUCLPTM.
D0 R17 MRSYTCUM reports the nonchargeable PR/SM CPU time for each physical PU on the machine. The record first contains a small prologue describing the physical machine. The record continues with a sequence of clauses, one clause per physical PU. Each clause identifies the physical PU and contains one accumulator that tells how much CPU time PR/SM itself used on the physical PU for nonchargeable overhead work. A typical CP Monitor data stream would contain exactly one D0 R17 record per monitor interval.

For those of you with z/OS backgrounds, CP Monitor records are analogous to what z/OS calls "SMF records". The records contain binary data meant for consumption by a reduction program of some kind. On z/OS, that reduction program is called RMF. On z/VM, the reduction program is called z/VM Performance Toolkit, or herein, "Perfkit".

Dedicated Partitions

A dedicated partition is one for which the PR/SM hypervisor has set aside specific physical engines for said partition's exclusive use. For example, if a partition were four-way dedicated with standard CPs, PR/SM would set aside four physical standard CPs and run nothing on them except said partition's four logical standard CPs. This dedication function is great for the anointed partition, but physical CPU cycles the partition doesn't consume are lost to the ages.

Because of this exclusivity property, when PR/SM reports logical PUs' consumption of cycles, it reports "100% busy" for every dedicated logical PU. If we want to know how busy a dedicated partition really is, we have to look at CPU utilization data collected by the operating system running in the dedicated partition. As far as PR/SM is concerned, those logical PUs are 100% busy, and PR/SM reports each one as consuming a whole physical PU.

For any given partition, all of its logical PUs are either dedicated or shared. For example, there is no notion of a z/VM-mode partition where the logical standard CPs are dedicated but the logical zIIPs are shared.

Unfortunately, the System z community's conventional use of the adjectives "dedicated" and "shared" is kind of upside-down. In truth, it's the System z's physical engines themselves that are either "dedicated" to a specific partition or "shared" among multiple partitions. (Think about it: when four people celebrate a birthday by together eating a cake, it's the cake that's said to be "shared", not the people.) Even so, System z fans always use these two particular adjectives to describe partitions and logical PUs. A "shared partition" has its logical PUs time-sliced onto physical PUs. A "dedicated partition" gets its own physical PUs. Do not let this upside-down use of terminology sidetrack you.

Perfkit's Reporting of Partitions' Utilization

The PR/SM counters break out consumption by logical PU, so it makes sense that we can tabulate consumption by logical PU, or by partition. A couple different Perfkit reports give us the lenses we need. Here is a description of the reports and some discussion of their strengths and weaknesses.

FCX126 LPAR is the most complete, most detailed, most voluminous, and perhaps most useful of this family of reports. If you like granularity and details, this report is for you. For each logical PU of each partition, the FCX126 LPAR report displays two columns derived from PR/SM's CPU accumulators:

The "%Busy" column is the total activity PR/SM charged to the logical PU. In other words, %Busy accounts for both the logical PU's own work and the PR/SM overhead the logical PU induced.
The "%Ovhd" column accounts for only the PR/SM overhead the logical PU induced.

(We will talk about the other FCX126 columns, namely "%VMld", "%Logld", and "%Susp", later.)

The %Busy and %Ovhd percentages are both out of 100%, where "100%" again means "one physical engine's worth". A value of 100% in %Busy would mean the logical PU were either (a) part of a dedicated partition, or (b) completely busy either running its own work or inducing overhead in PR/SM.

One of the strengths of the FCX126 LPAR report is that it shows us separate utilization values for every logical PU of every partition on the whole CEC. In fact, the interim version of the report, FCX126 INTERIM LPAR, gives us said breakout on a time-interval by time-interval basis. Because of this granularity, it is very easy to use FCX126 LPAR to see a runaway, overburdened, underused, or stalled partition, or even to see such a logical PU, no matter its partition.

Keep in mind the FCX126 LPAR report's %Busy and %Ovhd values are the average utilization values over the time interval described by the report. The described time interval is annotated in the upper left hand corner of the report. For interval-by-interval studies, use FCX126 INTERIM LPAR.

A weakness of the FCX126 LPAR report is that it can be somewhat voluminous and consequently overwhelming to digest. Also, rarely do we need to see the logical PUs' utilizations individually.

Here is a excerpt of the FCX126 LPAR report, so you'll recognize it in your own Perfkit reports. This excerpt is edited somewhat, for size, relevance, and appearance on an HTML page.

1FCX126 Run 2010/02/02 15:02:13 LPAR Logical Partition Activity From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of nnnnnnnn Run __________________________________________________________________________________________ LPAR Data, Collected in Partition xxxxxx Processor type and model : 2097-700 Nr. of configured partitions: 7 Nr. of physical processors : 30 Dispatch interval (msec) : dynamic Partition Nr. Upid #Proc Weight Wait-C Cap %Load CPU %Busy %Ovhd %Susp %VMld %Logld Type ccccccc 3 01 20 99 NO NO 14.9 0 24.4 .7 .8 23.5 23.7 IFL 99 NO 1 24.5 .6 .7 23.7 23.9 IFL 99 NO 2 24.0 .6 .7 23.2 23.4 IFL 99 NO 3 23.8 .6 .7 23.0 23.2 IFL 99 NO 4 23.5 .6 .7 22.8 22.9 IFL 99 NO 5 23.4 .6 .7 22.7 22.8 IFL 99 NO 6 23.3 .6 .7 22.5 22.7 IFL 99 NO 7 23.1 .6 .7 22.4 22.5 IFL 99 NO 8 22.9 .6 .7 22.2 22.3 IFL 99 NO 9 23.0 .6 .7 22.3 22.4 IFL 99 NO 10 22.9 .6 .7 22.2 22.3 IFL 99 NO 11 22.9 .6 .7 22.2 22.3 IFL 99 NO 12 23.5 .6 .7 22.8 22.9 IFL 99 NO 13 19.0 .7 .8 18.1 18.3 IFL 99 NO 14 19.6 .7 .8 18.8 18.9 IFL 99 NO 15 20.0 .7 .8 19.2 19.3 IFL 99 NO 16 20.5 .7 .8 19.6 19.8 IFL 99 NO 17 20.9 .7 .8 20.0 20.2 IFL 99 NO 18 21.3 .7 .8 20.4 20.6 IFL 99 NO 19 21.3 .7 .8 20.4 20.6 IFL

FCX202 LPARLOG is another report that comments on partitions' CPU utilization. This report is a time-indexed, interval-by-interval summary of the CPU time PR/SM charged to partitions, rolled up by partition instead of broken out by logical PU.

An important trait of FCX202 LPARLOG is that for each partition, LPARLOG adds up the utilizations for the partitions' logical PUs and then divides, to compute the utilization for the average logical PU in the partition. This trait can be either a strength or a weakness:

If the partition is homogeneous, the displayed average %Busy and %Ovhd values are usually pretty indicative of the behavior of the individual logical PUs in the partition. Another way to say this is that generally speaking, operating systems running in homogeneous partitions tend to spread their work fairly evenly.
However, if the partition is heterogeneous (sometimes also called "mixed-engine"), beware that the FCX202 %Busy and %Ovhd columns DO still report averages, but the individual logical PUs' utilizations might vary widely from one PU type to another. For this reason, for mixed-engine partitions, it's probably better to use FCX126 LPAR to study them.

FCX202 LPARLOG reveals on its right-hand side whether the partition is mixed-engine: the "Type" column will say MIX. For a homogeneous partition, the "Type" column will report the engine type: IFL, CP, or whatever.

(Once again, we will discuss "%VMld", "%Logld", and "%Susp" later.)

Here is a excerpt of FCX202 LPARLOG, again, so you'll recognize it in your own Perfkit reports.

1FCX202 Run 2010/02/02 15:02:13 LPARLOG Logical Partition Activity Log From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of nnnnnnnn Run _________________________________________________________________________________________________ Interval <Partition-> <- Load per Log. Processor --> End Time Name Nr. Upid #Proc Weight Wait-C Cap %Load %Busy %Ovhd %Susp %VMld %Logld Type >>Mean>> aaaaaaa 1 .. 0 0 NO NO ... ... ... ... ... ... .. >>Mean>> bbbbbbb 2 .. 0 0 NO NO ... ... ... ... ... ... .. >>Mean>> ccccccc 3 01 20 99 NO NO ... 22.4 .6 .7 21.6 21.8 IFL >>Mean>> ddddddd 4 02 20 99 NO NO ... .4 .0 ... ... ... IFL >>Mean>> eeeeee 5 05 5 50 NO NO ... 11.8 .3 ... ... ... IFL >>Mean>> ffffff 6 06 3 15 NO NO ... 1.9 .1 ... ... ... IFL >>Mean>> ggggggg 7 07 2 10 NO NO ... .4 .1 ... ... ... IFL >>Mean>> Total .. .. 30 273 .. .. .6 10.4 .3 ... ... ... .. 14:27:05 aaaaaaa 1 .. 0 0 NO NO ... ... ... ... ... ... .. 14:27:05 bbbbbbb 2 .. 0 0 NO NO ... ... ... ... ... ... .. 14:27:05 ccccccc 3 01 20 99 NO NO ... 21.5 .6 .6 20.8 20.9 IFL 14:27:05 ddddddd 4 02 20 99 NO NO ... .4 .0 ... ... ... IFL 14:27:05 eeeeee 5 05 5 50 NO NO ... 11.3 .3 ... ... ... IFL 14:27:05 ffffff 6 06 3 15 NO NO ... 1.4 .1 ... ... ... IFL 14:27:05 ggggggg 7 07 2 10 NO NO ... .4 .1 ... ... ... IFL 14:27:05 Total .. .. 30 273 .. .. .6 10.0 .3 ... ... ... ..

That's pretty much it for the Perfkit reports that show us what PR/SM had to say about CPU utilization.

z/VM's Own Accounting of CPU Utilization

As the z/VM Control Program runs, it keeps track of how it spends its time on each of its logical PUs. z/VM accrues each logical PU's time into these conceptual buckets:

Bucket 1: time the logical PU spends running guests. z/VM fans call this "guest time", "virtual time", or "emulation time". All three terms mean the same thing.
Bucket 2: time the logical PU spends running the Control Program, doing overhead work in support of a specific action taken by a specific guest. Some folks call this "CP time".
Bucket 3: time the logical PU spends running the Control Program, doing overhead or system management functions not attributable to the direct actions of, and therefore not chargeable to, any guest. Some folks call this "system time".
Bucket 4: time the logical PU spends with a wait PSW loaded. Almost everybody calls this "wait time".
Bucket 5: time the logical PU spends parked. Almost everybody calls this "park time".

D0 R2 MRSYTPRP contains z/VM's time accumulators. The accumulators in D0 R2 map out like this (all start with SYTPRP_):

Bucket 1, emulation time, is PFXPRBTM
Bucket 2, induced CP overhead, is PFXUTIME - PFXPRBTM
Bucket 3, unchargeable CP overhead, is PFXTMSYS
Bucket 4, wait time, is PFXTOTWT
Bucket 5, park time, is PFXPRKWT

At each CP Monitor interval, z/VM cuts a herd of D0 R2 records, one record for each logical PU in the partition.

Perfkit's Reporting of z/VM's Own Accounting of CPU Time

Several different Perfkit reports comment on z/VM's own CPU time accountings. Here's a description of the reports, again, with their strengths and weaknesses.

FCX126 LPAR again is the most granular, most detailed member of this family of reports. Notice that for the partition that collected the MONWRITE data, the LPAR report includes, for every logical PU in the partition, columns labelled "%VMld" (say "percent-VM-load") and "%Logld" (say "percent-logical-load"). These two percentage columns are derived from the z/VM time buckets we described above, calculated as follows:

%VMld is the percent utilization of a physical engine z/VM believes it achieved on the logical PU. A value of 100% would mean that z/VM was able to use that logical PU to consume one physical engine's worth of power. Another way to think of this is that %VMld is just the sum of the first three z/VM buckets named above: guests, plus induced Control Program, plus nonchargeable Control Program, divided by elapsed time and then multiplied by 100.
%Logld is more complicated. %Logld expresses the percent of z/VM-accounted time that is not z/VM-accounted wait time. In other words, the formula is 100 * (b1+b2+b3) / (b1+b2+b3+b4). %Logld is not an expression of consumption of a physical PU. Rather, %Logld merely expresses the fraction of the z/VM-accounted time that is not z/VM-accounted wait time.

Returning to the FCX126 LPAR excerpt above, we notice that %VMld and %Logld are filled in for every logical PU. It's important to note that this particular excerpt is only the little piece of FCX126 LPAR that applies to the partition that collected the MONWRITE data. For the other partitions on the CEC, FCX126 LPAR of course reports no values in these columns.

Also notice in the excerpt that %VMld is very close to the difference (%Busy - %Ovhd). Recalling the definitions of %Busy and %Ovhd, this makes perfect sense. Any discrepancy is explained by the idea that two different entities — z/VM and PR/SM — are accounting the very same phenomenon.

Again, FCX126 LPAR's granularity and detail are both its strength and its weakness. The strength is that for each logical PU, the report shows all of the percentages derivable from either the PR/SM-maintained counters or the z/VM-maintained counters. The corresponding weakness is the bulkiness of the report.

FCX144 PROCLOG also reports on the logical PUs' utilizations, again, using the z/VM time buckets, but combined a little differently.

The "Emul" time is time spent running guests, aka bucket 1 above.
The "User" time is the sum of bucket 1 and bucket 2, in other words, time either used directly by guests or directly chargeable to them as Control Program overhead they caused.
"Syst" time is bucket 3: nonchargeable Control Program overhead.
"Total" time is the sum of "User" and "Syst".

Each of these ranges up to 100%, again, "100%" meaning "one entire physical engine's worth".

A strength of FCX144 PROCLOG is that it reports on only the collecting partition. This makes it somewhat smaller than FCX126 LPAR. A weakness of PROCLOG is that it doesn't present PR/SM's counters alongside z/VM's counters, so some of the utilization picture is missing.

Here is a small excerpt from FCX144 PROCLOG.

1FCX144 Run 2010/02/02 15:02:13 PROCLOG Processor Activity, by Time From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of PM01664 Run ____________________________________________________________________ <--- Percent Busy ----> C Interval P End Time U Type Total User Syst Emul >>Mean>> 0 IFL 23.5 21.7 1.8 20.9 >>Mean>> 1 IFL 23.7 22.2 1.5 21.4 >>Mean>> 2 IFL 23.2 21.8 1.5 21.0 >>Mean>> 3 IFL 23.0 21.5 1.5 20.8 >>Mean>> 4 IFL 22.8 21.3 1.4 20.6 >>Mean>> 5 IFL 22.7 21.2 1.4 20.5 >>Mean>> 6 IFL 22.5 21.1 1.5 20.3 >>Mean>> 7 IFL 22.4 21.0 1.4 20.2 >>Mean>> 8 IFL 22.2 20.8 1.4 20.1 >>Mean>> 9 IFL 22.3 20.8 1.5 20.1 >>Mean>> 10 IFL 22.2 20.8 1.4 20.0 >>Mean>> 11 IFL 22.2 20.8 1.4 20.1 >>Mean>> 12 IFL 22.8 21.3 1.5 20.6 >>Mean>> 13 IFL 18.1 16.5 1.6 15.8 >>Mean>> 14 IFL 18.8 17.1 1.7 16.4 >>Mean>> 15 IFL 19.2 17.6 1.6 16.8 >>Mean>> 16 IFL 19.6 18.0 1.7 17.2 >>Mean>> 17 IFL 20.0 18.3 1.7 17.5 >>Mean>> 18 IFL 20.4 18.7 1.8 17.8 >>Mean>> 19 IFL 20.4 18.6 1.9 17.7

FCX304 PRCLOG strongly resembles FCX144 PROCLOG. Unlike PROCLOG, PRCLOG reports park time.

FCX239 PROCSUM contains one column, "Pct Busy", that reports on the average CPU-busy across all logical PUs in the partition, time-interval by time-interval, again, using z/VM's first three buckets. Again, "100%" would mean "one entire physical engine's worth". The averaging PROCSUM does can be handy, but like FCX202 LPARLOG, it's misleading in mixed-engine environments.

FCX225 SYSSUMLG reports the same average CPU-busy value as FCX239 PROCSUM, also time-interval by time-interval, and with the same strengths and weaknesses. A nice property of SYSSUMLG is that it also includes, interval by interval, a wide assortment of system performance metrics drawn from CPU consumption, I/O, paging, and other interesting behaviors. The variety present in FCX225 SYSSUMLG makes it a good first stop for examining a system's basic performance properties.

FCX100 CPU is similar to FCX144 PROCLOG in that it reports on each logical PU separately. For each logical PU, we see %EMU (bucket 1), %CP (bucket 2), %SYS (bucket 3), %WT (bucket 4), %CPU (sum of %EMU, %CP, and %SYS), and %LOGLD (same as "%Logld" on FCX126 LPAR). The FCX100 CPU values are averages over the time interval of the report, notated in the report's upper-left corner.

FCX101 REDISP (format 1) presents the z/VM time buckets a little differently. The thing to remember here is that for FCX101 REDISP, the columns are summations over all of the logical PUs in the partition. So, the "CPU" column is the grand total of the first three z/VM buckets, across all logical PUs, over the time interval of the report, expressed as a percent; again, "100" would mean one physical engine's worth. The %EM column matches up to bucket 1, %CP to bucket 2, %SY to bucket 3, and %WT to bucket 4.

The Notion of Suspend Time

By now you have undoubtedly noticed that FCX126 LPAR and FCX202 LPARLOG contain a column called "%Susp". This column, called "suspend time", deserves special explanation, because it is widely misunderstood.

We can begin to understand %Susp by taking another look at those z/VM time buckets described above and reported in D0 R2. Again, those buckets record, for each logical PU, the amount of time the logical PU spends in the states z/VM itself can see: the three distinct kinds of running, and time spent with a wait PSW loaded, and time spent parked.

Now, here's the thing about those D0 R2 counters. At each interval, the counters might not add up to 100%. In other words, time appears to be missing. Why? Here's why. z/VM measures those values using a System z facility called the "CPU timer". Here's the thing about a logical PU's CPU timer: it doesn't advance when the logical PU is not dispatched. So, the logical PU's undispatched time isn't recorded in any of those z/VM time buckets.

So, we arrive at the definition of %Susp, or "suspend time". %Susp is just 100% minus what the D0 R2 z/VM time buckets account for. That's it. No CP Monitor counter directly reports %Susp. Perfkit just calculates %Susp, by starting with 100% and subtracting out what the z/VM time buckets account for.

To understand the significance of %Susp, we need to examine the reasons why PR/SM would decline to dispatch a logical PU. What might those reasons be?

The logical PU was ready to run, but PR/SM couldn't find a place to run it, because all physical PUs of the matching type were busy. In other words, you might need a bigger CEC.
The logical PU did something that induced PR/SM overhead, and PR/SM responded, running its own instructions and charging them to the logical PU. In other words, %Ovhd is accruing.
The logical PU invoked some PR/SM function that resulted in the logical PU becoming temporarily undispatchable. For example, the logical PU issued a Diag x'44' to give up its PR/SM time slice. The z/VM spin lock manager does this sometimes. When this is happening in excess, you are going to see large values in FCX265 LOCKLOG.
The logical PU is running in a capped partition, and guess what, the partition has used its entitlement. Sometime soon the logical PU will get to run, but right now PR/SM has applied the brakes. When this is happening, you will see in FCX126 LPAR that the partition is capped, and via some quick manual arithmetic, you'll see that the partition has used its entitlement.

The misunderstanding of %Susp usually happens because someone forgets there are so many reasons why it can accrue. The biggest misconception is that %Susp should be able to accrue only if the corresponding physical PU pool is entirely utilized. This is patently false. The most common cause of %Susp is that the logical PU itself does something which makes it temporarily undispatchable, such as issuing a Diag x'44' or x'9C'. When a logical PU uses these diagnoses to excess, %Susp will start to show up. FCX239 PROCSUM reports on CP's use of Diag x'9C'. Modern z/VM systems almost never issue Diag x'44' to PR/SM.

Finally, a historical note. VM performance reporting products of ages past used to call our "suspend time" by another term, "involuntary wait", and used column headings such as "%IW" to denote it. Some readers might find this old term illuminates their understanding of what's occuring. For that reason alone, we mention it here.

The Whole Picture

At this point we've explained everything about all five interesting columns on FCX126 LPAR and FCX202 LPARLOG:

%Busy and %Ovhd, which come from PR/SM's timers;
%VMld and %Logld, which come from z/VM's timers;
%Susp, which is a value synthesized by Perfkit.

We've also explained the PROCLOG, PROCSUM, SYSSUMLG, CPU, and REDISP views of CPU consumption, and how Perfkit calculates them from z/VM's D0 R2 timers.

This pretty much wraps up the discussion of how Perfkit expresses CPU utilization, for partitions and for whole CECs.

Remember, with the exception of %Logld and %LOGLD, all of the columns we've discussed have the standard that "100%" means "an entire physical engine's worth".

The Notion of T/V Ratio

When people discuss the notion of "z/VM overhead", one of the ways they do it is to talk about a computed health metric called "system total-to-virtual ratio", or "system T/V ratio". This number is the ratio of total CPU time used to virtual time used, as z/VM CP sees things. Another way to think of this is by returning to those z/VM D0 R2 time buckets we mentioned earlier. System T/V ratio is just the total run cycles z/VM accounts, divided by guest run cycles z/VM accounts, or formulaically, (b1+b2+b3) / b1.

When system T/V is 1.0, it means all of the z/VM-accounted time was spent running guests. This is the perfect world: no cycles spent running the Control Program.

When system T/V is 2.0, it means for every second spent actually running guests, the z/VM Control Program spent one second running itself. Such a high T/V would usually be a sign of distress and would be a call to do some diagnosis and tuning.

System T/V ratio is very dependent on workload characteristics. Typical values for system T/V are in the 1.0 to 1.2 range.

The FCX225 SYSSUMLG and FCX239 PROCSUM reports both have a "T/V" column, calculated exactly as described here.

Guest CPU Utilization

To round out this article, we should talk about where Perfkit reports on guest CPU utilization, and what the reported numbers mean.

FCX112 USER is very useful in understanding guest CPU utilization. Four of its columns are particularly interesting. The columns apply to the time interval of the report, as notated in the report's upper-left corner. The columns are:

TCPU: the total CPU-seconds either consumed directly by the guest's virtual PUs or charged to the guest's virtual PUs by the Control Program as billable overhead;
VCPU: the total CPU-seconds consumed directly by the guest's virtual PUs;
%CPU: TCPU divided by the time interval of the report. Here again, "100%" means "one physical engine's worth";
T/V: user T/V ratio, aka TCPU divided by VCPU.

Here is a small excerpt from FCX112 USER.

1FCX112 Run 2010/02/02 15:02:13 USER General User Resource Utilization From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of nnnnnnnn Run __________________________________________________________________________ . . _____ . . <----- CPU Load -----> <-Seconds-> T/V Userid %CPU TCPU VCPU Ratio >>Mean>> 13.4 473.1 455.1 1.04 xxxxxxxx 91.4 3237 3025 1.07 xxxxxxxx 67.2 2380 2235 1.06 xxxxxxxx 47.7 1687 1640 1.03 xxxxxxxx 39.2 1388 1374 1.01 xxxxxxxx 39.1 1383 1355 1.02 xxxxxxxx 30.0 1063 1051 1.01 xxxxxxxx 16.6 586.0 562.3 1.04 xxxxxxxx 14.9 527.0 509.5 1.03 xxxxxxxx 12.5 440.6 433.1 1.02 xxxxxxxx 10.9 384.1 379.2 1.01 xxxxxxxx 9.81 347.4 327.8 1.06 xxxxxxxx 6.61 234.1 231.8 1.01 xxxxxxxx 6.12 216.8 215.4 1.01 xxxxxxxx 4.69 166.0 163.3 1.02 xxxxxxxx 2.68 94.87 92.35 1.03 xxxxxxxx 1.58 56.08 55.83 1.00 xxxxxxxx .04 1.249 1.212 1.03

One nice property of the FCX112 USER %CPU column is that one can use it in conjunction with the FCX226 UCONF user configuration report to determine whether a guest has too many virtual CPUs. If a guest's FCX112 USER %CPU value is, say, 125% but FCX226 UCONF reports the guest is a virtual six-way, one might want to reconsider whether the guest might operate better as a virtual two-way. If the guest is Linux, another option might be to consider activating the cpuplugd daemon. This daemon, which runs as a service process in Linux, turns off some of the guest's virtual CPUs when guest computing load seems too low for the virtual configuration. Spreading a guest's workload across too many virtual CPUs is a common configuration error which usually results in the guest inducing too much Control Program overhead.

FCX162 USERLOG is another place where we can find guest CPU utilization numbers. The first few columns match up to the FCX112 USER report and have the same definitions. But unlike FCX112 USER, FCX162 USERLOG tabulates one user's use of CPU over time. Here's an example.

1FCX162 Run 2010/02/02 15:02:13 USERLOG xxxxxxxx User Resource Consumption Log From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of nnnnnnnn Run _____________________________________________________________________ Resource Usage Log for User xxxxxxxx <----- CPU Load -----> Interval <-Seconds-> T/V End Time %CPU TCPU VCPU Ratio >>Mean>> 91.4 54.86 51.27 1.07 14:27:05 67.2 40.31 38.95 1.03 14:28:05 69.1 41.47 40.06 1.04 14:29:05 72.1 43.27 41.87 1.03 14:30:05 74.3 44.56 43.14 1.03 14:31:05 69.4 41.61 40.21 1.03 14:32:05 68.1 40.87 39.50 1.03 14:33:05 64.9 38.96 37.66 1.03 14:34:05 66.5 39.92 38.65 1.03 14:35:05 69.6 41.78 40.46 1.03 14:36:05 70.5 42.31 40.93 1.03 14:37:05 67.4 40.44 39.09 1.03 14:38:05 65.4 39.24 37.91 1.04 14:39:05 68.3 40.97 39.63 1.03

Be aware that no Perfkit report breaks down guest CPU time by virtual CPU. The fields are there in CP Monitor but Perfkit does not report on them. However, if you are running Perfkit on a live system, one of the USER drill-down screens does report on each virtual PU separately. Perfkit creates this drill-down screen by mining timer values right out of CP control blocks, instead of by looking at CP Monitor records.

The counters that feed FCX112 USER and FCX162 USERLOG come out in various records in CP Monitor domain 4, otherwise known as the User Domain.

CPU Utilizations from Places Other Than Perfkit

So far we have devoted our entire article to discussing CPU utilization in the context of MONWRITE data and Perfkit reports.

The outputs of certain CP commands also offer us some insight about CPU consumption on the running system. These command outputs are useful for informal, ad-hoc assessments of CPU consumption at the moment. For capacity planning, benchmarking, routine health assessments, or system monitoring, always use CP Monitor data and Perfkit or equivalent.

The command CP INDICATE LOAD shows percent utilizations by logical PU. The values are not load-right-now, nor are they average-load-since-IPL. They are smoothed averages computed from datums collected at intervals over the recent past, with very recent samples having more weight than less recent samples. It is based on processor time used and voluntary wait time. Therefore, it is skewed when running in an LPAR when the partition is using shared processors. When there is contention, the value reported can be higher than actual processor time used because involuntary wait is not included. Inflated utilizations here can also occur from capped partitions.

The command CP QUERY TIME shows total CPU time and virtual CPU time used by the issuing guest, since guest logon.

The commands CP INDICATE USER and CP INDICATE USER EXPANDED show total CPU time and virtual CPU time used by the named guest, again, since guest logon.

Summary

In this article we've described that both PR/SM and z/VM keep accumulators that track CPU consumption. All of these accumulators come out in CP Monitor. Perfkit reports on all of them.

PR/SM's accumulators describe time used by logical PUs, time logical PUs induce in PR/SM, and time PR/SM uses for itself. Utilization percentages calculated from these three accumulators come out in the FCX126 LPAR and FCX202 LPARLOG reports.

z/VM's accumulators describe, for each logical PU, time used by guests, time induced in the Control Program, time spent by the Control Program for its own purposes, time spent waiting, and time spent parked. Utilization percentages calculated from these accumulators find their way into many Perfkit reports, notable examples being FCX126 LPAR and FCX225 SYSSUMLG.

Guests' individual CPU utilization numbers come out in a few places, notably FCX112 USER and FCX162 USERLOG. These two reports comment on two phenomena: the individual guest's own consumption, and the consumption the guest's actions induced in the Control Program.

Almost all the time, all of these percentages are of a whole physical processor, and "100%" means "a whole physical engine's worth". The exceptions to this are FCX126 LPAR %Logld and FCX100 CPU %LOGLD, described earlier.

T/V ratio is a useful metric of overhead. FCX225 SYSSUMLG and FCX239 PROCSUM both report system T/V ratio. Individual users' T/V ratios appear on FCX112 USER and FCX162 USERLOG.

Summary Table

This table summarizes the various notions of CPU utilization, on what basis they are accounted, the CP Monitor records that contain the counters, and the Perfkit report fields that relate to them.

Things PR/SM Sees and Counts
Notion or Kind of CPU Time	Accounted per...	CP Monitor records	Relevant Perfkit Report Columns
Nonchargeable PR/SM overhead	physical PU	D0 R17 MRSYTCUM
PR/SM overhead induced by and charged to a specific logical PU	logical PU	D0 R16 MRSYTCUP	FCX126 LPAR %Ovhd FCX202 LPARLOG %Ovhd
PR/SM's view of a logical PU's own activity	logical PU	D0 R16 MRSYTCUP	FCX126 LPAR (%Busy - %Ovhd) FCX202 LPARLOG (%Busy - %Ovhd)
PR/SM's view of a logical PU's chargeable activity, both its own and the overhead charged to it	logical PU	D0 R16 MRSYTCUP	FCX126 LPAR %Busy FCX202 LPARLOG %Busy
PR/SM's view of how busy a pool of physical processors is altogether, by physical PU type	logical PU physical PU	D0 R16 MRSYTCUP D0 R17 MRSYTCUM

Things z/VM Sees and Counts, Logical PUs
Notion or Kind of CPU Time	Accounted per...	CP Monitor records	Relevant Perfkit Report Columns
z/VM's view of its use of a logical PU to run guests	logical PU	D0 R2 MRSYTPRP	FCX144 PROCLOG Emul FCX100 CPU %EMU FCX101 REDISP %EM
z/VM's view of its use of a logical PU for chargeable Control Program overhead	logical PU	D0 R2 MRSYTPRP	FCX144 PROCLOG (User - Emul) FCX100 CPU %CP FCX101 REDISP %CP
z/VM's view of its use of a logical PU for nonchargeable Control Program overhead	logical PU	D0 R2 MRSYTPRP	FCX144 PROCLOG Syst FCX100 CPU %SYS FCX101 REDISP %SY
z/VM's view of its use of a logical PU, overall	logical PU	D0 R2 MRSYTPRP	FCX126 LPAR %VMld FCX202 LPARLOG %VMld FCX144 PROCLOG Total FCX100 CPU %CPU FCX101 REDISP CPU FCX239 PROCSUM Pct Busy FCX225 SYSSUMLG Pct Busy
z/VM's view of its use of wait PSWs	logical PU	D0 R2 MRSYTPRP	FCX100 CPU %WT FCX101 REDISP %WT
Time not accounted for by any of z/VM's counters	logical PU		FCX126 LPAR %Susp FCX202 LPARLOG %Susp

Things z/VM Sees and Counts, Virtual PUs
Notion or Kind of CPU Time	Accounted per...	CP Monitor records	Relevant Perfkit Report Columns
Time a guest's virtual PU spends doing its own work	virtual PU	D4 R2 MRUSELOF D4 R3 MRUSEACT D4 R9 MRUSEATE	FCX112 USER VCPU FCX162 USERLOG VCPU perhaps others
Time a guest's virtual PU induces in the Control Program	virtual PU	same	FCX112 USER (TCPU - VCPU) FCX162 USERLOG (TCPU - VCPU) perhaps others

Informal CPU Utilization Tools
Notion or Kind of CPU Time	Accounted per...	CP Monitor records	Relevant CP Commands
z/VM's view of its use of its logical PUs, smoothed and averaged over the recent past	logical PU		CP INDICATE LOAD
z/VM's view of the CPU time used or induced by a guest since it logged on	guest		CP QUERY TIME CP INDICATE USER CP INDICATE USER EXPANDED