CPU Utilization in an SMT World

(Last revised: 2021-01-20, BKW)

(Note: before you read this, it would be a good idea for you to read the SMT vocabulary article first. Ed.)

In a recent PMR a customer using z/VM in SMT-2 mode asked whether his machine were "fully used". His question came because he had difficulty understanding the differences in the utilization values printed on the various Perfkit reports. On this web page I am publishing the response text we wrote, in case others might find the text useful.

Before I show you the response text, I need to explain a few things.

To understand the notion of "fully used" in an SMT environment, the reader first needs to understand a few concepts related to SMT and how the PR/SM dispatcher accomplishes the dispatching of an LPAR when the underlying hardware is capable of SMT. Here is a brief description.

On a z Systems CPC that is capable of SMT operation -- in other words, the z13 family or higher -- all of the following things are true:

What in the old days we used to call "physical processors" are now called physical cores. Each physical core is equipped to run two instruction streams concurrently. The hardware that lets the core run an instruction stream is called a processor. Thus we say each physical core, no matter what type (CP, IFL, zIIP, etc.), has two processors.
On the SE or HMC, when we define an LPAR and set up the LPAR's activation profile, we do some typing and mouse-clicking on a configuration tab labelled "Processors". On this tab, even though the tab is labelled "Processors", what we are really doing is telling the PR/SM hypervisor how many logical cores the LPAR should have. For example, if we specify that the LPAR should have "two CPs and three IFLs", what we are really specifying is that the LPAR should have two logical CP cores and three logical IFL cores.
In the old days we used to say that PR/SM dispatches an LPAR's logical processors on the physical processors of the machine. In an SMT world that's no longer true. In a machine capable of SMT operation, PR/SM dispatches an LPAR's logical cores on the physical cores of the machine.
When z/VM is IPLed in an LPAR, according to a setting specified in the z/VM system configuration file, the z/VM Control Program decides whether to opt-in for SMT. The following things are true:
1. If z/VM does not opt-in for SMT, strictly speaking the LPAR does not have logical cores. Rather it has only logical processors. But for purposes of discussing dispatch and accrual of dispatch time, we can think of those logical processors as being logical cores, each one housing one logical processor.
2. If z/VM opts-in for SMT at level SMT-1, each logical core of the LPAR contains exactly one logical processor.
3. If z/VM opts in for SMT at level SMT-2, each logical core of type IFL contains two logical processors, and each logical core of any other type contains only one logical processor.
Starting with z/VM 6.4 the Control Program supports a capability called Dynamic SMT. With Dynamic SMT, z/VM IPLs in one of the SMT modes, either SMT-1 (one logical processor per logical IFL core) or SMT-2 (two logical processors per logical IFL core). A CP command gives the administrator the ability to switch the system between SMT-1 mode and SMT-2 mode without an IPL. As far as the logical IFL cores are concerned, what this switching effectively does is add or remove a logical processor to or from the logical IFL core.
Now I can give you some more information about how logical core dispatch works in PR/SM:
1. Remember, PR/SM always dispatches logical cores onto physical cores.
2. If the logical core contains only one logical processor, the sole logical processor gets dispatched on one processor of the physical core while the other processor of the physical core goes unused.
3. If the logical core contains two logical processors, the two logical processors of the logical core get dispatched on the two processors of the physical core.
From the previous point it should be evident that because PR/SM is dispatching logical cores onto physical cores, it will never be true that dispatchable work from two different LPARs will inhabit the same physical core at the same time. Another way to say this is that at any instant, all of the work happening on a given physical core is related to exactly one LPAR, namely, the LPAR to which the dispatched logical core belongs.
As you can see by now, in an SMT world we can no longer talk about "an IFL", "two CPs", or the like. Rather we must be extremely careful to specify whether we are talking about a logical core or a logical processor. What I often tell people is this: in an SMT world they can no longer use "IFL", "zIIP", etc. as nouns. I now try very hard not to use phrases like "an IFL" or "two CPs" or "three logical zIIPs". Instead I try always to say "logical core" or "logical processor" and use the type-word as an adjective: "a logical IFL core", "a logical IFL processor", or what have you, so there is no confusion or ambiguity.
During the time that an SMT-2 logical core is dispatched on a physical core, the two logical processors of the logical core can move in and out of wait state independently. For example, during that period of logical core dispatch, one of the two logical processors could run 75% busy while the other one runs 35% busy.
The concept of physical core busy time applies to the notion of a physical core being busy running logical cores. For example, if for 38% of some time span a given physical core has a logical core dispatched upon it, we say the physical core is 38% busy. FCX302 PHYSLOG reports physical core busy, rolled up by type-group: how busy are the physical CP cores, how busy are the physical IFL cores, and so on.
The concept of logical core busy time applies to the notion of the logical core moving in and out of core dispatch on a physical core. For example, if for 75% of some time span a given logical core finds itself dispatched on a physical core, we say the logical core is 75% busy. FCX126 LPAR reports logical core busy.
The concept of logical processor busy time applies to the notion of the logical processor being in non-wait-state while in dispatch on real hardware. For example, if for 63% of some time span a given logical processor finds itself in run-state in dispatch on hardware, we say the logical processor is 63% busy. This is just as it always has been, even before the arrival of SMT. FCX100 CPU, FCX144 PROCLOG, and FCX304 PRCLOG report logical processor busy.
As you can see, because two logical processors belonging to the same logical core can move in and out of wait state independently, statistics reported for logical core busy will not necessarily match up to statistics reported for logical processor busy. Logical core busy and logical processor busy are two different phenomena and so there is no reason to expect the numbers to match.

Here's a picture that expresses the notions of logical core dispatch, logical core busy, and logical processor busy.

Figure 1. Relationship between logical core dispatch and logical processor busy.

Consider this scenario: 1. Logical core 2 contains logical processor 4 and logical processor 5 2. PR/SM dispatches logical core 2 onto a physical core like this: a. Start at time 1 and stop at time 4 = 3 units b. Start at time 6 and stop at time 8 = 2 units 3. Logical core 2 is therefore dispatched for 5 time units out of 10, so logical core 2 is 50% busy 4. Logical processor 4 runs: a. From time 2 to time 3 = 1 unit b. From time 7 to time 8 = 1 unit This means LCPU 4 runs 2/10 = 20% busy 5. Logical processor 5 runs: a. From time 1 to time 4 = 3 units b. From time 6 to time 7 = 1 unit This means LCPU 5 runs 4/10 = 40% busy time -> 0 -- 1 -- 2 -- 3 -- 4 -- 5 -- 6 -- 7 -- 8 -- 9 - 10 Core 2 x--------------x x---------x LPU 4 x----x x----x LPU 5 x--------------x x----x

It should be clear from the figure that logical core busy is not the same as logical processor busy. They are two different phenomena, measured by two different observers, and reported at different spots in Perfkit.

One further note, and then we're done. Some people use z/OS SMF records and a corresponding reporting tool such as IBM's RMF or MXG's MICS to observe utilization of a z/VM LPAR. Data captured in SMF records is core-centric, so the reports generated by the corresponding tools will be core-oriented. If you are comparing those reports to Perfkit reports, on the Perfkit side you will need to use core-oriented reports, such as FCX302 PHYSLOG and FCX126 LPAR.

And now, the response.

You asked whether your machine were "fully used". Well, that depends upon what you mean by "fully used". Let me explain.

In the FCX302 PHYSLOG report, those percent-busys are physical core busy from the point of view of PR/SM. If a machine has eight physical cores of type IFL, and if FCX302 PHYSLOG reports those eight physical IFL cores to be 800% busy, PR/SM has absolutely no room ever to do any additional dispatches of logical cores onto physical cores. So from PR/SM's point of view the physical cores are "fully used".

In the FCX126 LPAR report, those percent-busys are logical core dispatch busy from the point of view of PR/SM. If an LPAR's logical core shows up as 100% busy, it means the logical core was dispatched on a physical core 100% of the time. So from the point of view of PR/SM, it could not dispatch the logical core any more often. In that sense the logical core is "fully used".

Also a physical phenomenon, but not reported in the FCX126 LPAR report, is the notion of the logical core being ready to be core-dispatched. When core-dispatched plus core-ready-for-dispatch adds to 100%, the logical core is "fully used" even though it might be actually running on a physical core somewhat less than 100% of the time. In such a case FCX126 LPAR will show a logical-core-busy value of somewhat under 100%. You can think of this gap as "core suspend time", that is, time the logical core wanted to be dispatched on a physical core but PR/SM didn't dispatch it. PR/SM doesn't report this gap to us so we can't report it in Perfkit. But as you will see in a moment, we can see suspend time in another way, so no hope is lost here.

On each logical processor z/VM itself keeps track of the time it spends in these five states: running a guest (g), running in CP induced by a guest (cCP), running in CP not induced by a guest (ncCP), wait-PSW (w), and parked (p). You might think the quantity (g + cCP + ncCP + w + p) should add to 100%. Sometimes it doesn't. The missing time is called "logical processor suspend time", s. Suspend time is the time the logical processor wanted its logical core to be dispatched on a physical core but PR/SM didn't oblige. FCX304 PRCLOG reports logical processor suspend time.

FCX100 CPU, FCX144 PROCLOG, and FCX304 PRCLOG report logical-processor-busy by calculating 100*(g+cCP+ncCP)/e, where e is elapsed time. Certainly when we see a percent-busy value of 100% we can claim the logical processor is "fully used". But owing to the concept of suspend time, the logical processor might be "fully used" -- that is, have no more room to dispatch virtual CPUs -- even when logical-processor percent-busy reports at somewhat under 100%. Thus the best indicator of whether a logical processor is fully used is that w=0 and p=0. FCX304 PRCLOG directly reports g (Emul), ncCP (Syst), p (Park), and s (Susp). Further, it indirectly reports cCP, because User = g + cCP. Unfortunately, it does not report w, but we can calculate w = (100 - User - Syst - Park - Susp).

When PR/SM dispatches a logical core onto a physical core, during the time of the core-dispatch the two logical processors of the logical core can move in and out of wait-state independently. When they both run for the entire time of the core dispatch, that would be thread density 2. Some people will say that when TD=2 the logical core is "fully used". When one or more of the logical processors of the logical core loads a wait-state PSW during core dispatch, the core runs somewhat below thread density 2. In that case some would say the logical core is not "fully used".

You asked about capacity planning.

FCX302 PHYSLOG shows us physical utilization of physical cores. If FCX302 PHYSLOG says the physical cores are full, they can't handle any more logical core dispatches.
For our LPAR, if we use FCX126 LPAR and see that our logical cores are 100% busy, they can't be dispatched onto physical cores any more frequently than they are right now.
For our LPAR, if we use FCX304 PRCLOG to calculate w,
1. And then see that w=0 and p=0, the logical processor is fully used.
2. If we see w>0, z/VM found time to load a wait-state so maybe the logical processor is not quite full.
For our LPAR, if in FCX304 PRCLOG we see p>0, that's an indicator z/VM is projecting there would be no physical power to run the logical processor; when this happens we would expect also to see high or highly fluctuating values in FCX302 PHYSLOG.
For our LPAR, if in FCX304 PRCLOG we see s>>0, our LPAR's logical cores are not making it to physical cores and that's worth investigating.

Finally, you might find this article helpful.

Guest CPU Utilization

As of SMT, monitor record D4 R3 MRUSEACT and friends now report three different kinds of CPU time for guests. The three kinds of CPU time are:

Raw time: When the virtual CPU spends 1 second running, raw time advances by 1 second.
MT-1 equivalent time: When the virtual CPU spends 1 second running, MT-1 equivalent time advances by the amount of CPU time CP and the hardware estimate the virtual CPU would have accrued if the virtual CPU had spent that whole 1 second running on a core by itself. The estimate tries to compensate for the interference we estimate the virtual CPU might have experienced when it sometimes occupied an SMT-2 core concurrently with some other virtual CPU. In other words, 1 second of raw time will generally yield less than 1 second of MT-1 equivalent time. MT-1 equivalent time is probably best used for billing or chargeback.
Prorated core time: When the virtual CPU spends 1 second running alone on a core, the virtual CPU accrues 1 second of prorated core time. When the virtual CPU spends 1 second running on a core alongside another virtual CPU, it accrues 1/2 second of prorated core time. Prorated core time is used for enforcing CPU pooling, now called resource pools.

All three kinds of time are reported in two components, total time and pure guest time. In other words, the notions of VMDTTIME and VMDVTIME have been extended to MT-1 equivalent time and prorated core time.

For more information, read the comments in the D4 R3 MRUSEACT monitor record.

Performance Toolkit Considerations

The classic Perfkit utilization reports such as FCX100 CPU, FCX304 PRCLOG, and FCX126 LPAR tabulate raw time, either core time or processor time, as noted above.

The Perfkit CPU pooling activity report FCX309 CPLACT reports either raw time or prorated core time, depending upon the level of z/VM that produced the data. For data from z/VM 6.4 or later it's always prorated core time.

With APAR VM66215 Perfkit offers reports that tabulate raw time, MT-1 equivalent time, and prorated core time alongside one another. There are two new reports:

FCX333 USRPRCTM, an activity report, shows the different kinds of time for the different users.
FCX334 USRTMLOG, a log report, shows the different kinds of time for a single user, as a function of time.

For more information about Perfkit and VM66215, read here.