CPU Utilization in an SMT World
(Last revised: 2020-02-05, BKW)
(Note: before you read this, it would be a good idea for you to read the SMT vocabulary article first. Ed.)
In a recent PMR a customer using z/VM in SMT-2 mode asked whether his machine were "fully used". His question came because he had difficulty understanding the differences in the utilization values printed on the various Perfkit reports. On this web page I am publishing the response text we wrote, in case others might find the text useful.
Before I show you the response text, I need to explain a few things.
To understand the notion of "fully used" in an SMT environment, the reader first needs to understand a few concepts related to SMT and how the PR/SM dispatcher accomplishes the dispatching of an LPAR when the underlying hardware is capable of SMT. Here is a brief description.
On a z Systems CPC that is capable of SMT operation -- in other words, a z13 or z13s -- all of the following things are true:
- What in the old days we used to call "physical CPUs" are now called physical cores. Each physical core is equipped to run two instruction streams concurrently. The hardware that lets the core run an instruction stream is called a thread. Thus we say each physical core, no matter what type (CP, IFL, zIIP, etc.), has two threads.
- On the SE or HMC, when we define an LPAR and set up the LPAR's activation profile, we do some typing and mouse-clicking on a configuration tab labelled "Processors". On this tab, even though the tab is labelled "Processors", what we are really doing is telling the PR/SM hypervisor how many logical cores the LPAR should have. For example, if we specify that the LPAR should have "two CPs and three IFLs", what we are really specifying is that the LPAR should have two logical CP cores and three logical IFL cores.
- In the old days we used to say that PR/SM dispatches an LPAR's logical CPUs on the physical CPUs of the machine. In an SMT world that's no longer true. In a machine capable of SMT operation, PR/SM dispatches an LPAR's logical cores on the physical cores of the machine.
When z/VM is IPLed in an LPAR,
according to a setting specified in the z/VM system
configuration file, the z/VM Control Program decides whether
to opt-in for SMT. The following
two things are true:
- If z/VM does not opt-in for SMT, or if z/VM opts-in for SMT at level SMT-1, each logical core of the LPAR contains exactly one logical CPU.
- If z/VM opts in for SMT at level SMT-2, each logical core of type IFL contains two logical CPUs, and each logical core of any other type contains only one logical CPU.
In z/VM 6.4 the Control Program supports a capability called Dynamic SMT. With Dynamic SMT, z/VM IPLs in one of the SMT modes, either SMT-1 (one logical CPU per logical IFL core) or SMT-2 (two logical CPUs per logical IFL core). A CP command gives the administrator the ability to switch the system between SMT-1 mode and SMT-2 mode without an IPL. As far as the logical IFL cores are concerned, what this switching effectively does is add or remove a logical CPU to or from the logical IFL core.
Now I can give you some more information about how
logical core dispatch works in PR/SM:
- Remember, PR/SM always dispatches logical cores onto physical cores.
- If the logical core contains only one logical CPU, the sole logical CPU gets dispatched on one thread of the physical core while the other thread of the physical core goes unused.
- If the logical core contains two logical CPUs, the two logical CPUs of the logical core get dispatched on the two threads of the physical core.
- From the previous point it should be evident that because PR/SM is dispatching logical cores onto physical cores, it will never be true that dispatchable work from two different LPARs will inhabit the same physical core at the same time. Another way to say this is that at any instant, all of the work happening on a given physical core is related to exactly one LPAR, namely, the LPAR to which the dispatched logical core belongs.
- As you can see by now, in an SMT world we can no longer talk about "an IFL", "two CPs", or the like. Rather we must be extremely careful to specify whether we are talking about a logical core or a logical CPU. What I often tell people is this: in an SMT world they can no longer use "IFL", "zIIP", etc. as nouns. I now try very hard not to use phrases like "an IFL" or "two CPs" or "three logical zIIPs". Instead I try always to say "logical core" or "logical CPU" and use the type-word as an adjective: "a logical IFL core", "a logical IFL CPU", or what have you, so there is no confusion or ambiguity.
- During the time that an SMT-2 logical core is dispatched on a physical core, the two logical CPUs of the logical core can move in and out of wait state independently. For example, during that period of logical core dispatch, one of the two logical CPUs could run 75% busy while the other one runs 35% busy.
- The concept of physical core busy time applies to the notion of a physical core being busy running logical cores. For example, if for 38% of some time span a given physical core has a logical core dispatched upon it, we say the physical core is 38% busy. FCX302 PHYSLOG reports physical core busy, rolled up by type-group: how busy are the physical CP cores, how busy are the physical IFL cores, and so on.
- The concept of logical core busy time applies to the notion of the logical core moving in and out of core dispatch on a physical core. For example, if for 75% of some time span a given logical core finds itself dispatched on a physical core, we say the logical core is 75% busy. FCX126 LPAR reports logical core busy.
- The concept of logical CPU busy time applies to the notion of the logical CPU being in non-wait-state while in dispatch on real hardware. For example, if for 63% of some time span a given logical CPU finds itself in run-state in dispatch on hardware, we say the logical CPU is 63% busy. This is just as it always has been, even before the arrival of SMT. FCX100 CPU, FCX144 PROCLOG, and FCX304 PRCLOG report logical CPU busy.
- As you can see, because two logical CPUs belonging to the same logical core can move in and out of wait state independently, statistics reported for logical core busy will not necessarily match up to statistics reported for logical CPU busy. Logical core busy and logical CPU busy are two different phenomena and so there is no reason to expect the numbers to match.
Here's a picture that expresses the notions of logical core dispatch, logical core busy, and logical CPU busy.
|Figure 1. Relationship between logical core dispatch and logical CPU busy.|
It should be clear from the figure that logical core busy is not the same as logical CPU busy. They are two different phenomena, measured by two different observers, and reported at different spots in Perfkit.
One further note, and then we're done. Some people use z/OS SMF records and a corresponding reporting tool such as IBM's RMF or MXG's MICS to observe utilization of a z/VM LPAR. Data captured in SMF records is core-centric, so the reports generated by the corresponding tools will be core-oriented. If you are comparing those reports to Perfkit reports, on the Perfkit side you will need to use core-oriented reports, such as FCX302 PHYSLOG and FCX126 LPAR.
And now, the response.
You asked whether your machine were "fully used". Well, that depends upon what you mean by "fully used". Let me explain.
In the FCX302 PHYSLOG report, those percent-busys are physical core busy from the point of view of PR/SM. If a machine has eight physical cores of type IFL, and if FCX302 PHYSLOG reports those eight physical IFL cores to be 800% busy, PR/SM has absolutely no room ever to do any additional dispatches of logical cores onto physical cores. So from PR/SM's point of view the physical cores are "fully used".
In the FCX126 LPAR report, those percent-busys are logical core dispatch busy from the point of view of PR/SM. If an LPAR's logical core shows up as 100% busy, it means the logical core was dispatched on a physical core 100% of the time. So from the point of view of PR/SM, it could not dispatch the logical core any more often. In that sense the logical core is "fully used".
Also a physical phenomenon, but not reported in the FCX126 LPAR report, is the notion of the logical core being ready to be core-dispatched. When core-dispatched plus core-ready-for-dispatch adds to 100%, the logical core is "fully used" even though it might be actually running on a physical core somewhat less than 100% of the time. In such a case FCX126 LPAR will show a logical-core-busy value of somewhat under 100%. You can think of this gap as "core suspend time", that is, time the logical core wanted to be dispatched on a physical core but PR/SM didn't dispatch it. PR/SM doesn't report this gap to us so we can't report it in Perfkit. But as you will see in a moment, we can see suspend time in another way, so no hope is lost here.
On each logical CPU z/VM itself keeps track of the time it spends in these five states: running a guest (g), running in CP induced by a guest (cCP), running in CP not induced by a guest (ncCP), wait-PSW (w), and parked (p). You might think the quantity (g + cCP + ncCP + w + p) should add to 100%. Sometimes it doesn't. The missing time is called "logical CPU suspend time", s. Suspend time is the time the logical CPU wanted its logical core to be dispatched on a physical core but PR/SM didn't oblige. FCX304 PRCLOG reports logical CPU suspend time.
FCX100 CPU, FCX144 PROCLOG, and FCX304 PRCLOG report logical-CPU-busy by calculating 100*(g+cCP+ncCP)/e, where e is elapsed time. Certainly when we see a percent-busy value of 100% we can claim the logical CPU is "fully used". But owing to the concept of suspend time, the logical CPU might be "fully used" -- that is, have no more room to dispatch virtual CPUs -- even when logical-CPU percent-busy reports at somewhat under 100%. Thus the best indicator of whether a logical CPU is fully used is that w=0 and p=0. FCX304 PRCLOG directly reports g (Emul), ncCP (Syst), p (Park), and s (Susp). Further, it indirectly reports cCP, because User = g + cCP. Unfortunately, it does not report w, but we can calculate w = (100 - User - Syst - Park - Susp).
When PR/SM dispatches a logical core onto a physical core, during the time of the core-dispatch the two logical CPUs of the logical core can move in and out of wait-state independently. When they both run for the entire time of the core dispatch, that would be thread density 2. Some people will say that when TD=2 the logical core is "fully used". When one or more of the logical CPUs of the logical core loads a wait-state PSW during core dispatch, the core runs somewhat below thread density 2. In that case some would say the logical core is not "fully used".
You asked about capacity planning.
- FCX302 PHYSLOG shows us physical utilization of physical cores. If FCX302 PHYSLOG says the physical cores are full, they can't handle any more logical core dispatches.
- For our LPAR, if we use FCX126 LPAR and see that our logical cores are 100% busy, they can't be dispatched onto physical cores any more frequently than they are right now.
For our LPAR, if we use FCX304 PRCLOG to calculate w,
- And then see that w=0 and p=0, the logical CPU is fully used.
- If we see w>0, z/VM found time to load a wait-state so maybe the logical CPU is not quite full.
- For our LPAR, if in FCX304 PRCLOG we see p>0, that's an indicator z/VM is projecting there would be no physical power to run the logical CPU; when this happens we would expect also to see high or highly fluctuating values in FCX302 PHYSLOG.
- For our LPAR, if in FCX304 PRCLOG we see s>>0, our LPAR's logical cores are not making it to physical cores and that's worth investigating.
Finally, you might find this article helpful.
Guest CPU Utilization
As of SMT, monitor record D4 R3 MRUSEACT and friends now report three different kinds of CPU time for guests. The three kinds of CPU time are:
- Raw time:
When the virtual CPU spends 1 second running,
raw time advances by 1 second.
- MT-1 equivalent time:
When the virtual CPU spends 1 second running,
MT-1 equivalent time advances by the amount of CPU time
CP and the hardware estimate the virtual CPU would have
if the virtual CPU had spent that whole 1 second
running on a core by itself. The estimate tries to
compensate for the interference we estimate the
virtual CPU might have experienced when it sometimes
occupied an MT-2
core concurrently with some other virtual CPU.
In other words, 1 second of raw time will generally
yield less than 1 second of MT-1 equivalent time.
MT-1 equivalent time is probably best used for
billing or chargeback.
- Prorated core time: When the virtual CPU spends 1 second running alone on a core, the virtual CPU accrues 1 second of prorated core time. When the virtual CPU spends 1 second running on a core alongside another virtual CPU, it accrues 1/2 second of prorated core time. Prorated core time is used for enforcing CPU pooling, now called resource pools.
All three kinds of time are reported in two components, total time and pure guest time. In other words, the notions of VMDTTIME and VMDVTIME have been extended to MT-1 equivalent time and prorated core time.
All Perfkit reports except the CPU Pooling reports report raw time.
For more information, read the comments in the D4 R3 MRUSEACT monitor record.