Using CPU Measurement Facility Host Counters
With VM64961 for z/VM 5.4 or with later z/VM releases, z/VM can now collect and record the System z CPU Measurement Facility host counters. These counters record the hardware performance experience of the logical PUs of the z/VM partition.
z/VM's CP Monitor facility logs out the counters in a new Monitor record, D5 R13 MRPRCMFC. The MONWRITE utility journals the monitor records to disk.
In this article we describe what the counters portray, how to reduce the counters, what the calculated metrics mean, and how to use the calculated metrics to gain insight about the behavior of the z/VM partition and its logical PUs.
What the Counters Portray
The System z CPU Measurement Facility offers means by which a System z CPU records its internal performance experience for later extraction by software. The host counters component of CPU MF counts internal CPU events such as instructions completed, clock cycles used, and cache misses experienced.
For complete information about the CPU Measurement Facility, see these documents:
The Load-Program-Parameter and CPU-Measurement Facilities, z15 level, September 2019
The CPU-Measurement Facility Extended Counters Definition, z15 level, September 2019
- z15 CPU MF Formulas, John Burg, IBM, September 2019
How to Collect the Counters
To make use of the counters, one must first set up to collect them. To learn how, visit our CPU MF collection instructions page. Following the instructions correctly results in one obtaining a MONWRITE file containing the D5 R13 MRPRCMFC records.
How to Reduce the Counters
In his presentation John Burg describes the calculations needed to derive interesting metrics from the raw counter values. Each CEC type (z10, z196, etc.) emits raw counters of different meaning and layout, so the calculations are specific to machine type. The output of the calculations is a set of values useful in understanding machine behavior.
z/VM Performance Toolkit contains no support for analyzing the raw counter values. In other words, Perfkit has not been updated to do the calculations prescribed by Burg.
On this web site we have posted a reduction tool one can use to do the Burg calculations. This package contains these items:
- A first exec, CPUMFINT, that extracts the raw counters and other data from a MONWRITE file, writing the extracted data to an intermediary CMS file we call the interim file.
- A second exec, CPUMFLOG, that reads the interim file, applies the Burg formulas, and produces a formatted, time-indexed log report as output.
- Ancillary or support execs used by CPUMFINT or CPUMFLOG.
The process of reducing the counters, then, amounts to this:
- Start with a MONWRITE file that contains D5 R13 records.
- Use the CPUMFINT tool to extract counter data from the MONWRITE file. CPUMFINT takes the MONWRITE file as input and produces the interim file as output. The interim file will have CMS filetype CPUMFINT.
- Use the CPUMFLOG tool to process the interim file. The CPUMFLOG tool applies the Burg formulas, does the appropriate calculations, and writes a report. The report file will have CMS filetype $CPUMFLG.
Specific invocation instructions are included in the downloadable package.
The CPUMFLOG tool uses only the basic counters and the extended counters in its calculations. The interim file does also contain the problem-state counters and the crypto counters, provided the administrator enabled those counter sets for this partition on the SE. Those interested in analyzing the crypto counters or problem counters can do so by applying the formulas and techniques described in the Burg presentation.
Appearance of The CPUMFLOG Report
Metrics calculated from the CPU MF counters describe the performance experience of each logical PU in the partition over time. For each CP Monitor sample interval, for each logical PU, CPUMFLOG writes a report row calculated from the counter values for that interval. The resulting tabular report bears a vague resemblance to a Perfkit xxxxLOG report.
The columns of the report will vary slightly according to CEC type. The various models have different cache structures and therefore warrant accordingly different sets of columns in their report outputs.
Here is an excerpt of a z15 report. The report is very wide; on this web page, for page rendering purposes, we have broken the columns into groups.
The workload here was entirely contrived for internal lab purposes; the values in the report mean absolutely nothing as far as customer workload expectations are concerned.
The table below gives definitions for each of the columns in the report.
|Basic LPU Statistics|
The hh:mm:ss of the CP Monitor interval-end time, in UTC.
The first flock of rows is marked ">>Mean>>" to indicate that the rows are the mean experience of each logical PU over the whole time range recorded in the MONWRITE file.
The special row ">>MofM>>", mean of means, is the average experience of the average logical PU over the whole time range of the MONWRITE file.
The special row ">>AllP>>", all processors, merely states the sum of the LPARCPU, T1MSEC, eMIPS, and iMIPS columns, described later.
|LPU||The processor address, aka logical PU number, of the PU this row describes.|
|Typ||The type of processor: CP, IFL, etc.|
|EGHZ||Effective clock rate of the CEC, in GHz.|
|LPARCPU||Percent busy of this logical PU as portrayed by the counters.|
|PrbInst||The percent of completed instructions that were problem-state instructions.|
|PrbTime||The percent of the CPU-busy time that was spent in problem state.|
|Basic CPI Statistics|
|CPI||Cycles per instruction. The average number of clock cycles that transpire between completion of instructions.|
|EICPI||Estimated instruction complexity CPI, sometimes also known as "infinite CPI". This is the number of clock cycles instructions would take if they never, ever incurred an L1 miss. The word "infinite" comes from the wish, "If we but had infinite L1, this is how long the instructions would have taken."|
|EFCPI||Estimated cache miss CPI, sometimes also known as "finite CPI". This is the number of clock cycles instructions are being delayed because of L1 misses. The word "finite" comes from the lament, "Because our L1 is finite, this is how much our CPI is elongating." If we had infinite L1, this number would be zero.|
|ESCPL1M||Estimated sourcing cycles per L1 miss. When an L1 miss happens, this is how many clock cycles it takes to make things right.|
|RNI||Relative nest intensity. A scalar that expresses how hard the caches and memory are working to keep up with the demands of the CPUs. Higher numbers indicate higher intensity. Each CEC type's RNI formula is weighted in such a way that RNI values are comparable across CEC types.|
|Basic TLB Statistics|
|T1MSEC||Miss rate of the Translation Lookaside Buffer (TLB), in misses per millisecond.|
|T1CPU||Percent of CPU-busy that is attributable to TLB misses.|
|T1CYPTM||Number of cycles a TLB miss tends to cost.|
|PTEPT1M||PTE percent of all TLB misses.|
|Memory Cache (L1, etc.) Behavior|
|L1MP||L1 miss percentage. This is the percent of instructions that incur an L1 miss.|
|LxxP||Percent of L1 misses sourced from cache level xx.
On z10, the levels are L1.5 ("15"), L2 on this book ("2L"), or L2 on some other book ("2R").
On z196 and later, the levels are L2 ("2"), L3 ("3"), L4 on this book ("4L"), or L4 on some other book ("4R").
|MEMP||Percent of L1 misses sourced from memory.|
|eMIPS||Instruction completion rate, millions of instructions per elapsed second.|
|iMIPS||Instruction completion rate, millions of instructions per CPU-second.|
|L4LPOC||Percent of L1 misses sourced from local L4 off-cluster L3|
|MEMPLC||Percent of L1 misses sourced from memory local-on-chip|
|MEMPNC||Percent of L1 misses sourced from memory on-cluster|
|MEMPND||Percent of L1 misses sourced from memory on-drawer|
|MEMPFD||Percent of L1 misses sourced from memory off-drawer|
|SIIS||Percent of I-cache writes sourced with L3 intervention|
What To Do With The Information
The CPU MF counters data isn't like ordinary performance data, in that there is no z/VM or System z "knob" one can directly turn to affect the achieved values. For example, there's no "give me more L1" knob that we could turn to increase the amount of L1 on the CEC, if we felt there were something lacking about our L1 performance.
For this reason, the CPU MF report is at risk for being labelled "tourist information" or "gee-whiz information". Some analysts might say that because there isn't much that can be done to influence it, why would we bother even looking at it?
It turns out there are some very useful things we can do with CPU MF information, even though we don't have cache adjusting knobs at our immediate disposal. In the rest of this article, we briefly explore some of them.
Probably the most useful thing to do with the CPU MF report is to use it as your workload's characterization index into the IBM Large Systems Performance Report (LSPR). The L1 miss percent L1MP and the RNI value together constitute the "LSPR hint" which in turn reveals which portion of the LSPR to consult when projecting your own workload's scaling or migration characteristics. For more information on this, see IBM's LSPR page.
One thing we can do to affect cache performance is to be cognizant of the idea that all of the partitions running on the CEC are competing for the CEC's cache. Steps we take to help the partitions' peak times not to overlap will help matters. If we have our workload scheduled so that all partitions heat up at 9 AM and all partitions cool off at 6 PM, we might consider whether we might stagger our company's work so that the partitions heat up at different times. An extension to this might be that if we had put all of the Europe partitions on one CEC, all of the North America partitions on a second, and all of the Asia partitions on a third, we might instead consider a less time-oriented placement, so that any given CEC doesn't have all of its partitions hot at the same time.
If our CEC is hosting a mix of z/OS partitions and other partitions, we can affect cache performance by turning on z/OS HiperDispatch in the z/OS partitions. Doing this helps PR/SM and z/OS to shrink those partitions' cache influence, because z/OS HiperDispatch switches the z/OS partitions to something called vertical mode. For more information about vertical mode partitions, consult z/OS documentation.
Another thing we can do to affect cache performance is to tune the system's configurations of logical CPUs and virtual CPUs so that those two choices are right-sized for the workload. If a z/VM partition is a logical 16-way but is running only 425% busy on average with peaks at 715%, set it to be an 8-way instead of a 16-way. The same thing applies to virtual servers. If that big Linux guest runs only 115% busy on average with peaks of 280%, it probably should not be configured as a virtual 12-way. Set it to be a virtual 3-way or 4-way instead.
Speaking of tuning Linux virtual machines, customers report varying degrees of success with using the cpuplugd daemon to shut off unneeded virtual CPUs during off-peak times. If you have large N-way Linux guests, consider trying cpuplugd in a test environment, and if the tests work out for you, consider putting it into production.
Just as CPU counts can be right-sized, memory can also be right-sized. Take another look at those UPAGELOG reports for your virtual servers and the I/O rates to your virtual servers' swap extents. If your virtual servers are ignoring their swap extents, you can probably afford to decrease their memory sizes.