Using CPU Measurement Facility Host Counters
Last updated: 2021-04-09, BKW
With VM64961 for z/VM 5.4 or with later z/VM releases, z/VM can now collect and record the System z CPU Measurement Facility host counters. These counters record the hardware performance experience of the logical processors of the z/VM partition.
z/VM's CP Monitor facility logs out the counters in a new monitor record, D5 R13 MRPRCMFC. The MONWRITE utility journals the monitor records to disk.
In this article we describe what the counters portray, how to reduce the counters, what the calculated metrics mean, and how to use the calculated metrics to gain insight about the behavior of the z/VM partition and its logical processors.
What the Counters Portray
The System z CPU Measurement Facility offers means by which a System z CPU records its internal performance experience for later extraction by software. The host counters component of CPU MF counts internal CPU events such as instructions completed, clock cycles used, and cache misses experienced.
For complete information about the CPU Measurement Facility, see these documents:
The Load-Program-Parameter and CPU-Measurement Facilities, z15 level, September 2019
The CPU-Measurement Facility Extended Counters Definition, z15 level, September 2019
- z15 CPU MF Formulas, John Burg, IBM, September 2019
How to Collect the Counters
To make use of the counters, one must first set up to collect them. To learn how, visit our CPU MF collection instructions page. Following the instructions correctly results in one obtaining a MONWRITE file containing the D5 R13 MRPRCMFC records.
How to Reduce the Counters
In his presentation John Burg describes the calculations needed to derive interesting metrics from the raw counter values. Each CEC type (z10, z196, etc.) emits raw counters of different meaning and layout, so the calculations are specific to machine type. The output of the calculations is a set of values useful in understanding machine behavior.
z/VM Performance Toolkit contains no support for analyzing the raw counter values. In other words, Perfkit has not been updated to do the calculations prescribed by Burg.
On this web site we have posted a reduction tool one can use to do the Burg calculations. This package contains these items:
- A first exec, CPUMFINT, that extracts the raw counters and other data from a MONWRITE file, writing the extracted data to an intermediary CMS file we call the interim file.
- A second exec, CPUMFLOG, that reads the interim file, applies the Burg formulas, and produces a formatted, time-indexed log report as output.
- Ancillary or support execs used by CPUMFINT or CPUMFLOG.
The process of reducing the counters, then, amounts to this:
- Start with a MONWRITE file that contains D5 R13 records.
- Use the CPUMFINT tool to extract counter data from the MONWRITE file. CPUMFINT takes the MONWRITE file as input and produces the interim file as output. The interim file will have CMS filetype CPUMFINT.
- Use the CPUMFLOG tool to process the interim file. The CPUMFLOG tool applies the Burg formulas, does the appropriate calculations, and writes a report. The report file will have CMS filetype $CPUMFLG.
Specific invocation instructions are included in the downloadable package.
The CPUMFLOG tool uses only the basic, problem, and extended counters in its calculations. The interim file does also contain the crypto counters, provided the administrator enabled those counter sets for this partition on the SE. A separate package on our download library, D5R13CRY, reduces the CPACF crypto counters. However, D5R13CRY runs directly off the MONWRITE file, so there is no need to use the CPUMFINT tool first to see CPACF behavior.
Appearance of The CPUMFLOG Report
Metrics calculated from the CPU MF counters describe the performance experience of each logical processor in the partition over time. For each CP Monitor sample interval, for each logical processor, CPUMFLOG writes a report row calculated from the counter values for that interval. The resulting tabular report bears a vague resemblance to a Perfkit xxxxLOG report.
The columns of the report will vary slightly according to CEC type. The various models have different cache structures and therefore warrant accordingly different sets of columns in their report outputs.
Here is an excerpt of a z15 report. The report is very wide; on this web page, for page rendering purposes, we have broken the columns into groups.
The workload here was entirely contrived for internal lab purposes; the values in the report mean absolutely nothing as far as customer workload expectations are concerned.
The table below gives definitions for each of the columns in the report.
Basic Logical Processor Statistics
The hh:mm:ss of the CP Monitor interval-end time, in system local time.
The first flock of rows is marked ">>Mean>>" to indicate that the rows are the mean experience of each logical processor over the whole time range recorded in the MONWRITE file.
The special row ">>MofM>>", mean of means, is the average experience of the average logical processor over the whole time range of the MONWRITE file.
The special row ">>AllP>>", all processors, merely states the sums of the LPARCPU, T1MSEC, eMIPS, and iMIPS columns, described later.
|LPU||The processor address, aka logical processor number, of the logical processor this row describes.|
|Typ||The type of processor: CP, IFL, etc.|
|EGHZ||Effective clock rate of the CEC, in GHz.|
|LPARCPU||Percent busy of this logical processor as portrayed by the counters.|
|PrbInst||The percent of completed instructions that were problem-state instructions.|
|PrbTime||The percent of the CPU-busy time that was spent in problem state.|
Basic CPI Statistics
|CPI||Cycles per instruction. The average number of clock cycles that transpire between completion of instructions. This is not the same as the average number of cycles it takes for an instruction to run, from its start to its finish.|
|EICPI||Estimated instruction complexity CPI, sometimes also known as "infinite CPI". This is the number of clock cycles that would transpire between completion of instructions if no instruction ever incurred an L1 miss. The word "infinite" comes from the wish, "If we had infinite L1, this is how the machine would perform."|
|EFCPI||Estimated cache miss CPI, sometimes also known as "finite CPI". This is the number of clock cycles instruction completions are being delayed because of L1 misses. The word "finite" comes from the lament, "Because our L1 is finite, this is how much our CPI is elongating." If we had infinite L1, this number would be zero.|
|ESCPL1M||Estimated sourcing cycles per L1 miss. When an L1 miss happens, this is how many clock cycles it takes to make things right.|
|RNI||Relative nest intensity. A scalar that expresses how hard the caches and memory are working to keep up with the demands of the CPUs. Higher numbers indicate higher intensity. Each CEC type's RNI formula is weighted in such a way that RNI values are comparable across CEC types.|
Basic TLB Statistics
|T1MSEC||Miss rate of the Translation Lookaside Buffer (TLB), in misses per millisecond.|
|T1CPU||Percent of CPU-busy that is attributable to TLB misses.|
|T1CYPTM||Number of cycles a TLB miss tends to cost.|
|PTEPT1M||PTE percent of all TLB misses.|
Memory Cache (L1, etc.) Behavior
|L1MP||L1 miss percentage. This is the percent of instructions that incur an L1 miss.|
|LxxP||Percent of L1 misses sourced from cache level xx.
On z10, the levels are L1.5 ("15"), L2 on this book ("2L"), or L2 on some other book ("2R").
On z196 and later, the levels are L2 ("2"), L3 ("3"), L4 on this book ("4L"), or L4 on some other book ("4R").
|MEMP||Percent of L1 misses sourced from memory.|
|eMIPS||Instruction completion rate, millions of instructions per elapsed second.|
|iMIPS||Instruction completion rate, millions of instructions per CPU-second.|
|SIIS||Percent of I-cache writes achieved with L3 intervention|
|ICWL3PMI||I-cache writes achieved with L3 intervention per million instructions completed|
LSPR Workload Hint
|LSPR||Low, Avg, or High, per this LSPR article|
|L4LPOC||Percent of L1 misses sourced from local L4 off-cluster L3|
|MEMPLC||Percent of L1 misses sourced from memory local-on-chip|
|MEMPNC||Percent of L1 misses sourced from memory on-cluster|
|MEMPND||Percent of L1 misses sourced from memory on-drawer|
|MEMPFD||Percent of L1 misses sourced from memory off-drawer|
Deflate Behavior (DFLTCC, z15 and later only)
|DF_DPMI||DFLTCC instructions completed per million total instructions completed|
|DF_D012PMI||DFLTCC CC=0,1,2 instructions completed per million total instructions completed|
|DF_012PD||DFLTCCs completed with CC=0,1,2 per DFLTCC completed|
|DF_CPD||CPI of DFLTCC instructions|
|DF_CSOA||Of DFLTCC CPI, cycles spent obtaining access to DFLTCC processor|
|DF_POBC||Percent of CPU-busy cycles spent busy doing DFLTCC|
What To Do With The Information
The CPU MF counters data isn't like ordinary performance data in that there is no z/VM or System z "knob" one can directly turn to affect the achieved values. For example, there's no "give me more L1" knob that we could turn to increase the amount of L1 on the CEC if we felt there were something lacking about our L1 performance.
For this reason the CPU MF report is at risk for being labelled "tourist information" or "gee-whiz information". Some analysts might say that because there isn't much that can be done to influence it, why would we bother even looking at it?
It turns out there are some very useful things we can do with CPU MF information even though we don't have cache adjusting knobs at our immediate disposal. In the rest of this article, we briefly explore some of them.
Probably the most useful thing to do with the CPU MF report is to use it as your workload's characterization index into the IBM Large Systems Performance Report (LSPR). The L1 miss percent L1MP and the RNI value together constitute the "LSPR hint" which in turn reveals which portion of the LSPR to consult when projecting your own workload's scaling or migration characteristics. Later versions of our CPUMF tool even print a column that states which workload hint applies. For more information on this, see IBM's LSPR page.
One thing we can do to affect cache performance is to be cognizant of the idea that all of the partitions running on the CEC are competing for the CEC's cache. Steps we take to help the partitions' peak times not to overlap will help matters. If we have our workload scheduled so that all partitions heat up at 9 AM and all partitions cool off at 6 PM, we might consider whether we might stagger our company's work so that the partitions heat up at different times. An extension to this might be that if we had put all of the Europe partitions on one CEC, all of the North America partitions on a second, and all of the Asia partitions on a third, we might instead consider a less time-oriented placement, so that any given CEC doesn't have all of its partitions hot at the same time.
Another thing we can do is run with HiperDispatch enabled in all the partitions that support it. For z/OS this means turning on the HiperDispatch feature. For z/VM this means running with CP SET SRM POLARIZATION VERTICAL. Doing this helps PR/SM to shrink those partitions' cache influence. For more information, consult operating system documentation.
Another thing we can do to affect cache performance is to tune the system's configurations of logical processors and virtual processors so that those two choices are right-sized for the workload. If a z/VM partition has 16 logical processors but is running only 425% busy on average with peaks at 715%, set it to have 8 logical processors instead. The same thing applies to virtual servers. If that big Linux guest runs only 115% busy on average with peaks of 280%, it probably should not be configured as a virtual 12-way. Set it to be a virtual 3-way or 4-way instead.
On versions of z/VM that have VM66063 or later, running with CP SET SRM UNPARKING MEDIUM can help to control cache effects. The medium unparking model leaves unneeded VL cores parked even though PR/SM might have the capacity to run them. For more information, see our unparking article.
Much is made of the phenomenon called "store into instruction stream" (SIIS) and its potential to affect performance. I-cache stores that cause the L3 to intervene can dramatically decrease performance, so we want to know whether they are happening, and if so, whether they are happening enough to worry about. Later versions of the CPUMF tool print a column called "ICWL3PMI" which tabulates I-cache stores with L3 intervention per million instructions completed. For more information, read our SIIS article.
Speaking of tuning Linux virtual machines, customers report varying degrees of success with using the cpuplugd daemon to shut off unneeded virtual CPUs during off-peak times. If you have large N-way Linux guests, consider trying cpuplugd in a test environment, and if the tests work out for you, consider putting it into production.
Just as CPU counts can be right-sized, memory can also be right-sized. Take another look at those UPAGELOG reports for your virtual servers and the I/O rates to your virtual servers' swap extents. If your virtual servers are ignoring their swap extents, you can probably afford to decrease their memory sizes.