Using CPU Measurement Facility Host Counters

With VM64961 for z/VM 5.4 or with later z/VM releases, z/VM can now collect and record the System z CPU Measurement Facility host counters. These counters record the hardware performance experience of the logical PUs of the z/VM partition.

z/VM's CP Monitor facility logs out the counters in a new Monitor record, D5 R13 MRPRCMFC. The MONWRITE utility journals the monitor records to disk.

In this article we describe what the counters portray, how to reduce the counters, what the calculated metrics mean, and how to use the calculated metrics to gain insight about the behavior of the z/VM partition and its logical PUs.


What the Counters Portray

The System z CPU Measurement Facility offers means by which a System z CPU records its internal performance experience for later extraction by software. The host counters component of CPU MF counts internal CPU events such as instructions completed, clock cycles used, and cache misses experienced.


Reference Materials

For complete information about the CPU Measurement Facility, see these documents:


How to Collect the Counters

To make use of the counters, one must first set up to collect them. To learn how, visit our CPU MF collection instructions page. Following the instructions correctly results in one obtaining a MONWRITE file containing the D5 R13 MRPRCMFC records.


How to Reduce the Counters

In his presentation John Burg describes the calculations needed to derive interesting metrics from the raw counter values. Each CEC type (z10, z196, etc.) emits raw counters of different meaning and layout, so the calculations are specific to machine type. The output of the calculations is a set of values useful in understanding machine behavior.

z/VM Performance Toolkit contains no support for analyzing the raw counter values. In other words, Perfkit has not been updated to do the calculations prescribed by Burg.

On this web site we have posted a reduction tool one can use to do the Burg calculations. This package contains these items:

  • A first exec, CPUMFINT, that extracts the raw counters and other data from a MONWRITE file, writing the extracted data to an intermediary CMS file we call the interim file.
  • A second exec, CPUMFLOG, that reads the interim file, applies the Burg formulas, and produces a formatted, time-indexed log report as output.
  • Ancillary or support execs used by CPUMFINT or CPUMFLOG.

The process of reducing the counters, then, amounts to this:

  1. Start with a MONWRITE file that contains D5 R13 records.
  2. Use the CPUMFINT tool to extract counter data from the MONWRITE file. CPUMFINT takes the MONWRITE file as input and produces the interim file as output. The interim file will have CMS filetype CPUMFINT.
  3. Use the CPUMFLOG tool to process the interim file. The CPUMFLOG tool applies the Burg formulas, does the appropriate calculations, and writes a report. The report file will have CMS filetype $CPUMFLG.

Specific invocation instructions are included in the downloadable package.

The CPUMFLOG tool uses only the basic counters and the extended counters in its calculations. The interim file does also contain the problem-state counters and the crypto counters, provided the administrator enabled those counter sets for this partition on the SE. Those interested in analyzing the crypto counters or problem counters can do so by applying the formulas and techniques described in the Burg presentation.


Appearance of The CPUMFLOG Report

Metrics calculated from the CPU MF counters describe the performance experience of each logical PU in the partition over time. For each CP Monitor sample interval, for each logical PU, CPUMFLOG writes a report row calculated from the counter values for that interval. The resulting tabular report bears a vague resemblance to a Perfkit xxxxLOG report.

The columns of the report will vary slightly according to CEC type. The various models have different cache structures and therefore warrant accordingly different sets of columns in their report outputs.

Here is an excerpt of a z15 report. The report is very wide; on this web page, for page rendering purposes, we have broken the columns into groups.

The workload here was entirely contrived for internal lab purposes; the values in the report mean absolutely nothing as far as customer workload expectations are concerned.

_IntEnd_ LPU Typ ___EGHZ___ _LPARCPU__ _PrbInst__ _PrbTime_ >>Mean>> 0 IFL 5.200 91.109 46.501 34.52 >>Mean>> 1 IFL 5.200 90.743 46.004 34.45 >>Mean>> 2 IFL 5.200 89.825 43.943 32.96 >>Mean>> 3 IFL 5.200 88.818 41.230 31.04 >>Mean>> 4 IFL 5.200 87.678 37.983 28.75 >>Mean>> 5 IFL 5.200 86.669 34.081 26.78 >>Mean>> 6 IFL 5.200 85.719 31.242 24.95 >>Mean>> 7 IFL 5.200 84.996 29.672 23.30 >>Mean>> 8 IFL 5.200 90.183 43.945 33.40 >>Mean>> 9 IFL 5.200 90.237 44.972 34.39 >>Mean>> 10 IFL 5.200 89.259 42.458 32.28 >>Mean>> 11 IFL 5.200 88.161 39.437 30.64 >>Mean>> 12 IFL 5.200 87.214 36.276 28.78 >>Mean>> 13 IFL 5.200 86.433 34.080 27.26 >>Mean>> 14 IFL 5.200 85.527 31.721 25.42 >>Mean>> 15 IFL 5.200 85.063 30.312 24.24 >>Mean>> 16 IFL 5.200 79.759 26.410 19.20 >>Mean>> 17 IFL 5.200 78.977 24.978 17.82 >>Mean>> 18 IFL 5.200 77.634 21.557 15.55 >>Mean>> 19 IFL 5.200 76.675 17.426 14.10 >>MofM>> --- --- 5.200 86.034 35.801 27.30 >>AllP>> --- --- 5.200 1720.679 35.801 27.30 (continued) ___CPI____ __EICPI___ __EFCPI___ _ESCPL1M__ ___RNI____ 2.985 0.891 2.094 76.188 3.444 3.003 0.891 2.111 76.143 3.453 3.049 0.901 2.148 74.587 3.371 3.107 0.914 2.193 72.694 3.275 3.182 0.928 2.255 70.944 3.174 3.294 0.948 2.346 69.464 3.095 3.366 0.959 2.407 68.199 3.015 3.380 0.966 2.414 67.080 2.939 3.061 0.907 2.153 75.036 3.422 3.041 0.902 2.139 76.254 3.489 3.105 0.912 2.193 74.959 3.391 3.203 0.928 2.275 72.981 3.306 3.261 0.942 2.319 70.729 3.210 3.314 0.952 2.363 69.642 3.141 3.355 0.962 2.393 68.153 3.048 3.369 0.966 2.403 67.239 2.999 3.084 0.977 2.107 55.915 2.222 3.071 0.984 2.087 54.312 2.171 3.150 1.000 2.149 53.257 2.112 3.266 1.023 2.243 52.640 2.080 3.175 0.940 2.235 67.858 2.990 3.175 0.940 2.235 67.858 2.990 (continued) __T1MSEC__ __T1CPU___ _T1CYPTM__ _PTEPT1M__ 4631.691 2.755 28.177 100.945 4606.388 2.745 28.116 100.946 4765.183 2.882 28.250 100.985 4964.039 3.052 28.398 101.023 5173.350 3.247 28.615 101.062 5355.564 3.414 28.733 101.096 5559.683 3.605 28.906 101.129 5698.106 3.746 29.057 101.148 4663.143 2.843 28.587 100.960 4505.766 2.740 28.536 100.950 4686.193 2.906 28.779 100.989 4922.121 3.081 28.699 101.017 5128.207 3.257 28.806 101.051 5277.667 3.403 28.976 101.087 5455.231 3.576 29.152 101.114 5578.054 3.685 29.224 101.138 6159.004 3.709 24.974 101.276 6309.778 3.831 24.933 101.293 6485.759 4.022 25.035 101.325 6576.780 4.135 25.070 101.337 5325.085 3.302 27.744 101.108 106501.708 3.302 27.744 101.108 (continued) ___L1MP___ ___L2P____ ___L3P____ ___L4LP___ ___L4RP___ ___MEMP___ 2.748 52.927 23.663 8.810 0.001 14.599 2.773 53.027 23.462 8.862 0.001 14.649 2.880 53.081 23.992 8.718 0.001 14.208 3.017 53.319 24.454 8.511 0.001 13.715 3.178 53.466 25.034 8.312 0.001 13.188 3.378 53.571 25.507 8.146 0.001 12.775 3.530 53.595 26.137 7.909 0.001 12.357 3.599 53.698 26.554 7.790 0.001 11.957 2.870 52.912 23.789 8.826 0.001 14.471 2.805 52.843 23.377 8.958 0.001 14.821 2.926 53.013 23.874 8.811 0.001 14.302 3.117 53.203 24.316 8.612 0.001 13.869 3.279 53.437 24.783 8.409 0.001 13.371 3.393 53.439 25.295 8.263 0.001 13.002 3.511 53.653 25.781 8.032 0.001 12.533 3.573 53.632 26.166 7.937 0.001 12.264 3.768 53.621 31.300 7.092 0.001 7.986 3.843 53.713 31.588 6.979 0.001 7.720 4.036 53.681 32.059 6.857 0.001 7.402 4.262 53.675 32.300 6.791 0.001 7.233 3.294 53.390 26.348 8.090 0.001 12.171 3.294 53.390 26.348 8.090 0.001 12.171 (continued) __eMIPS___ __iMIPS___ 1587.358 1742.265 1571.559 1731.873 1531.870 1705.393 1486.491 1673.634 1432.614 1633.952 1368.235 1578.691 1324.236 1544.858 1307.737 1538.587 1532.093 1698.877 1543.010 1709.946 1494.774 1674.643 1431.264 1623.468 1390.564 1594.433 1356.151 1569.019 1325.748 1550.097 1312.936 1543.486 1344.678 1685.918 1337.157 1693.094 1281.742 1651.010 1220.681 1592.017 1409.045 1637.777 28180.899 32755.549 (continued) __L4LPOC__ __MEMPLC__ __MEMPNC__ __MEMPND__ __MEMPFD__ 1.843 2.344 4.669 7.586 0.000 1.847 2.349 4.690 7.610 0.000 1.770 2.275 4.540 7.393 0.000 1.666 2.202 4.374 7.139 0.000 1.552 2.110 4.204 6.874 0.000 1.482 2.052 4.062 6.661 0.000 1.370 1.973 3.936 6.448 0.000 1.304 1.913 3.796 6.248 0.000 1.825 4.635 2.319 7.516 0.000 1.883 4.776 2.363 7.682 0.000 1.792 4.587 2.288 7.426 0.000 1.695 4.441 2.213 7.214 0.000 1.600 4.292 2.130 6.949 0.000 1.523 4.156 2.077 6.769 0.000 1.423 4.000 2.003 6.531 0.000 1.384 3.913 1.959 6.392 0.000 5.570 2.067 1.994 3.925 0.000 5.472 2.006 1.921 3.793 0.000 5.368 1.921 1.846 3.635 0.000 5.312 1.876 1.802 3.555 0.000 2.462 2.960 2.926 6.285 0.000 2.462 2.960 2.926 6.285 0.000 (continued) ___SIIS___ 0.089 0.082 0.083 0.088 0.092 0.096 0.097 0.102 0.086 0.080 0.083 0.086 0.092 0.091 0.095 0.099 0.078 0.081 0.081 0.080 0.088 0.088

The table below gives definitions for each of the columns in the report.

Column Meaning
Basic LPU Statistics
IntEnd The hh:mm:ss of the CP Monitor interval-end time, in UTC.

The first flock of rows is marked ">>Mean>>" to indicate that the rows are the mean experience of each logical PU over the whole time range recorded in the MONWRITE file.

The special row ">>MofM>>", mean of means, is the average experience of the average logical PU over the whole time range of the MONWRITE file.

The special row ">>AllP>>", all processors, merely states the sum of the LPARCPU, T1MSEC, eMIPS, and iMIPS columns, described later.

LPU The processor address, aka logical PU number, of the PU this row describes.
Typ The type of processor: CP, IFL, etc.
EGHZ Effective clock rate of the CEC, in GHz.
LPARCPU Percent busy of this logical PU as portrayed by the counters.
PrbInst The percent of completed instructions that were problem-state instructions.
PrbTime The percent of the CPU-busy time that was spent in problem state.
Basic CPI Statistics
CPI Cycles per instruction. The average number of clock cycles that transpire between completion of instructions.
EICPI Estimated instruction complexity CPI, sometimes also known as "infinite CPI". This is the number of clock cycles instructions would take if they never, ever incurred an L1 miss. The word "infinite" comes from the wish, "If we but had infinite L1, this is how long the instructions would have taken."
EFCPI Estimated cache miss CPI, sometimes also known as "finite CPI". This is the number of clock cycles instructions are being delayed because of L1 misses. The word "finite" comes from the lament, "Because our L1 is finite, this is how much our CPI is elongating." If we had infinite L1, this number would be zero.
ESCPL1M Estimated sourcing cycles per L1 miss. When an L1 miss happens, this is how many clock cycles it takes to make things right.
RNI Relative nest intensity. A scalar that expresses how hard the caches and memory are working to keep up with the demands of the CPUs. Higher numbers indicate higher intensity. Each CEC type's RNI formula is weighted in such a way that RNI values are comparable across CEC types.
Basic TLB Statistics
T1MSEC Miss rate of the Translation Lookaside Buffer (TLB), in misses per millisecond.
T1CPU Percent of CPU-busy that is attributable to TLB misses.
T1CYPTM Number of cycles a TLB miss tends to cost.
PTEPT1M PTE percent of all TLB misses.
Memory Cache (L1, etc.) Behavior
L1MP L1 miss percentage. This is the percent of instructions that incur an L1 miss.
LxxP Percent of L1 misses sourced from cache level xx.

On z10, the levels are L1.5 ("15"), L2 on this book ("2L"), or L2 on some other book ("2R").

On z196 and later, the levels are L2 ("2"), L3 ("3"), L4 on this book ("4L"), or L4 on some other book ("4R").

MEMP Percent of L1 misses sourced from memory.
MIPS Behavior
eMIPS Instruction completion rate, millions of instructions per elapsed second.
iMIPS Instruction completion rate, millions of instructions per CPU-second.
Drawer Behavior
L4LPOC Percent of L1 misses sourced from local L4 off-cluster L3
MEMPLC Percent of L1 misses sourced from memory local-on-chip
MEMPNC Percent of L1 misses sourced from memory on-cluster
MEMPND Percent of L1 misses sourced from memory on-drawer
MEMPFD Percent of L1 misses sourced from memory off-drawer
Store-Into-Instruction-Stream Behavior
SIIS Percent of I-cache writes sourced with L3 intervention


What To Do With The Information

The CPU MF counters data isn't like ordinary performance data, in that there is no z/VM or System z "knob" one can directly turn to affect the achieved values. For example, there's no "give me more L1" knob that we could turn to increase the amount of L1 on the CEC, if we felt there were something lacking about our L1 performance.

For this reason, the CPU MF report is at risk for being labelled "tourist information" or "gee-whiz information". Some analysts might say that because there isn't much that can be done to influence it, why would we bother even looking at it?

It turns out there are some very useful things we can do with CPU MF information, even though we don't have cache adjusting knobs at our immediate disposal. In the rest of this article, we briefly explore some of them.

Probably the most useful thing to do with the CPU MF report is to use it as your workload's characterization index into the IBM Large Systems Performance Report (LSPR). The L1 miss percent L1MP and the RNI value together constitute the "LSPR hint" which in turn reveals which portion of the LSPR to consult when projecting your own workload's scaling or migration characteristics. For more information on this, see IBM's LSPR page.

One thing we can do to affect cache performance is to be cognizant of the idea that all of the partitions running on the CEC are competing for the CEC's cache. Steps we take to help the partitions' peak times not to overlap will help matters. If we have our workload scheduled so that all partitions heat up at 9 AM and all partitions cool off at 6 PM, we might consider whether we might stagger our company's work so that the partitions heat up at different times. An extension to this might be that if we had put all of the Europe partitions on one CEC, all of the North America partitions on a second, and all of the Asia partitions on a third, we might instead consider a less time-oriented placement, so that any given CEC doesn't have all of its partitions hot at the same time.

If our CEC is hosting a mix of z/OS partitions and other partitions, we can affect cache performance by turning on z/OS HiperDispatch in the z/OS partitions. Doing this helps PR/SM and z/OS to shrink those partitions' cache influence, because z/OS HiperDispatch switches the z/OS partitions to something called vertical mode. For more information about vertical mode partitions, consult z/OS documentation.

Another thing we can do to affect cache performance is to tune the system's configurations of logical CPUs and virtual CPUs so that those two choices are right-sized for the workload. If a z/VM partition is a logical 16-way but is running only 425% busy on average with peaks at 715%, set it to be an 8-way instead of a 16-way. The same thing applies to virtual servers. If that big Linux guest runs only 115% busy on average with peaks of 280%, it probably should not be configured as a virtual 12-way. Set it to be a virtual 3-way or 4-way instead.

Speaking of tuning Linux virtual machines, customers report varying degrees of success with using the cpuplugd daemon to shut off unneeded virtual CPUs during off-peak times. If you have large N-way Linux guests, consider trying cpuplugd in a test environment, and if the tests work out for you, consider putting it into production.

Just as CPU counts can be right-sized, memory can also be right-sized. Take another look at those UPAGELOG reports for your virtual servers and the I/O rates to your virtual servers' swap extents. If your virtual servers are ignoring their swap extents, you can probably afford to decrease their memory sizes.