Using CPU Measurement Facility Host Counters

Last updated: 2022-06-28, BKW

With VM64961 for z/VM 5.4 or with later z/VM releases, z/VM can now collect and record the System z CPU Measurement Facility host counters. These counters record the hardware performance experience of the logical processors of the z/VM partition.

z/VM's CP Monitor facility logs out the counters in a new monitor record, D5 R13 MRPRCMFC. The MONWRITE utility journals the monitor records to disk.

In this article we describe what the counters portray, how to reduce the counters, what the calculated metrics mean, and how to use the calculated metrics to gain insight about the behavior of the z/VM partition and its logical processors.


What the Counters Portray

The System z CPU Measurement Facility offers means by which a System z CPU records its internal performance experience for later extraction by software. The host counters component of CPU MF counts internal CPU events such as instructions completed, clock cycles used, and cache misses experienced.


Reference Materials

For complete information about the CPU Measurement Facility, see these documents:

z16

z15


How to Collect the Counters

To make use of the counters, one must first set up to collect them. To learn how, visit our CPU MF collection instructions page. Following the instructions correctly results in one obtaining a MONWRITE file containing the D5 R13 MRPRCMFC records.


How to Reduce the Counters

In his presentation John Burg describes the calculations needed to derive interesting metrics from the raw counter values. Each CEC type (z14, z15, z16, etc.) emits raw counters of different meaning and layout, so the calculations are specific to machine type. The output of the calculations is a set of values useful in understanding machine behavior.

z/VM Performance Toolkit contains no support for analyzing the raw counter values. In other words, Perfkit has not been updated to do the calculations prescribed by Burg.

On this web site we have posted a reduction tool one can use to do the Burg calculations. This package contains these items:

  • A first exec, CPUMFINT, that extracts the raw counters and other data from a MONWRITE file, writing the extracted data to an intermediary CMS file we call the interim file.
  • A second exec, CPUMFLOG, that reads the interim file, applies the Burg formulas, and produces a formatted, time-indexed log report as output.
  • Ancillary or support execs used by CPUMFINT or CPUMFLOG.

The process of reducing the counters, then, amounts to this:

  1. Start with a MONWRITE file that contains D5 R13 records.
  2. Use the CPUMFINT tool to extract counter data from the MONWRITE file. CPUMFINT takes the MONWRITE file as input and produces the interim file as output. The interim file will have CMS filetype CPUMFINT.
  3. Use the CPUMFLOG tool to process the interim file. The CPUMFLOG tool applies the Burg formulas, does the appropriate calculations, and writes a report. The report file will have CMS filetype $CPUMFLG.

Specific invocation instructions are included in the downloadable package.

The CPUMFLOG tool uses only the basic, problem, and extended counters in its calculations. The interim file does also contain the CPACF crypto counters, provided the administrator enabled those counter sets for this partition on the SE. A separate package on our download library, D5R13CRY, reduces the CPACF crypto counters. However, D5R13CRY runs directly off the MONWRITE file, so to see CPACF behavior there is no need to use the CPUMFINT tool.


Appearance of The CPUMFLOG Report

Metrics calculated from the CPU MF counters describe the performance experience of each logical processor in the partition over time. For each CP Monitor sample interval, for each logical processor, CPUMFLOG writes a report row calculated from the counter values for that interval. The resulting tabular report bears a vague resemblance to a Perfkit xxxxLOG report.

The columns of the report will vary slightly according to CEC type. The various types have different cache structures and therefore warrant accordingly different sets of columns in their report outputs.

Here is an excerpt of a z15 report. The report is very wide; on this web page, for page rendering purposes, we have broken the columns into groups.

The workload here was entirely contrived for internal lab purposes; the values in the report mean absolutely nothing as far as customer workload expectations are concerned.

_IntEnd_ LPU Typ ___EGHZ___ _LPARCPU__ _PrbInst__ _PrbTime_ >>Mean>> 0 IFL 5.200 91.109 46.501 34.52 >>Mean>> 1 IFL 5.200 90.743 46.004 34.45 >>Mean>> 2 IFL 5.200 89.825 43.943 32.96 >>Mean>> 3 IFL 5.200 88.818 41.230 31.04 >>Mean>> 4 IFL 5.200 87.678 37.983 28.75 >>Mean>> 5 IFL 5.200 86.669 34.081 26.78 >>Mean>> 6 IFL 5.200 85.719 31.242 24.95 >>Mean>> 7 IFL 5.200 84.996 29.672 23.30 >>Mean>> 8 IFL 5.200 90.183 43.945 33.40 >>Mean>> 9 IFL 5.200 90.237 44.972 34.39 >>Mean>> 10 IFL 5.200 89.259 42.458 32.28 >>Mean>> 11 IFL 5.200 88.161 39.437 30.64 >>Mean>> 12 IFL 5.200 87.214 36.276 28.78 >>Mean>> 13 IFL 5.200 86.433 34.080 27.26 >>Mean>> 14 IFL 5.200 85.527 31.721 25.42 >>Mean>> 15 IFL 5.200 85.063 30.312 24.24 >>Mean>> 16 IFL 5.200 79.759 26.410 19.20 >>Mean>> 17 IFL 5.200 78.977 24.978 17.82 >>Mean>> 18 IFL 5.200 77.634 21.557 15.55 >>Mean>> 19 IFL 5.200 76.675 17.426 14.10 >>MofM>> --- --- 5.200 86.034 35.801 27.30 >>AllP>> --- --- 5.200 1720.679 35.801 27.30 (continued) ___CPI____ __EICPI___ __EFCPI___ _ESCPL1M__ ___RNI____ 2.985 0.891 2.094 76.188 3.444 3.003 0.891 2.111 76.143 3.453 3.049 0.901 2.148 74.587 3.371 3.107 0.914 2.193 72.694 3.275 3.182 0.928 2.255 70.944 3.174 3.294 0.948 2.346 69.464 3.095 3.366 0.959 2.407 68.199 3.015 3.380 0.966 2.414 67.080 2.939 3.061 0.907 2.153 75.036 3.422 3.041 0.902 2.139 76.254 3.489 3.105 0.912 2.193 74.959 3.391 3.203 0.928 2.275 72.981 3.306 3.261 0.942 2.319 70.729 3.210 3.314 0.952 2.363 69.642 3.141 3.355 0.962 2.393 68.153 3.048 3.369 0.966 2.403 67.239 2.999 3.084 0.977 2.107 55.915 2.222 3.071 0.984 2.087 54.312 2.171 3.150 1.000 2.149 53.257 2.112 3.266 1.023 2.243 52.640 2.080 3.175 0.940 2.235 67.858 2.990 3.175 0.940 2.235 67.858 2.990 (continued) __T1MSEC__ __T1CPU___ _T1CYPTM__ _PTEPT1M__ 4631.691 2.755 28.177 0.000 4606.388 2.745 28.116 0.000 4765.183 2.882 28.250 0.000 4964.039 3.052 28.398 0.000 5173.350 3.247 28.615 0.000 5355.564 3.414 28.733 0.000 5559.683 3.605 28.906 0.000 5698.106 3.746 29.057 0.000 4663.143 2.843 28.587 0.000 4505.766 2.740 28.536 0.000 4686.193 2.906 28.779 0.000 4922.121 3.081 28.699 0.000 5128.207 3.257 28.806 0.000 5277.667 3.403 28.976 0.000 5455.231 3.576 29.152 0.000 5578.054 3.685 29.224 0.000 6159.004 3.709 24.974 0.000 6309.778 3.831 24.933 0.000 6485.759 4.022 25.035 0.000 6576.780 4.135 25.070 0.000 5325.085 3.302 27.744 0.000 106501.708 3.302 27.744 0.000 (continued) ___L1MP___ ___L2P____ ___L3P____ ___L4LP___ ___L4RP___ ___MEMP___ 2.748 52.927 23.663 8.810 0.001 14.599 2.773 53.027 23.462 8.862 0.001 14.649 2.880 53.081 23.992 8.718 0.001 14.208 3.017 53.319 24.454 8.511 0.001 13.715 3.178 53.466 25.034 8.312 0.001 13.188 3.378 53.571 25.507 8.146 0.001 12.775 3.530 53.595 26.137 7.909 0.001 12.357 3.599 53.698 26.554 7.790 0.001 11.957 2.870 52.912 23.789 8.826 0.001 14.471 2.805 52.843 23.377 8.958 0.001 14.821 2.926 53.013 23.874 8.811 0.001 14.302 3.117 53.203 24.316 8.612 0.001 13.869 3.279 53.437 24.783 8.409 0.001 13.371 3.393 53.439 25.295 8.263 0.001 13.002 3.511 53.653 25.781 8.032 0.001 12.533 3.573 53.632 26.166 7.937 0.001 12.264 3.768 53.621 31.300 7.092 0.001 7.986 3.843 53.713 31.588 6.979 0.001 7.720 4.036 53.681 32.059 6.857 0.001 7.402 4.262 53.675 32.300 6.791 0.001 7.233 3.294 53.390 26.348 8.090 0.001 12.171 3.294 53.390 26.348 8.090 0.001 12.171 (continued) __eICR____ __iICR____ 1587.358 1742.265 1571.559 1731.873 1531.870 1705.393 1486.491 1673.634 1432.614 1633.952 1368.235 1578.691 1324.236 1544.858 1307.737 1538.587 1532.093 1698.877 1543.010 1709.946 1494.774 1674.643 1431.264 1623.468 1390.564 1594.433 1356.151 1569.019 1325.748 1550.097 1312.936 1543.486 1344.678 1685.918 1337.157 1693.094 1281.742 1651.010 1220.681 1592.017 1409.045 1637.777 28180.899 32755.549 (continued) ___SIIS___ __ICWL3PMI__ 0.089 9.102 0.082 8.385 0.083 9.031 0.088 10.238 0.092 11.647 0.096 13.191 0.097 14.333 0.102 15.667 0.086 9.187 0.080 8.144 0.083 9.073 0.086 10.277 0.092 11.914 0.091 12.470 0.095 13.722 0.099 14.836 0.078 12.930 0.081 13.744 0.081 14.764 0.080 15.542 0.088 11.731 0.088 11.731 (continued) ____LSPR____ Avg Avg Avg High High High High High Avg Avg Avg High High High High High High High High High High High (continued) __L4LPOC__ __MEMPLC__ __MEMPNC__ __MEMPND__ __MEMPFD__ 1.843 2.344 4.669 7.586 0.000 1.847 2.349 4.690 7.610 0.000 1.770 2.275 4.540 7.393 0.000 1.666 2.202 4.374 7.139 0.000 1.552 2.110 4.204 6.874 0.000 1.482 2.052 4.062 6.661 0.000 1.370 1.973 3.936 6.448 0.000 1.304 1.913 3.796 6.248 0.000 1.825 4.635 2.319 7.516 0.000 1.883 4.776 2.363 7.682 0.000 1.792 4.587 2.288 7.426 0.000 1.695 4.441 2.213 7.214 0.000 1.600 4.292 2.130 6.949 0.000 1.523 4.156 2.077 6.769 0.000 1.423 4.000 2.003 6.531 0.000 1.384 3.913 1.959 6.392 0.000 5.570 2.067 1.994 3.925 0.000 5.472 2.006 1.921 3.793 0.000 5.368 1.921 1.846 3.635 0.000 5.312 1.876 1.802 3.555 0.000 2.462 2.960 2.926 6.285 0.000 2.462 2.960 2.926 6.285 0.000 (continued) __DF_DPMI___ _DF_D012PMI_ __DF_012PD__ ___DF_CPD___ __DF_CSOA___ __DF_POBC___ 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 12.072 4.040 33.466 32805.483 36.437 22.532 13.978 4.796 34.312 30136.801 38.922 24.200 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 8.232 2.732 33.188 33704.138 35.448 15.822 5.932 2.010 33.891 31472.059 37.089 9.873 11.020 3.708 33.647 32225.789 36.838 19.133 10.474 3.439 32.830 34785.777 33.905 19.347 11.451 3.833 33.469 32758.509 35.881 20.199 13.980 4.805 34.368 29956.053 38.960 24.368 12.671 4.235 33.425 32946.864 35.684 24.233 13.493 4.540 33.646 32237.048 36.492 25.292 12.699 4.313 33.960 31251.932 36.754 23.189 12.081 4.016 33.240 33512.292 34.258 23.978 12.668 4.197 33.136 33824.184 33.967 24.054 10.047 3.295 32.795 34941.855 32.984 16.463 11.670 3.918 33.573 32463.611 36.107 20.822 11.670 3.918 33.573 32463.611 36.107 20.822

The table below gives definitions for each of the columns in the report.

Column Meaning

Basic Logical Processor Statistics
IntEnd The hh:mm:ss of the CP Monitor interval-end time, in system local time.

The first flock of rows is marked ">>Mean>>" to indicate that the rows are the mean experience of each logical processor over the whole time range recorded in the MONWRITE file.

The special row ">>MofM>>", mean of means, is the average experience of the average logical processor over the whole time range of the MONWRITE file.

The special row ">>AllP>>", all processors, merely states the sums of the LPARCPU, T1MSEC, eICR, and iICR columns, described later.

LPU The processor address, aka logical processor number, of the logical processor this row describes.
Typ The type of processor: CP, IFL, etc.
EGHZ Effective clock rate of the CEC, in GHz.
LPARCPU Percent busy of this logical processor as portrayed by the counters.
PrbInst The percent of completed instructions that were problem-state instructions.
PrbTime The percent of the CPU-busy time that was spent in problem state.

Basic CPI Statistics
CPI Cycles per instruction. The average number of clock cycles that transpire between completion of instructions. This is not the same as the average number of cycles it takes for an instruction to run, from its start to its finish.
EICPI Estimated instruction complexity CPI, sometimes also known as "infinite CPI". This is the number of clock cycles that would transpire between completion of instructions if no instruction ever incurred an L1 miss. The word "infinite" comes from the wish, "If we had infinite L1, this is how the machine would perform."
EFCPI Estimated cache miss CPI, sometimes also known as "finite CPI". This is the number of clock cycles instruction completions are being delayed because of L1 misses. The word "finite" comes from the lament, "Because our L1 is finite, this is how much our CPI is elongating." If we had infinite L1, this number would be zero.
ESCPL1M Estimated sourcing cycles per L1 miss. When an L1 miss happens, this is how many clock cycles it takes to make things right.
RNI Relative nest intensity. A scalar that expresses how hard the caches and memory are working to keep up with the demands of the CPUs. Higher numbers indicate higher intensity. Each CEC type's RNI formula is weighted in such a way that RNI values are comparable across CEC types.

Basic TLB Statistics
T1MSEC Miss rate of the Translation Lookaside Buffer (TLB), in misses per millisecond.
T1CPU Percent of CPU-busy that is attributable to TLB misses.
T1CYPTM Number of cycles a TLB miss tends to cost.
PTEPT1M PTE percent of all TLB misses. For z14 and later, this metric is deprecated. However, so as not to disturb the columnar nature of the report, on such machine types the metric is reported as zero.

Memory Cache (L1, etc.) Behavior
L1MP L1 miss percentage. This is the percent of instructions that incur an L1 miss.
LxxP Percent of L1 misses sourced from cache level xx.

On z10, the levels are L1.5 ("15"), L2 on this book ("2L"), or L2 on some other book ("2R").

On z196 and later, the levels are L2 ("2"), L3 ("3"), L4 on this book ("4L"), or L4 on some other book ("4R").

MEMP Percent of L1 misses sourced from memory.

Instruction Completion Behavior
eICR Instruction completion rate, millions of instructions per elapsed second.
iICR Instruction completion rate, millions of instructions per CPU-second.

Store-Into-Instruction-Stream Behavior
SIIS Percent of I-cache writes achieved with L3 intervention
ICWL3PMI I-cache writes achieved with L3 intervention per million instructions completed

LSPR Workload Hint
LSPR Low, Avg, or High, per this LSPR article

Drawer Behavior (z14 and z15 only)
L4LPOC Percent of L1 misses sourced from local L4 off-cluster L3
MEMPLC Percent of L1 misses sourced from memory local-on-chip
MEMPNC Percent of L1 misses sourced from memory on-cluster
MEMPND Percent of L1 misses sourced from memory on-drawer
MEMPFD Percent of L1 misses sourced from memory off-drawer

Deflate Behavior (DFLTCC, z15 and later only)
DF_DPMI DFLTCC instructions completed per million total instructions completed
DF_D012PMI DFLTCC CC=0,1,2 instructions completed per million total instructions completed
DF_012PD Percent of completed DFLTCCs that had CC=0,1,2
DF_CPD CPI of DFLTCC instructions
DF_CSOA Per completed DFLTCC, cycles spent obtaining access to DFLTCC coprocessor
DF_POBC Percent of CPU-busy cycles spent doing DFLTCC


What To Do With The Information

The CPU MF counters data isn't like ordinary performance data in that there is no z/VM or System z "knob" one can directly turn to affect the achieved values. For example, there's no "give me more L1" knob that we could turn to increase the amount of L1 on the CEC if we felt there were something lacking about our L1 performance.

For this reason the CPU MF report is at risk for being labelled "tourist information" or "gee-whiz information". Some analysts might say that because there isn't much that can be done to influence it, why would we bother even looking at it?

It turns out there are some very useful things we can do with CPU MF information even though we don't have cache adjusting knobs at our immediate disposal. In the rest of this article, we briefly explore some of them.

Probably the most useful thing to do with the CPU MF report is to use it as your workload's characterization index into the IBM Large Systems Performance Report (LSPR). The L1 miss percent L1MP and the RNI value together constitute the "LSPR hint" which in turn reveals which portion of the LSPR to consult when projecting your own workload's scaling or migration characteristics. Later versions of our CPUMF tool even print a column that states which workload hint applies. For more information on this, see IBM's LSPR page.

One thing we can do to affect cache performance is to be cognizant of the idea that all of the partitions running on the CEC are competing for the CEC's cache. Steps we take to help the partitions' peak times not to overlap will help matters. If we have our workload scheduled so that all partitions heat up at 9 AM and all partitions cool off at 6 PM, we might consider whether we might stagger our company's work so that the partitions heat up at different times. An extension to this might be that if we had put all of the Europe partitions on one CEC, all of the North America partitions on a second, and all of the Asia partitions on a third, we might instead consider a less time-oriented placement, so that any given CEC doesn't have all of its partitions hot at the same time.

Another thing we can do is run with HiperDispatch enabled in all the partitions that support it. For z/OS this means turning on the HiperDispatch feature. For z/VM this means running with CP SET SRM POLARIZATION VERTICAL. Doing this helps PR/SM to shrink those partitions' cache influence. For more information, consult operating system documentation.

Another thing we can do to affect cache performance is to tune the system's configurations of logical processors and virtual processors so that those two choices are right-sized for the workload. If a z/VM partition has 16 logical processors but is running only 425% busy on average with peaks at 715%, set it to have 8 logical processors instead. The same thing applies to virtual servers. If that big Linux guest runs only 115% busy on average with peaks of 280%, it probably should not be configured as a virtual 12-way. Set it to be a virtual 3-way or 4-way instead.

On versions of z/VM that have VM66063 or later, running with CP SET SRM UNPARKING MEDIUM can help to control cache effects. The medium unparking model leaves unneeded VL cores parked even though PR/SM might have the capacity to run them. For more information, see our unparking article.

Much is made of the phenomenon called "store into instruction stream" (SIIS) and its potential to affect performance. I-cache stores that cause the L3 to intervene can dramatically decrease performance, so we want to know whether they are happening, and if so, whether they are happening enough to worry about. Later versions of the CPUMF tool print a column called "ICWL3PMI" which tabulates I-cache stores with L3 intervention per million instructions completed. For more information, read our SIIS article.

Speaking of tuning Linux virtual machines, customers report varying degrees of success with using the cpuplugd daemon to shut off unneeded virtual CPUs during off-peak times. If you have large N-way Linux guests, consider trying cpuplugd in a test environment, and if the tests work out for you, consider putting it into production.

Just as CPU counts can be right-sized, memory can also be right-sized. Take another look at those UPAGELOG reports for your virtual servers and the I/O rates to your virtual servers' swap extents. If your virtual servers are ignoring their swap extents, you can probably afford to decrease their memory sizes.