Skip to main content

IBM Systems  >   z Systems  >   z/VM  >  

Using CPU Measurement Facility Host Counters

With VM64961 for z/VM 5.4 or with later z/VM releases, z/VM can now collect and record the System z CPU Measurement Facility host counters. These counters record the hardware performance experience of the logical PUs of the z/VM partition.

z/VM's CP Monitor facility logs out the counters in a new Monitor record, D5 R13 MRPRCMFC. The MONWRITE utility journals the monitor records to disk.

In this article we describe what the counters portray, how to reduce the counters, what the calculated metrics mean, and how to use the calculated metrics to gain insight about the behavior of the z/VM partition and its logical PUs.


What the Counters Portray

The System z CPU Measurement Facility offers means by which a System z CPU records its internal performance experience for later extraction by software. The host counters component of CPU MF counts internal CPU events such as instructions completed, clock cycles used, and cache misses experienced.


Reference Materials

For complete information about the CPU Measurement Facility, see these documents:


How to Collect the Counters

To make use of the counters, one must first set up to collect them. To learn how, visit our CPU MF collection instructions page. Following the instructions correctly results in one obtaining a MONWRITE file containing the D5 R13 MRPRCMFC records.


How to Reduce the Counters

In his presentation John Burg describes the calculations needed to derive interesting metrics from the raw counter values. Each CEC type (z10, z196, etc.) emits raw counters of different meaning and layout, so the calculations are specific to machine type. The output of the calculations is a set of values useful in understanding machine behavior.

z/VM Performance Toolkit contains no support for analyzing the raw counter values. In other words, Perfkit has not been updated to do the calculations prescribed by Burg.

On this web site we have posted a reduction tool one can use to do the Burg calculations. This package contains these items:

  • A first exec, CPUMFINT, that extracts the raw counters and other data from a MONWRITE file, writing the extracted data to an intermediary CMS file we call the interim file.
  • A second exec, CPUMFLOG, that reads the interim file, applies the Burg formulas, and produces a formatted, time-indexed log report as output.
  • Ancillary or support execs used by CPUMFINT or CPUMFLOG.

The process of reducing the counters, then, amounts to this:

  1. Start with a MONWRITE file that contains D5 R13 records.
  2. Use the CPUMFINT tool to extract counter data from the MONWRITE file. CPUMFINT takes the MONWRITE file as input and produces the interim file as output. The interim file will have CMS filetype CPUMFINT.
  3. Use the CPUMFLOG tool to process the interim file. The CPUMFLOG tool applies the Burg formulas, does the appropriate calculations, and writes a report. The report file will have CMS filetype $CPUMFLG.

Specific invocation instructions are included in the downloadable package.

The CPUMFLOG tool uses only the basic counters and the extended counters in its calculations. The interim file does also contain the problem-state counters and the crypto counters, provided the administrator enabled those counter sets for this partition on the SE. Those interested in analyzing the crypto counters or problem counters can do so by applying the formulas and techniques described in the Burg presentation.


Appearance of The CPUMFLOG Report

Metrics calculated from the CPU MF counters describe the performance experience of each logical PU in the partition over time. For each CP Monitor sample interval, for each logical PU, CPUMFLOG writes a report row calculated from the counter values for that interval. The resulting tabular report bears a vague resemblance to a Perfkit xxxxLOG report.

The columns of the report will vary slightly according to CEC type. The various models have different cache structures and therefore warrant accordingly different sets of columns in their report outputs.

Here is an excerpt of a z13 report. The report is very wide; on this web page, for page rendering purposes, we have broken the columns into groups. The workload here was entirely contrived for internal lab purposes; the values in the report mean absolutely nothing as far as customer workload expectations are concerned.

_IntEnd_ LPU Typ ___EGHZ___ _LPARCPU__ _PrbState_ >>Mean>> 0 IFL 5.000 30.701 19.447 >>Mean>> 1 IFL 5.000 30.267 19.681 >>Mean>> 2 IFL 5.000 32.119 21.226 >>Mean>> 3 IFL 5.000 31.946 21.515 >>Mean>> 4 IFL 5.000 32.230 20.491 >>Mean>> 5 IFL 5.000 31.992 20.595 >>Mean>> 6 IFL 5.000 27.172 16.554 >>Mean>> 7 IFL 5.000 27.085 16.680 >>Mean>> 8 IFL 5.000 32.400 22.006 >>Mean>> 9 IFL 5.000 32.146 22.012 >>Mean>> 10 IFL 5.000 28.362 18.591 >>Mean>> 11 IFL 5.000 28.141 19.048 >>Mean>> 12 IFL 5.000 21.781 11.165 >>Mean>> 13 IFL 5.000 21.365 12.398 >>Mean>> 14 IFL 5.000 28.704 20.258 >>Mean>> 15 IFL 5.000 28.679 20.512 >>Mean>> 16 IFL 5.000 24.721 15.806 >>Mean>> 17 IFL 5.000 24.520 16.340 >>Mean>> 18 IFL 5.000 32.779 23.221 >>Mean>> 19 IFL 5.000 32.713 23.320 >>Mean>> 20 IFL 5.000 30.353 20.989 >>Mean>> 21 IFL 5.000 30.163 21.960 >>Mean>> 22 IFL 5.000 23.400 14.224 >>Mean>> 23 IFL 5.000 23.135 14.856 >>Mean>> 24 IFL 5.000 27.431 5.381 >>Mean>> 25 IFL 5.000 27.753 5.112 >>Mean>> 26 IFL 5.000 30.309 10.575 >>Mean>> 27 IFL 5.000 30.079 10.084 >>Mean>> 28 IFL 5.000 28.925 8.231 >>Mean>> 29 IFL 5.000 28.894 8.547 >>Mean>> 30 IFL 5.000 30.955 12.240 >>Mean>> 31 IFL 5.000 30.723 12.064 >>MofM>> --- --- 5.000 28.811 16.708 >>AllP>> --- --- 5.000 921.943 16.708 (continued) ___CPI____ __EICPI___ __EFCPI___ _ESCPL1M__ ___RNI____ 3.751 1.611 2.141 35.431 0.673 3.723 1.603 2.120 34.984 0.663 3.691 1.606 2.085 35.052 0.662 3.676 1.596 2.080 34.857 0.658 3.693 1.605 2.089 34.733 0.658 3.680 1.597 2.083 34.533 0.654 3.706 1.581 2.125 34.528 0.672 3.671 1.567 2.104 34.066 0.664 3.697 1.610 2.087 35.268 0.665 3.698 1.609 2.088 35.181 0.661 3.697 1.582 2.114 34.840 0.669 3.666 1.571 2.095 34.518 0.662 3.700 1.537 2.163 34.334 0.704 3.528 1.500 2.028 32.326 0.670 3.596 1.564 2.032 34.015 0.656 3.580 1.554 2.026 33.903 0.654 3.615 1.550 2.065 33.335 0.658 3.586 1.542 2.044 33.018 0.652 3.653 1.602 2.051 35.141 0.659 3.646 1.591 2.056 35.108 0.659 3.600 1.572 2.028 34.073 0.653 3.565 1.561 2.004 33.921 0.649 3.645 1.549 2.096 33.704 0.675 3.619 1.535 2.084 33.466 0.671 3.708 1.590 2.119 30.896 0.653 3.680 1.577 2.103 30.675 0.649 3.730 1.625 2.105 32.059 0.651 3.728 1.615 2.114 31.846 0.649 3.726 1.608 2.117 31.738 0.648 3.693 1.593 2.100 31.552 0.645 3.712 1.623 2.089 32.168 0.647 3.705 1.609 2.096 32.032 0.646 3.669 1.585 2.085 33.654 0.659 3.669 1.585 2.085 33.654 0.659 (continued) __T1MSEC__ __T1CPU___ _T1CYPTM__ _PTEPT1M__ 2453.155 11.118 69.570 57.754 2429.022 11.269 70.208 58.006 2594.269 11.268 69.750 58.295 2601.502 11.369 69.801 58.126 2566.926 10.992 69.006 57.550 2566.724 11.095 69.142 57.427 2069.909 10.483 68.807 56.671 2073.932 10.540 68.827 56.524 2631.340 11.263 69.343 58.217 2628.462 11.384 69.616 58.128 2196.796 10.592 68.372 56.794 2195.996 10.683 68.448 56.559 1551.299 9.487 66.600 53.496 1535.451 9.522 66.247 52.992 2374.719 11.269 68.106 56.582 2371.102 11.323 68.474 56.681 1942.316 10.566 67.241 55.038 1935.619 10.644 67.420 55.042 2726.860 11.508 69.168 57.984 2755.979 11.622 68.978 57.540 2477.460 11.094 67.960 57.357 2471.313 11.206 68.384 57.352 1772.049 10.224 67.503 55.092 1761.271 10.318 67.767 55.208 2062.040 9.742 64.795 51.092 2077.506 9.737 65.036 51.040 2385.742 10.550 67.016 54.334 2386.567 10.639 67.043 54.202 2219.675 10.088 65.729 52.826 2228.459 10.162 65.880 52.595 2501.015 10.886 67.367 54.978 2490.013 10.969 67.673 54.999 2282.327 10.770 67.978 56.013 73034.450 10.770 67.978 56.013 (continued) ___L1MP___ ___L2P____ ___L3P____ ___L4LP___ ___L4RP___ ___MEMP___ 6.042 81.744 10.984 5.027 0.003 2.242 6.060 82.096 10.834 4.833 0.002 2.234 5.947 81.889 10.929 4.993 0.003 2.185 5.968 82.149 10.763 4.887 0.003 2.198 6.014 81.910 11.013 4.891 0.002 2.183 6.032 82.151 10.867 4.792 0.002 2.187 6.155 81.767 11.064 4.910 0.003 2.256 6.175 82.145 10.836 4.761 0.003 2.255 5.918 82.059 10.734 4.987 0.003 2.217 5.936 82.237 10.604 4.947 0.002 2.210 6.069 81.764 11.054 4.947 0.003 2.233 6.068 82.098 10.838 4.832 0.002 2.230 6.300 80.969 11.304 5.396 0.004 2.328 6.274 82.185 10.744 4.780 0.003 2.289 5.973 82.037 10.878 4.907 0.003 2.176 5.975 82.200 10.773 4.840 0.002 2.184 6.194 81.933 11.007 4.870 0.003 2.188 6.189 82.208 10.854 4.748 0.002 2.188 5.836 81.780 11.104 4.939 0.003 2.175 5.855 81.957 10.966 4.884 0.003 2.190 5.952 81.920 10.928 5.022 0.002 2.128 5.909 82.159 10.786 4.915 0.002 2.137 6.219 81.834 10.902 5.000 0.003 2.261 6.226 82.186 10.650 4.880 0.003 2.281 6.857 82.297 10.729 4.776 0.008 2.191 6.856 82.383 10.715 4.712 0.008 2.182 6.566 82.012 11.105 4.698 0.007 2.179 6.637 82.240 10.938 4.629 0.006 2.187 6.671 82.047 10.994 4.810 0.008 2.141 6.654 82.225 10.872 4.756 0.008 2.140 6.494 81.947 11.149 4.756 0.007 2.141 6.544 82.177 10.944 4.721 0.006 2.152 6.194 82.031 10.902 4.865 0.004 2.199 6.194 82.031 10.902 4.865 0.004 2.199 (continued) __eMIPS___ __iMIPS___ 409.195 1332.820 406.444 1342.885 435.130 1354.747 434.505 1360.139 436.313 1353.737 434.637 1358.583 366.622 1349.282 368.914 1362.041 438.156 1352.326 434.687 1352.214 383.603 1352.508 383.814 1363.908 294.326 1351.288 302.798 1417.240 399.099 1390.378 400.591 1396.824 341.954 1383.243 341.923 1394.478 448.700 1368.850 448.588 1371.294 421.522 1388.730 423.027 1402.466 320.941 1371.563 319.616 1381.537 369.859 1348.345 377.060 1358.639 406.280 1340.472 403.372 1341.052 388.171 1342.005 391.206 1353.919 416.934 1346.885 414.569 1349.366 392.580 1362.617 12562.553 43603.755

The table below gives definitions for each of the columns in the report.

Column Meaning
Basic LPU Statistics
IntEnd The hh:mm:ss of the CP Monitor interval-end time, in UTC.

The first flock of rows is marked ">>Mean>>" to indicate that the rows are the mean experience of each logical PU over the whole time range recorded in the MONWRITE file.

The special row ">>MofM>>", mean of means, is the average experience of the average logical PU over the whole time range of the MONWRITE file.

The special row ">>AllP>>", all processors, merely states the sum of the LPARCPU, T1MSEC, eMIPS, and iMIPS columns, described later.

LPU The processor address, aka logical PU number, of the PU this row describes.
Typ The type of processor: CP, IFL, etc.
EGHZ Effective clock rate of the CEC, in GHz.
LPARCPU Percent busy of this logical PU as portrayed by the counters.
PrbState The CPU-busy spent in problem state.
Basic CPI Statistics
CPI Cycles per instruction. The average number of clock cycles that transpire between completion of instructions.
EICPI Estimated instruction complexity CPI, sometimes also known as "infinite CPI". This is the number of clock cycles instructions would take if they never, ever incurred an L1 miss. The word "infinite" comes from the wish, "If we but had infinite L1, this is how long the instructions would have taken."
EFCPI Estimated cache miss CPI, sometimes also known as "finite CPI". This is the number of clock cycles instructions are being delayed because of L1 misses. The word "finite" comes from the lament, "Because our L1 is finite, this is how much our CPI is elongating." If we had infinite L1, this number would be zero.
ESCPL1M Estimated sourcing cycles per L1 miss. When an L1 miss happens, this is how many clock cycles it takes to make things right.
RNI Relative nest intensity. A scalar that expresses how hard the caches are working to keep up with the demands of the CPUs. Higher numbers indicate higher intensity. Each CEC type's RNI formula is weighted in such a way that RNI values are comparable across CEC types.
Basic TLB Statistics
T1MSEC Miss rate of the Translation Lookaside Buffer (TLB), in misses per millisecond.
T1CPU Percent of CPU-busy that is attributable to TLB misses.
T1CYPTM Number of cycles a TLB miss tends to cost.
PTEPT1M PTE percent of all TLB misses.
Memory Cache (L1, etc.) Behavior
L1MP L1 miss percentage. This is the percent of instructions that incur an L1 miss.
LxxP Percent of L1 misses sourced from cache level xx.

On z10, the levels are L1.5 ("15"), L2 on this book ("2L"), or L2 on some other book ("2R").

On z196 and later, the levels are L2 ("2"), L3 ("3"), L4 on this book ("4L"), or L4 on some other book ("4R").

MEMP Percent of L1 misses sourced from memory.
MIPS Behavior
eMIPS Instruction completion rate, millions of instructions per elapsed second.
iMIPS Instruction completion rate, millions of instructions per CPU-second.


What To Do With The Information

The CPU MF counters data isn't like ordinary performance data, in that there is no z/VM or System z "knob" one can directly turn to affect the achieved values. For example, there's no "give me more L1" knob that we could turn to increase the amount of L1 on the CEC, if we felt there were something lacking about our L1 performance.

For this reason, the CPU MF report is at risk for being labelled "tourist information" or "gee-whiz information". Some analysts might say that because there isn't much that can be done to influence it, why would we bother even looking at it?

It turns out there are some very useful things we can do with CPU MF information, even though we don't have cache adjusting knobs at our immediate disposal. In the rest of this article, we briefly explore some of them.

Probably the most useful thing to do with the CPU MF report is to use it as your workload's characterization index into the IBM Large Systems Performance Report (LSPR). The L1 miss percent L1MP and the RNI value together constitute the "LSPR hint" which in turn reveals which portion of the LSPR to consult when projecting your own workload's scaling or migration characteristics. For more information on this, see IBM's LSPR page.

One thing we can do to affect cache performance is to be cognizant of the idea that all of the partitions running on the CEC are competing for the CEC's cache. Steps we take to help the partitions' peak times not to overlap will help matters. If we have our workload scheduled so that all partitions heat up at 9 AM and all partitions cool off at 6 PM, we might consider whether we might stagger our company's work so that the partitions heat up at different times. An extension to this might be that if we had put all of the Europe partitions on one CEC, all of the North America partitions on a second, and all of the Asia partitions on a third, we might instead consider a less time-oriented placement, so that any given CEC doesn't have all of its partitions hot at the same time.

If our CEC is hosting a mix of z/OS partitions and other partitions, we can affect cache performance by turning on z/OS HiperDispatch in the z/OS partitions. Doing this helps PR/SM and z/OS to shrink those partitions' cache influence, because z/OS HiperDispatch switches the z/OS partitions to something called vertical mode. For more information about vertical mode partitions, consult z/OS documentation.

Another thing we can do to affect cache performance is to tune the system's configurations of logical CPUs and virtual CPUs so that those two choices are right-sized for the workload. If a z/VM partition is a logical 16-way but is running only 425% busy on average with peaks at 715%, set it to be an 8-way instead of a 16-way. The same thing applies to virtual servers. If that big Linux guest runs only 115% busy on average with peaks of 280%, it probably should not be configured as a virtual 12-way. Set it to be a virtual 3-way or 4-way instead.

Speaking of tuning Linux virtual machines, customers report varying degrees of success with using the cpuplugd daemon to shut off unneeded virtual CPUs during off-peak times. If you have large N-way Linux guests, consider trying cpuplugd in a test environment, and if the tests work out for you, consider putting it into production.

Just as CPU counts can be right-sized, memory can also be right-sized. Take another look at those UPAGELOG reports for your virtual servers and the I/O rates to your virtual servers' swap extents. If your virtual servers are ignoring their swap extents, you can probably afford to decrease their memory sizes.