Using CPU Measurement Facility Host Counters

Last updated: 2025-05-16, FACKRUDDIN MD

With VM64961 for z/VM 5.4 or with later z/VM releases, z/VM can now collect and record the System z CPU Measurement Facility host counters. These counters record the hardware performance experience of the logical processors of the z/VM partition.

z/VM's CP Monitor facility logs out the counters in a new monitor record, D5 R13 MRPRCMFC. The MONWRITE utility journals the monitor records to disk.

In this article we describe what the counters portray, how to reduce the counters, what the calculated metrics mean, and how to use the calculated metrics to gain insight about the behavior of the z/VM partition and its logical processors.


What the Counters Portray

The System z CPU Measurement Facility offers means by which a System z CPU records its internal performance experience for later extraction by software. The host counters component of CPU MF counts internal CPU events such as instructions completed, clock cycles used, and cache misses experienced.


Reference Materials

For complete information about the CPU Measurement Facility, see these documents:

z16

    z15


    How to Collect the Counters

    To make use of the counters, one must first set up to collect them. To learn how, visit our CPU MF collection instructions page. Following the instructions correctly results in one obtaining a MONWRITE file containing the D5 R13 MRPRCMFC records.


    How to Reduce the Counters

    In his presentation John Burg describes the calculations needed to derive interesting metrics from the raw counter values. Each CEC type (z14, z15, z16, etc.) emits raw counters of different meaning and layout, so the calculations are specific to machine type. The output of the calculations is a set of values useful in understanding machine behavior.

    z/VM Performance Toolkit contains no support for analyzing the raw counter values. In other words, Perfkit has not been updated to do the calculations prescribed by Burg.

    On this web site we have posted a reduction tool one can use to do the Burg calculations. This package contains these items:

    • A first exec, CPUMFINT, that extracts the raw counters and other data from a MONWRITE file, writing the extracted data to an intermediary CMS file we call the interim file.
    • A second exec, CPUMFLOG, that reads the interim file, applies the Burg formulas, and produces a formatted, time-indexed log report as output.
    • Ancillary or support execs used by CPUMFINT or CPUMFLOG.

    The process of reducing the counters, then, amounts to this:

    1. Start with a MONWRITE file that contains D5 R13 records.
    2. Use the CPUMFINT tool to extract counter data from the MONWRITE file. CPUMFINT takes the MONWRITE file as input and produces the interim file as output. The interim file will have CMS filetype CPUMFINT.
    3. Use the CPUMFLOG tool to process the interim file. The CPUMFLOG tool applies the Burg formulas, does the appropriate calculations, and writes a report. The report file will have CMS filetype $CPUMFLG.

    Specific invocation instructions are included in the downloadable package.

    The CPUMFLOG tool uses only the basic, problem, and extended counters in its calculations. The interim file does also contain the CPACF crypto counters, provided the administrator enabled those counter sets for this partition on the SE. A separate package on our download library, D5R13CRY, reduces the CPACF crypto counters. However, D5R13CRY runs directly off the MONWRITE file, so to see CPACF behavior there is no need to use the CPUMFINT tool.


    Appearance of The CPUMFLOG Report

    Metrics calculated from the CPU MF counters describe the performance experience of each logical processor in the partition over time. For each CP Monitor sample interval, for each logical processor, CPUMFLOG writes a report row calculated from the counter values for that interval. The resulting tabular report bears a vague resemblance to a Perfkit xxxxLOG report.

    The columns of the report will vary slightly according to CEC type. The various types have different cache structures and therefore warrant accordingly different sets of columns in their report outputs.

    Here is an excerpt of a z16 report. The report is very wide; on this web page, for page rendering purposes, we have broken the columns into groups.

    The workload here was entirely contrived for internal lab purposes; the values in the report mean absolutely nothing as far as customer workload expectations are concerned.

  • z16 CPU MF Output Log Report, Nov 2024

    The table below gives definitions for each of the columns in the report.

    Column Meaning

    Basic Logical Processor Statistics
    IntEnd The hh:mm:ss of the CP Monitor interval-end time, in system local time.

    The first flock of rows is marked ">>Mean>>" to indicate that the rows are the mean experience of each logical processor over the whole time range recorded in the MONWRITE file.

    The special row ">>MofM>>", mean of means, is the average experience of the average logical processor over the whole time range of the MONWRITE file.

    The special row ">>AllP>>", all processors, merely states the sums of the LPARCPU, T1MSEC, eICR, and iICR columns, described later.

    LPU The processor address, aka logical processor number, of the logical processor this row describes.
    Typ The type of processor: CP, IFL, etc.
    EGHZ Effective clock rate of the CEC, in GHz.
    LPARCPU Percent busy of this logical processor as portrayed by the counters.
    PrbInst The percent of completed instructions that were problem-state instructions.
    PrbTime The percent of the CPU-busy time that was spent in problem state.

    Basic CPI Statistics
    CPI Cycles per instruction. The average number of clock cycles that transpire between completion of instructions. This is not the same as the average number of cycles it takes for an instruction to run, from its start to its finish.
    EICPI Estimated instruction complexity CPI, sometimes also known as "infinite CPI". This is the number of clock cycles that would transpire between completion of instructions if no instruction ever incurred an L1 miss. The word "infinite" comes from the wish, "If we had infinite L1, this is how the machine would perform."
    EFCPI Estimated cache miss CPI, sometimes also known as "finite CPI". This is the number of clock cycles instruction completions are being delayed because of L1 misses. The word "finite" comes from the lament, "Because our L1 is finite, this is how much our CPI is elongating." If we had infinite L1, this number would be zero.
    ESCPL1M Estimated sourcing cycles per L1 miss. When an L1 miss happens, this is how many clock cycles it takes to make things right.
    RNI Relative nest intensity. A scalar that expresses how hard the caches and memory are working to keep up with the demands of the CPUs. Higher numbers indicate higher intensity. Each CEC type's RNI formula is weighted in such a way that RNI values are comparable across CEC types.

    Basic TLB Statistics
    T1MSEC Miss rate of the Translation Lookaside Buffer (TLB), in misses per millisecond.
    T1CPU Percent of CPU-busy that is attributable to TLB misses.
    T1CYPTM Number of cycles a TLB miss tends to cost.
    PTEPT1M PTE percent of all TLB misses. For z14 and later, this metric is deprecated. However, so as not to disturb the columnar nature of the report, on such machine types the metric is reported as zero.

    Memory Cache (L1, etc.) Behavior
    L1MP L1 miss percentage. This is the percent of instructions that incur an L1 miss.
    LxxP Percent of L1 misses sourced from cache level xx.

    On z10, the levels are L1.5 ("15"), L2 on this book ("2L"), or L2 on some other book ("2R").

    On z196 and later, the levels are L2 ("2"), L3 ("3"), L4 on this book ("4L"), or L4 on some other book ("4R").

    MEMP Percent of L1 misses sourced from memory.

    Instruction Completion Behavior
    eICR Instruction completion rate, millions of instructions per elapsed second.
    iICR Instruction completion rate, millions of instructions per CPU-second.

    Store-Into-Instruction-Stream Behavior
    SIIS Percent of I-cache writes achieved with L3 intervention
    ICWL3PMI I-cache writes achieved with L3 intervention per million instructions completed

    LSPR Workload Hint
    LSPR Low, Avg, or High, per this LSPR article

    Drawer Behavior (z14 and z15 only)
    L4LPOC Percent of L1 misses sourced from local L4 off-cluster L3
    MEMPLC Percent of L1 misses sourced from memory local-on-chip
    MEMPNC Percent of L1 misses sourced from memory on-cluster
    MEMPND Percent of L1 misses sourced from memory on-drawer
    MEMPFD Percent of L1 misses sourced from memory off-drawer

    Deflate Behavior (DFLTCC, z15 and later only)
    DF_DPMI DFLTCC instructions completed per million total instructions completed
    DF_D012PMI DFLTCC CC=0,1,2 instructions completed per million total instructions completed
    DF_012PD Percent of completed DFLTCCs that had CC=0,1,2
    DF_CPD CPI of DFLTCC instructions
    DF_CSOA Per completed DFLTCC, cycles spent obtaining access to DFLTCC coprocessor
    DF_POBC Percent of CPU-busy cycles spent doing DFLTCC

    NN Behavior (NNPA, z16 and later only)
    NN_NPMI NNPA instructions completed per million total instructions completed
    NN_N012PMI NNPA CC=0,1,2 instructions completed per million total instructions completed
    NN_012PN Percent of completed NNPA that had CC=0, 1, or 2
    NN_CPN Total cycles of processing per completed NNPA
    NN_CSOA Cycles spent obtaining access per completed NNPA
    NN_POBC Percent of busy cycles spent doing NNPAs

    AIU Behavior (AIU, z16 and later only)
    W_AIU_CPU Waiting for access to AIU (LPAR CPU Units) OR (Cycles spent obtaining access to the Integrated Accelerator for AI)
    C_AIU_CPU Executing AIU (LPAR CPU Units) OR (Cycles spent executing on the AIU)
    AIU_CPU Total AIU CPU (LPAR CPU Units)


    What To Do With The Information

    The CPU MF counters data isn't like ordinary performance data in that there is no z/VM or System z "knob" one can directly turn to affect the achieved values. For example, there's no "give me more L1" knob that we could turn to increase the amount of L1 on the CEC if we felt there were something lacking about our L1 performance.

    For this reason the CPU MF report is at risk for being labelled "tourist information" or "gee-whiz information". Some analysts might say that because there isn't much that can be done to influence it, why would we bother even looking at it?

    It turns out there are some very useful things we can do with CPU MF information even though we don't have cache adjusting knobs at our immediate disposal. In the rest of this article, we briefly explore some of them.

    Probably the most useful thing to do with the CPU MF report is to use it as your workload's characterization index into the IBM Large Systems Performance Report (LSPR). The L1 miss percent L1MP and the RNI value together constitute the "LSPR hint" which in turn reveals which portion of the LSPR to consult when projecting your own workload's scaling or migration characteristics. Later versions of our CPUMF tool even print a column that states which workload hint applies. For more information on this, see IBM's LSPR page.

    One thing we can do to affect cache performance is to be cognizant of the idea that all of the partitions running on the CEC are competing for the CEC's cache. Steps we take to help the partitions' peak times not to overlap will help matters. If we have our workload scheduled so that all partitions heat up at 9 AM and all partitions cool off at 6 PM, we might consider whether we might stagger our company's work so that the partitions heat up at different times. An extension to this might be that if we had put all of the Europe partitions on one CEC, all of the North America partitions on a second, and all of the Asia partitions on a third, we might instead consider a less time-oriented placement, so that any given CEC doesn't have all of its partitions hot at the same time.

    Another thing we can do is run with HiperDispatch enabled in all the partitions that support it. For z/OS this means turning on the HiperDispatch feature. For z/VM this means running with CP SET SRM POLARIZATION VERTICAL. Doing this helps PR/SM to shrink those partitions' cache influence. For more information, consult operating system documentation.

    Another thing we can do to affect cache performance is to tune the system's configurations of logical processors and virtual processors so that those two choices are right-sized for the workload. If a z/VM partition has 16 logical processors but is running only 425% busy on average with peaks at 715%, set it to have 8 logical processors instead. The same thing applies to virtual servers. If that big Linux guest runs only 115% busy on average with peaks of 280%, it probably should not be configured as a virtual 12-way. Set it to be a virtual 3-way or 4-way instead.

    On versions of z/VM that have VM66063 or later, running with CP SET SRM UNPARKING MEDIUM can help to control cache effects. The medium unparking model leaves unneeded VL cores parked even though PR/SM might have the capacity to run them. For more information, see our unparking article.

    Much is made of the phenomenon called "store into instruction stream" (SIIS) and its potential to affect performance. I-cache stores that cause the L3 to intervene can dramatically decrease performance, so we want to know whether they are happening, and if so, whether they are happening enough to worry about. Later versions of the CPUMF tool print a column called "ICWL3PMI" which tabulates I-cache stores with L3 intervention per million instructions completed. For more information, read our SIIS article.

    Speaking of tuning Linux virtual machines, customers report varying degrees of success with using the cpuplugd daemon to shut off unneeded virtual CPUs during off-peak times. If you have large N-way Linux guests, consider trying cpuplugd in a test environment, and if the tests work out for you, consider putting it into production.

    Just as CPU counts can be right-sized, memory can also be right-sized. Take another look at those UPAGELOG reports for your virtual servers and the I/O rates to your virtual servers' swap extents. If your virtual servers are ignoring their swap extents, you can probably afford to decrease their memory sizes.