Managing PR/SM Overhead

(Last revised: 2017-07-06, BKW)

Introduction

In a recent PMR a customer using z/VM showed us a Perfkit listing citing about 18 IFL cores' worth of unchargeable PR/SM overhead, sometimes also called "LPAR management time", in the shared IFL processor pool. Here's the Perfkit excerpt. The column of interest is the one labelled %LPmgt.

1FCX302 Run 2017/06/26 16:24:06 PHYSLOG Real Core Utilization Log From 2017/06/26 10:10:00 To 2017/06/26 10:46:00 For 2160 Secs 00:36:00 "This is a performance report" _____________________________________________________________________________________ Interval <PhCore> Shrd Total End Time Type Conf Ded Log. Weight %LgclC %Ovrhd LCoT/L %LPmgt %Total TypeT/L >>Mean>> IFL 86 0 95 1000 3240.9 244.65 1.075 1804.8 5290.4 1.632 >>Mean>> >Sum 86 0 95 1000 3240.9 244.65 1.075 1804.8 5290.4 1.632

In this article we will go over what causes unchargeable PR/SM overhead, what you can do to abate it, and what tradeoffs you make in doing this tuning.

Causes

Largely speaking, unchargeable PR/SM overhead is time spent in the PR/SM hypervisor managing the dispatching of logical cores onto physical cores. When the CPC is out of tune or is poorly configured, a couple of undesirable things start happening:

The dispatch queue within PR/SM tends to grow, causing more computational complexity for the PR/SM dispatcher.
The operating systems running in the LPARs are at increased risk for their logical processors becoming undispatched at really crucial moments, namely, those moments while they hold spin locks vital to the operating system's operation. When this happens, a logical processor trying to acquire the lock senses that the logical processor holding the lock is not running and therefore issues an API call to PR/SM, "diagnose 9C", asking PR/SM to go run the lock-holding logical processor so it can finish what it is doing and release the lock. This sequence causes one or more PR/SM dispatch events.

The purpose of tuning the CPC is to try to mitigate the undesirables.

Relevant Perfkit Reports

As you look at unchargeable PR/SM overhead, its causes, and its remedies, the following Perfkit reports will help you:

FCX302 PHYSLOG prints a table of core utilization by time and by core type. For each type-pool (CPs, IFLs, etc.), the report tells you the number of cores, the number being used for shared LPARs, and some utilization information. Here's a brief excerpt. 1FCX302 Run 2017/06/26 16:24:06 PHYSLOG Real Core Utilization Log From 2017/06/26 10:10:00 To 2017/06/26 10:46:00 For 2160 Secs 00:36:00 "This is a performance report" ______________________________________________________________________________ Interval <PhCore> Shrd Total End Time Type Conf Ded Log. Weight %LgclC %Ovrhd LCoT/L %LPmgt %Total TypeT/L >>Mean>> IFL 86 0 95 1000 3240.9 244.65 1.075 1804.8 5290.4 1.632 >>Mean>> >Sum 86 0 95 1000 3240.9 244.65 1.075 1804.8 5290.4 1.632 10:11:00 IFL 86 0 116 1000 4238.0 340.31 1.080 1714.0 6292.3 1.485 10:11:00 >Sum 86 0 116 1000 4238.0 340.31 1.080 1714.0 6292.3 1.485 10:12:00 IFL 86 0 115 1000 4164.1 343.12 1.082 1911.5 6418.7 1.541 10:12:00 >Sum 86 0 115 1000 4164.1 343.12 1.082 1911.5 6418.7 1.541
Some helpful definitions:
- %LgclC is percent-busy core utilization spent running logical cores doing their own work.
- %Ovrhd is percent-busy core utilization spent in the PR/SM hypervisor doing work induced by the direct actions of the LPARs and chargeable to the LPARs.
- %LPmgt is percent-busy core utilization spent in the PR/SM hypervisor doing overhead work not chargeable to any specific LPAR.
- %Total is the sum of %LgclC, %Ovrhd, and %LPmgt.
FCX306 LSHARACT reports the LPARs' weights, entitlements, logical core counts, and core utilizations. Here is an excerpt: 1FCX306 Run 2017/06/26 16:24:06 LSHARACT Logical Partition Share From 2017/06/26 10:10:00 To 2017/06/26 10:46:00 For 2160 Secs 00:36:00 "This is a performance report" ______________________________________________________________________________________________ LPAR Data, Collected in Partition L001 Core counts: CP ZAAP IFL ICF ZIIP Dedicated 0 0 0 0 0 Shared physical 0 0 86 0 0 Shared logical 0 0 95 0 0 ____ . . . . . . . . . . . . Core Partition Core Load LPAR <CoreTotal,%> Core Type Name Count Max Weight Entlment Cap TypeCap GrpCapNm GrpCap Busy Excess Conf ... P001 ... ... 0 ... ... ... ... ... ... ... . ... P002 ... ... 0 ... ... ... ... ... ... ... . IFL L001 12 1200 80 688.0 No ... ... ... 542.4 .0 o IFL L002 24 2400 230 1978.0 No ... ... ... 281.8 .0 o IFL L003 24 2400 230 1978.0 No ... ... ... 624.9 .0 o IFL L004 28 2800 230 1978.0 No ... ... ... 817.5 .0 o IFL L005 28 2800 230 1978.0 No ... ... ... 1273.0 .0 o
Some helpful definitions:
- Core Count is the number of logical cores online in the LPAR.
- LPAR Weight is the LPAR's weight.
- Entlment is the LPAR's entitlement, computed from the size of the shared physical pool, and the LPAR's weight, and the other LPARs' weights.
- Busy is the percent-busy core utilization for the LPAR, averaged over the life of the Perfkit listing.
- Excess shows you the amount of core utilization the LPAR used over its entitlement.
- Core Conf shows you how the LPAR's number of online logical cores seems to relate to the LPAR's entitlement, with the following meanings:
  - o means the LPAR is overconfigured, that is, there are too many logical cores compared to its entitlement.
  - u means the LPAR is underconfigured, that is, there are too few logical cores compared to its entitlement.
  - - means the LPAR appears to have just about the right number of logical cores compared to its entitlement.
FCX239 PROCSUM shows the rate at which the reporting LPAR's z/VM Control Program is issuing Diag 9C calls into PR/SM. The columns you are looking for are out on the right, like this:
The values are rates per second. DSP, SYN, and HVR refer to CP modules HCPDSP, HCPSYN, and HCPHVR respectively. Ideally we are looking for no LPAR to be issuing any of these.
To calculate the total rate at which all your z/VM LPARs are calling into PR/SM via diagnose 9C, add up the rates found in their respective Perfkit listings.

Best Practices

The overall objective for best practices to reduce PR/SM overhead is to configure the CPC in such a way that all of the following conditions hold:

Each LPAR consumes the bulk of its CPU power on its entitled logical cores, and
Each LPAR has no more than two vertical-low logical cores online.

To achieve this, attend to all of the following points:

Match entitlements to needs for power: for each LPAR, make sure the LPAR's entitlement is consistent with its actual need for computing power. If the workload running in the LPAR requires about 23 cores' worth of power to run correctly, the LPAR's entitlement should be set to about 23 cores' worth, aka about 2300%.
How do we set an LPAR's entitlement? Within a type-pool, the LPARs' entitlements are controlled by the LPARs' weights. An LPAR's entitlement is just its weight, divided by the sum of the weights, multiplied by the total power of the shared type-pool. For example:

You can manage the LPARs' weights from the HMC or from the SE.
For a lot more information about the relationship between weight and entitlement, read this presentation.
Match logical core counts to entitlements: make sure the logical core counts for the LPARs are in line with the LPARs' entitlements. For example, we would not want an LPAR with entitlement 150% (1.5 cores' worth) to have 20 logical cores online. A good rule of thumb is to have only two to three more logical cores online than the LPAR's entitlement can cover. For example, if the LPAR has entitlement 2335%, we would want there to be no more than 26 logical cores online in the LPAR.
A good way to achieve the above is to define the LPAR with enough logical cores to cover its maximum demand and then to use the z/VM command CP VARY CORE (or, if running non-SMT, CP VARY PROCESSOR) to manage the online core count to match entitlement and utilization.
To find out the core IDs you need for your CP VARY CORE commands, issue CP QUERY PROCESSOR. The core IDs come out on the output. If you don't see any core IDs, your system is running non-SMT and you should use CP VARY PROCESSOR instead.
Don't be afraid to change: Your LPARs' workloads are likely to change now and then. Perhaps Monday is a peak day for one of your LPARs, or perhaps end-of-month processing is happening in some LPAR. Keep track of your workloads' requirements for CPU power and then manage the entitlements and logical core counts to match the workloads' demands.

An Example

The LSHARACT excerpt above shows a few things we might change if we were going to try to align with the best practices we've just finished discussing. For convenience, we'll repeat the excerpt:

1FCX306 Run 2017/06/26 16:24:06 LSHARACT Logical Partition Share From 2017/06/26 10:10:00 To 2017/06/26 10:46:00 For 2160 Secs 00:36:00 "This is a performance report" ______________________________________________________________________________________________ LPAR Data, Collected in Partition L001 Core counts: CP ZAAP IFL ICF ZIIP Dedicated 0 0 0 0 0 Shared physical 0 0 86 0 0 Shared logical 0 0 95 0 0 ____ . . . . . . . . . . . . Core Partition Core Load LPAR <CoreTotal,%> Core Type Name Count Max Weight Entlment Cap TypeCap GrpCapNm GrpCap Busy Excess Conf ... P001 ... ... 0 ... ... ... ... ... ... ... . ... P002 ... ... 0 ... ... ... ... ... ... ... . IFL L001 12 1200 80 688.0 No ... ... ... 542.4 .0 o IFL L002 24 2400 230 1978.0 No ... ... ... 281.8 .0 o IFL L003 24 2400 230 1978.0 No ... ... ... 624.9 .0 o IFL L004 28 2800 230 1978.0 No ... ... ... 817.5 .0 o IFL L005 28 2800 230 1978.0 No ... ... ... 1273.0 .0 o

Now, let's notice a few things:

If LPAR L002 needs to run only 281% busy,
1. ... why does it have 24 logical cores online?
2. ... why is its entitlement 1978%?
Same kinds of questions for LPAR L003.
Same kinds of questions for LPAR L004.
Same kinds of questions for LPAR L005.

Assuming the CPU demands of LPARs L002 to L005 are fairly steady at the average values depicted in the LSHARACT excerpt, what might we change?

How about reducing LPAR L002 to about five cores online?
How about reducing LPAR L003 to about eight cores online?
How about reducing LPAR L004 to about ten cores online?
How about reducing LPAR L005 to about fifteen cores online?

Steps like these would help to bring this CPC into line with the best practices for reducing unchargeable PR/SM overhead.

Tradeoffs

Managing the relationship among physical cores, logical cores, utilization, and entitlement requires awareness of the tradeoffs or compromises being made. In this section we discuss some of those tradeoffs.

One tradeoff we make in tuning these systems is between dispatch latency and the overhead associated with additional parallelism. At one end of this spectrum is the configuration that results from strictly following the PR/SM tuning recommendations above. The objective of those configuration guidelines is to reduce overhead in the PR/SM hypervisor by getting rid of what some might call "unnecessary" logical cores. When the ratio of logical cores to physical cores is close to 1, PR/SM dispatching overhead is reduced. Few, heavily busy logical cores are easier for PR/SM to handle than many, lightly busy logical cores.

The consequence of tuning PR/SM in this way is that the burden of managing parallelism gets pushed up into the operating system running in the LPAR. Suppose, for example, that the CPC is sufficiently lightly loaded that PR/SM will let our z/VM LPAR draw 20 logical cores' worth of power if it wants to do so. If the z/VM hypervisor, running SMT-2, has 40 guest virtual CPUs to dispatch, and those guest virtual CPUs will require only six logical cores' worth of power altogether, what is the correct compromise? Consider these scenarios:

Should z/VM try to run the 40 virtual CPUs on only six logical cores? This will relieve PR/SM of dispatch overhead. But it will also result in the virtual CPUs queueing in z/VM at the logical CPUs, possibly increasing z/VM scheduling overhead, virtual CPU dispatch latency, and transaction response time.
At the other end of the spectrum, what if z/VM were to use 20 logical cores to run the 40 virtual CPUs? This would make z/VM's dispatching job easier, and it would afford the virtual CPUs instant access to logical CPUs. But PR/SM's dispatching job would be more difficult, because it is managing more logical cores.

Another tradeoff we make is between dispatch latency and performance of the CPC's caches. Consider again the scenario above: six logical cores, or 20 logical cores?

The former will help to reduce the z/VM LPAR's tendency to contaminate the caches being used by the other LPARs, because z/VM is running on fewer logical cores. This can help improve CPI, but it can increase dispatch latency for virtual CPUs.
The latter tends to increase the z/VM LPAR's ability to contaminate caches being used by other LPARs, because z/VM is running its workload on more logical cores. This can harm CPI, but it can help improve dispatch latency for virtual CPUs.

A third tradeoff is between the staff and automation needed to manage entitlement, logical core count, and utilization as compared to the results achieved through said management. How good is "good enough" and what is the cost of that achievement? Only the customer can answer this.

Clearly this is a compromise whose correct answer comes only from observing the success metrics for the workload and for the customer's business and then tuning to optimize those success metrics.

z/VM HiperDispatch

When the LPAR is vertical and z/VM is running in an LPAR for which Global Performance Data Control (GPDC) is enabled, z/VM manages its unparked logical core count according to what we call the capacity floor forecast. Every few seconds, z/VM uses its GPDC privilege to ask PR/SM for information about the CPC: the numbers of physical cores of each type, the configurations of the activated LPARs, and those activated LPARs' core utilizations. Using that information, z/VM projects a floor (a minimum) on how much computing power PR/SM is likely to be able to deliver to the z/VM LPAR in the next few seconds, if z/VM wants to use it. z/VM then runs with enough logical cores unparked to be able to consume that power if necessary.

The purpose of the above strategy is to give z/VM the opportunity to achieve dispatch parallelism for its guests without running on logical cores that PR/SM might be unable to power. If PR/SM is likely to be able to power 20 logical cores over the next few seconds, z/VM runs with 20 logical cores unparked and then makes use of all 20 of them to run guests. z/VM does this even if it might result in each of the 20 cores being very lightly loaded. This strategy can help to reduce the amount of time a dispatchable guest virtual CPU waits for access to a logical CPU. This reducing of dispatch latency can help to reduce transaction response time, which is a success metric for some workloads.

Though the above behavior can help decrease virtual CPU dispatch latency, it can also inadvertently promote PR/SM overhead. There is another way, though, to configure z/VM HiperDispatch. With this second technique, z/VM HiperDispatch behaves in a way that helps to reduce PR/SM overhead.

When the LPAR is vertical and z/VM is running in an LPAR for which GPDC is disabled, z/VM cannot see the configuration of the CPC, nor can it see the configurations and core utilizations of the other LPARs. This means z/VM cannot compute a capacity floor forecast. So, instead, z/VM computes a consumption ceiling forecast for its own LPAR. The computation yields a consumption ceiling above which it is very unlikely the workload will rise within the next few seconds. z/VM then runs with enough logical cores unparked to be able to power the workload as long as its utilization remains below the forecast ceiling. Embedded here is the assumption that PR/SM will fully power each unparked logical core. There is no way for z/VM to know this for certain, because z/VM cannot see the condition of the CPC.

The consumption ceiling forecast takes into account both the average utilization over the recent past and the variability present in the samples collected to compute the average. A very steady workload will produce ceiling forecasts that are close to the mean. An erratically behaving workload will produce ceiling forecasts that are somewhat higher than the mean.

After computing the consumption ceiling forecast, z/VM adds a "pad" value specified by the system programmer. The pad value is controlled by the command CP SET SRM CPUPAD. The purpose of the pad value is to give the system programmer an opportunity to specify a safety margin over and above the forecast consumption ceiling.

All of the information z/VM collects to forecast capacity floors and consumption ceilings comes out in monitor records. The Perfkit FCX299 PUCFGLOG report displays this information, along with the calculated projections and the configuration decisions made.

A drawback of turning off GPDC for the z/VM LPAR is that CPC-wide traits, such as the list of activated LPARs and the core utilizations of the shared physical core pools, will not show up in Monitor. This in turn denies the performance reporting products, such as Perfkit, the opportunity to format and report on said information. Perhaps a workaround for this drawback is to run one very small z/VM LPAR with GPDC enabled, running no actual workload, whose only purpose is to collect CPC-wide performance data.

Summary

The PR/SM hypervisor requires CPU power for its own ends. One of those ends is the dispatching of LPARs' logical cores. As the ratio of logical cores to physical cores grows, PR/SM dispatching complexity increases. This can increase PR/SM's CPU consumption. The ratio increasing can also cause operating systems running in LPARs occasionally to observe that one or more of their logical cores is not dispatched on a physical core. If this happens at an inopportune moment, the operating system's reaction to the condition can increase PR/SM overhead.

There exist CPC tuning guidelines that can help to decrease PR/SM overhead. The theme of the guidelines is to match the LPARs' entitlements and logical core counts to the workloads' needs for power. LPAR weight controls LPAR entitlement, so matching entitlements to the workloads' needs means being willing to adjust LPAR weights. Matching logical core counts to workloads' demands means being willing to issue CP VARY CORE occasionally.

Certain Perfkit reports help the system administrator to see the lay of the land as regards PR/SM overhead, weight, entitlement, logical core count, and utilization. Key reports are FCX302 PHYSLOG and FCX304 LSHARACT.

z/VM HiperDispatch can help you to run your z/VM LPAR in a fashion that helps to reduce PR/SM overhead. Running in vertical mode with Global Performance Data Control disabled causes z/VM to run with a number of unparked logical cores that seems about right for the demands of the workload. However, running this way removes the system programmer's ability to see CPC-wide configuration and behavior data in the performance reporting products such as Perfkit.

Ultimately, tuning is about maximizing the success metrics for the workload. Optimizing the behavior of PR/SM is not necessarily the same as optimizing the business result. Tradeoffs will have to be made.