Skip to main content

IBM Systems  >   System z  >   z/VM  >  

Reconciling CPU Utilization in an LPAR Environment

Updated: 19 July 2000

The correlation between the VMPRF and RTM CPU Utilization values can be somewhat interesting and challenging at times. Add in the LPAR environment, and the complexity can turn into perplexity, and might take away some of the fun!

The CP INDICATE command (and RTM's Logical CPU busy) report how busy the VM system is as seen by VM itself. For example, CP INDICATE may report 100% busy, but in reality the system may have only used 15% of the base machine (because LPAR or the host VM did not want to give more cycles to the VM system queried with INDICATE). Capacity people may say "100% is wrong, it should be 15%", while end-users may say the 100% is right because it is what they feel. Their VM system is busy and there are no available resources. So, what seems wrong at first, may have value after all.

The following information should help explain the differences between various views of processor utilization.


Points to Remember

  • VMPRF PRF017 report includes LPAR overhead. The Physical Processor utilization is based out of 100% of the CEC Physical capacity, while the Logical Processor utilization is based out of 100% of the Partition's Logical capacity.
  • In RTM, the %CPU field on the SLOG, GENERAL, and CPU LOGICAL displays are based on a logical view of processor utilization. %CPU for these screens is shown as the percentage of busy of (N *100%) where N is the number of logical processors defined for the VM partition.
  • The RTM logical processor value does not account for time when the processor is taken away from VM to run other partitions or to enforce LPAR capping.
  • The following VMPRF reports also show processor utilization:
    • PRF002 System_Summary_By_Time shows utilization out of 100% for the number of processors VM is running on. It is based on processor time used and elapsed (wallclock) time, so it is not skewed by competition from other partitions. It does not include any LPAR overhead.
    • PRF003 Processors_By_Time and PRF015 Processors_Complex_By_Time both show the utilization for a single processor out of 100% of a processor. It is based on processor time used and elapsed (wallclock) time, so it is not skewed by competition from other partitions. It does not include any LPAR overhead.
  • The INDICATE command AVGPROC field gives the processor utilization out of 100% for the number of processors VM is running on. It is based on processor time used and voluntary wait time. Therefore, it is skewed when running in an LPAR when the partition is using shared processors. It will further differ by the fact that it reports a smoothed average.
  • The processor utilization values shown in the RTM USER, ULOG, and GENERAL displays for individual users is shown as a percentage busy of processor based on actual processor time used over elapsed or wall clock time. It corresponds with physical CPU reports.
  • RTM Physical CPU can be seen with the D CPU PHYSICAL command or in the D CPLOG command.
  • Most of the comments included here for LPAR also apply when running as a second level guest of another VM/ESA system. However, first level overhead is not reported the way LPAR overhead is reported. The math can also get more complicated if you have more virtual processors defined for the second level VM/ESA system than there are real processors (this cannot be done in LPAR).
  • Spin Time (%SP) as reported on the RTM General and CPU displays is inaccurate in an LPAR environment. CP computes this value based on wall clock time. If the processor is preempted while in spin loop, then the involuntary wait time is also included in the spin time counter.

Example:

The relationship between different measurements is probably best explained with an example. The following Table shows 7 logical partitions running on a real 3-way CEC. Only partitions A and B are not capped, all the others are.


Reconciling CPU Utilization in an LPAR Environment
VMPRF PRF017 and RTM SLOG correlations
VMPRF PRF017 LPAR Report RTM
LPAR Partition Name Number Logical PUs Weight Proc Util
Logical
Proc Util
Physical
Logical
%CPU
Physical
%CPU
Voluntary Wait (%VW) Involuntary Wait (%IW)
A 3 578 34.55 34.55 105 100 185 16
B 2 26 1.31 .87 2.5 2.4 194 3.2
C 1 95 13.20 4.40 15 13 76 11
D 1 138 28.69 9.56 35 28 52 20
E 1 79 14.06 4.69 16 14 70 17
F 1 42 12.76 4.25 66 13 6.5 81
G 1 42 1.94 .65 2.0 1.9 95 3.4

Relationship of VMPRF PRF017 Logical to Physical Processor Utilization

Both of these fields use the total CPU used by the partition as the numerator. The difference is what is used as the denominator. The total power of the logical number of processors for this partition is used for the logical utilization, while the total power of the number of physical processors is used for the physical utilization. The relationship is:

                                                    Number Logical PUs
 Physical Proc Util =  PRF017 Logical Proc Util  * ---------------------
                                                    Number Physical PUs
 
 
                                                    Number Physical PUs
 Logical Proc Util =  PRF017 Physical Proc Util  * ---------------------
                                                    Number Logical PUs

Partition A has the same value (34.55%) for both the VMPRF Logical Utilization and the Physical Utilization. This is because partition A has the same number of Logical processors as there are Physical processors.

The other partitions all have fewer logical processors than there are real processors. Therefore, their logical processor utilization will be higher than the physical processor utilization on the PRF017 report. Partition C is a logical 1-way. It used the CPU equivalent to 4.40% of a real 3-way machine. However, from the logical view, Partition C used 13.20% of a real 1-way machine.

Relationship of VMPRF PRF017 Logical to RTM Physical Utilization

One of the differences between the VMPRF PRF017 logical processor utilization and the RTM Physical processor utilization (from D CPU PHYSICAL) is that VMPRF includes in LPAR overhead. The other difference is that the PRF017 value is of a 100% maximum or total, while the RTM physical processor value is out of N*100% maximum or total, where N is the number of logical processors. This means a relationship of:

                                                   RTM Physical CPU
 PRF017 Logical Util= Partition LPAR Overhead + ----------------------
                                                  Number Logical PUs
 
 
 RTM Phys. CPU =  (PRF Logical Proc Util - LPAR Overhead) * Num Logical PUs

In the example above, partitions C,D,E,F, and G have only 1 logical processor so N*100% = 100%. Therefore their PRF017 Logical Processor Utilizations are all very similar to the RTM Physical %CPU. Actually, slightly higher because LPAR overhead is included. Partition D's RTM value (13) compared to PRF's (12.76) is an exception, most likely due to rounding and intervals being slightly off. Partition A's PRF logical value (100) divided by the logical processors (3) yields 33.3 which is slightly lower than the 34.55 reported by PRF for logical utilization. Again due to LPAR overhead. Partition B follows with a logical 2 way yielding values of 1.2 and 1.31.

Relationship of RTM Logical to RTM Physical Utilization

RTM records the CPU timers from VM associated with running user or system work (%US, %EM, and %SY) and also active wait time (%WT). From these values, RTM computes logical %CPU for the GENERAL, SLOG, and CPU LOGICAL displays by:

 
                        US +  EM +  SY
 RTM Logical %CPU = -----------------------
                      US +  EM +  SY +  WT
Where the total or max is of N * 100%.

This works fine when VM is running natively. However, in an LPAR or second level where the processor can be taken out from under VM the logical %CPU can be misleading because the sum used in the denominator (US + EM + SY + WT) can be less than wall clock time. In order to compute physical %CPU, RTM knows the wall clock time and adds another counter for the missing time which it calls Involuntary Wait time (%IW) to go along with voluntary wait (%VW), both which are shown on the D CPU PHYSICAL screen. In the physical %CPU calculation, IW is included in the denominator.

 
                            US +  EM +  SY
 RTM Physical %CPU = ---------------------------
                      US +  EM +  SY +  VW + IW

So the more involuntary wait, the greater the potential difference between RTMs logical and physical processor utilization values. The relationship between the two can be approximated by the following:

 
                      RTM Logical %CPU * (N*100 - IW)
 RTM Physical %CPU = ----------------------------------
                                 N*100
 
 
                     RTM Physical %CPU * N*100
 RTM Logical %CPU = ----------------------------
                          N*100  - IW

So in our example for Partition A the RTM physical is 100% while the logical is 105% and the involuntary wait (%IW) is 16%. So out of 300% (since logical 3-way), 16% of the wall clock time is unaccounted for in the regular CPU timers. This 16% explains the difference. An %IW of 16% on a 3-way basically means that over a minute each processor was missing 3.2 seconds (or total of 9.6 seconds for all 3 logical PUs). So physically, partition A was 100% out of 300% or 1/3 busy or 60 seconds of processor time over a minute (remember 3-way). Now, logically, VM does not know about the 9.6 seconds it lost. Logically RTM thinks it only had 56.8 = 60 - 3.2 seconds available per PU. Logical RTM computes partition A was 60 / 56.8 = 105.6%. The report logical value was 105, lost significance explains the slight difference.

Partition F is more interesting. It is a logical 1 way with an %IW of 81%. This means that over a minute there were 48.6 seconds that VM CPU timers were not running and 7.8 seconds it was considering busy (the remaining 3.6 seconds (6%) were true active wait. From a logical view, this results in 7.8 / 11.4 = 68% busy which is close to the logically reported value of 66%. Note that partition F is also capped and has a low weight that normalizes to 12.6% of one physical processor. So it is obvious that it reached the capped and was limited often.


Taking the above configuration and values into account, one should be reminded that the values calculated can be affected by interval times as well as other system variables. This example is meant to explain the similarities and relationships between the values displayed for the two IBM products.