Reconciling CPU Utilization in an LPAR EnvironmentUpdated: 19 July 2000The correlation between the VMPRF and RTM CPU Utilization values can be somewhat interesting and challenging at times. Add in the LPAR environment, and the complexity can turn into perplexity, and might take away some of the fun! The CP INDICATE command (and RTM's Logical CPU busy) report how busy the VM system is as seen by VM itself. For example, CP INDICATE may report 100% busy, but in reality the system may have only used 15% of the base machine (because LPAR or the host VM did not want to give more cycles to the VM system queried with INDICATE). Capacity people may say "100% is wrong, it should be 15%", while end-users may say the 100% is right because it is what they feel. Their VM system is busy and there are no available resources. So, what seems wrong at first, may have value after all. The following information should help explain the differences between various views of processor utilization. Points to Remember
Example:
The relationship between different measurements is probably best
explained with an example.
The following Table shows 7 logical partitions running on a real
3-way CEC.
Only partitions A and B are not capped, all the others are.
Relationship of VMPRF PRF017 Logical to Physical Processor UtilizationBoth of these fields use the total CPU used by the partition as the numerator. The difference is what is used as the denominator. The total power of the logical number of processors for this partition is used for the logical utilization, while the total power of the number of physical processors is used for the physical utilization. The relationship is:
Number Logical PUs
Physical Proc Util = PRF017 Logical Proc Util * ---------------------
Number Physical PUs
Number Physical PUs
Logical Proc Util = PRF017 Physical Proc Util * ---------------------
Number Logical PUs
Partition A has the same value (34.55%) for both the VMPRF Logical Utilization and the Physical Utilization. This is because partition A has the same number of Logical processors as there are Physical processors. The other partitions all have fewer logical processors than there are real processors. Therefore, their logical processor utilization will be higher than the physical processor utilization on the PRF017 report. Partition C is a logical 1-way. It used the CPU equivalent to 4.40% of a real 3-way machine. However, from the logical view, Partition C used 13.20% of a real 1-way machine. Relationship of VMPRF PRF017 Logical to RTM Physical UtilizationOne of the differences between the VMPRF PRF017 logical processor utilization and the RTM Physical processor utilization (from D CPU PHYSICAL) is that VMPRF includes in LPAR overhead. The other difference is that the PRF017 value is of a 100% maximum or total, while the RTM physical processor value is out of N*100% maximum or total, where N is the number of logical processors. This means a relationship of:
RTM Physical CPU
PRF017 Logical Util= Partition LPAR Overhead + ----------------------
Number Logical PUs
RTM Phys. CPU = (PRF Logical Proc Util - LPAR Overhead) * Num Logical PUs
In the example above, partitions C,D,E,F, and G have only 1 logical processor so N*100% = 100%. Therefore their PRF017 Logical Processor Utilizations are all very similar to the RTM Physical %CPU. Actually, slightly higher because LPAR overhead is included. Partition D's RTM value (13) compared to PRF's (12.76) is an exception, most likely due to rounding and intervals being slightly off. Partition A's PRF logical value (100) divided by the logical processors (3) yields 33.3 which is slightly lower than the 34.55 reported by PRF for logical utilization. Again due to LPAR overhead. Partition B follows with a logical 2 way yielding values of 1.2 and 1.31. Relationship of RTM Logical to RTM Physical UtilizationRTM records the CPU timers from VM associated with running user or system work (%US, %EM, and %SY) and also active wait time (%WT). From these values, RTM computes logical %CPU for the GENERAL, SLOG, and CPU LOGICAL displays by:
US + EM + SY
RTM Logical %CPU = -----------------------
US + EM + SY + WT
Where the total or max is of N * 100%.
This works fine when VM is running natively. However, in an LPAR or second level where the processor can be taken out from under VM the logical %CPU can be misleading because the sum used in the denominator (US + EM + SY + WT) can be less than wall clock time. In order to compute physical %CPU, RTM knows the wall clock time and adds another counter for the missing time which it calls Involuntary Wait time (%IW) to go along with voluntary wait (%VW), both which are shown on the D CPU PHYSICAL screen. In the physical %CPU calculation, IW is included in the denominator.
US + EM + SY
RTM Physical %CPU = ---------------------------
US + EM + SY + VW + IW
So the more involuntary wait, the greater the potential difference between RTMs logical and physical processor utilization values. The relationship between the two can be approximated by the following:
RTM Logical %CPU * (N*100 - IW)
RTM Physical %CPU = ----------------------------------
N*100
RTM Physical %CPU * N*100
RTM Logical %CPU = ----------------------------
N*100 - IW
So in our example for Partition A the RTM physical is 100% while the logical is 105% and the involuntary wait (%IW) is 16%. So out of 300% (since logical 3-way), 16% of the wall clock time is unaccounted for in the regular CPU timers. This 16% explains the difference. An %IW of 16% on a 3-way basically means that over a minute each processor was missing 3.2 seconds (or total of 9.6 seconds for all 3 logical PUs). So physically, partition A was 100% out of 300% or 1/3 busy or 60 seconds of processor time over a minute (remember 3-way). Now, logically, VM does not know about the 9.6 seconds it lost. Logically RTM thinks it only had 56.8 = 60 - 3.2 seconds available per PU. Logical RTM computes partition A was 60 / 56.8 = 105.6%. The report logical value was 105, lost significance explains the slight difference. Partition F is more interesting. It is a logical 1 way with an %IW of 81%. This means that over a minute there were 48.6 seconds that VM CPU timers were not running and 7.8 seconds it was considering busy (the remaining 3.6 seconds (6%) were true active wait. From a logical view, this results in 7.8 / 11.4 = 68% busy which is close to the logically reported value of 66%. Note that partition F is also capped and has a low weight that normalizes to 12.6% of one physical processor. So it is obvious that it reached the capped and was limited often. Taking the above configuration and values into account, one should be reminded that the values calculated can be affected by interval times as well as other system variables. This example is meant to explain the similarities and relationships between the values displayed for the two IBM products. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||