|
Contents | Previous | Next
Improved Processor Scalability
Abstract
With z/VM 5.3, up to 32 CPUs are supported with a single VM image.
Prior to this release, z/VM supported up to 24 CPUs.
In addition to functional changes that enable z/VM 5.3 to run with
more processors configured, a new locking infrastructure has been
introduced that improves system efficiency for
large n-way configurations. A performance study was
conducted to compare the system efficiency of z/VM 5.3 to z/VM 5.2.
While z/VM 5.3 is
more efficient than z/VM 5.2 for all of the n-way measurement points
included in this study, the efficiency improvement is substantial at
large n-way
configurations. With a 24-way LPAR configuration, a 19% throughput
improvement was observed.
This section reviews performance experiments that were conducted to
verify the significant improvement in efficiency with z/VM 5.3 when
running with large n-way configurations.
Prior to z/VM 5.3, the VM Control Program (CP) scheduler lock had to
always be held exclusive. With z/VM 5.3, a new scheduler lock
infrastructure has been implemented. The new infrastructure includes
a new Processor Local Dispatch Vector (PLDV) lock, one per processor.
The new infrastructure enables obtaining the scheduler lock in shared
mode in combination with the individual PLDV lock for a processor
in exclusive mode when system
conditions allow. This new lock design reduces contention for
the scheduler lock, enabling z/VM to more efficiently manage large n-way
configurations. A study that was done comparing z/VM 5.3 to
z/VM 5.2 with the same workload using the same LPAR
configurations is reviewed. The results show that processor scaling with
z/VM 5.3 is much improved for large n-way configurations.
Background
Motivated by customers' needs to consolidate large numbers of guest
systems onto a single VM image, the design of the scheduler lock has been
incrementally enhanced to reduce lock contention. With z/VM 4.3,
CP timer management scalability
was improved by eliminating master processor
serialization and other design changes were made to reduce large system
effects. With z/VM 4.4, more
scheduler lock improvements were made. A
new CP timer request block lock was
introduced to manage timer request serialization (TRQBK lock), removing
that burden from the CP scheduler lock.
With z/VM 5.1,
24-way support was announced.
Now, with z/VM 5.3,
scheduler lock contention has been reduced even further with the
introduction of a new lock infrastructure that enables the scheduler lock
to be held shared when conditions allow. With these additional enhancements,
32 CPUs are supported with a single VM image.
Method
A 2094-109 z9
system was used to conduct experiments in an LPAR configured with 10GB
of central storage and 25GB of expanded storage.
The breakout of central storage and expanded storage for this evaluation
was arbitrary. Similar results are expected with other breakouts because
the measurements were obtained in a non-paging environment.
The LPAR's n-way configuration was varied for the evaluation. The
hardware configuration included shared processors and processor capping
for all measurements. z/VM 5.2 measurements were used as
the baseline for the comparison. z/VM 5.2 baseline measurements were
done with the LPAR configured as a 6-way, 12-way, 18-way, 24-way, and
30-way.
z/VM 5.3 measurements were done for each of these n-way environments. In
addition, a 32-way measurement was done, since that is the
largest configuration supported by z/VM 5.3.
Processor capping creates a maximum limit for system processing power
allocated to the LPAR. By running with processor capping enabled,
any effects that are measured as the
n-way is varied can be attributed to the n-way changes rather than a
combination of n-way effects and large system effects. Processing
capacity was held at approximately 6 full processors for this study.
The software application workload used for this evaluation was a version of
the Apache workload without storage constraints.
The Linux guests that were acting as clients
were configured as virtual uniprocessor machines with 1GB of storage.
The Linux guests that were acting as web servers
were configured as virtual 5-way machines with 128MB of storage.
The number of Linux web clients and web servers was increased as the n-way
was increased in order to generate enough dispatchable units of work to
keep the processors busy.
The Application Workload Modeler (AWM) was used to simulate client requests for
the Apache workload measurements. Hardware instrumentation data, AWM data,
and Performance Toolkit for VM data were collected for each measurement.
Results and Discussion
For this study, if system efficiency is not affected by the n-way changes,
the expected result for the Internal Throughput Rate Ratio (ITRR) is that
it will increase proportionally as the n-way increases.
For example, if the
number of CPUs is doubled, the ITRR would double if system efficiency is not
affected by the n-way change.
Figure 1. Large N-Way Effects on ITR Ratio
Figure 1
shows the comparison of ITRR between z/VM 5.2 and z/VM 5.3. It
also shows the line for processor scaling, using the 6-way 5.3
measurement as the baseline.
Figure 1
illustrates the dramatic improvement with z/VM 5.3 scalability
with larger n-way configurations. The processor ratio line shows
the line for perfect scaling. While the z/VM 5.3 system
does not scale perfectly, this is expected as software
multi-processor locking will always have some impact on system
efficiency. The loss of system
efficiency is more pronounced for larger n-way configurations because that
is where scheduler lock contention is the greatest.
It should be noted that z/VM 5.2 only supports up to 24 CPUs for a single
VM image. The chart shows a
30-way
configuration to illustrate the dramatic improvement in efficiency with
z/VM 5.3. This also explains why support was limited to 24 CPUs with
z/VM 5.2.
Table 1 shows a summary of the measurement data
collected when
running with z/VM 5.3 with LPAR n-way configurations of 6-way, 12-way,
18-way, 24-way, 30-way, and 32-way.
Table 1. Comparison of System Efficiency with z/VM 5.3 as N-Way Increases
| Shared CPUs | 6 | 12 | 18 | 24 td> |
30 | 32 |
| AWM Clients | 3 | 5 | 7 | 9 | 11
| 12 |
| AWM Servers | 5 | 9 | 13 | 17 | 21 |
23 |
| Run ID | AVPK5187 | AVPK5184 | AVPK5183 | AVPK5182
| AVPK5188 | AVPK5186 |
| Tx/sec (p) | 4674.21 | 4120.00 | 3841.05 | 3315.00 | 3177.00 | 3117.00 |
| ITR (p) | 4697.70 | 8015.56 | 11297.21 | 13102.77 | 15805.97 | 16492.06 |
| Total Util/Proc (p) | 99.5 | 51.4 | 34.0 | 25.3 | 20.1 | 18.9 |
| Processors (p) | 6 | 12 | 18 | 24 | 30 | 32 |
| Total CPU/Tx (p) | 1.277 | 1.497 | 1.593 | 1.832 | 1.898 | 1.940 |
| CP CPU/Tx (p) | 0.290 | 0.341 | 0.417 | 0.485 | 0.482 | 0.462 |
| Emul CPU/Tx (p) | 0.987 | 1.156 | 1.176 | 1.347 | 1.416 | 1.478 |
| Pct Spin Time (p) | .371 | 5.233 | 9.883 | 13.08 | 16.48 | 14.53 |
| Sch Pct Spin Time (p) | .315 | 4.763 | 9.034 | 12.58 | 16.05 | 14.08 |
| TRQ Pct Spin Time (p) | .054 | .463 | .833 | .488 | .433 | .442 |
| Privops/Tx (p) | 50.818 | 54.376 | 52.766 | 54.109 | 53.331 | 52.557 |
| Diagnoses/Tx (p) | 1.840 | 4.631 | 2.022 | 2.172 | 2.339 | 2.069 |
| RATIOS | | | | | | |
| Tx/sec (p) | 1.000 | 0.881 | 0.822 | 0.709 | 0.680 | 0.667 |
| ITR (p) | 1.000 | 1.706 | 2.405 | 2.789 | 3.365 | 3.511 |
| Total Util/Proc (p) | 1.000 | 0.517 | 0.342 | 0.254 | 0.202 | 0.190 |
| Processors (p) | 1.000 | 2.000 | 3.000 | 4.000 | 5.000 | 5.333 |
| Total CPU/Tx (p) | 1.000 | 1.172 | 1.247 | 1.435 | 1.486 | 1.519 |
| CP CPU/Tx (p) | 1.000 | 1.176 | 1.438 | 1.672 | 1.662 | 1.593 |
| Emul CPU/Tx (p) | 1.000 | 1.171 | 1.191 | 1.365 | 1.435 | 1.497 |
| Pct Spin Time (p) | 1.000 | 14.105 | 26.639 | 35.256 | 44.420 | 39.164 |
| Sch Pct Spin Time (p) | 1.000 | 15.121 | 28.679 | 39.937 | 50.952 | 44.698 |
| TRQ Pct Spin Time (p) | 1.000 | 8.574 | 15.426 | 9.037 | 8.019 | 8.185 |
| Privops/Tx (p) | 1.000 | 1.070 | 1.038 | 1.065 | 1.049 | 1.034 |
| Diagnoses/Tx (p) | 1.000 | 2.517 | 1.099 | 1.180 | 1.271 | 1.124 |
|
Notes: z9 machine; 10GB central storage; 25GB expanded storage;
Apache web serving workload with uniprocessor clients and 5-way servers;
Non-paging environment
with processor capping in effect to maintain processing capacity constant.
(h) hardware instrumentation data; (p) Performance Toolkit data
|
Table 1
highlights the key measurement points that were used
in this performance study. Some of the same trends found here were also
found in the 24-Way Support evaluated in the
z/VM 5.1 performance report.
Reference the z/VM 5.1 table comparison
of system efficiency.
The CPU time per transaction (Total CPU/Tx) increases as the n-way
increases. Both CP and the Linux guests (represented by emulation)
contribute to the increase. However, the CP CPU/Tx numbers are
lower than they were with z/VM 5.1 (although this metric was not included
in the z/VM 5.1 table). In fact, there is a slight downward trend
in the z/VM 5.3 numbers with the 30-way and 32-way configurations. The
reduction in CP's CPU time per transaction is a result of the
improvements to the scheduler lock design and other enhancements
incorporated into z/VM 5.3.
Another trend discussed in the 24-Way Support with z/VM 5.1 is the fact
that the Linux guest virtual MP machines are spinning on locks within the
Linux system. This spinning results in Diagnose X'44's being generated.
For further information concerning Diagnose X'44's, please refer to the
discussion in the 24-Way Support section in the z/VM 5.1 Performance
Report.
Finally, the 24-Way Support in the z/VM 5.1 Performance Report discusses
the make up of the CP CPU time per transaction. Two components that are
included there are formal spin time and non-formal spin time. With
z/VM 5.3, a breakout by lock type of
formal spin time is included in monitor records and is
now presented in the Performance Toolkit with new screen FCX265 - Spin Lock
Log By Time. A snapshot of the ">>Mean>>" portion of that screen
is shown below.
The scheduler lock is "SRMSLOCK" in the Spin Lock Log screen shown below.
The new lock infrastructure discussed in
the Introduction of this section
is used for all of the formal locks. However, at this time,
only the scheduler lock exploits the shared mode enabled by the new
design. The new infrastructure
may be exploited for other locks in the future as appropriate.
FCX265 Run 2007/05/21 14:33:24 LOCKLOG
Spin Lock Log, by Time
_____________________________________________________________________________________
<------------------- Spin Lock Activity -------------------->
<----- Total -----> <--- Exclusive ---> <----- Shared ---->
Interval Locks Average Pct Locks Average Pct Locks Average Pct
End Time LockName /sec usec Spin /sec usec Spin /sec usec Spin
>>Mean>> SRMATDLK 61.0 48.39 .009 61.0 48.39 .009 .0 .000 .000
>>Mean>> RSAAVCLK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> RSA2GCLK .0 3.563 .000 .0 3.563 .000 .0 .000 .000
>>Mean>> BUTDLKEY .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> HCPTMFLK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> RSA2GLCK .0 .551 .000 .0 .551 .000 .0 .000 .000
>>Mean>> HCPRCCSL .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> RSASXQLK .0 1.867 .000 .0 1.867 .000 .0 .000 .000
>>Mean>> HCPRCCMA .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> RCCSFQL .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> RSANOQLK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> NSUNLSLK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> HCPPGDML .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> NSUIMGLK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> FSDVMLK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> DCTLLOK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> SYSDATLK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> RSACALLK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> RSAAVLLK .0 .328 .000 .0 .328 .000 .0 .000 .000
>>Mean>> HCPPGDAL .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> HCPPGDTL .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> HCPPGDSL .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> HCPPGDPL .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> SRMALOCK .0 .000 .000 .0 .000 .000 .0 .000 .000
>>Mean>> HCPTRQLK 675.5 209.2 .442 675.5 209.2 .442 .0 .000 .000
>>Mean>> SRMSLOCK 30992 145.4 14.08 30991 145.4 14.08 .7 949.1 .002
Summary and Conclusions
With the workload used for this evaluation, there is a gradual decrease
in system efficiency which is more pronounced at large n-way configurations.
The specific
workload used will have a significant effect on the efficiency with which
z/VM can manage large numbers of processor engines.
As stated in the 24-Way Support
section in the z/VM 5.1 report, when z/VM
is running in large n-way LPAR configurations, z/VM overhead will be lower
for workloads with fewer, more CPU-intensive guests than for workloads with
many lightly loaded guests. Some workloads (such as CMS workloads) require
master processor serialization. Workloads of this type will not be able to
fully utilize as many CPUs because of
master processor serialization. Also, application workloads that
use a single
virtual machine and are not capable of using multiple processors (such as
DB2 for VM and VSE, SFS, and RACF) may
not be able to take full advantage of a large n-way configuration.
This evaluation focused on analyzing the effects of increasing the n-way
configuration while holding CPU processing capacity relatively constant. In
production environments, n-way increases will typically also result in
processing capacity increases. Before exploiting large n-way configurations,
the specific workload characteristics should be considered in terms of how it
will perform with the work dispatched across more CPUs as well as utilizing the
larger processing capacity.
Contents | Previous | Next
|