IBM: z/VM Performance Report: Workload and Resource Distribution

Workload and Resource Distribution

Abstract

In z/VM 6.2, up to four z/VM systems can be connected together into a cluster called a z/VM Single System Image cluster (SSI cluster). An SSI cluster is a multisystem environment in which the z/VM member systems can be managed as a single resource pool. System resources and workloads can be distributed across the members of an SSI cluster to improve resource efficiency, to achieve workload balancing, and to prepare for Live Guest Relocation (LGR). This multisystem environment can also be used to let a workload consume resources beyond what a single z/VM system can supply.

Distributing system resources and workloads across an SSI cluster provided benefit by improving processor efficiency in a CPU-constrained environment. A virtual-I/O-constrained environment running in an SSI cluster benefitted by increasing exposures to the shared DASD packs. A memory-constrained environment running in an SSI cluster benefitted by improving processor efficiency and reducing the memory overcommitment ratio.

A workload designed to use 1 TB of real memory across a four-member SSI cluster scaled linearly and was not influenced by the SSI cluster environment running in the background.

SSI state transitions did not influence individual workloads in the SSI cluster.

Introduction

With z/VM 6.2 up to four z/VM images can be connected together to form an SSI cluster. The new support allows for resource and workload balance, preliminary system configuration for LGR, and resource growth beyond the current z/VM limitations for a defined workload.

This article evaluates the performance benefit when different workloads and the system resources are distributed across an SSI cluster. It also demonstrates that the z/VM image performance is not influenced by SSI transition states. Lastly, this article demonstrates how a workload and resources scale to 1 TB of real memory across a four-member SSI cluster.

Background

With one z/VM image, a workload can use up to 256 GB of real memory and 32 processors. The system administrator can divide an existing workload and system resources across an SSI cluster or the system administrator can build a workload to use resources beyond the current z/VM system limits, notably real memory and real processors.

Table 1 shows z/VM system limits for a workload distributed across an SSI.

Table 1. z/VM System Limits

z/VM System Limits	1-Member	2-Member	3-Member	4-Member
Real Memory	256 GB *	512 GB *	768 GB	1 TB *
IFLs	32	64	96	128
Note: * System limits measured in this report

Members of an SSI cluster have states that describe the status of each member within the cluster. Valid states are Down, Joining, Joined, Leaving, Isolated, Suspended, and Unknown. A member that is shut down and then IPLed will transition through four of the seven states, namely, Leaving, Down, Joining, and Joined. Deactivating ISFC links transitions the member from a Joined state to a Suspended or Unknown state. Reactivating the ISFC links transitions the member back to a Joined state.

The current state of each member in an SSI cluster can be verified through a new CP command, QUERY SSI. The following is an example of an output for a QUERY SSI command:

SSI Name: PRFASSI SSI Mode: Stable Cross-System Timeouts: Enabled SSI Persistent Data Record (PDR) device: PFASSI on 7000 SLOT SYSTEMID STATE PDR HEARTBEAT RECEIVED HEARTBEAT 1 SYSTEM01 Joined 11/01/2011 14:51:03 11/01/2011 14:51:03 2 SYSTEM02 Down (not IPLed) 3 SYSTEM03 Down (not IPLed) 4 SYSTEM04 Down (not IPLed)

In this SSI cluster, member SYSTEM01 is up and Joined. The remaining members of the SSI cluster are not IPL'd and Down.

In this article the words SSI cluster member or simply member describe a system that is a member of the SSI cluster.

Method

Workload Distribution Measurements

Three separate Apache workloads were used to evaluate the benefits of distributing workloads and system resources across an SSI cluster.

The first base environment studied was a CPU-constrained environment in which the workload was using 100% of the IFLs.
The second base environment studied was a virtual I/O environment in which the workload was bound by DASD I/O.
The third base environment studied was a z/VM memory-constrained environment in which the workload was 1.2x memory overcommitted.

For each set of comparison measurements, the workload and resources of the base environment were evenly distributed across a two-member SSI cluster and then across a four-member SSI cluster.

Table 2 contains the common configuration parameters for each of the three Apache workloads as the workloads are distributed across the SSI clusters. These choices keep the number of CPUs and amount of memory constant across the configurations.

Table 2. Common configuration parameters for workload distribution

Resources Defined per member in SSI cluster	1-member	2-member	4-member
Central storage	43 GB	22 GB	11 GB
XSTOR	8 GB	8 GB (4 GB *)	8 GB (2 GB *)
Processors	12	6	3
Note: * XSTOR setting for the memory-constrained workload System model: 2097-742

Table 2.1, Table 2.2, and Table 2.3 contain the specific configuration parameters for the CPU-constrained, virtual-I/O-constrained, and memory-constrained workloads respectively.

Table 2.1 Specific configuration parameters for CPU-constrained workload

Resources Defined per SSI cluster member	1-member	2-member	4-member
Server virtual machines	16	8	4
Client virtual machines	8	4	2
Sessions *	32	16	8
Note: * based on Apache SSI-mode workload Client connections per server = 1; Number of 1 MB HTML files = 1200; Server virtual memory = 1 GB; Client virtual memory = 1 GB; Server virtual processors = 1; Client virtual processors = 1

Table 2.2 Specific configuration parameters for virtual-I/O-constrained workload

Resources Defined per SSI cluster member	1-member	2-member	4-member
Server virtual machines	16	8	4
Client virtual machines	4	2	1
Sessions *	32	16	8
Note: * based on Apache SSI-mode workload Client connections per server = 1; Number of 1 MB HTML files = 3000; Server virtual memory = 256 MB; Client virtual memory = 1 GB; Server virtual processors = 1; Client virtual processors = 3; System MDC OFF so as to force virtual I/O

Table 2.3 Specific configuration parameters for memory-constrained workload

Resources Defined per SSI cluster member	1-member	2-member	4-member
Server virtual machines	48	24	12
Client virtual machines	8	4	2
Sessions *	480	240	120
Note: * based on Apache SSI-mode workload Client connections per server = 5; Number of 1 MB HTML files = 1200; Server virtual memory = 1 GB; Client virtual memory = 1 GB; Server virtual processors = 1; Client virtual processors = 1

Scaling Measurements

Three Apache measurements were completed to evaluate the z/VM Control Program's ability to scale to 1 TB of real memory across a four-member SSI cluster. Each SSI cluster member was configured to use 256 GB of real memory, which is the maximum supported memory for a z/VM system. Table 3 contains the configuration parameters for each measurement.

Table 3. SSI Apache configuration for 256 GB, 512 GB, and 1 TB measurements

Parameters *	1-member	2-member	4-member
Processor model	z10	z10, z10	z10, z10, z196, z196
Central storage	256 GB	512 GB	1 TB
XSTOR	32 GB	64 GB	128 GB
Processors	3	6	12
Server virtual machines	24	48	96
Client virtual machines	4	8	16
Sessions	24	96	384
Note: * Total for the SSI cluster z10: 2097-742; z196: 2817-744 Client connections per server = 1; Number of 1 MB HTML files = 10K; Server virtual memory = 10 GB; Client virtual memory = 1 GB; Server virtual processors = 1; Client virtual processors = 1

State-Change Measurements

Two measurements were defined to demonstrate that the SSI state changes do not influence a workload running in any one of the members of the cluster.

In the first measurement, as one member in a four-member SSI cluster was running an Apache workload, the other three members were changing their SSI state by constantly shutting down and reipl'ing the z/VM system.
In the second measurement, as one member in the four-member SSI cluster was running an Apache workload, the same member was continuously deactivating and activating ISFC links.

Results and Discussion

Distributed Workload: CPU-constrained Apache

Table 4 compares a CPU-constrained environment in a one-, two-, and four-member SSI cluster.

Table 4. CPU-constrained workload distributed across an SSI cluster

Number of SSI cluster Members	1	2	4
Run ID	A1NWA190	A2NWA120	A1NWA101
Tx/sec (c)	982.30	1160.93	1239.56
Throughput ratio (c)	1.00	1.18	1.26
ITR (p)	1020.47	1196.92	1306.85
ITR ratio (c)	1.00	1.17	1.28
Real processors per SSI cluster member (p)	12	6	3
Total util/proc (p)	99.3	99.7	98.2
Cycles/instruction (h)	5.07	4.28	4.15
Instructions/tx (h)	10587476	10573644	10544818
Cache miss cycles/instruction (h)	2.79	2.07	1.92
Note: (p) = Data taken from Performance Toolkit; (c) = Data was calculated; (h) = Data taken from hardware instrumentation; Cycles/instruction = Processor cycles per instruction; Instructions/tx = Number of instructions issued per transaction; Cache miss cycles/instruction = Number of cycles an instruction stalled due to cache misses;

Compared to the one-member SSI cluster the total throughput in the two-member and four-member SSI cluster increased by 18% and 26% respectively. The internal throughput increased by the same amount. While the total processor utilization remained nearly 100% busy and the number of instructions per transaction remained constant as the workload and resources were distributed across the SSI, the processor cycles/instruction decreased. The benefit is attributed to increased processor efficiency in small N-way configurations. According to the z/VM LSPR ITR Ratios for IBM Processors study, in a CPU-constrained environment, as the total number of processors decreases per z/VM system, the efficiency of each processor in a z/VM system increases.

Distributed Workload: Virtual-I/O-Constrained Apache

Table 5 compares a virtual-I/O-constrained environment in a one-, two- and four-member SSI cluster.

Table 5. Virtual-I/O-constrained workload distributed across an SSI cluster

Number of Members	1	2	4
Run ID	A0VWA310	A2VWA120	A1VWA100
Tx/sec (c)	419.07	543.26	612.12
Throughput ratio (c)	1.00	1.30	1.46
Total Virtual I/O Rate (p)	2030	2683	2983
Total Virtual I/O Rate Ratio (p)	1.00	1.33	1.47
Note: (p) = Data taken from Performance Toolkit; (c) = Data was calculated;

In the one-member measurement, the workload is limited by virtual I/O.

Compared to the one-member SSI cluster, the throughput in the two-member and four-member SSI cluster increased by 33% and 46% respectively. As the workload and resources were distributed across a two-member and four-member SSI cluster, the total virtual I/O rate increased by 33% and 47%.

One of the volumes shared among the member of the SSI cluster is user volume LNX026. Table 5.1 compares real I/O for DASD pack LNX026 for a one-, two- and four-member SSI cluster.

Table 5.1 Real I/O for DASD Volume LNX026

Number of Members	1	2	4
Run ID	A0VWA310	A2VWA120	A1VWA100
Aggregate I/O rate (c)	417	552	610
Avg service time (c)	1.9	2.1	2.5
Volume %util (c)	79	116	153
Avg wait time (c)	4.4	2.5	1.2
Avg response time (c)	6.3	4.6	3.7
Note: (c) = Data was calculated;

Distributing the I/O load for the shared volumes across four device numbers (one per member) lets the DASD subsystem overlap I/Os. As a result, I/O response time decreases and volume I/O rate increases. This is the same effect as PAV would have. By distributing the virtual I/O workload and resources across an SSI cluster, volume I/O rate increased, thus increasing the total throughput.

Distributed Workload: Memory-Constrained Apache

Table 6 compares a real-memory-constrained environment in a one-, two- and four-member SSI cluster.

Table 6. Memory-constrained workload distributed across an SSI cluster

Number of Members	1	2	4
Run ID	A0SWB040	A2SWA120	A1SWA100
Tx/sec (c)	786.23	929.38	1037.24
Throughput ratio (c)	1.00	1.18	1.32
Total util/proc (p)	100.0	100.0	95.4
ITR (p)	800.0	968.2	1100.6
ITR ratio (c)	1.00	1.21	1.38
Real processors per SSI cluster member (p)	12	6	3
Cycles/instruction (h)	6.19	5.09	4.41
Instructions/tx (h)	10851821	10810178	11108661
Cache miss cycles/instruction (h)	3.97	2.93	2.26
Resident pages <2 GB (p)	0	520335	520304
Avlst <2G AvailFrms (p)	520000	320	570
Avlst >2G AvailFrms (p)	2011	2372	2048
LXA_SERV DASD Paging (p)	92.9	36.1	29.7
Note: (p) = Data taken from Performance Toolkit; (c) = Data was calculated; (h) = Data taken from hardware instrumentation; Cycles/instruction = Processor cycles per instruction; Instructions/tx = Number of instructions issued per transaction; Cache miss cycles/instruction = Number of cycles an instruction stalled due to cache misses; Resident Pages <2 GB = the total number of user pages below 2 GB; Avlst <2G AvailFrms = Number of free frames available below 2 GB; Avlst <2G AvailFrms = Number of free frames available above 2 GB; LXA_SERV DASD Paging = Average number of user pages paged to paging space

In the one-member SSI cluster measurement, the workload is limited by real memory.

Compared to the one-member SSI cluster, the throughput for the two-member and four-member SSI cluster increased by 18% and 32% respectively. The internal throughput increased by nearly the same amount. While the total processor utilization remained nearly 100% busy and the number of instructions per transaction remained nearly constant as the workload and resources were distributed across the SSI, the processor cycles/instruction decreased. The benefit is attributed to increased processor efficiency in small N-way configurations. The majority of the improvement was due to the LSPR ITR Ratios for IBM Processors as noted in the CPU-constrained Apache workload.

Part of the improvement is due to the two-member and four-member measurements using frames below 2 GB, thus the Linux servers were paging less in the two-member and four-member SSI cluster environments. With z/VM 6.2.0, a new memory managment algorithm was introduced to exclude the use of frames below 2 GB in certain memory configurations when it would be advantageous to do so. This was added to eliminate storage management searches for frames below 2 GB that severely impacted system performance. For more information on the storage management improvements, see Storage Management Improvements.

In the one-member SSI cluster measurement, resident frames below 2 GB is zero, while the available pages below 2 GB is 520000 pages. This is an indication CP is not and will not be using the frames below 2 GB. This factor should be taken into consideration when calculating memory over-commitment ratios.

Scaling a 1 TB Apache Workload Across a Four-Member SSI Cluster

Table 7 compares a 1 TB workload spread across a four-member SSI cluster.

Table 7. 1 TB Apache workload distributed across a four-member SSI cluster

Number of Members	1	2	4
Run ID	A0T6A240	A3T6A250	A3T6A260
Tx/sec (c)	219.37	417.61	980.15
Throughput ratio (c)	1.00	1.90	4.47
Total central storage	256 GB	512 GB	1 TB
MDC XSTOR pages (p)	0	905542	0
Total Number of Processors	3	6	12
Total util/proc (p)	100.0	100.0	99.6
Note: (p) = Data taken from Performance Toolkit; (c) = Data was calculated;

Compared to the one-member SSI cluster measurement, the throughput in the two-member measurement was 1.9 times higher. This was slightly lower than the expected 2.0 times due to XSTOR pages used for MDC in the two-member SSI cluster measurement. In the four-member SSI cluster measurement the throughput was more than 4.0 times higher than the one-member SSI cluster measurement. The benefit can be attributed to using a z196 for two of the four members.

Table 7.1 compares the throughput in each member of the 1 TB workload spread across a four-member SSI cluster.

Table 8. Throughput for 1 TB Apache workload distributed across a four-member SSI cluster

Processor Model	z10	z10	z196	z196
Throughput ratio (c)	1.00	1.00	1.34	1.33
Note: (p) = Data has been calculated;

Previous performance measurements demonstrated a z10 to a z196 performance ratio varied from 1.36 to 1.89. Overall, the one-, two-, and four-member SSI cluster measurements scaled linearly up to 1 TB, as expected.

Effect of SSI State Changes

Table 8 studies SSI state changes Joined, Leaving, Down, and Joining.

Table 9. CPU-constrained Apache workload during SSI state transitions

SSI state transitions	no	yes
Run ID	A1NWA190	A1NWA191
Tx/sec (c)	982.30	980.90
Number of processors	12	12
Total util/proc (p)	99.3	100.0
Note: Member 1: Running Apache Workload Members 2, 3, and 4: Continuous SSI State Transitions: Joined, Leaving, Down, and Joining

The base case is running a CPU-constrained workload on one member of a four-member SSI cluster. The other three members of the cluster are initially in a Joined state and idle. Thoughout the measurement, the three idle members were continuously shut down and re-IPLed. Compared to the base case, the throughput in the new measurement did not change. All twelve processors continued to run 100% busy. In our experiment, a workload running on one member is not influenced by the state transitions occurring in the other members of the cluster.

Table 9 studies SSI state changes Joined and Suspend/Unknown.

Table 10. CPU-constrained Apache workload during SSI state transitions

SSI state transitions	no	yes
Run ID	A1NWA190	A0NWB080
Tx/sec (c)	982.30	1021.79
Number of Processors	12	12
Total util/proc (p)	99.3	100.0
Note: Member 1: Running Apache Workload Member 1: Continuous SSI State Transitions: Joined and Unknown or Suspended

The base case is running a CPU-constrained workload on one member of a four-member SSI cluster. The other three members of the cluster are in a Joined state and idle throughout the measurement. Compared to the base case, the throughput in the new measurement did not change significantly. All 12 processors continued to run 100% busy. Therefore, a workload running on one member is not influenced by the state transitions occurring in that member.

Summary and Conclusions

Overall, distributing resources and workloads across an SSI cluster does not influence workload performance.

Distributing a CPU-constrained workload across an SSI cluster improved the processor efficiency of the individual processors in each z/VM image. This allowed for more real work to get done.

In the virtual-I/O-constrained environment, compared to the one-member SSI cluster measurement, the two-member and four-member SSI cluster measurements increased the number of device exposures available to the workload. This increased the total virtual I/O rate, thus increasing the total workload throughput.

In the memory-constrained environment, a majority of the improvement was attained by improving individual processor efficency as the workload and resources were distributed across the members. Additionally, a new memory management algorithm caused CP not to use frames below 2 GB in the one-member SSI cluster measurement. The two-member and four-member SSI cluster measurements used frames below 2 GB and this provided a small advantage.

In the set of measurements that scaled up to 1 TB in an SSI environment, as workload and resources were added by member, the workload throughput increased linearly.

SSI state transitions do not influence workload performance running on individual members.

Contents | Previous | Next