IBM: z/VM Performance Report

Memory Management Serialization Contention Relief and 2 TB Central Storage Support

Abstract

Memory Management Serialization Contention Relief (hence, the enhancement) provides performance improvements to the memory management subsystem. It enables workload scaling up to the new z/VM 6.4 maximum supported central storage size of 2 TB.

Spin lock contention in the memory management subsystem has been a barrier to supporting central storage sizes above 1 TB. With z/VM 6.4 extensive changes were made to lock structures resulting in reduced spin lock contention. With these lock structure changes, along with other memory management subsystem changes, IBM measurements demonstrate the ability to scale workloads up to the new z/VM 6.4 maximum supported central storage size of 2 TB.

Background

As systems increase in memory size and number of processors, workloads can grow, putting more demand on the frame manager. The frame manager is the collection of modules in the z/VM memory management subsystem that maintains lists of available frames, manages requests for frames, and coalesces frames as they are returned.

Prior to z/VM 6.4 there were two locks providing serialized access to the lists of available frames: one lock for below-2-GB frames and one lock for above-2-GB frames. Contention for these locks, particularly on the lock for frames above 2 GB (RSA2GLCK), was noticeable with various workloads. This contention was limiting the growth of real memory z/VM could support.

The primary change made in the frame manager for z/VM 6.4 was to organize central storage into available list zones. An available list zone represents a range of central storage, much like the below-2-GB available lists and the above-2-GB available lists represented a range of central storage in prior releases. Management of the available frames within a zone is serialized by a lock unique to that zone. Zone locks are listed as AVZAnnnn and AVZBnnnn in monitor record D0 R23 MRSYTLCK, where nnnn is the zone number. The number and size of zones is determined internally by z/VM and can depend on the maximum potential amount of central storage, the number of attached processors, and whether the zone represents central storage above 2 GB or below 2 GB.

Other improvements to the frame manager include:

The ability to queue frame returns so a task returning a frame is not delayed when the zone lock is not available.

Improved defer logic which will check the global Invalid But Resident (IBR) aging list for a satisfactory frame before deferring the requestor.

Enhanced available list replenishment logic which can run simultaneously to the global IBR aging list replenishment.

More appropriate thresholds for available frame requests.

Another area where improvements have been made is in Page Table Resource Manager (PTRM) page allocations. In heavy paging environments significant lock contention was observed with a single PTRM address space allocation lock. The contention is now avoided by using CPU address to spread PTRM allocations across the 128 PTRM address spaces.

All of these items combined have enabled a new z/VM 6.4 maximum supported central storage size of 2 TB.

Method

A scan of previous IBM measurements revealed the Sweet Spot Priming workload (which uses the Virtual Storage Exerciser) experienced the highest level of spin lock contention on the available lists locks. Based on that finding the Sweet Spot Priming workload was chosen to measure the effectiveness of the enhancement.

In addition, both the Sweet Spot Priming workload and the Apache Scaling workload were used to measure the scalability of workloads up to the new z/VM 6.4 maximum supported central storage size of 2 TB.

Sweet Spot Priming Workload Using VIRSTOEX

The Sweet Spot Priming workload was designed to place high demand on the memory management subsystem. It consists of four sets of users which "prime" their virtual memory by changing data in a predetermined number of pages. This may be viewed as analogous to a customer application reading a database into memory. The workload is designed to overcommit memory by approximately 28%. The four sets of users are logged on sequentially. Each group completes its priming before the next group is logged on. The first three sets of users do not cause paging during priming. For the first three sets of users, only elapsed time is of interest. Each user touches a fixed number of pages based on virtual machine size. The fourth set of users does cause paging during priming. For the fourth set of users, ETR is defined as thousands of pages written to paging DASD per second.

Sweet Spot Priming workload measurements were used to evaluate the reduced lock contention and illustrate the improved performance of the workload as a result. The number of CPUs was held constant while central storage size and the virtual memory size of the users were increased to maintain a constant memory overcommitment level.

A modified z/VM 6.3 Control Program was used to obtain measurements with central storage sizes larger than the z/VM 6.3 maximum supported central storage size of 1 TB.

Table 1 shows the Sweet Spot Priming workload configurations used.

Table 1. Sweet Spot Priming Workload Configurations.
Central Storage in GB	256	512	768	1024	1280	1536	1792	2048
IFL Cores	64	64	64	64	64	64	64	64
CM1x Users	88	88	88	88	88	88	88	88
CM2x Users	88	88	88	88	88	88	88	88
CM3x Users	88	88	88	88	88	88	88	88
CM4x Users	88	88	88	88	88	88	88	88
CMx Virtual Memory Size in GB	6.5	11	15.5	20	24.5	29	33.5	38
Notes: The number of CPUs was held constant while central storage size and virtual memory size of the users were increased to maintain a constant memory overcommitment level. CEC model 2964-NC9. Dedicated LPAR with 64 IFL cores in non-SMT mode. The VIRSTOEX program only touches virtual memory above 2 GB.

Apache Scaling Workload Using Linux

The Apache Scaling workload was used to illustrate a Linux-based webserving workload that scales up to a central storage size of 2 TB. The Apache Scaling workload has a small amount of memory overcommitment. The memory overcommitment was kept small to avoid a large volume of paging that would cause the DASD paging subsystem to become the limiting factor for the workload as it were scaled up with cores, memory, and AWM clients and servers.

To allow comparisons with central storage sizes above 1 TB a modified z/VM 6.3 Control Program was used to obtain measurements with central storage sizes larger than the z/VM 6.3 maximum supported central storage size of 1 TB.

Table 2 shows the Apache Scaling workload configurations used.

Table 2. Apache Scaling Workload Configurations
Central storage in GB	512	1024	1536	2048
IFL cores	8	16	24	32
AWM clients (1 GB)	4	8	12	16
AWM servers (30 GB)	24	48	72	96
Notes: AWM clients and servers are arranged in groups to keep the total number of client/server sessions manageable. Each AWM client has a session with 6 servers. CEC model 2964-NC9. Dedicated LPAR with IFL cores in non-SMT mode.

Results and Discussion

Sweet Spot Priming Workload Measurements Results

Table 3 contains selected results of z/VM 6.3 measurements. Table 4 contains selected results of z/VM 6.4 measurements. Table 5 contains comparisons of selected results of z/VM 6.4 measurements to z/VM 6.3 measurements.

Overall spin lock contention decreased from z/VM 6.3 to z/VM 6.4. This was the expected result of the enhancement. See Figure 1.

ETR was relatively the same in both z/VM 6.3 and z/VM 6.4. This was the expected result. The workload is limited by test system DASD paging infrastructure. See Figure 4.

ITR increased from z/VM 6.3 to z/VM 6.4. See Figure 4.

Processor utilization decreased from z/VM 6.3 to z/VM 6.4. See Figure 5.

Elapsed time for CM1, CM2, and CM3 users (non-paging) priming phases decreased from z/VM 6.3 to z/VM 6.4. Elapsed time scaled linearly as central storage size increased. See Figure 2.

Elapsed time for CM4 users (paging) priming phase remained relatively the same from z/VM 6.3 to z/VM 6.4. The workload is limited by test system DASD paging infrastructure.

Elapsed time for logoff of all users decreased from z/VM 6.3 to z/VM 6.4. Elapsed time scaled linearly as central storage size increased. See Figure 3.

Figure 1 illustrates the spin lock percent busy for the CM4 users (paging) priming phase by central storage size.

Figure 1. Sweet Spot CM4 Users (paging) Priming Spin Lock %Busy by Central Storage Size.

Figure 2 illustrates the elapsed time for the CM1, CM2, and CM3 users (non-paging) priming phases by central storage size.

Figure 2. Sweet Spot CM1/CM2/CM3 Users (non-paging) Priming Time by Central Storage Size.

Figure 3 illustrates the elapsed time for the logoff phase by central storage size.

Figure 3. Sweet Spot Users Logoff Time by Central Storage Size.

Figure 4 illustrates the external and internal transaction rate for the CM4 users (paging) priming phase by central storage size.

Figure 4. Sweet Spot CM4 Users (paging) Priming ETR and ITR by Central Storage Size.

Figure 5 illustrates the percent busy per processor for the CM4 users (paging) phase by central storage size.

Figure 5. Sweet Spot CM4 Users (paging) Priming %Busy per Processor by Central Storage Size.

Table 3 shows the Sweet Spot Priming workload results on z/VM 6.3.

Table 3. Sweet Spot Priming Workload z/VM 6.3 results.
Central Storage in GB	256	512	768	1024	1280	1536	1792	2048
Runid	SSXS6287	SSXS6288	SSXS6289	SSXS628A	SSXS628B	SSXS628C	SSXS628D	SSXS628E
ETR	179.02	178.65	178.25	178.59	178.35	178.07	178.24	178.23
ITR	179.7	179.4	179.1	179.3	181.4	179.0	179.1	179.1
Total Busy per Processor	99.6	99.6	99.5	99.6	98.3	99.5	99.5	99.5
System Busy per Processor	60.3	59.9	60.4	60.1	60.2	60.1	59.4	60.2
Emulation Busy per Processor	0.2	0.2	0.2	0.2	0.2	0.3	0.2	0.2
CP Busy per Processor	99.4	99.4	99.3	99.4	98.1	99.2	99.3	99.3
T/V Ratio	404.3	404.3	409.2	408.9	406.3	395.1	412.5	412.6
Total Spin Lock Busy	5432.05	5424.83	5401.44	5436.85	5287.71	5380.59	5403.33	5372.19
RSA2GLCK Spin Lock Busy	4829.77	4809.88	4767.82	4829.73	4609.50	4718.28	4770.21	4694.73
CM1 Users Priming Time	17	34	50	68	81	99	110	124
CM2 Users Priming Time	18	34	47	58	73	91	110	125
CM3 Users Priming Time	17	31	45	64	79	92	108	123
CM4 Users Priming Time	116	234	338	452	565	678	789	903
Logoff Time	104	212	293	384	508	628	745	879
Notes: CEC model 2964-NC9. Dedicated LPAR with 64 IFL cores in non-SMT mode.

Table 4 shows the Sweet Spot Priming workload results on z/VM 6.4.

Table 4. Sweet Spot Priming Workload z/VM 6.4 results.
Central Storage in GB	256	512	768	1024	1280	1536	1792	2048
Runid	SSYS8154	SSYS8155	SSYS8156	SSYS8151	SSYS8157	SSYS8158	SSYS8159	SSYS8150
ETR	182.64	178.16	179.50	183.30	177.78	169.57	182.95	190.30
ITR	208.0	207.6	211.2	216.4	213.9	204.5	217.0	228.7
Total Busy per Processor	87.8	85.8	85.0	84.7	83.1	82.9	84.3	83.2
System Busy per Processor	65.7	60.8	61.1	59.4	60.1	55.6	56.0	59.8
Emulation Busy per Processor	0.2	0.2	0.2	0.2	0.2	0.2	0.3	0.3
CP Busy per Processor	87.6	85.6	84.8	84.5	82.9	82.7	84.0	82.9
T/V Ratio	356.8	356.8	351.3	341.1	350.6	360.0	336.2	329.3
Total Spin Lock Busy	848.49	793.67	791.25	804.14	802.47	733.87	737.75	830.40
AVZLOCKS Spin Lock Busy	28.90	34.52	34.79	38.38	31.04	38.20	41.31	33.18
CM1 Users Priming Time	8	16	24	32	40	49	56	64
CM2 Users Priming Time	9	16	25	33	41	49	56	65
CM3 Users Priming Time	8	16	24	32	40	49	57	65
CM4 Users Priming Time	116	229	338	444	563	706	767	845
Logoff Time	68	129	206	299	360	446	524	634
Notes: CEC model 2964-NC9. Dedicated LPAR with 64 IFL cores in non-SMT mode.

Table 5 shows the comparison of z/VM 6.4 results to z/VM 6.3 results.

Table 5. Sweet Spot Priming Workload z/VM 6.4 results compared to z/VM 6.3 results.
Central Storage in GB	256	512	768	1024	1280	1536	1792	2048
ETR	+2.0%	+0.3%	+0.7%	+2.6%	-0.3%	-4.8%	+2.6%	+6.8%
ITR	+15.7%	+15.7%	+17.9%	+20.7%	+17.9%	+14.2%	+21.2%	+27.7%
Total Busy per Processor	-11.8%	-13.9%	-14.6%	-15.0%	-15.5%	-16.7%	-15.3%	-16.4%
System Busy per Processor	+9.0%	+1.5%	+1.2%	-1.2%	-0.2%	-7.5%	-5.7%	-0.7%
Emulation Busy per Processor	0.0%	0.0%	0.0%	0.0%	0.0%	-33.3%	+50.0%	+50.0%
CP Busy per Processor	-11.9%	-13.9%	-14.6%	-15.0%	-15.5%	-16.6%	-15.4%	-16.5%
T/V Ratio	-11.7%	-11.7%	-14.1%	-16.6%	-13.7%	-8.9%	-18.5%	-20.2%
Total Spin Lock Busy	-84.4%	-85.4%	-85.4%	-85.2%	-84.8%	-86.4%	-86.3%	-84.5%
Available List Spin Locks Busy	-99.4%	-99.3%	-99.3%	-99.2%	-99.3%	-99.2%	-99.1%	-99.3%
CM1 Users Priming Time	-52.9%	-52.9%	-52.0%	-52.9%	-50.6%	-50.5%	-49.1%	-48.4%
CM2 Users Priming Time	-50.0%	-52.9%	-46.8%	-43.1%	-43.8%	-46.2%	-49.1%	-48.0%
CM3 Users Priming Time	-52.9%	-48.4%	-46.7%	-50.0%	-49.4%	-46.7%	-47.2%	-47.2%
CM4 Users Priming Time	0.0%	-2.1%	0.0%	-1.8%	-0.4%	+4.1%	-2.8%	-6.4%
Logoff Time	-34.6%	-39.2%	-29.7%	-22.1%	-29.1%	-29.0%	-29.7%	-27.9%
Notes: All deltas are percentage difference of z/VM 6.4 measurements compared to z/VM 6.3 measurements. CEC model 2964-NC9. Dedicated LPAR with 64 IFL cores in non-SMT mode.

Apache Scaling Workload Measurements Results

Table 6 contains selected results of z/VM 6.3 measurements. Table 7 contains selected results of z/VM 6.4 measurements. Table 8 contains comparisons of the selected results of z/VM 6.4 measurements to z/VM 6.3 measurements.

Both ETR and ITR improved from z/VM 6.3 to z/VM 6.4 with the largest difference occurring between 1 TB and 2 TB of central storage. See Figure 6.

Processor utilization was mostly unchanged from z/VM 6.3 to z/VM 6.4. There was a small variation at the lower increments of central storage but almost none from 1 TB to 2 TB. See Table 6, Table 7, and Table 8.

Figure 6 illustrates the external and internal transaction rate for the Apache Scaling workload by central storage size.

Figure 6. Apache Scaling workload - ETR and ITR by Central Storage Size.

Table 6 shows the Apache Scaling workload results on z/VM 6.3.

Table 6. Apache Scaling workload z/VM 6.3 results.
IFL Cores	8	16	24	32
Central Storage in GB	512	1024	1536	2048
Run ID	A51X628Z	A1TX628Z	A15X628Z	A2TX628Z
ETR	1955.76	2853.14	3330.71	3751.85
ITR	2087.30	3166.60	3725.60	4312.50
Total Busy per Processor	93.7	90.1	89.4	87.0
Emulation Busy per Processor	68.6	66.3	61.9	59.9
CP Busy per Processor	25.1	23.8	27.5	27.1
T/V Ratio	1.37	1.36	1.44	1.45
Notes: CEC model 2964-NC9. Dedicated LPAR with IFL cores in non-SMT mode.

Table 7 shows the Apache Scaling workload results on z/VM 6.4.

Table 7. Apache Scaling Workload z/VM 6.4 results.
IFL Cores	8	16	24	32
Central Storage in GB	512	1024	1536	2048
Run ID	A51Y715Z	A1TY715Z	A15Y715Z	A2TY715Z
ETR	2024.75	2884.61	3449.17	3808.00
ITR	2135.80	3201.60	3845.20	4392.20
Total Busy per Processor	94.8	90.1	89.7	86.7
Emulation Busy per Processor	68.6	66.1	62.3	59.6
CP Busy per Processor	26.2	24.0	27.4	27.1
T/V Ratio	1.38	1.36	1.44	1.45
Notes: CEC model 2964-NC9. Dedicated LPAR with IFL cores in non-SMT mode.

Table 8 shows the Apache Scaling workload results of z/VM 6.4 compared to z/VM 6.3.

Table 8. Apache Scaling Workload z/VM 6.4 results compared to z/VM 6.3 results.
IFL Cores	8	16	24	32
Central Storage in GB	512	1024	1536	2048
ETR	+3.5%	+1.1%	+3.6%	+1.5%
ITR	+2.3%	+1.1%	+3.2%	+2.0%
Total Busy per Processor	+1.2%	0.0%	+0.3%	-0.3%
Emulation Busy per Processor	0.0%	-0.3%	+0.6%	-0.5%
CP Busy per Processor	+4.4%	+0.8%	-0.4%	0.0%
T/V Ratio	+0.7%	0.0%	0.0%	0.0%
Notes: All deltas are percentage difference of z/VM 6.4 measurements compared to z/VM 6.3 measurements. CEC model 2964-NC9. Dedicated LPAR with IFL cores in non-SMT mode.

Summary and Conclusions

Memory Management Serialization Contention Relief provides performance improvements as central storage size is increased. The results of IBM measurements demonstrate spin lock contention is reduced and workloads scale up to the new z/VM 6.4 maximum supported central storage size of 2 TB.

Contents | Previous | Next