IBM: z/VM Performance Report

Simultaneous Multithreading (SMT)

Abstract

z/VM for z13 lets z/VM dispatch work on up to two threads (logical CPUs) of an IFL processor core. This enhancement is supported for only IFLs. According to the characteristics of the workload, results in measured workloads varied from 0.64x to 1.36x on ETR and 1.01x to 1.97x on ITR.

In an SMT environment individual virtual CPUs might have lower performance than they have when running on single-threaded cores. Studies have shown that for workloads sensitive to the behavior of individual virtual CPUs, increasing virtual processors or adding more servers to the workload can return the ETR to levels achieved when running without SMT. In general, whether these techniques will work is very much a property of the structure of the workload.

Introduction

This article provides a performance evaluation of select z/VM workloads running in an SMT-2 environment on the new IBM z13.

Prior to the IBM z13, we often used the words IFL, core, logical PU, logical CPU, CPU, and thread interchangeably. This is no longer the case with the IBM z13 and the introduction of SMT to z Systems. With z/VM for z13, z/VM can now dispatch work on up to two threads of a z13 IFL core. Though IBM z13 SMT support includes IFLs and zIIPs, z/VM supports SMT on only IFLs.

Two threads of the same core share the cache and the execution unit. Each thread has separate registers, timing facilities, translation lookaside buffer (TLB) entries, and program status word (PSW).

Enabling SMT-2 in z/VM

In z/VM, SMT-2 is disabled by default. To enable two threads per IFL core, include the following statement in the system configuration file.

MULTITHreading ENAble

Whether or not z/VM opts in for SMT, its LPAR's units of dispatchability continue to be logical CPUs. When z/VM does not opt in for SMT, PR/SM dispatches the partition's logical CPUs on single-threaded physical cores. When z/VM opts in for SMT, PR/SM dispatches the partition's logical CPUs on threads of a multithreaded core. PR/SM assures that when both threads of a multithreaded physical core are in use, they are always both running logical CPUs of the same LPAR.

Once z/VM is enabled for SMT-2, it applies to the whole logical partition. Further, disabling SMT-2 requires an IPL.

Vertical Polarization

Enabling the z/VM SMT facility requires that z/VM be configured to run with HiperDispatch vertical polarization mode enabled. The rationale behind this decision is as follows. Vertical polarization gets a tighter core affinity and therefore better cache affinity. To configure the LPAR mode to vertical polarization, include the following statement in the system configuration file.

SRM POLARization VERTical

Reshuffle Algorithm

Enabling the z/VM SMT facility requires that z/VM be configured to use the work balancing algorithm reshuffle. The alternative work balancing algorithm, rebalance, is not supported with SMT-2 as performance studies have shown the rebalance algorithm is effective for only a very limited class of workloads. To configure the LPAR to use the reshuffle work balancing algorithm, include the following statement in the system configuration file.

SRM DSPWDMethod REShuffle

Threads of a Core Draw Work from a Single Dispatch Vector (DV)

z/VM maintains dispatch vectors on a core basis, not on a thread basis. There are several benefits of having threads of a core draw from the same dispatch vector. Threads of the same core share the same L1 and L2 cache, so there is limited cache penalty in moving a guest virtual CPU between threads of a core. Further, because of features of the reshuffle algorithm, there is a tendency to place guest virtual CPUs together in the same DV. Having the threads of a core draw from the same DV might increase the likelihood that different virtual CPUs of the same guest will be dispatched concurrently on threads of the same core. Last, having threads of a core draw from a shared DV helps reduce stealing. Giving each thread its own DV would cause VMDBKs to be spread more thinly across DVs, making the system more likely to steal. By associating the two threads with the same DV, work is automatically balanced between them without the need for stealing.

Thread Affinity

There is a TLB penalty when a virtual CPU moves between threads, whether or not the threads are on the same core. To minimize this penalty thread affinity was implemented. Thread affinity makes an effort to keep a virtual CPU on the same thread of a core, as long as the virtual CPU stays in the core's DV.

Preemption

Preemption controls whether the virtual CPU currently dispatched on a logical processor will be preempted when new work of higher priority is added to that logical processor's DV. Preemption is disabled with SMT-2. This lets the current virtual CPU remain on the logical processor and in turn experience better processor efficiency due to continued advantage of existing L1, L2, and TLB content.

Minor Time Slice

With SMT-2, virtual machine minor time slice default value (DSPSLICE) is increased to 10 milliseconds, to let a virtual CPU run longer on a thread. This helps the virtual CPU to get benefit from buildup in L1, L2, and the TLB. This in part compensates for the slower throughput level of a thread versus a whole core.

Time Slice Early

Time Slice Early is a new function that allows CP to improve processor efficiency. When SMT-2 is enabled, when a virtual CPU loads a wait PSW, if the minor time slice is 50% complete or more, CP ends the virtual CPU's minor time slice. This helps assure that a virtual CPU is not holding a guest spinlock at what would otherwise be the end of its minor time slice.

In-Chip Steal Barrier

In z/VM 6.3, the HiperDispatch enhancement introduced the notion of steal barriers. For a logical CPU to steal a VMDBK cross-chip or cross-book, certain severity criteria had to be met. The longer the topological drag would be, the more severe the situation would need to be before a logical CPU would do a steal. This strategy kept VMDBKs from being dragged long topological distances unless the situation were dire enough. In SMT the notion of steal barriers has been extended to include within-chip.

MT1-Equivalent Time versus Raw Time

Raw time is a measure of the CPU time each virtual CPU spent dispatched. When a virtual CPU runs on a single-threaded core, raw time measures usage of a core; when a virtual CPU runs on a thread of a multithreaded core, raw time measures usage of a thread. MT1-equivalent time is a measure of effective capacity consumed, taking into account the effects of multithreading. MT1-equivalent time approximates the time that would have been consumed if the workload had been run with multithreading disabled, that is, with all core resources available to one thread. The effect of the adjustment is to "discount" the raw time to compensate for the slowdown induced by the activity on the other thread of the core.

Live Guest Relocation (LGR)

When a non-SMT system and an SMT-2 system are joined in an SSI, a guest that is eligible for LGR can be relocated between the systems.

SMT Not Virtualized to Guest

SMT is not virtualized to the guest. SMT is functionally transparent to the guest. The guest does not need to be SMT-aware to gain value.

Multithreading Metrics

The following new metrics are available in CP monitor record MRSYTPRP, D0 R2.

Average thread density (TD): This represents the average number of threads in use while the core was in use. Periods of time when neither thread was in use are ignored in the TD calculation. When the number of threads per core is two, the thread density is a value between one and two, inclusively.

Core productivity: This is a metric that represents a ratio of how much work* was accomplished to the maximum amount of work* that could have been accomplished. With single-threaded cores, productivity is 100% because whenever the core is dispatched, it is executing as many instructions as possible. With SMT-2, if both threads are executing instructions all of the time during the interval, productivity is 100%.

Core busy time: This measures the amount of time work* was dispatched on at least one thread of the core in an interval.

MT utilization: This measures how much of the maximum core capacity was used. It is a combination of busy time and productivity.

Capacity factor: This metric represents the ratio of how much work* was accomplished on the core to the amount of work* that would have been accomplished if only one thread had been active. For example, if a workload running on two-threaded cores had a capacity factor of 130%, it meant that the cores were able to accomplish 1.3x (130%) the amount of work* that would have been accomplished on single-threaded cores. For single-threaded cores capacity factor is always 100%.

Maximum capacity factor: This metric represents a ratio of the maximum amount of work* that can be accomplished if two threads were active per core to the maximum amount of work* that would have been accomplished if only one thread had been active per core. For single-threaded cores, maximum capacity factor is always 100%. Capacity factor is equal to maximum capacity factor only when the core ran at thread density two for the entire interval.

* The term work is used to describe a relative instruction completion rate. It is not intended to describe how much work a workload is actually accomplishing.

Performance Toolkit Updates for SMT

To help to support z/VM's operation in SMT mode, IBM updated these Perfkit reports.

FCX154 / SYSSET includes the state of the multithreading mode plus multithreading settings since the last IPL, per processor type.

FCX180 / SYSCONF includes the state of the multithreading mode.

FCX287 / TOPOLOG includes the state of the multithreading mode.

FCX303 / DSVBK includes core and thread information per logical processor.

FCX304 / PRCLOG includes core and thread information per logical processor.

Perfkit does not report the new D0 R2 multithreading metrics.

Method

z13 SMT-2 was evaluated by direct comparison to non-SMT. Each individual comparison used an identical logical partition configuration, z/VM system, and LINUX level for both SMT-2 and non-SMT. Changes in number of users, number of virtual processors, and number of guests will be described in the individual sections.

Specialized Virtual Storage Exerciser, Apache, and DayTrader workloads were used to evaluate the characteristics of SMT-2 in a variety of configurations with a wide variety of workloads. The Master Processor Exerciser was used to evaluate the effect of multithreading on applications having a z/VM master processor requirement.

Results varied widely for the measured workloads.

Best results occurred for applications having highly parallel activity and no single point of serialization. This will be demonstrated by the results of an Apache workload with a sufficient number of MP clients and MP servers without any specific limitations that would prevent productive use of all the available processor cycles.

No improvement is expected for applications having a single point of serialization. Specific serializations in any given workload might not be easily identified. This will be demonstrated by the results of an Apache workload with a limited number of UP clients and by an application serialized by the z/VM master processor.

Specific configurations chosen for comparison included storage sizes from 12 GB to 1 TB and dedicated logical processors from 1 to 64. Only eight specific experiments are discussed in this article.

New z/VM monitor data available with the SMT-2 support is described in z/VM Performance Management.

Results and Discussion

With SMT-2, calculated ITR values might not be as meaningful as values calculated for non-SMT. The ITR calculation predicts the current efficiency for logical processors, but with SMT-2, thread efficiency generally decreases as the thread density increases. The results demonstrate a wide range of thread density.

SMT-2 Ideal Application

Table 1 contains a comparison of selected values between SMT-2 and non-SMT for an Apache workload with ideal SMT-2 characteristics.

The workload consists of highly parallel activity with no single point of serialization. There are 2 AWM clients and 2 Apache servers, each defined with 4 virtual processors. This provides 16 virtual processors to drive the 4 logical processors with non-SMT or the 8 logical processors with SMT-2. There are 16 AWM connections between each client and each server, therefore 64 concurrent sessions. These should be sufficient to keep the 16 virtual processors busy. This configuration provides a demonstration of the value that can be obtained for a workload that has ideal SMT characteristics.

For this workload SMT-2 provided a 36% increase in transaction rate, a 53% increase in ITR, a 25% decrease in average response time, and an 11% decrease in processor utilization.

Average thread density for the SMT-2 measurement was 1.83.

Table 1. SMT-2 Ideal Application

Run ID	AMPDGLD0	AMPDGLD1	Delta	Pct
Multithreading	disabled	enabled
Logical processors	4	8	4	100.0
ETR	7396.28	10072.69	2676.41	36.2
ITR	7736.69	11906.25	4169.56	53.9
Total util/proc	95.6	84.6	-11.0	-11.5
AWM avg resp time	0.008674	0.006445	-0.002229	-25.7
AWM client util	151.381	272.698	121.317	80.1
Apache server util	39.254	64.810	25.556	65.1
SMT-2 avg thread density	na	1.83
Notes: z/VM for z13; 2964-NC9; 4 dedicated IFL cores; 30 GB central storage; storage-rich; Apache workload; 2 AWM clients (4 virtual CPUs, 1 GB); 2 Apache servers (4 virtual CPUs, 512 MB); 16 AWM connections to each server; 2 URL files; 15 KB avg URL size; Linux SLES11 SP3.

Maximum z/VM 6.3 Storage Configuration

Table 2 contains a comparison of selected values between SMT-2 and non-SMT for an Apache workload using the maximum supported 1 TB of storage. This workload provides a good demonstration of the value of SMT-2 for a workload with no specific serialization.

The workload has 4 AWM clients each defined with 4 virtual processors. Average utilization with non-SMT is 92% of a virtual processor which provides enough excess capacity for the expected increase with SMT-2. The 16 AWM client virtual processors are enough to support the 8 logical processors with non-SMT or the 16 logical processors with SMT-2.

The workload has 128 Apache servers each with 1 virtual processor. Average utilization with non-SMT is only 2.5% which provides enough excess capacity for the expected increase with SMT-2. The 128 Apache server virtual processors are enough to support the 8 logical processors with non-SMT or the 16 logical processors with SMT-2.

Each of the 128 Apache servers has 10 GB of virtual storage and is primed with 10000 URL files. Each URL file is 1 MB so all the virtual storage in each Apache server participates in the measurement. The 128 fully populated Apache servers exceed the 1 TB of central storage, thus providing heavy DASD paging activity for this workload.

There is a single AWM client connection to each Apache server, thus creating 512 parallel sessions to supply work to the 128 Apache server virtual processors.

There are 224 paging devices to handle the paging activity.

For this workload SMT-2 provided a 21% increase in transaction rate, a 30% increase in ITR, an 18% decrease in average response time and a 7% decrease in processor utilization.

Average thread density for the SMT-2 measurement was 1.82.

DASD paging rate increased 16%.

Although there was a high percentage increase in spin lock time, it was not a major factor in the results.

Table 2. SMT-2 Maximum Storage

Run ID	A1TDGLD0	A1TDGLD1	Delta	Pct
Multithreading	disabled	enabled
Logical processors	8	16	8	100.0
ETR	854.79	1036.09	181.30	21.2
ITR	972.46	1268.16	295.70	30.4
AWM avg resp time	0.541012	0.442838	-0.098174	-18.1
Total util/proc	87.9	81.7	-6.2	-7.1
Apache server util	2.40	5.12	2.72	113.3
AWM client util	92.37	149.46	57.09	61.8
SMT-2 avg thread density	na	1.82
DASD page rate	99245.0	115000.0	15755.0	15.9
Reported sys spin %busy	0.6	3.6	3.0	500.0
Notes: z/VM for z13; 2964-NC9; 8 dedicated IFL cores; 1 TB central storage; Apache workload; 4 AWM clients (4 virtual cpus, 4 GB); 128 Apache servers (1 virtual CPU, 10 GB); 1 AWM connection to each server; 10000 URL files; 1 MB avg URL size; Linux SLES11 SP1; Four 8.0 Gbps FICON switched channels; 2107-E8 control unit; 224 3390-54 paging volumes.

Maximum SMT-2 Processor Configuration

Table 3 contains a comparison of selected values between SMT-2 and non-SMT for a DayTrader workload using the maximum supported 32 cores in SMT-2.

For this workload, comparisons are made at approximately 95% logical processor utilization. The number of servers is changed to create the desired logical processor utilization.

This workload consists of a single AWM client connected to the desired number of DayTrader servers through a local VSWITCH in the same logical partition.

For this experiment the AWM client has 4 virtual processors, 1 GB of virtual storage, and a relative share setting of 10000. For this experiment each DayTrader server has 2 virtual processors, 1 GB of virtual storage, and a relative share setting of 100.

There are 46 DayTrader servers in the non-SMT measurement and 116 DayTrader servers in the SMT-2 measurement. These increased servers will have an effect on the measured AWM response time.

This workload provides a good demonstration of the value of SMT-2 for a workload with no specific serialization.

For this workload SMT-2 provided a 12% increase in transaction rate, a 19% increase in ITR, a 172% increase in average response time, and a 5.3% decrease in processor utilization.

Average thread density for the SMT-2 measurement was 1.89.

Table 3. SMT-2 Maximum Processor

Run ID	DT1AU03	DT2AU12	Delta	Pct
Multithreading	disabled	enabled
DayTrader servers	46	116	70	152.2
ETR	2447.40	2763.64	316.24	12.9
ITR	2512.73	2997.44	484.71	19.3
AWM avg resp time	0.021897	0.059584	0.037687	172.1
Total util/proc	97.4	92.2	-5.2	-5.3
SMT-2 avg thread density	na	1.89
Notes: z/VM 6.3 with VM65586; 2964-NC9; 32 dedicated IFL cores; 256 GB central storage; storage-rich; DayTrader workload; 1 AWM client (4 virtual CPUs, 1 GB, 10000 relative share); DayTrader servers (2 virtual CPUs, 1 GB, 100 relative share); 1 AWM connection to each server; Linux RedHat 6.0.

LINUX-only Mode Partition with a Single Processor Serialization Application

The first two data columns in Table 4 contain a comparison of selected values between SMT-2 and non-SMT for an Apache workload that is serialized by the number of virtual processors available for the AWM clients.

Because this workload has a single point of serialization, the results indicate that it is not a good candidate for SMT-2 without mitigation. The workload consists of 3 AWM clients each with a single virtual processor. In non-SMT, average client utilization was 70%, so one would predict needing more than 100% with SMT-2.

There are 12 Apache servers each with a single virtual processor. With non-SMT the average server utilization is only 6% so no serialization is expected.

For this workload SMT-2 provided a 35% decrease in transaction rate, a 1% increase in ITR, a 37% decrease in processor utilization, and a 45% increase in AWM response time.

Average thread density for the SMT-2 measurement was 1.28.

With SMT-2 the AWM client virtual processors reached 100% utilization at a lower workload throughput than with non-SMT.

Serialization for this workload can be removed by adding virtual processors to the existing clients or by adding more client virtual machines. The third and fourth columns of Table 4 contain results for these two methods of removing the serialization.

For the measurement in the third data column of Table 4, an additional virtual processor was added to the existing 3 AWM clients. This increases the total AWM client virtual processors from 3 to 6. Overall results for this experiment show a 64% increase in transaction rate, an 8% increase in ITR, a 50% increase in processor utilization, and a 38% decrease in AWM response time. Average thread density for this measurement was 1.95. These results are now better than the original non-SMT measurement.

For the measurement in the fourth data column of Table 4, 3 additional AWM clients were added to the original SMT-2 configuration. This increases the total AWM client virtual processors from 3 to 6. It also increases the number of AWM sessions from 36 to 72. The increased sessions will tend to increase the AWM response time. Compared to the original SMT-2 measurement, overall results for this experiment show a 100% increase in transaction rate, a 28% increase in ITR, a 56% increase in processor utilization, and a 17% increase in AWM response time. Average thread density for this measurement was 1.99. These results are now better than the original non-SMT measurement.

The following is a discussion about multithreading metrics.

The SMT productivity value indicates whether all SMT capacity of the core has been consumed. A value less than 1.0 indicates that it might be possible to get additional core throughput by increasing core utilization. A core busy value near 100% with a low thread density has more capacity than one with high core utilization and high thread density. If core thoughput is higher when both threads are in use than when only one thread is in use, then the core throughput might increase as thread density increases.

SMT capacity factor values above 1.0 provide an indication of how much higher core throughput is when both threads are active compared to when only one thread is active. It varies relative to levels of contention for core resources between the instruction streams running on the two threads, levels of cache contention, and other factors that influence core efficiency. SMT capacity factor is not an indication of how well the overall workload performs when SMT is enabled, as can be seen by comparing the SMT capacity factor and ETR values for the set of runs. The mitigation approaches in these cases resulted in significant changes in the mix of work running on the cores so that a comparison of the SMT capacity factors in isolation could be misleading. From an overall workload perspective the ETR is a more important metric.

An SMT maximum capacity value larger than the SMT capacity factor indicates that the core might be able to accomplish more work* as thread density increases. Both values are based on data collected only while at least one thread is active. As the thread density approaches either of its limits, the values might become less reliable because of limited data for one of the cases. This might have been a factor in this workload because core utilization is very high.
* The term work is used to describe a relative instruction completion rate. It is not intended to describe how much work a workload is actually accomplishing.

Table 4. Serialized Application

Run ID	APNDGLD0	APNDGLD1	APNDGLDF	APNDGLDD
Multithreading	disabled	enabled	enabled	enabled
Logical processors	3	6	6	6
AWM clients	3	3	3	6
AWM client virt proc	1	1	2	1
Total client virt proc	3	3	6	6
ETR ratio	1.000	0.649	1.065	1.303
ITR ratio	1.000	1.017	1.102	1.305
ETR	7787.11	5051.55	8292.57	10143.94
ITR	8128.51	8267.68	8955.26	10610.82
AWM avg resp time	0.004897	0.007102	0.004388	0.008332
Total util/proc	95.8	61.1	92.6	95.6
AWM client util	70.4	95.4	143.7	73.6
Apache server util	6.19	6.53	10.2	10.8
SMT-2 IFL core busy %	95.8	95.5	95.1	95.7
SMT-2 Avg IFL thread density	na	1.28	1.95	1.99
SMT-2 productivity	na	0.9	0.98	1.0
SMT-2 capacity factor	na	1.04	1.49	1.07
SMT-2 maximum capacity factor	na	1.15	1.51	1.07
Notes: z/VM for z13; 2964-NC9; 3 dedicated IFL cores; 12 GB central storage; storage-rich; Apache workload; 12 Apache servers (1 virtual CPU, 10 GB); 1 AWM connection to each server; 2 URL files; Linux SLES11 SP1; 15 KB avg URL size.

LINUX-only Mode Partition with a z/VM Master Processor Serialization Application

Table 5 contains a comparison of selected values between SMT-2 and non-SMT for a workload that is serialized by the z/VM master processor.

The Master Processor Exerciser was used to evaluate the effect of multithreading on applications having a z/VM master processor requirement. The workload consists of an application that requires use of the z/VM master processor in each transaction. In a LINUX-only mode partition, both the master and the non-master portion of the workload execute on logical IFL processors, therefore the master logical processor is one thread of an IFL core. Because this workload has a serialization point, it is a good workload to study to see the effect SMT can have on serialized workloads.

For this workload SMT-2 provided a 17% decrease in transaction rate, a 97% increase in ITR, and a 58% decrease in processor utilization.

This is a good example of an SMT-2 ITR value that is not very meaningful.

z/VM master processor utilization decreased 2.8%.

Average thread density for the SMT-2 measurement was 1.20.

Table 5. Linux-only Partition Master Application

Run ID	STXS210E	STXS210F	Delta	Pct
Multithreading	disabled	enabled
Logical processors	4	8	4	100.0
Master util/proc	100.1	97.3	-2.8	-2.8
Total util/proc	80.2	33.4	-46.8	-58.4
ETR	1572.70	1291.00	-281.70	-17.9
ITR	1960.97	3865.27	1904.30	97.1
SMT-2 avg thread density	na	1.20
Notes: z/VM for z13; 2964-NC9; 4 dedicated IFL cores; 30 GB central storage; storage-rich; 8 VIRSTOMP users (8 virtual CPUs, 144 GB) VIRSTOMP parameters (I=1024KB,E=144GB,C=8,B=1).

z/VM-mode Partition with a z/VM Master Processor Serialization Application

Table 6 contains a comparison of selected values between SMT-2 and non-SMT for a workload that is serialized by the z/VM master processor.

The same Master Processor Exerciser used with the LINUX-only mode partition was used to evaluate the effect of multithreading on applications having a z/VM master processor requirement in a z/VM-mode partition. The workload consists of an application that requires use of the z/VM master processor in each transaction. In a z/VM-mode partition, the z/VM master processor is on a logical CP processor, which always is on a non-SMT core, but the non-master portion of the workload executes on logical IFL processors, which run on SMT-2 cores. Because this workload has a serialization point, it is a good workload to study to see the effect SMT can have on serialized workloads.

For this workload SMT-2 provided a 28% decrease in transaction rate, a 63% increase in ITR, and a 56% decrease in processor utilization.

z/VM master processor utilization decreased 27%. No specific reason is yet known for this low master processor utilization.

Although no detail is provided in this article, the results in Table 5 and Table 6 provide a valid comparison between a LINUX-only mode partition and a z/VM-mode partition.

Average thread density for the SMT-2 measurement was 1.16.

Table 6. z/VM-Mode Partition Master Application

Run ID	STXS210G	STXS210H	Delta	Pct
Multithreading	disabled	enabled
Logical processors	6	10	4	66.7
Logical IFLs	4	8	4	100.0
Logical CP	2	2	0	0.0
ETR	337.50	240.40	-97.10	-28.8
ITR	881.20	1439.52	558.32	63.4
Total util/proc	38.3	16.7	-21.6	-56.4
Master util/proc	100.1	72.2	-27.9	-27.9
CP Total util/proc	50.3	36.3	-14.0	-27.8
IFL Total util/proc	32.3	11.8	-20.5	-63.5
SMT-2 Avg IFL thread density	na	1.16
Notes: z/VM for z13; 2964-NC9; 4 dedicated IFL cores; 2 dedicated CP cores; 30 GB central storage; storage-rich; 8 VIRSTOMP users (8 virtual CPUs, 144 GB); VIRSTOMP parameters (I=1024 KB,E=144 GB,C=8,B=1).

z/VM Apache CPU Pooling Workload

Table 7 contains a comparison of selected values between SMT-2 and non-SMT for a CPU pooling workload.

See CPU Pooling for information about the Apache CPU pooling workload and previous results.

A workload with both CAPACITY-limited CPU pools and LIMITHARD-limited CPU pools was selected because it provided the most comprehensive view.

With SMT-2, CAPACITY-limited CPU pools are limited based on the utilization of threads rather than utilization of cores so reduced capacity is expected.

With SMT-2, LIMITHARD-limited CPU pools are based on a percentage of the available resources, so when the number of logical processors doubles, their maximum utilization will double.

The measured workload has 6 AWM clients that are not part of any CPU pool. Each AWM client has 1 virtual processor. There are 16 Apache servers, each with 4 virtual processors. The 16 Apache servers are divided into four CPU pools, two limited by capacity and two limited by LIMITHARD. Each CPU pool has four Apache servers.

The CAPACITY-limited CPU pools are entitled to 40% of a core in non-SMT and SMT-2 environments. However in SMT-2, the limiting algorithm used thread time and thus was limited to 40% of a thread. The LIMITHARD-limited CPU pools are entitled to 5% of the existing resources which is 40% of a core with non-SMT and 80% of a thread with SMT-2. Thus entitlement of the CPU pools was identical in non-SMT but they are no longer identical in SMT-2.

In the non-SMT measurement, utilizations for the 4 pools were identical and equal to their entitled amount. In the SMT-2 measurement, utilizations differ widely between the two types of CPU pools.

With SMT-2, utilizations of the CAPACITY-limited CPU pools decreased 2%. With SMT-2, utilizations of the LIMITHARD-limited CPU pools increased 29%.

With SMT-2, none of the 4 CPU pools consumed their entitled utilization. The primary reason for not reaching their entitled utilization in this experiment is serialization in the AWM clients. Average utilization of the 6 AWM virtual processors approached 100% in the SMT-2 measurement and prevented the Apache servers from reaching their entitled utilization.

For this workload SMT-2 provided a 5.5% decrease in transaction rate, a 57% increase in ITR, a 40% decrease in processor utilization, and an 8.7% increase in AWM response time.

Average thread density for the SMT-2 measurement was 1.20.

Results indicate that caution is needed for CPU pooling workloads with SMT-2.

Table 7. CPU Pooling

Run ID	APLDGLD2	APLDGLD6	Delta	Pct
Multithreading	disabled	enabled
Logical processors	8	16	8	100.0
CPUPOOL1 limit type	LIMITHARD	LIMITHARD
CPUPOOL2 limit type	LIMITHARD	LIMITHARD
CPUPOOL3 limit type	CAPACITY	CAPACITY
CPUPOOL4 limit type	CAPACITY	CAPACITY
CPUPOOL1 max sh	4.99878	4.99878	0.00000	0.0
CPUPOOL2 max sh	4.99878	4.99878	0.00000	0.0
CPUPOOL3 max sh	39.99939	39.99939	0.00000	0.0
CPUPOOL4 max sh	39.99939	39.99939	0.00000	0.0
CPUPOOL1 mean %util	38.78863	50.28028	11.49165	29.6
CPUPOOL2 mean %util	38.81830	49.96539	11.14709	28.7
CPUPOOL3 mean %util	38.83480	37.89933	-0.93547	-2.4
CPUPOOL4 mean %util	38.83230	38.03735	-0.79495	-2.0
CPUPOOL1 max %util	39.99057	54.70999	14.71942	36.8
CPUPOOL2 max %util	39.99059	58.76215	18.77156	46.9
CPUPOOL3 max %util	39.99990	39.99968	-0.00022	0.0
CPUPOOL4 max %util	39.99989	40.00057	0.00068	0.0
ETR	10869.13	10269.44	-599.69	-5.5
ITR	13535.65	21350.19	7814.54	57.7
AWM avg resp time	0.009208	0.010010	0.000802	8.7
Total util/proc	80.3	48.1	-32.2	-40.1
AWM client util	80.269	97.000	16.731	20.8
SMT-2 avg thread density	na	1.20
Notes: z/VM for z13; 2964-NC9; 8 dedicated IFL cores; 128 GB central storage; storage-rich; Apache workload; Linux SLES11 SP1; 6 AWM clients (1 virtual CPU, 1 GB); 16 Apache servers (1 virtual CPU, 10 GB); 1 AWM connection to each server; 4 CPU pools (4 Apache servers in each); 10000 URL files; 1 MB avg URL size.

Live Guest Relocation Workload

Table 8 contains a comparison of selected values between SMT-2 and non-SMT for a 25-user live guest relocation workload. This workload provides a good demonstration of various characteristics of the SSI infrastructure on the z13 and factors that affect live guest relocation with SMT-2.

See Live Guest Relocation for information about the live guest relocation workload and previous results.

This evaluation was completed by relocating 25 identical Linux guests. Each guest had 2 virtual processors and 4 GB of virtual storage. Each guest was running the PING, PFAULT, and BLAST applications. PING provides network I/O. PFAULT uses processor cycles and randomly references storage, thereby constantly changing storage pages. BLAST generates application I/O.

Relocations were done synchronously using the SYNC option of the VMRELOCATE command.

The measurement was completed in a two-member SSI cluster with identical configurations and connected by an ISFC logical link made up of 16 CTCs on four FICON CTC 8 Gb CHPIDs.

There was no other active work on either the source or destination system.

Compared to the non-SMT measurement, average quiesce time increased 25% and total relocation time increased 9.8%.

Average thread density for the SMT-2 measurement was 1.71.

Several independent factors influenced these relocation results.

Some of the running applications showed an improvement with SMT-2. The BLAST application showed a 34% increase in completions, and the PFAULT application showed a 71% increase in completions. The PING completions were nearly identical.
The total number of relocation passes decreased 15%. With non-SMT the number of passes varied over time. The early relocating guests were around the maximum (16), the number slowly declined, and the last few guests to relocate were around the maximum number of passes when no progress is being observed (8). With SMT-2 nearly every user was close to the 8 passes, so progress is not being made. The ability of the application to change more pages is likely the reason for the change in passes.
Despite the decrease in the total number of passes, the total number of relocated pages increased 12%. SMT-2 provided the ability for the applications to change pages faster and thus more pages were relocated in the intermediate passes (2 through N-2).
The total number of pages relocated during quiesce increased 51% and is thus a factor for the increased quiesce time. Because this results from the ability of the application to change pages faster, perhaps increased quiesce time is an expected and acceptable effect.
Quiesce time is also affected by serialized code paths running on a thread rather than on a core.

Table 8. Live Guest Relocation

Run ID	GLDS1153	GLDS1155	Delta	Pct
Multithreading	disabled	enabled
Logical processors	4	8	4	100.0
Avg relocate time (Usec)	5772582	6326926	554344	9.6
Avg quiesce time (USec)	707176	890503	183327	25.9
PFAULT completions	13610610	23389467	9778857	71.8
BLAST completions	695	933	238	34.2
PING completions	520	522	2	0.4
Total memory move passes	241	203	-38	-15.7
Total pages moved	27418394	30959480	3541086	12.9
Pages moved during quiesce	1731296	2627289	895993	51.7
SMT-2 avg thread density	na	1.71
Notes: z/VM for z13; 2964-NC9; 4 dedicated IFL cores; 51 GB central storage; storage-rich; Linux workload; 25 Linux servers (2 virtual CPUs, 4 GB); Linux SLES10 SP2; Linux server applications (PFAULT, PING, BLAST); ISFC logical link (16 CTCs, 4 FICON CHPIDs, 8 Gb).

Summary and Conclusions

Results in measured workloads varied widely.

Best results were observed for applications having highly parallel activity and no single point of serialization.

No improvements were observed for applications having a single point of serialization.

To overcome serialization, workload adjustment should be done where possible.

Workloads that have a heavy dependency on the z/VM master processor are not good candidates for SMT-2. In z/VM Performance Toolkit, the master processor can be identified from FCX100 CPU and FCX180 SYSCONF.

Results indicate that caution is needed for CPU pooling workloads with SMT-2.

The multithreading metrics provide information about how well the cores perform when SMT is enabled, but they do not take into consideration other factors that influence workload throughput. The values reported by the metrics are directly related to core utilization, thread density and ITR. There is no direct relationship with ETR or with transaction response time.

As core utilization and thread density increase, core efficiency might decrease. Using ITR to extrapolate remaining capacity could be misleading. Therefore, using ITR to predict remaining partition capacity might be overly optimistic. The workloads reported above were steady state workloads.

Measuring workload throughput and response time is the best way to know whether SMT is providing value to the workload.

Contents | Previous | Next