As customers begin to deploy the z13 and z/VM for z13, they might wish to give consideration to the following items.
The z13: Notable Characteristics
The IBM z13 offers processor capacity improvements over the IBM zEC12. Understanding important aspects of how the machine is built will help customers to realize capacity gains in their own installations.
One way z13 achieves capacity improvements is that its caches are larger than zEC12. Compared to the zEC12, on the z13 the L1 I-cache is 50% larger, the L1 D-cache is 33% larger, the L2 is 100% larger, the L3 is 33% larger, and the L4 is 25% larger. With suitable memory reference habits, the workload will benefit from the increased cache size.
Another way z13 achieves capacity improvements is that its CPU cores are multithreaded. Where on zEC12 each CPU core ran exactly one stream of instructions, on z13 each CPU core can run two instruction streams concurrently. This strategy lets the instruction streams share resources of the core. The theory is that the two streams are likely not to need exactly the same core resources at exactly the same instant all the time. It follows that while one thread is using one part of the core, the other thread can use another part of it. This can result in higher productivity from the core, if the instruction streams' behaviors are sufficiently symbiotic.
Previous System z machines such as the zEC12 used a three-level topology as regards how the cores and memory were connected to one another. The hierarchy was this: cores were on chips, chips were on nodes, and the nodes fitted into the machine's frame. On z13 there's an additional layer in the hierarchy: cores are on chips, chips are on nodes, nodes are fitted into drawers, and the drawers are in turn connected together. This means that depending upon drawer boundaries, off-node L4 or off-node memory can be either on-drawer, which is closer, or off-drawer, which is farther away. The latter means longer access times.
In the next section we'll explore how to take these factors into account so as to achieve good performance from the z13.
How to Get Performance from the z13
To get good performance from the z13, it's necessary to think about how the machine works and then to adapt the workload's traits to exploit the machine's strengths.
For example, consider cache. A workload that stays within the z13's cache has a good chance of running well on z13. Use the CPU Measurement Facility host counters and the z/VM Download Library's CPUMF tool to observe your workload's behavior with respect to cache. See our CPU MF article for help on understanding a z/VM CPU MF host counters report. If the workload is spilling out of cache, perhaps rebalancing work among LPARs will help. One workload of ours that did very well on z13 had an L1 miss rate of about 1% and resolved about 85% of its L1 misses at L3 or higher.
The amount of performance improvement a customer will see in moving from zEC12 to z13 is very dependent on the workload's cache footprint. A workload that stayed well within zEC12's cache might see only modest improvement on z13, because it will get no help from the increased z13 cache sizes. At the other end of the spectrum, a workload that grossly overflows cache on both machines similarly might see no benefit from z13. The best case is likely to be the workload that didn't fit well into zEC12 cache but does fit well into the increased caches on z13. Again, make use of the CPU MF host counters to observe your workload's cache behavior.
Another factor about z13 cache relates to multithreading. Yes, the L1s and L2s are larger than they were on zEC12. But when multithreading is enabled, the two threads of a core share the L1 and the L2. Switching the z13 from non-SMT mode to SMT-2 mode might well cause a change in the performance of the L1 or of the L2. This behavior is very much a function of the behavior of the workload.
Speaking of cache, it's good to mention here that one of the purposes of running vertically is to improve the behavior of the CPC's cache hierarchy, sometimes informally called the nest. When a partition uses vertical mode, PR/SM endeavors to place the partition's logical CPUs close to one another in the machine topology, and it also tries not to move a logical CPU in the topology from one dispatch to the next. These points are especially true for high-entitlement logical CPUs, called vertical highs, notated Vh. If you have not yet tried running vertically, consider at least trying it. Before you do, make sure you have good, workable measurements of your workload's behavior from horizontal mode. Then switch your partition to vertical, collect the same measurements, make a comparison, and decide for yourself how to proceed.
One consequence of multithreaded operation is that although the core might complete more instructions per unit of time, the two instruction streams themselves might respectively experience lower instruction completion rates than they might have experienced had they run alone on the core. This is akin to how a two-lane highway with speed limit 45 MPH can move more cars per second than can a one-lane highway with speed limit 60 MPH. In the two-lane case, the cars have slowed down, but the highway is doing more work.
To get the most out of a multithreaded z13, the workload will need to be configured in such a way that it can get benefit out of a large number of instruction streams that might well individually be slower than previous machines' streams. A workload whose throughput hangs entirely on the throughput of a single software thread -- think virtual CPU of a z/VM guest -- might not do as well on a multithreaded z13 as it did on zEC12. But if the workload can be parallelized, so that a number of instruction streams concurrently contribute to its throughput, the workload might do better, core for core. To do well with a multithreaded z13, customers will need to examine the arrangement and configuration of their deployments and remove single-thread barriers.
Another consequence of multithreaded operation is that as the core approaches 200% busy -- that is, neither thread ever loads a wait PSW -- the opportunity for the threads' instruction streams to fit together synergistically can decrease. Customers might find that while they could run a single-threaded zEC12 core to very high percent-busy without concern, running a two-threaded z13 core to very high percent-busy might not produce desirable results. Watch the workload's performance as compared to percent-busy as reported by z/VM Performance Toolkit's FCX304 PRCLOG and make an adjustment if needed.
A further consequence of multithreaded operation is that owing to the reduced capacity of each thread, more threads -- read more logical CPUs -- might be required to achieve capacity equivalent to an earlier machine. Be aware, though, that adding logical CPUs increases parallelism and therefore has the potential to increase spin lock contention. Customers should pay attention to FCX265 LOCKLOG and FCX239 PROCSUM and contact IBM if spin lock contention rises to unacceptable levels.
A single drawer of a z13 can hold at most 36 customer cores. Depending upon model number, the limit might be smaller. This means that as the number of cores defined for an LPAR increases, the LPAR might end up spanning a drawer boundary. Whether a drawer boundary poses a problem for the workload is very much a property of the workload's memory reference habits. If locality of reference is very good and use of global memory areas such as spin lockwords is very light, the drawer boundary might pose no problem at all. As the workload moves away from those traits, the drawer boundary might begin to pose a problem. Customers interested in using LPARs that cross drawer boundaries should pay very close attention to workload performance and CPU MF host counters reports to make sure the machine is running as desired. The FCX287 TOPOLOG report of z/VM Performance Toolkit details the topology of the LPAR. When interpreting TOPOLOG on a z13, keep in mind that nodes 1 and 2 are on drawer 1, nodes 3 and 4 are on drawer 2, and so on.
Owing to how the z13 assigns core types (CP, IFL, zIIP, etc.) to the physical cores of the machine, customers using mixed-engine LPARs might find the LPAR has been placed across drawers. This can be true even when the number of cores defined for the LPAR is small. Again, the FCX287 TOPOLOG report will reveal this. If a mixed-engine z/VM LPAR is not performing as expected, contact IBM.
During its runs of laboratory workloads IBM gained some experience with factors that might cause variability in what IBM calls the SMT benefit, that is, the capacity of a multithreaded z13 core compared to the capacity of a single-threaded z13 core. One factor that emerged was the percent of CPU-busy that was spent resolving Translation Lookaside Buffer (TLB) misses. In the z/VM CPU MF host counters reports produced by the CPUMF tool, the column T1CPU tabulates this value. It was our experience that as T1CPU increased, the SMT benefit decreased. For example, in one of our 16-core experiments, we saw an ITRR of 1.26 when we turned on multithreading. T1CPU for those workloads was about 8%. In another pair of 16-core experiments, we saw an ITRR of 1.08 when we turned on multithreading. T1CPU for those workloads was about 25%. Factors like this are why IBM marketing materials advertise the SMT benefit as likely falling into the range of 10% to 30%. In CPU MF host counters data z/VM customers have sent us, T1CPU tends to land in the neighborhood of 17% with standard deviation 6%. In other words, T1CPU tends to vary a lot.
As part of its evaluation of the z13 exploitation PTF IBM did runs in SMT-2 mode with the recent CPU Pooling feature activated. We found that in SMT-2 mode there were some cases where the pool was slightly overlimited, that is, the guests in the pool were held back to an aggregate consumption that was slightly less than was specified on the command. We found this for only CAPACITY-limited CPU pools. IBM continues to study this issue. In the meantime, customers who experience this can compensate by adjusting the specified limit upward slightly so that the desired behavior is obtained. And speaking of adjusting limits, remember that anytime you feel you need to adjust your CPU Pooling limits, make sure the adjustments you make are within the limits of your capacity license.
In z/VM for z13 z/VM HiperDispatch no longer parks logical CPUs because of elevated T/V ratio. Provided Global Performance Data Control is enabled, the number of unparked logical CPUs is now determined solely on the basis of how much capacity it appears the LPAR will have at its disposal. When Global Performance Data Control is disabled, the number of unparked logical CPUs is determined by projected load ceiling plus CPUPAD.
The Most Important Thing to Remember
Longtime z/VM performance expert Bill Bitner has a standard answer he gives when a customer asks whether his system is exhibiting good performance. Bill will often reply, "Well, that depends. What do you mean by 'performance', and what do you mean by 'good'?" Bill's answer is right on target, and with the coming of z13, perhaps it's even more so.
Kidding aside, understanding whether your workload is getting value out of z13 is entirely about whether you have taken time to do all of these things:
- Establish meaningful measures of performance for your workload;
- Establish success thresholds for those measures of performance;
- Routinely collect the values of those measures;
- Routinely compare the collected values to your success thresholds;
- Take corrective action if the collected values fall short.
Your measures of success might be as simple as transaction rate and transaction response time. If your business requires it, you might define different or additional measures. Whatever measures you pick, routinely collect and evaluate them. In this way you have the best chance of getting the performance you expect.