Dispatching: Latency vs. Single-Processor Speed

Last revised: 2023-01-23, BKW

Abstract

In z/VM, dispatching of virtual CPUs often involves trade-offs between dispatch latency and single-processor speed. The article discusses some of the tradeoffs and the knobs relevant in those tradeoffs.

SMT-2

Many factors influence the behavior of a workload. One such factor is the availability of parallelism in the partition. We especially began grappling with this on the z13 when we introduced SMT-2. On z13 and later, a partition of N logical IFL cores can have either N "faster" logical IFL processors or 2N "slower" logical IFL processors, according to the SMT setting. Students of SMT are well aware of this. They are also aware that the magnitude of the difference is very much a function of the workload.

Different workloads will respond differently to the two SMT settings. Workloads that do better when single-processor speed is higher will want non-SMT or SMT-1. Such workloads are those consisting of a fairly small number of independent, CPU-speed-sensitive virtual CPUs. Workloads that do better when the parallelism opportunity is higher will want SMT-2. Such workloads are those consisting of a larger number of virtual CPUs, said virtual CPUs either having dependencies upon one another's progress or being sensitive to dispatch delay rather than to processor speed.

With respect to the SMT facility, there are three possible SMT configurations: non-SMT, SMT-1, or SMT-2. Some specialists use the phrases "opt-out" or "opt-in" for the SMT facility; non-SMT is said to be "opted-out" while the other two are said to be "opted-in".

Whether to opt-in for the SMT facility is controlled in the z/VM system configuration file. To opt-out, code MULTITHREADING DISABLE in the config file. To opt-in, code MULTITHREADING ENABLE. For operands for the enable, refer to z/VM CP Planning and Administration. If the config file contains no MULTITHREADING statement, the system comes up opted-out. Whether to opt-in can be changed only by changing the config file and re-IPLing.

When one opts-in for SMT, one can change the system between SMT-1 and SMT-2 without an IPL. To do this, use the CP SET MULTITHREADING command. z/VM supports SMT-2 on only the IFL cores of the partition. When the administrator selects SMT-2, all core types will run at SMT-1 except the IFL cores, which will run at SMT-2.

One might think non-SMT and SMT-1 are functionally identical. This is not quite true. A non-SMT z/VM partition can use 80 logical cores. An SMT-1 z/VM partition can use only 40 logical cores. Observers will notice that in SMT-1 all logical processors have even-numbered identifiers: 0, 2, 4, and so on. The odd-numbered identifiers are reserved for use with SMT-2, and they will be found on only IFL cores.

Dispatch Heuristics

One of the factors that influences single-processor speed is how well the cache does in supporting the executing instruction stream. As the cache does better, CPI decreases and so the apparent single-processor speed rises. As the cache does worse, CPI increases and so the apparent single-processor speed falls. Workloads sensitive to single-processor speed will be sensitive to CPI.

On what does cache effectiveness depend? One factor is the amount of dispatch time a given virtual CPU spends on a given logical CPU before some other virtual CPU takes its turn on that logical CPU. As the time the virtual CPU spends on the logical CPU increases, the cache of that logical CPU "warms up" and becomes more effective. If that logical CPU switches virtual CPUs too frequently, the cache does not warm up sufficiently and so it does not support the workload as much as it otherwise might.

In bringing the IBM z13 to market, IBM learned that the single-processor behavior of that machine was very sensitive to the amount of support the cache gave to the workload, especially in SMT-2 configurations. After some experimentation, IBM built into the z/VM dispatcher some SMT-level-sensitive governance over a dispatching parameter called "the minor time slice." Owing to what IBM found, IBM configured z/VM so that in SMT-2 configurations, the minor time slice is longer than it is in an SMT-1 or a non-SMT configuration. On the workloads we evaluated, IBM found this to be a good trade-off. The longer minor time slice gave the cache more chance to warm-up and single-processor behavior improved correspondingly. And on our workloads, that better warm-up resulted in better overall performance.

Length of the minor time slice is not the only factor that influences the amount of support the cache gives to the workload. Another factor is something we might call "drag" or "switching." Suppose virtual CPU 2 has recently run on logical CPU 2 and is in-queue on logical CPU 2 waiting to run there again. Now suppose logical CPU 1 becomes available. Should logical CPU 1 "steal" virtual CPU 2 and run it, or not? The act of stealing might decrease virtual CPU 2's dispatch latency, but the steal also drags virtual CPU 2 to a logical processor where the cache might not be as warmed up. Dispatch latency might be improved, but single-processor speed might be harmed. The "right answer" here is not something the z/VM dispatcher can discern. Rather the "right answer" is what would work best for the workload. Does the workload want single-processor speed or does it want prompt dispatch? That is the question.

Built into the z/VM dispatcher are some configurable constants called "steal barriers." These constants, which we determined empirically over our usual regression workloads, inform the dispatcher how to make the tradeoff between dispatch promptness and single-processor speed. Returning to the above example, how long should virtual CPU 2 have to wait for logical CPU 2 to service it, before logical CPU 1 decides, "OK, he's waited long enough, I'll steal?" In addition to the "right answer" being a function of workload, it's also a function of the topological distance between logical CPU 1 and logical CPU 2. Maybe a cross-drawer steal is a lot more cache-painful than a within-chip steal and so the former's constant, its delay or patience factor, should be larger.

We have recently seen some client SMT-2 workloads where dispatch latency was clearly more important than single-processor speed. At those clients we had good success in configuring z/VM to use the SMT-1 dispatcher constants even though the clients' configurations were SMT-2.

To recognize such situations, look at Perfkit FCX304 PRCLOG and compare it to Perfkit FCX301 DSVBKACT. The former gives CPU-busy as a function of logical CPU and of time. The latter tells us how deep the logical CPUs' dispatch queues tend to be. If FCX304 shows us only moderate CPU-busy but FCX301 shows us non-trivial dispatch queue depths, there is a good chance the reason the queues are nontrivially deep is because logical CPUs are deciding to let waiting virtual CPUs wait rather than stealing them. In such a situation one might ask whether the correct trade-off is being made.

Another set of data to consider comes not from Perfkit but rather from the application's behavior as reported in its logs. How is transaction response time doing? Is it too long? Is it failing to meet expectation? If the answers are "yes" and CPU-busy is only moderate, intentional dispatch delay might be the reason.

The SMT-1 dispatcher constants are reachable through a CP command. Here are the CP commands one can issue:
CP SET SYSCONTROL DISPATCH MODLEVEL 0 ; use the SMT-1 settings CP SET SYSCONTROL DISPATCH MODLEVEL 1 ; use the SMT-2 settings

In truth, MODLEVEL 0 and MODLEVEL 1 are nicknames for packages of individual settings used for SMT-1 and SMT-2. There are about ten settings in the settings package. MODLEVEL 0 refers to the settings package used for non-SMT or SMT-1. MODLEVEL 1 refers to the settings package used for SMT-2.

To check which settings package your system is using, issue CP QUERY SYSCONTROL. If you see TSEARLY 0, your system is using MODLEVEL 0, the settings package for non-SMT or SMT-1. If you see any other value for TSEARLY, your system is using MODLEVEL 1, the settings package for SMT-2.

So, now to your question. Should you change the MODLEVEL setting? If you are on non-SMT or SMT-1, probably not. If you are on SMT-2, you probably should not change MODLEVEL without asking IBM its opinion. But if your system has moderate CPU utilization, and nontrivial dispatch queue depths, and you know from experience that your workload is more sensitive to dispatch latency than it is to single-processor speed, MODLEVEL 0 might be right for you. Feel free to collect MONWRITE data, open a case, and ask for guidance. Be prepared to discuss your application's success metrics and why you think your application's behavior is not meeting expectation.

Summary

Workload performance is a function of dispatch latency, single-processor speed, and other factors. In this article we considered the first two. Dispatching virtual CPUs often requires making tradeoffs between them. The SMT setting and the MODLEVEL setting are knobs the system programmer can twist to affect dispatch latency and single-processor speed.

As always, the final arbiter of whether performance is acceptable is the behavior of the application. Keep track of transaction response time and other application-centric performance metrics. Tune to optimize application behavior, not to optimize some metric reported in a system performance product such as Perfkit.