IBM: z/VM Performance Report: Performance Improvements

Performance Improvements

The following items improve performance:

Scheduler Lock Improvement
Queued I/O Assist
Dispatcher Detection of Long-Term I/O
Virtual Disk in Storage Frames Can Now Reside above 2G
z/VM Virtual Switch
TCP/IP Stack Improvements
TCP/IP Device Layer MP Support

Scheduler Lock Improvement

A number of CP functions make use of the scheduler lock to achieve the required multiprocessor serialization. Because of this, the scheduler lock can limit the capacity of high n-way configurations. With z/VM 4.4.0, the timer management functions no longer user the scheduler lock but instead make use of a new timer request block lock, thus reducing contention for the scheduler lock. Measurement results of three environments that were constrained by scheduler lock contention showed throughput improvements of 8%, 73%, and 270%. See Scheduler Lock Improvement and Linux Guest Crypto on z990 for results and further discussion.

Queued I/O Assist

IBM introduced Queued Direct I/O (QDIO), a shared-memory I/O architecture for IBM zSeries computers, with its OSA Express networking adapter. Later I/O devices, such as HiperSockets and the Fiber Channel Protocol (FCP) adapter, also use QDIO.

In extending QDIO for HiperSockets, IBM revised the interrupt scheme so as to lighten the interrupt delivery process. The older, heavyweight interrupts, called PCI interrupts, were still used for the OSA Express and FCP adapters, but HiperSockets used a new, lighter interrupt scheme called adapter interrupts.

The new IBM eServer zSeries 990 (z990) and z/VM Version 4 Release 4 cooperate to provide important performance improvements to the QDIO architecture as regards QDIO interrupts. First, the z990 OSA Express adapter now uses adapter interrupts. This lets OSA Express adapters and FCP channels join HiperSockets in using these lighter interrupts and experiencing the attendant performance gains. Second, z990 millicode, when instructed by z/VM CP to do so, can deliver adapter interrupts directly to a running z/VM guest without z/VM CP intervening and without the running guest leaving SIE. This is similar to traditional IOASSIST for V=R guests, with the bonus that it applies to V=V guests. Third, when an adapter interrupt needs to be delivered to a nonrunning guest, the z990 informs z/VM CP of the identity of the nonrunning guest, rather than forcing z/VM CP to examine the QDIO data structures of all guests to locate the guest for which the interrupt is intended. This reduces z/VM CP processing per adapter interrupt.

Together, these three improvements benefit any guest that can process adapter interruptions. This includes all users of HiperSockets, and it also includes any guest operating system that adds adapter interruption support for OSA Express and FCP channels.

Measurement results for data transfer between two Linux guests showed 2% to 5% reductions in total CPU requirements for HiperSockets connectivity and 8% to 18% CPU reductions for the Gigabit Ethernet case. See Queued I/O Assist for further information.

Dispatcher Detection of Long-Term I/O

Traditionally, the z/VM CP dispatcher has contained an algorithm intended to hold a virtual machine in the dispatch list if the virtual machine had an I/O operation outstanding. The intent was to avoid dismantling virtual machine resources if an I/O interrupt was imminent.

When this algorithm was designed, the pending I/O operation almost always belonged to a disk drive. The I/O interrupt came very shortly after the I/O operation was started. Avoiding dismantling the virtual machine while the I/O was in flight was almost always the right decision.

Recent uses of z/VM to host large numbers of network-enabled guests, such as Linux guests, have shown a flaw in this algorithm. Guests that use network devices very often start a long-running I/O operation to the network device and then fall idle. A READ CCW applied to a CTC adapter is one example of an I/O that could turn out to be long-running. Depending on the kind of I/O device being used and the intensity of the network activity, the long-running I/O might complete in seconds, minutes, or perhaps even never complete at all.

As mentioned earlier, holding a virtual machine in the dispatch list tends to protect the physical resources being used to support its execution. Chief among these physical resources is real storage. As long as the z/VM system has plenty of real storage to support its workload, the idea that some practically-idle guests are remaining in the dispatch list and using real storage has little significance. However, as workload grows and real storage starts to become constrained, protection of the real storage being used by idling guests becomes a problem. Because Linux guests tend to try to use all of the virtual storage allocated to them, holding an idle Linux guest in the dispatch list is problematic.

The PTFs associated with APAR VM63282 change later releases of z/VM to exempt certain I/O devices from causing the guest to be held in the dispatch list while an I/O is in flight. When the appropriate PTF is applied, outstanding I/O to the following kinds of I/O devices no longer prevents the guest from being dropped from the dispatch list:

Real or virtual CTCA (includes CTC, 3088, ESCON, and FICON)
Real or virtual message processor devices
Real OSA devices (LCS or QDIO mode)
Real HiperSockets devices
VM guest LAN devices (QDIO or HiperSockets)

The PTF numbers are:

z/VM Release	PTF Number
z/VM 4.4.0	UM30889
z/VM 4.3.0	UM30888

IBM evaluated this algorithm change on a storage-constrained Linux HTTP guest serving workload. We found that the change did tend to increase the likelihood that a network-enabled guest will drop from the dispatch list.

As a side effect, we found that virtual machine state sampling data generated by the CP monitor tends to be more accurate when this PTF is applied. Monitor reports a virtual machine's state as "I/O active" before it reports other states. This is consistent with the historical view that an outstanding I/O is a short-lived phenomenon. Removing very-long-running I/Os from the sampler's field of view helps the CP monitor facility more accurately report virtual machine states.

Last, it is appropriate to note that other phenomena besides an outstanding I/O operation will tend to hold a guest in the dispatch queue. Chief among these is something called the CP "test-idle timer". When a virtual machine falls idle -- that is, it has no instructions to run and it has no I/Os outstanding -- CP leaves the virtual machine in the dispatch list for 300 milliseconds (ms) before deciding to drop it from the dispatch list. Like the outstanding I/O algorithm, the intent of the test-idle timer is to prevent CP from excessively disturbing the real storage allocated to a guest that might run again "soon". Some guest operating systems, such as TPF and Linux, employ a timer tick (every 200 msec in the case of TPF, every 10 msec in the case of Linux without the timer patch) even when they are basically otherwise idle. This ticking timer tends to subvert CP's test-idle logic and leave such guests in queue when they could perhaps cope well with leaving. IBM is aware of this situation and is considering whether a change in the area of test-idle is appropriate. In the meantime, system programmers wanting to do their own experiments with the value of the test-idle timer can get IBM's SRMTIDLE package from our download page.

Virtual Disk in Storage Frames Can Now Reside above 2G

Page frames used by the Virtual Disk in Storage facility can now reside above 2G in central storage. This change can potentially improve the performance of VM systems that use v-disks and are currently constrained by the 2G line.

z/VM Virtual Switch

The z/VM Virtual Switch can be used to eliminate the need for a virtual machine to serve as a TCP/IP router between a set of virtual machines in a VM Guest LAN and a physical LAN that is reached through an OSA-Express adapter. With virtual switch, the router function is instead accomplished directly by CP. This can eliminate most of the CPU time that was used by the virtual machine router it replaces, resulting in a significant reduction in total system CPU time. Decreases ranging from 19% to 33% were observed for the measured environments when a TCP/IP VM router was replaced with virtual switch. Decreases ranging from 46% to 70% were observed when a Linux router was replaced with virtual switch. See z/VM Virtual Switch for results and further discussion.

TCP/IP Stack Improvements

CPU usage of the TCP/IP stack virtual machine was reduced substantially. A 16% reduction in total CPU time per MB was observed for the streaming workload (represents FTP-like bulk data transfer) and a 5% reduction was observed for the RR workload (represents Telnet activity). The largest improvements were observed for the CRR workload, which represents webserving workloads where each transaction includes a connect/disconnect pair. In that case, CPU/transaction decreased by 81%. See TCP/IP Stack Performance Improvements for measurement results and further discussion.

These improvements target the case where the TCP/IP stack is for a host system and are focused on the upper layers of the TCP/IP stack (TCP and UDP). As such, they complement the improvements made in z/VM 4.3.0, which were directed primarily at the case where TCP/IP VM is used as a router and were focused on the lower layers of the TCP/IP stack. See TCP/IP Stack Performance Improvements for results and discussion of those z/VM 4.3.0 improvements.

TCP/IP Device Layer MP Support

Prior to this release, TCP/IP VM didn't have any virtual MP support and, as a result, any given TCP/IP stack virtual machine could only run on one real processor at a time. With TCP/IP 440, support has been added to allow device-specific processing to be done on virtual processors other than the base processor. This can be used to offload some processing from the base processor, which is used by the remaining stack functions. CP can then dispatch these on separate real processors if they are available. This can increase the rate of work that can be handled by the stack virtual machine before the base processor becomes fully utilized. For the measured Gigabit Ethernet and HiperSockets cases, throughput changes ranging from a 2% decrease to a 24% improvement were observed. See TCP/IP Device Layer MP Support for results and further discussion.

Contents | Previous | Next