IBM: z/VM Performance Report: Queued Direct I/O Support

Queued Direct I/O Support

Starting with the IBM G5 CMOS family of processors, a new type of I/O facility called the Queued Direct I/O Hardware Facility is available. This facility is on the OSA Express Adapter and supports the Gigabit Ethernet card, the ATM card, and the Fast Ethernet card.

The QDIO Facility is a functional element on S/390 and zSeries processors that allows a program (TCP/IP) to directly exchange data with an I/O device without performing traditional S/390 I/O instructions. Data transfer is initiated and performed by referencing main storage directly via a set of data queues by both the I/O device and TCP/IP. Once TCP/IP establishes and activates the data queues, there is minimal processor intervention required to perform the direct exchange of data. All data transfers are controlled and synchronized via a state-change-signaling protocol (state machine).

By using a state machine rather than a Start Subchannel instruction for controlling data transfer, the high overhead associated with starting and processing I/O interrupts by both hardware and software for data transfers is virtually eliminated. Thus the overhead reduction realized with a direct memory map interface will provide TCP/IP the capability to support high Gigabit bandwidth network speeds without substantially impacting other system processing.

This section presents and discusses measurement results that assess the performance of the Gigabit Ethernet using the QDIO support included in TCP/IP Level 3A0.

Methodology: The workload driver is an internal tool which can be used to simulate such bulk-data transfers as FTP and Tivoli Storage Manager. The data are driven from the application layer of the TCP/IP protocol stack, thus causing the entire networking infrastructure, including the OSA hardware and the TCP/IP procotol code, to be measured. It moves data between client-side memory and server-side memory, eliminating all outside bottlenecks such as DASD or tape.

A client-server pair was used in which the client received 10MB of data (inbound) or sent 10MB of data (outbound) with a one-byte response. Additional client-server pairs were added until maximum throughput was attained.

Figure 1. QDIO Performance Run Environment

           *------------------------*------------------------*
           |          VM_s          |          VM_c          |
           |    (server machine)    |    (client machine)    |
           |                        |                        |
           |                        |                        |
           *----*----*----*----*----+----*----*----*----*----*
           |    |    |    |    |    |    |    |    |    |    |
      *----*TCP/| 1  | 2  | 3  | 4  | 1  | 2  | 3  | 4  |TCP/*----*
      V    | IP |    |    |    |    |    |    |    |    | IP |    V
  *-----*  |    |    |    |    |    |    |    |    |    |    |  *-----*
  |OSA_s|  |    |    |    |    |    |    |    |    |    |    |  |OSA_c|
  |     |  *----*----*----*----*----+----*----*----*----*----*  |     |
  *--*--*  |                        |                        |  *--*--*
     |     |          CP            |          CP            |     |
     |     *------------------------*------------------------*     |
     |                    9672-ZZ7 processor                       |
     |                                                             |
     |                           Gbit                              |
     *------------------------>  Ethernet <------------------------*
                                 LAN

Figure 1 shows the measurement environment. All clients ran on VM_c, communicating over the OSA Express Gigabit Ethernet adapter card (shown as OSA_c) to the servers on VM_s, which communicate over a second OSA Express Gigabit Ethernet adapter card (shown as OSA_s). Both adapters were run in QDIO mode. While collecting the performance data, it was determined that optimum results were achieved when TCP/IP was configured with DATABUFFERPOOLSIZE set to 32760 and DATABUFFERLIMITS set to 10 for both the outbound buffer limit and the inbound buffer limit. These parameters are used to determine the number and size of buffers that may be allocated for a TCP connection that is using window scaling.

Each performance run, starting with 1 client-server pair and progressing to 4 client-server pairs, consisted of starting the server(s) on VM_s and then starting the client(s) on VM_c. The client(s) sent 10MB of data (outbound case) or received 10MB of data (inbound case) for 200 seconds. Monitor data were collected for 150 seconds of that time period. Data were collected only on the client machine. Another internal tool, called PSTATS, was used to gather performance data for the OSA adapters.

At least 3 measurement trials were taken for each case, and a representative trial was chosen to show in the results. A complete set of runs was done with the maximum transmission unit (MTU) set to 1500 and another set of runs was done with 8992 MTU. The CP monitor data for each measurement were reduced by VMPRF and by an exec that extracts pertinent fields from the TCP/IP APPLDATA monitor records (subrecords x'00' and x'04').

Throughput Results: The throughput results are summarized in Figure 2.

Figure 2. QDIO Throughput

The graphs show maximum throughputs ranging from 30 MB/sec to 48 MB/sec, depending on the case, as compared to the maximum of about 125 MB/sec for the Gigabit Ethernet transport itself. CPU utilization in the TCPIP stack virtual machine is the primary constraint. The stack virtual machine can only run on one processor at a time. This limitation can be avoided by distributing the network on two or more TCPIP stack virtual machines. It appears that if the stack machine bottleneck were removed, the next bottleneck may be the adapter PCI bus capacity, if both stack machines are sharing the adapter.

Whenever we ran with 4 clients, the throughput rate appeared to be leveling off and that running 5 clients would not give us better throughput. We verified this by running with 5 clients for the Inbound 1500 MTU case and this was indeed true.

Better throughput was realized with 8992 MTU specified because of more efficient operation using the larger packet size. Fewer packets need to be exchanged and much of the overhead is on a per-packet basis.

Detailed Results: The following four tables contain additional results for each of the cases. Each of the following tables consists of two parts: the absolute results followed by the results expressed as ratios relative to the 1 client-server pair base case. The names used for the fields in the tables are the same as the names in the monitor data from which they were obtained. Following is an explanation of each field:

MB/sec (workload driver)	The workload driver produces a report which gives how many megabytes of data were sent (or received) during the run. It represents throughput from the application point of view. This measurement was what we chose to show in the Figure 2 graphs.
MB/sec (monitor)	This field is calculated by adding two values, ioInOctets and ioOutOctets, ¹ to get total bytes received and sent per second and dividing by 1Meg. It represents total throughput, including header bytes.
TCPIP CPU Utilization	This field is calculated from the TCPIP entry in the USER_RESOURCE_UTILIZATION VMPRF report. ² Using total CPU seconds for the TCPIP stack virtual machine divided by total time, we get the percent of the total time given to TCPIP.
TCPIP total CPU msec/MB	This field was calculated from two of the previous fields (TCPIP CPU Utilization divided by MB/sec(driver)) to show the number of TCPIP milliseconds of time to MB of data ratio.
TCPIP virtual CPU msec/MB	This field is calculated in the same manner as TCPIP total CPU msec/MB, using virtual CPU seconds for the TCPIP stack from the USER_RESOURCE_UTILIZATION VMPRF report.
TCPIP CP CPU msec/MB	This is the difference of total CPU msec/MB less virtual CPU msec/MB yielding the time spent in CP on TCP/IP's behalf.
OSA_c and OSA_s CPU (pstats)	This shows the CPU utilization, as reported by PSTATS, for the CPU on the OSA adapters. Several readings were taken during the run and these numbers reflect the average.
OSA_c and OSA_s PCI (pstats)	This shows the utilization, as reported by PSTATS, for the OSA adapter PCI bus. Again, several readings were taken during each run and then averaged.
FpspSize	The number of locked 4K pages held by the Fixed Page Storage Manager ³ is reported in TCP/IP Application Monitor records sub-record x'04'.
FpspinUse	The number of locked 4K pages currently allocated to users of the fixed page storage pool is reported in TCP/IP Application Monitor records sub-record x'04'.
ioInOctets/ioDirectRead	Both values are found in TCP/IP Application Monitor record sub-record x'00'. This calculation shows the number of bytes per QDIO inbound data transfer.
ioOutOctets/ioDirectWrite	Both values are found in TCP/IP Application Monitor record sub-record x'00'. This calculation shows the number of bytes per QDIO outbound data transfer.
ioDirectReads/MB	This is the same TCP/IP Application Monitor record information as before, divided by MB/sec from the workload driver, to show the number of QDIO inbound data transfers per megabyte.
ioDirectWrites/MB	This is the same TCP/IP Application Monitor record information as before, divided by MB/sec from the workload driver, to show the number of QDIO outbound data transfers per megabyte.
QDIOpolls/MB	QDIOpolls is found in the TCP/IP Application Monitor record, sub-record x'00' and contains the number of QDIO polling operations. This calculated value shows the number of polls per megabyte.
ioPCI/MB	ioPCI is found in the TCP/IP Application Monitor record, sub-record x'00', and contains the number of PCI interrupts. This calculated value shows the number of interrupts per megabyte.

Table 1. QDIO Run: Outbound 1500 MTU

Client-Server Pairs Run ID	1 QNO11	2 QNO21	3 QNO32	4 QNO42
MB/sec (workload driver) MB/sec (monitor)	29.98 31.82	31.4 33.38	33.39 35.45	33.58 35.54
TCPIP CPU Utilization TCPIP total CPU msec/MB TCPIP virtual CPU msec/MB TCPIP CP CPU msec/MB	95.3 31.8 23.8 8.0	94.7 30.1 22.3 7.9	94.7 28.4 20.6 7.8	94.7 28.2 20.3 7.9
OSA_c_CPU (pstats) OSA_c_PCI (pstats) OSA_s_CPU (pstats) OSA_s_PCI (pstats)	55 49 80 56	42 51 81 59	35 53 82 62	37 52 82 65
FpspSize FpspInUse	295 163	295 154	303 162	303 151
ioInOctets/ioDirectRead ioOutOctets/ioDirectWrite	86 17288	167 17681	220 18922	216 19710
ioDirectReads/MB ioDirectWrites/MB QDIOpolls/MB ioPCI/MB	41.0 64.2 100.4 36.2	29.2 62.8 56.9 25.0	19.3 58.6 36.7 15.9	18.5 56.1 36.0 14.8
MB/sec (workload driver) MB/sec (monitor)	1.000 1.000	1.047 1.049	1.114 1.114	1.120 1.117
TCPIP CPU Utilization TCPIP total CPU msec/MB TCPIP virtual CPU msec/MB TCPIP CP CPU msec/MB	1.000 1.000 1.000 1.000	0.994 0.947 0.937 0.988	0.994 0.893 0.866 0.975	0.994 0.887 0.853 0.988
OSA_c_CPU (pstats) OSA_c_PCI (pstats) OSA_s_CPU (pstats) OSA_s_PCI (pstats)	1.000 1.000 1.000 1.000	0.764 1.041 1.013 1.054	0.636 1.082 1.025 1.107	0.673 1.061 1.025 1.161
FpspSize FpspInUse	1.000 1.000	1.000 0.945	1.027 0.994	1.027 0.926
ioInOctets/ioDirectRead ioOutOctets/ioDirectWrite	1.000 1.000	1.942 1.023	2.558 1.095	2.512 1.140
ioDirectReads/MB ioDirectWrites/MB QDIOpolls/MB ioPCI/MB	1.000 1.000 1.000 1.000	0.712 0.978 0.567 0.691	0.471 0.913 0.366 0.439	0.451 0.874 0.359 0.409
Note: Gigabit Ethernet; 9672-ZZ7; z/VM 3.1.0 with 31-bit CP; TCP/IP 3A0; databufferpoolsize 32760; databufferlimits 10 10

Not included in these tables, but noticed during the runs, the average inbound packet size and the average outbound packet size for all the runs was:

                            Inbound Outbound
                             Case    Case
                            *------*-------*
  inbound packet - 1500 MTU | 1532 |   84  |
  outbound packet- 1500 MTU |   84 | 1532  |
                            *------+-------*
  inbound packet - 8992 MTU | 9016 |   84  |
  outbound packet- 8992 MTU |   84 | 9015  |
                            *------*-------*

TCPIP utilization is either as high as it can go, or close to it, even for one client. The reason for throughput increase, as clients are added, is that stack efficiency increases with more clients. This is due to piggybacking effects as seen in the ioDirectReads/MB and ioDirectWrites/MB. For example, one client gets 41.0 direct reads for every megabyte and four clients get 18.5.

The performance statistics for the adapter cards show that the CPU utilizations for both OSA_c and OSA_s are either flat or improving, while the statistics for the PCI bus show increases that are approximately proportional to the throughput rate.

Table 2. QDIO Run: Inbound 1500 MTU

Client-Server Pairs Run ID	1 QNI11	2 QNI22	3 QNI32	4 QNI42
MB/sec (workload driver) MB/sec (monitor)	13.3 13.79	28.11 30.00	29.92 32.44	30.38 32.23
TCPIP CPU Utilization TCPIP total CPU msec/MB TCPIP virtual CPU msec/MB TCPIP CP cpu msec/MB	36.7 27.6 19.5 8.0	80.0 28.5 20.5 8.0	85.3 28.5 20.5 8.0	84.7 27.9 19.7 8.1
OSA_c_CPU (pstats) OSA_c_PCI (pstats) OSA_s_CPU (pstats) OSA_s_PCI (pstats)	20 35 16 32	61 54 31 47	67 49 37 52	73 61 32 52
FpspSize FpspInUse	303 152	303 167	303 167	303 152
ioInOctets/ioDirectRead ioOutOctets/ioDirectWrite	25176 84	21995 145	23948 144	25823 134
ioDirect Reads/MB ioDirectWrites/MB QDIOpolls/MB ioPCI/MB	43.1 31.1 75.7 32.0	50.7 27.9 68.7 31.6	47.3 25.7 66.8 31.1	43.0 24.7 64.8 30.2
MB/sec (workload driver) MB/sec (monitor)	1.000 1.000	2.114 2.175	2.250 2.352	2.284 2.337
TCPIP CPU Utilization TCPIP total CPU msec/MB TCPIP virtual CPU msec/MB TCPIP CP cpu msec/MB	1.000 1.000 1.000 1.000	2.180 1.033 1.051 1.000	2.324 1.033 1.051 1.000	2.308 1.011 1.010 1.013
OSA_c_CPU (pstats) OSA_c_PCI (pstats) OSA_s_CPU (pstats) OSA_s_PCI (pstats)	1.000 1.000 1.000 1.000	3.050 1.543 1.938 1.469	3.350 1.400 2.313 1.625	3.650 1.743 2.000 1.625
FpspSize FpspInUse	1.000 1.000	1.000 1.099	1.000 1.099	1.000 1.000
ioInOctets/ioDirectRead ioOutOctets/ioDirectWrite	1.000 1.000	0.874 1.726	0.951 1.714	1.026 1.595
ioDirect Reads/MB ioDirectWrites/MB QDIOpolls/MB ioPCI/MB	1.000 1.000 1.000 1.000	1.176 0.897 0.908 0.988	1.097 0.826 0.882 0.972	0.998 0.794 0.856 0.944
Note: Gigabit Ethernet; 9672-ZZ7; z/VM 3.1.0 with 31-bit CP; TCP/IP 3A0; databufferpoolsize 32760; databufferlimits 10 10

The inbound case had much lower throughput with one client than the outbound case. The reason for this is not understood at this time. The multi-client inbound throughputs are more similar to the corresponding outbound throughputs.

Unlike the outbound run, TCPIP efficiency did not increase appreciably with an increasing number of clients. Instead, the increase in throughput comes from the overlapping of requests from multiple clients.

Table 3. QDIO Run: Outbound 8992 MTU

Client-Server Pairs Run ID	1 QJO11	2 QJO21	3 QJO31	4 QJO41
MB/sec (workload driver) MB/sec (monitor)	35.39 35.98	43.63 44.14	47.53 47.94	48.13 48.39
TCPIP CPU Utilization TCPIP total CPU msec/MB TCPIP vitual CPU msec/MB TCPIP CP cpu msec/MB	88.0 24.9 16.6 8.3	92.7 21.2 13.1 8.1	91.3 19.2 11.1 8.1	88.7 18.4 10.4 8.0
OSA_c_CPU (pstats) OSA_c_PCI (pstats) OSA_s_CPU (pstats) OSA_s_PCI (pstats)	22 54 24 52	18 63 28 61	11 65 23 67	9 67 24 68
FpspSize FpspInUse	279 151	279 167	347 160	421 220
ioInOctets/ioDirectRead ioOutOctets/ioDirectWrite	86 16734	126 25995	156 33137	160 36660
ioDirectReads/MB ioDirectWrites/MB QDIOpolls/MB ioPCI/MB	40.3 63.5 97.5 32.8	25.7 40.7 61.6 21.8	13.9 31.9 31.5 11.4	10.0 28.7 21.0 8.1
MB/sec (workload driver) MB/sec (monitor)	1.000 1.000	1.233 1.227	1.343 1.332	1.360 1.345
TCPIP CPU Utilization TCPIP total CPU msec/MB TCPIP vitual CPU msec/MB TCPIP CP cpu msec/MB	1.000 1.000 1.000 1.000	1.053 0.851 0.789 0.976	1.038 0.771 0.669 0.976	1.008 0.739 0.627 0.964
OSA_c_CPU (pstats) OSA_c_PCI (pstats) OSA_s_CPU (pstats) OSA_s_PCI (pstats)	1.000 1.000 1.000 1.000	0.818 1.167 1.167 1.173	0.500 1.204 0.958 1.288	0.409 1.241 1.000 1.308
FpspSize FpspInUse	1.000 1.000	1.000 1.106	1.244 1.060	1.509 1.457
ioInOctets/ioDirectRead ioOutOctets/ioDirectWrite	1.000 1.000	1.465 1.553	1.814 1.980	1.860 2.191
ioDirectReads/MB ioDirectWrites/MB QDIOpolls/MB ioPCI/MB	1.000 1.000 1.000 1.000	0.638 0.641 0.632 0.665	0.345 0.502 0.323 0.348	0.248 0.452 0.215 0.247
Note: Gigabit Ethernet; 9672-ZZ7; z/VM 3.1.0 with 31-bit CP; TCP/IP 3A0; databufferpoolsize 32760; databufferlimits 10 10

8992 MTU packet sizes gave higher efficiencies than the 1500 MTU case. Notice that the number of direct reads per megabyte went from 18.5 for 4 clients with 1500 MTU to 10.0 for the same number of clients with 8992 MTU. Similar results were seen for the direct writes.

These efficiencies can also be seen in the performance statistics for the adapter card as well as TCPIP total CPU milliseconds per megabyte. The numbers are significantly lower than the corresponding numbers for the 1500 MTU case.

In addition to the differences noted, there are also similarities between the 8992 MTU case and the 1500 MTU case. Each shows that the TCPIP stack is the limiting factor and that the increase in the number of clients has a proportional increase in efficiency.

Table 4. QDIO Run: Inbound 8992 MTU

Client-Server Pairs Run ID	1 QJI11	2 QJI21	3 QJI31	4 QJI41
MB/sec (workload driver) MB/sec (monitor)	34.6 35.07	44.68 45.12	44.72 45.20	44.61 45.07
TCPIP CPU Utilization TCPIP total CPU msec/MB TCPIP virtual CPU msec/MB TCPIP CP CPU msec/MB	77.3 22.4 14.8 7.5	92.7 20.7 13.3 7.5	91.3 20.4 12.8 7.6	91.3 20.5 12.9 7.6
OSA_c_CPU (pstats) OSA_c_PCI (pstats) OSA_s_CPU (pstats) OSA_s_PCI (pstats)	18 51 10 52	23 63 10 63	21 65 8 63	20 64 7 63
FpspSize FpspInUse	421 152	421 184	421 168	421 167
ioInOctets/ioDirectRead ioOutOctets/ioDirectWrite	15888 84	16177 105	15304 129	14872 152
ioDirect Reads/MB ioDirectWrites/MB QDIOpolls/MB ioPCI/MB	66.7 28.6 81.7 35.9	65.4 17.0 59.5 28.3	69.2 11.0 51.5 25.5	71.1 8.3 49.8 24.7
MB/sec (workload driver) MB/sec (monitor)	1.000 1.000	1.291 1.287	1.292 1.289	1.289 1.285
TCPIP CPU Utilization TCPIP total CPU msec/MB TCPIP virtual CPU msec/MB TCPIP CP CPU msec/MB	1.000 1.000 1.000 1.000	1.199 0.924 0.899 1.000	1.181 0.911 0.865 1.013	1.181 0.915 0.872 1.013
OSA_c_CPU (pstats) OSA_c_PCI (pstats) OSA_s_CPU (pstats) OSA_s_PCI (pstats)	1.000 1.000 1.000 1.000	1.278 1.235 1.000 1.212	1.167 1.275 0.800 1.212	1.111 1.255 0.700 1.212
FpspSize FpspInUse	1.000 1.000	1.000 1.211	1.000 1.105	1.000 1.099
ioInOctets/ioDirectRead ioOutOctets/ioDirectWrite	1.000 1.000	1.018 1.250	0.963 1.536	0.936 1.810
ioDirect Reads/MB ioDirectWrites/MB QDIOpolls/MB ioPCI/MB	1.000 1.000 1.000 1.000	0.981 0.594 0.728 0.788	1.037 0.385 0.630 0.710	1.066 0.290 0.610 0.688
Note: Gigabit Ethernet; 9672-ZZ7; z/VM 3.1.0 with 31-bit CP; TCP/IP 3A0; databufferpoolsize 32760; databufferlimits 10 10

A single client with 8992 MTU has the same low throughput and utilization characteristics as the 1500 MTU case, although not as pronounced.

The maximum throughput was essentially reached at two clients. Also, note that the TCPIP CPU utilization never goes higher after 2 clients.

Footnotes:

¹: ioInOctets and ioOutOctets are found in the TCP/IP Application Monitor Data, sub-record x'00'. The description for these, and all other data found in the TCP/IP Application Monitor Data, can be found in the z/VM: Performance book, Appendix F.
²: This is described in Chapter 6 of the Performance and Reporting Facility User's Guide and Reference.
³: This service, new to TCP/IP Level 3A0, provides a new storage pool of 4K pages that have been locked by Diagnose 98. QDIO uses the FPSM to provide a cache of locked pages for fast I/O buffer management. Since the typical QDIO data transfer is faster than traditional I/O, the active life of a data buffer is short lived, thereby causing the same buffers to be quickly reused for the next data transfer. A new optional statement in the TCP/IP configuration file called FIXEDPAGESTORAGEPOOL can be used to manually tune the FPSM.

Contents | Previous | Next