IBM: z/VM Performance Report: TCP/IP Device Layer MP Support

TCP/IP Device Layer MP Support

In addition to the TCP/IP performance enhancements described in section TCP/IP Stack Performance Improvements, support was added to TCP/IP 440 to allow individual device drivers to be associated with particular virtual processors. Prior to this release, TCP/IP VM didn't have any virtual MP support and, as a result, any given TCP/IP stack virtual machine could only run on one real processor at a time. With TCP/IP 440, the device-specific processing can be done on virtual processors other than the base processor. This can be used to offload some processing from the base processor, which is used by the remaining stack functions, increasing the rate of work that can be handled by the stack virtual machine before the base processor becomes fully utilized. A new option, CPU, on the DEVICE configuration statement, designates the CPU where the driver for a particular device will be dispatched. If no specification is provided or if the designated CPU is not in the configuration, the base processor, which must be CPU 0, is used.

Methodology:

This section summarizes the results of a performance evaluation comparing TCP/IP 440 with and without the device layer MP support active.

An internal tool was used to drive connect-request-response (CRR) and streaming workloads. The CRR workload had the client connecting, sending 64 bytes to the server, the server responding with 8K and the client then disconnecting. The streaming workload consisted of the client sending 20 bytes to the server and the server responding with 20MB.

The measurements were done on a 2064-109 using 2 LPARs. Each LPAR had 3 dedicated processors, 1GB of central storage and 2GB expanded storage. In the measurement environment each LPAR had an equal number of client and server virtual machines defined. The client(s) from one LPAR communicated with the server(s) on the other LPAR.

Both Gigabit Ethernet (QDIO) and HiperSockets were used for communication between the TCP/IP stacks running on each of the LPARs. For the QDIO measurements both the maximum transmission units (MTU) 1492 and 8992 were used. For HiperSockets 8K, 16K, 32K, and 56K MTU sizes were used. Performance runs were made using 1, 10, 20, and 50 client-server pairs for each workload.

Each scenario for QDIO and HiperSockets was run with CPU 0 specified and then with CPU 1 specified for the device on the TCP/IP DEVICE configuration statement for the TCP/IP stack on each LPAR. A complete set of runs, consisting of 3 trials for each case, was done. CP monitor data was captured for one of the LPARs during the measurement and reduced using VMPRF. In addition, Performance Toolkit for VM data was captured for the same LPAR and used to report information on the CPU utilization for each virtual CPU.

Results: The following tables show the comparison between results on TCP/IP 440 with (CPU 1) and without (CPU 0) the device layer MP support active for a set of the measurements taken. MB/sec (megabytes per second) or trans/sec (transactions per second) were supplied by the workload driver and shows the throughput rate. All other values are from CP monitor data, derived from CP monitor data, or from Performance Toolkit for VM data.

Table 1. QDIO - Streaming 1492

Client/Server pairs	1	10	20	50
CPU 0 - runid	qf0sn013	qf0sn101	qf0sn201	qf0sn502
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util	61.31 12.77 8.66 4.11 48.97 48.70 0.00	74.83 17.08 11.43 5.65 68.97 69.00 0.00	77.05 18.92 12.77 6.15 74.62 74.58 0.00	77.06 20.71 14.05 6.66 79.52 79.47 0.00
CPU 1 - runid	qf1sn013	qf1sn103	qf1sn202	qf1sn502
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util	69.32 14.98 10.13 4.85 67.50 NA NA	77.44 19.21 13.09 6.12 85.90 NA NA	80.41 20.37 13.92 6.45 92.31 68.95 22.85	83.87 21.35 14.70 6.65 98.81 75.87 23.33
%diff MB/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec	13.06 17.25 16.93 17.93	3.49 12.51 14.60 8.28	4.36 7.65 8.97 4.92	8.84 3.11 4.61 - 0.06
Note: 2064-109; LPAR with 3 dedicated processors

Table 2. QDIO - Streaming 8992

Client/Server pairs	1	10	20	50
CPU 0 - runid	qf0sj011	qf0sj102	qf0sj202	qf0sj501
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util	59.79 10.23 6.37 3.86 36.36 36.04 0.00	98.04 12.21 7.53 4.68 66.92 66.78 0.00	98.14 12.41 7.61 4.80 68.97 68.74 0.00	96.06 12.87 7.90 4.97 70.51 70.56 0.00
CPU 1 - runid	qf1sj012	qf1sj102	qf1sj202	qf1sj502
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util	67.06 12.98 8.10 4.88 53.59 38.53 15.70	105.90 14.59 9.21 5.38 90.26 65.93 24.82	108.18 12.87 8.13 4.74 81.90 62.48 19.88	106.90 13.69 8.67 5.02 85.78 70.63 20.43
%diff MB/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec	12.16 26.75 27.07 26.21	8.02 19.49 22.31 14.97	10.23 3.68 6.75 - 1.19	11.28 6.44 9.75 1.16
Note: 2064-109; LPAR with 3 dedicated processors

Table 3. QDIO - CRR 1492

Client/Server pairs	1	10	20	50
CPU 0 - runid	qf0cn013	qf0cn102	qf0cn203	qf0cn503
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util	148.02 1.71 1.18 0.53 10.00 9.84 0.00	443.49 1.67 1.16 0.51 23.33 23.62 0.00	535.09 1.70 1.18 0.52 26.94 26.92 0.00	706.13 1.82 1.30 0.52 36.15 36.02 0.00
CPU 1 - runid	qf1cn013	qf1cn102	qf1cn202	qf1cn502
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util	153.33 2.02 1.35 0.67 13.85 9.05 4.48	436.44 1.85 1.27 0.58 30.77 22.25 7.73	556.68 1.86 1.28 0.58 38.06 28.20 9.22	711.28 2.02 1.43 0.59 50.79 38.28 11.27
%diff trans/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec	3.59 18.37 14.85 26.24	- 1.59 11.53 9.93 15.16	4.03 9.45 8.42 11.79	0.73 10.85 9.66 13.80
Note: 2064-109; LPAR with 3 dedicated processors

Table 4. QDIO - CRR 8992

Client/Server pairs	1	10	20	50
CPU 0 - runid	qf0cj013	qf0cj101	qf0cj201	qf0cj502
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util	146.65 1.27 0.88 0.39 10.00 9.84 0.00	453.78 1.70 1.31 0.39 23.33 23.62 0.00	522.41 1.86 1.46 0.40 26.94 26.92 0.00	716.41 1.68 1.27 0.41 36.15 36.02 0.00
CPU 1 - runid	qf1cj012	qf1cj102	qf1cj201	qf1cj501
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util	155.80 1.40 0.94 0.46 10.00 7.09 3.07	465.32 1.95 1.50 0.45 49.74 47.16 6.04	587.90 2.03 1.57 0.46 64.10 50.56 7.98	781.14 1.85 1.37 0.48 64.87 53.02 10.54
%diff trans/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec	6.24 10.83 7.26 18.90	2.54 14.60 14.27 15.70	12.54 9.43 7.33 17.19	9.04 10.47 7.70 19.13
Note: 2064-109; LPAR with 3 dedicated processors

Table 5. HiperSocket - Streaming 8K

Client/Server pairs	1	10	20	50
CPU 0 - runid	hf0sj011	hf0sj103	hf0sj201	hf0sj501
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util	139.12 7.57 4.51 3.06 73.30 73.45 0.00	129.58 10.17 6.16 4.01 78.61 78.48 0.00	118.43 10.77 6.54 4.23 75.56 75.40 0.00	104.12 11.90 7.38 4.52 72.86 72.66 0.00
CPU 1 - runid	hf1sj012	hf1sj103	hf1sj201	hf1sj501
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util	163.03 8.56 5.23 3.33 104.44 75.74 28.26	160.08 9.65 5.62 4.03 115.13 89.86 25.04	138.46 10.77 6.33 4.44 109.49 87.14 21.90	127.96 11.67 6.96 4.71 106.15 86.04 20.20
%diff MB/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec	17.19 13.05 15.96 8.77	23.54 -5.04 -8.71 0.60	16.91 0.02 -3.19 5.00	22.90 -1.88 -5.60 4.17
Note: 2064-109; LPAR with 3 dedicated processors

Table 6. HiperSocket - Streaming 56K

Client/Server pairs	1	10	20	50
CPU 0 - runid	hf0s5012	hf0s5102	hf0s5201	hf0s5501
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util	135.67 7.47 4.40 3.07 68.06 68.02 0.00	139.91 8.51 4.93 3.58 75.90 75.88 0.00	131.64 8.61 4.99 3.62 73.08 73.08 0.00	112.84 8.88 5.21 3.67 66.15 66.08 0.00
CPU 1 - runid	hf1s5013	hf1s5101	hf1s5203	hf1s5503
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util	168.47 7.90 4.61 3.29 96.67 72.52 23.57	160.55 9.21 5.19 4.02 108.61 87.86 20.04	144.95 10.03 5.75 4.28 105.00 86.40 18.22	130.51 11.27 6.51 4.76 103.08 85.63 17.10
%diff MB/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec	24.18 5.79 4.81 7.18	14.75 8.22 5.33 12.19	10.11 16.53 15.28 18.23	15.66 26.84 24.84 29.69
Note: 2064-109; LPAR with 3 dedicated processors

Table 7. HiperSocket - CRR 8K

Client/Server pairs	1	10	20	50
CPU 0 - runid	hf0cj013	hf0cj103	hf0cj201	hf0cj501
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util	175.93 1.33 0.94 0.39 8.97 8.62 0.00	457.63 1.56 1.17 0.39 31.94 31.60 0.00	546.08 1.47 1.08 0.39 30.83 30.62 0.00	704.48 1.74 1.32 0.42 48.72 44.30 0.00
CPU1 - runid	hf1cj012	hf1cj101	hf1cj202	hf1cj501
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util	185.14 1.61 1.12 0.49 14.44 10.78 3.26	486.05 2.02 1.56 0.46 53.85 47.28 5.74	601.41 1.95 1.51 0.44 62.22 53.28 7.26	789.53 1.76 1.30 0.46 61.11 47.14 9.96
%diff trans/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec	5.24 20.61 19.21 23.95	6.21 29.76 33.08 19.69	10.13 32.81 39.66 13.82	12.07 1.50 -1.53 11.08
Note: 2064-109; LPAR with 3 dedicated processors

Table 8. HiperSocket - CRR 56K

Client/Server pairs	1	10	20	50
CPU 0 - runid	hf0c5013	hf0c5101	hf0c5201	hf0c5502
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util	174.87 1.32 0.94 0.38 9.17 8.62 0.00	460.10 1.61 1.22 0.39 34.17 32.16 0.00	544.76 1.45 1.05 0.40 28.80 28.12 0.00	706.31 1.65 1.24 0.41 44.62 43.97 0.00
CPU 1 - runid	hf1c5012	hf1c5101	hf1c5201	hf1c5503
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util	185.71 1.66 1.18 0.48 15.28 11.38 3.56	485.49 1.96 1.52 0.44 52.78 47.88 5.89	594.50 2.00 1.54 0.46 62.56 46.60 7.80	787.43 1.80 1.33 0.47 63.59 52.46 10.20
%diff trans/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec	6.20 25.96 24.98 28.40	5.52 21.63 24.67 12.14	9.13 38.50 47.10 15.81	11.49 10.94 7.51 13.74
Note: 2064-109; LPAR with 3 dedicated processors

Summary:

In general the costs per MB or transaction are higher due to the overhead for implementing the virtual MP support. However, the throughput, as reported by MB/sec or trans/sec, is greater in almost all cases measured because the stack virtual machine can now use more than one processor. In addition, overall between 10% to 30% of the workload is moved from CPU 0 (base processor) to CPU 1. The workload moved from CPU 0 to CPU 1 represents the device-specific processing which can now be done in parallel with the stack functions which must be done on the base processor. The best case scenario above is seen for Hipersocket - Streaming with an 8K MTU size. In this case the percentage of the workload moved from CPU 0 to CPU 1 ranged from 19% for 50 client-server pairs to 27% for one client-server pair. In addition, the throughput increased over 16% in all cases while the percent increase in CPU consumption ranged from a high of just over 13% with one client-server pair to a decrease of over 5% for 10 client-server pairs.

Contents | Previous | Next