TCP/IP Device Layer MP Support
In addition to the TCP/IP performance enhancements described in section TCP/IP Stack Performance Improvements, support was added to TCP/IP 440 to allow individual device drivers to be associated with particular virtual processors. Prior to this release, TCP/IP VM didn't have any virtual MP support and, as a result, any given TCP/IP stack virtual machine could only run on one real processor at a time. With TCP/IP 440, the device-specific processing can be done on virtual processors other than the base processor. This can be used to offload some processing from the base processor, which is used by the remaining stack functions, increasing the rate of work that can be handled by the stack virtual machine before the base processor becomes fully utilized. A new option, CPU, on the DEVICE configuration statement, designates the CPU where the driver for a particular device will be dispatched. If no specification is provided or if the designated CPU is not in the configuration, the base processor, which must be CPU 0, is used.
This section summarizes the results of a performance evaluation comparing TCP/IP 440 with and without the device layer MP support active.
An internal tool was used to drive connect-request-response (CRR) and streaming workloads. The CRR workload had the client connecting, sending 64 bytes to the server, the server responding with 8K and the client then disconnecting. The streaming workload consisted of the client sending 20 bytes to the server and the server responding with 20MB.
The measurements were done on a 2064-109 using 2 LPARs. Each LPAR had 3 dedicated processors, 1GB of central storage and 2GB expanded storage. In the measurement environment each LPAR had an equal number of client and server virtual machines defined. The client(s) from one LPAR communicated with the server(s) on the other LPAR.
Both Gigabit Ethernet (QDIO) and HiperSockets were used for communication between the TCP/IP stacks running on each of the LPARs. For the QDIO measurements both the maximum transmission units (MTU) 1492 and 8992 were used. For HiperSockets 8K, 16K, 32K, and 56K MTU sizes were used. Performance runs were made using 1, 10, 20, and 50 client-server pairs for each workload.
Each scenario for QDIO and HiperSockets was run with CPU 0 specified and then with CPU 1 specified for the device on the TCP/IP DEVICE configuration statement for the TCP/IP stack on each LPAR. A complete set of runs, consisting of 3 trials for each case, was done. CP monitor data was captured for one of the LPARs during the measurement and reduced using VMPRF. In addition, Performance Toolkit for VM data was captured for the same LPAR and used to report information on the CPU utilization for each virtual CPU.
Results:
The following tables show the comparison between results on
TCP/IP 440 with (CPU 1) and without (CPU 0) the device layer
MP support active for a set of the measurements taken.
MB/sec (megabytes
per second) or trans/sec (transactions per second) were supplied
by the workload driver and shows the throughput rate. All other
values are from CP monitor data, derived from CP monitor data,
or from Performance Toolkit for VM data.
Table 1. QDIO - Streaming 1492
Client/Server pairs | 1 | 10 | 20 | 50 |
CPU 0 - runid | qf0sn013 | qf0sn101 | qf0sn201 | qf0sn502 |
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util | 61.31 12.77 8.66 4.11 48.97 48.70 0.00 | 74.83 17.08 11.43 5.65 68.97 69.00 0.00 | 77.05 18.92 12.77 6.15 74.62 74.58 0.00 | 77.06 20.71 14.05 6.66 79.52 79.47 0.00 |
CPU 1 - runid | qf1sn013 | qf1sn103 | qf1sn202 | qf1sn502 |
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util | 69.32 14.98 10.13 4.85 67.50 NA NA | 77.44 19.21 13.09 6.12 85.90 NA NA | 80.41 20.37 13.92 6.45 92.31 68.95 22.85 | 83.87 21.35 14.70 6.65 98.81 75.87 23.33 |
%diff MB/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec | 13.06 17.25 16.93 17.93 | 3.49 12.51 14.60 8.28 | 4.36 7.65 8.97 4.92 | 8.84 3.11 4.61 - 0.06 |
Note: 2064-109; LPAR with 3 dedicated processors |
Table 2. QDIO - Streaming 8992
Client/Server pairs | 1 | 10 | 20 | 50 |
CPU 0 - runid | qf0sj011 | qf0sj102 | qf0sj202 | qf0sj501 |
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util | 59.79 10.23 6.37 3.86 36.36 36.04 0.00 | 98.04 12.21 7.53 4.68 66.92 66.78 0.00 | 98.14 12.41 7.61 4.80 68.97 68.74 0.00 | 96.06 12.87 7.90 4.97 70.51 70.56 0.00 |
CPU 1 - runid | qf1sj012 | qf1sj102 | qf1sj202 | qf1sj502 |
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util | 67.06 12.98 8.10 4.88 53.59 38.53 15.70 | 105.90 14.59 9.21 5.38 90.26 65.93 24.82 | 108.18 12.87 8.13 4.74 81.90 62.48 19.88 | 106.90 13.69 8.67 5.02 85.78 70.63 20.43 |
%diff MB/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec | 12.16 26.75 27.07 26.21 | 8.02 19.49 22.31 14.97 | 10.23 3.68 6.75 - 1.19 | 11.28 6.44 9.75 1.16 |
Note: 2064-109; LPAR with 3 dedicated processors |
Client/Server pairs | 1 | 10 | 20 | 50 |
CPU 0 - runid | qf0cn013 | qf0cn102 | qf0cn203 | qf0cn503 |
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util | 148.02 1.71 1.18 0.53 10.00 9.84 0.00 | 443.49 1.67 1.16 0.51 23.33 23.62 0.00 | 535.09 1.70 1.18 0.52 26.94 26.92 0.00 | 706.13 1.82 1.30 0.52 36.15 36.02 0.00 |
CPU 1 - runid | qf1cn013 | qf1cn102 | qf1cn202 | qf1cn502 |
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util | 153.33 2.02 1.35 0.67 13.85 9.05 4.48 | 436.44 1.85 1.27 0.58 30.77 22.25 7.73 | 556.68 1.86 1.28 0.58 38.06 28.20 9.22 | 711.28 2.02 1.43 0.59 50.79 38.28 11.27 |
%diff trans/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec | 3.59 18.37 14.85 26.24 | - 1.59 11.53 9.93 15.16 | 4.03 9.45 8.42 11.79 | 0.73 10.85 9.66 13.80 |
Note: 2064-109; LPAR with 3 dedicated processors |
Client/Server pairs | 1 | 10 | 20 | 50 |
CPU 0 - runid | qf0cj013 | qf0cj101 | qf0cj201 | qf0cj502 |
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util | 146.65 1.27 0.88 0.39 10.00 9.84 0.00 | 453.78 1.70 1.31 0.39 23.33 23.62 0.00 | 522.41 1.86 1.46 0.40 26.94 26.92 0.00 | 716.41 1.68 1.27 0.41 36.15 36.02 0.00 |
CPU 1 - runid | qf1cj012 | qf1cj102 | qf1cj201 | qf1cj501 |
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util | 155.80 1.40 0.94 0.46 10.00 7.09 3.07 | 465.32 1.95 1.50 0.45 49.74 47.16 6.04 | 587.90 2.03 1.57 0.46 64.10 50.56 7.98 | 781.14 1.85 1.37 0.48 64.87 53.02 10.54 |
%diff trans/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec | 6.24 10.83 7.26 18.90 | 2.54 14.60 14.27 15.70 | 12.54 9.43 7.33 17.19 | 9.04 10.47 7.70 19.13 |
Note: 2064-109; LPAR with 3 dedicated processors |
Table 5. HiperSocket - Streaming 8K
Client/Server pairs | 1 | 10 | 20 | 50 |
CPU 0 - runid | hf0sj011 | hf0sj103 | hf0sj201 | hf0sj501 |
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util | 139.12 7.57 4.51 3.06 73.30 73.45 0.00 | 129.58 10.17 6.16 4.01 78.61 78.48 0.00 | 118.43 10.77 6.54 4.23 75.56 75.40 0.00 | 104.12 11.90 7.38 4.52 72.86 72.66 0.00 |
CPU 1 - runid | hf1sj012 | hf1sj103 | hf1sj201 | hf1sj501 |
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util | 163.03 8.56 5.23 3.33 104.44 75.74 28.26 | 160.08 9.65 5.62 4.03 115.13 89.86 25.04 | 138.46 10.77 6.33 4.44 109.49 87.14 21.90 | 127.96 11.67 6.96 4.71 106.15 86.04 20.20 |
%diff MB/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec | 17.19 13.05 15.96 8.77 | 23.54 -5.04 -8.71 0.60 | 16.91 0.02 -3.19 5.00 | 22.90 -1.88 -5.60 4.17 |
Note: 2064-109; LPAR with 3 dedicated processors |
Table 6. HiperSocket - Streaming 56K
Client/Server pairs | 1 | 10 | 20 | 50 |
CPU 0 - runid | hf0s5012 | hf0s5102 | hf0s5201 | hf0s5501 |
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util | 135.67 7.47 4.40 3.07 68.06 68.02 0.00 | 139.91 8.51 4.93 3.58 75.90 75.88 0.00 | 131.64 8.61 4.99 3.62 73.08 73.08 0.00 | 112.84 8.88 5.21 3.67 66.15 66.08 0.00 |
CPU 1 - runid | hf1s5013 | hf1s5101 | hf1s5203 | hf1s5503 |
MB/sec tot_cpu_msec/MB emul_msec/MB cp_msec/MB tot_cpu_util vcpu0_util vcpu1_util | 168.47 7.90 4.61 3.29 96.67 72.52 23.57 | 160.55 9.21 5.19 4.02 108.61 87.86 20.04 | 144.95 10.03 5.75 4.28 105.00 86.40 18.22 | 130.51 11.27 6.51 4.76 103.08 85.63 17.10 |
%diff MB/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec | 24.18 5.79 4.81 7.18 | 14.75 8.22 5.33 12.19 | 10.11 16.53 15.28 18.23 | 15.66 26.84 24.84 29.69 |
Note: 2064-109; LPAR with 3 dedicated processors |
Client/Server pairs | 1 | 10 | 20 | 50 |
CPU 0 - runid | hf0cj013 | hf0cj103 | hf0cj201 | hf0cj501 |
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util | 175.93 1.33 0.94 0.39 8.97 8.62 0.00 | 457.63 1.56 1.17 0.39 31.94 31.60 0.00 | 546.08 1.47 1.08 0.39 30.83 30.62 0.00 | 704.48 1.74 1.32 0.42 48.72 44.30 0.00 |
CPU1 - runid | hf1cj012 | hf1cj101 | hf1cj202 | hf1cj501 |
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util | 185.14 1.61 1.12 0.49 14.44 10.78 3.26 | 486.05 2.02 1.56 0.46 53.85 47.28 5.74 | 601.41 1.95 1.51 0.44 62.22 53.28 7.26 | 789.53 1.76 1.30 0.46 61.11 47.14 9.96 |
%diff trans/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec | 5.24 20.61 19.21 23.95 | 6.21 29.76 33.08 19.69 | 10.13 32.81 39.66 13.82 | 12.07 1.50 -1.53 11.08 |
Note: 2064-109; LPAR with 3 dedicated processors |
Table 8. HiperSocket - CRR 56K
Client/Server pairs | 1 | 10 | 20 | 50 |
CPU 0 - runid | hf0c5013 | hf0c5101 | hf0c5201 | hf0c5502 |
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util | 174.87 1.32 0.94 0.38 9.17 8.62 0.00 | 460.10 1.61 1.22 0.39 34.17 32.16 0.00 | 544.76 1.45 1.05 0.40 28.80 28.12 0.00 | 706.31 1.65 1.24 0.41 44.62 43.97 0.00 |
CPU 1 - runid | hf1c5012 | hf1c5101 | hf1c5201 | hf1c5503 |
trans/sec tot_cpu_msec/trans emul_msec/trans cp_msec/trans tot_cpu_util vcpu0_util vcpu1_util | 185.71 1.66 1.18 0.48 15.28 11.38 3.56 | 485.49 1.96 1.52 0.44 52.78 47.88 5.89 | 594.50 2.00 1.54 0.46 62.56 46.60 7.80 | 787.43 1.80 1.33 0.47 63.59 52.46 10.20 |
%diff trans/sec %diff tot_cpu_msec %diff emul_msec %diff cp_msec | 6.20 25.96 24.98 28.40 | 5.52 21.63 24.67 12.14 | 9.13 38.50 47.10 15.81 | 11.49 10.94 7.51 13.74 |
Note: 2064-109; LPAR with 3 dedicated processors |
In general the costs per MB or transaction are higher due to the overhead for implementing the virtual MP support. However, the throughput, as reported by MB/sec or trans/sec, is greater in almost all cases measured because the stack virtual machine can now use more than one processor. In addition, overall between 10% to 30% of the workload is moved from CPU 0 (base processor) to CPU 1. The workload moved from CPU 0 to CPU 1 represents the device-specific processing which can now be done in parallel with the stack functions which must be done on the base processor. The best case scenario above is seen for Hipersocket - Streaming with an 8K MTU size. In this case the percentage of the workload moved from CPU 0 to CPU 1 ranged from 19% for 50 client-server pairs to 27% for one client-server pair. In addition, the throughput increased over 16% in all cases while the percent increase in CPU consumption ranged from a high of just over 13% with one client-server pair to a decrease of over 5% for 10 client-server pairs.