Contents | Previous | Next

TCP/IP Device Layer MP Support

Discussion 

In addition to the TCP/IP performance enhancements described in section TCP/IP Stack Performance Improvements, support was added to TCP/IP 440 to allow individual device drivers to be associated with particular virtual processors. Prior to this release, TCP/IP VM didn't have any virtual MP support and, as a result, any given TCP/IP stack virtual machine could only run on one real processor at a time. With TCP/IP 440, the device-specific processing can be done on virtual processors other than the base processor. This can be used to offload some processing from the base processor, which is used by the remaining stack functions, increasing the rate of work that can be handled by the stack virtual machine before the base processor becomes fully utilized. A new option, CPU, on the DEVICE configuration statement, designates the CPU where the driver for a particular device will be dispatched. If no specification is provided or if the designated CPU is not in the configuration, the base processor, which must be CPU 0, is used.

Methodology 

This section summarizes the results of a performance evaluation comparing TCP/IP 440 with and without the device layer MP support active.

An internal tool was used to drive connect-request-response (CRR) and streaming workloads. The CRR workload had the client connecting, sending 64 bytes to the server, the server responding with 8K and the client then disconnecting. The streaming workload consisted of the client sending 20 bytes to the server and the server responding with 20MB.

The measurements were done on a 2064-109 using 2 LPARs. Each LPAR had 3 dedicated processors, 1GB of central storage and 2GB expanded storage. In the measurement environment each LPAR had an equal number of client and server virtual machines defined. The client(s) from one LPAR communicated with the server(s) on the other LPAR.

Both Gigabit Ethernet (QDIO) and HiperSockets were used for communication between the TCP/IP stacks running on each of the LPARs. For the QDIO measurements both the maximum transmission units (MTU) 1492 and 8992 were used. For HiperSockets 8K, 16K, 32K, and 56K MTU sizes were used. Performance runs were made using 1, 10, 20, and 50 client-server pairs for each workload.

Each scenario for QDIO and HiperSockets was run with CPU 0 specified and then with CPU 1 specified for the device on the TCP/IP DEVICE configuration statement for the TCP/IP stack on each LPAR. A complete set of runs, consisting of 3 trials for each case, was done. CP monitor data was captured for one of the LPARs during the measurement and reduced using VMPRF. In addition, Performance Toolkit for VM data was captured for the same LPAR and used to report information on the CPU utilization for each virtual CPU.

Results  The following tables show the comparison between results on TCP/IP 440 with (CPU 1) and without (CPU 0) the device layer MP support active for a set of the measurements taken. MB/sec (megabytes per second) or trans/sec (transactions per second) were supplied by the workload driver and shows the throughput rate. All other values are from CP monitor data, derived from CP monitor data, or from Performance Toolkit for VM data.

Table 1. QDIO - Streaming 1492


Client/Server pairs


1


10


20


50


CPU 0 - runid


qf0sn013


qf0sn101


qf0sn201


qf0sn502


MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util


61.31
12.77
8.66
4.11
48.97
48.70
0.00


74.83
17.08
11.43
5.65
68.97
69.00
0.00


77.05
18.92
12.77
6.15
74.62
74.58
0.00


77.06
20.71
14.05
6.66
79.52
79.47
0.00


CPU 1 - runid


qf1sn013


qf1sn103


qf1sn202


qf1sn502


MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util


69.32
14.98
10.13
4.85
67.50
NA
NA


77.44
19.21
13.09
6.12
85.90
NA
NA


80.41
20.37
13.92
6.45
92.31
68.95
22.85


83.87
21.35
14.70
6.65
98.81
75.87
23.33


%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec


13.06
17.25
16.93
17.93


3.49
12.51
14.60
8.28


4.36
7.65
8.97
4.92


8.84
3.11
4.61
- 0.06

Note: 2064-109; LPAR with 3 dedicated processors

Table 2. QDIO - Streaming 8992


Client/Server pairs


1


10


20


50


CPU 0 - runid


qf0sj011


qf0sj102


qf0sj202


qf0sj501


MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util


59.79
10.23
6.37
3.86
36.36
36.04
0.00


98.04
12.21
7.53
4.68
66.92
66.78
0.00


98.14
12.41
7.61
4.80
68.97
68.74
0.00


96.06
12.87
7.90
4.97
70.51
70.56
0.00


CPU 1 - runid


qf1sj012


qf1sj102


qf1sj202


qf1sj502


MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util


67.06
12.98
8.10
4.88
53.59
38.53
15.70


105.90
14.59
9.21
5.38
90.26
65.93
24.82


108.18
12.87
8.13
4.74
81.90
62.48
19.88


106.90
13.69
8.67
5.02
85.78
70.63
20.43


%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec


12.16
26.75
27.07
26.21


8.02
19.49
22.31
14.97


10.23
3.68
6.75
- 1.19


11.28
6.44
9.75
1.16

Note: 2064-109; LPAR with 3 dedicated processors

Table 3. QDIO - CRR 1492


Client/Server pairs


1


10


20


50


CPU 0 - runid


qf0cn013


qf0cn102


qf0cn203


qf0cn503


trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util


148.02
1.71
1.18
0.53
10.00
9.84
0.00


443.49
1.67
1.16
0.51
23.33
23.62
0.00


535.09
1.70
1.18
0.52
26.94
26.92
0.00


706.13
1.82
1.30
0.52
36.15
36.02
0.00


CPU 1 - runid


qf1cn013


qf1cn102


qf1cn202


qf1cn502


trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util


153.33
2.02
1.35
0.67
13.85
9.05
4.48


436.44
1.85
1.27
0.58
30.77
22.25
7.73


556.68
1.86
1.28
0.58
38.06
28.20
9.22


711.28
2.02
1.43
0.59
50.79
38.28
11.27


%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec


3.59
18.37
14.85
26.24


- 1.59
11.53
9.93
15.16


4.03
9.45
8.42
11.79


0.73
10.85
9.66
13.80

Note: 2064-109; LPAR with 3 dedicated processors

Table 4. QDIO - CRR 8992


Client/Server pairs


1


10


20


50


CPU 0 - runid


qf0cj013


qf0cj101


qf0cj201


qf0cj502


trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util


146.65
1.27
0.88
0.39
10.00
9.84
0.00


453.78
1.70
1.31
0.39
23.33
23.62
0.00


522.41
1.86
1.46
0.40
26.94
26.92
0.00


716.41
1.68
1.27
0.41
36.15
36.02
0.00


CPU 1 - runid


qf1cj012


qf1cj102


qf1cj201


qf1cj501


trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util


155.80
1.40
0.94
0.46
10.00
7.09
3.07


465.32
1.95
1.50
0.45
49.74
47.16
6.04


587.90
2.03
1.57
0.46
64.10
50.56
7.98


781.14
1.85
1.37
0.48
64.87
53.02
10.54


%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec


6.24
10.83
7.26
18.90


2.54
14.60
14.27
15.70


12.54
9.43
7.33
17.19


9.04
10.47
7.70
19.13

Note: 2064-109; LPAR with 3 dedicated processors

Table 5. HiperSocket - Streaming 8K


Client/Server pairs


1


10


20


50


CPU 0 - runid


hf0sj011


hf0sj103


hf0sj201


hf0sj501


MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util


139.12
7.57
4.51
3.06
73.30
73.45
0.00


129.58
10.17
6.16
4.01
78.61
78.48
0.00


118.43
10.77
6.54
4.23
75.56
75.40
0.00


104.12
11.90
7.38
4.52
72.86
72.66
0.00


CPU 1 - runid


hf1sj012


hf1sj103


hf1sj201


hf1sj501


MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util


163.03
8.56
5.23
3.33
104.44
75.74
28.26


160.08
9.65
5.62
4.03
115.13
89.86
25.04


138.46
10.77
6.33
4.44
109.49
87.14
21.90


127.96
11.67
6.96
4.71
106.15
86.04
20.20


%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec


17.19
13.05
15.96
8.77


23.54
-5.04
-8.71
0.60


16.91
0.02
-3.19
5.00


22.90
-1.88
-5.60
4.17

Note: 2064-109; LPAR with 3 dedicated processors

Table 6. HiperSocket - Streaming 56K


Client/Server pairs


1


10


20


50


CPU 0 - runid


hf0s5012


hf0s5102


hf0s5201


hf0s5501


MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util


135.67
7.47
4.40
3.07
68.06
68.02
0.00


139.91
8.51
4.93
3.58
75.90
75.88
0.00


131.64
8.61
4.99
3.62
73.08
73.08
0.00


112.84
8.88
5.21
3.67
66.15
66.08
0.00


CPU 1 - runid


hf1s5013


hf1s5101


hf1s5203


hf1s5503


MB/sec
tot_cpu_msec/MB
emul_msec/MB
cp_msec/MB
tot_cpu_util
vcpu0_util
vcpu1_util


168.47
7.90
4.61
3.29
96.67
72.52
23.57


160.55
9.21
5.19
4.02
108.61
87.86
20.04


144.95
10.03
5.75
4.28
105.00
86.40
18.22


130.51
11.27
6.51
4.76
103.08
85.63
17.10


%diff MB/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec


24.18
5.79
4.81
7.18


14.75
8.22
5.33
12.19


10.11
16.53
15.28
18.23


15.66
26.84
24.84
29.69

Note: 2064-109; LPAR with 3 dedicated processors

Table 7. HiperSocket - CRR 8K


Client/Server pairs


1


10


20


50


CPU 0 - runid


hf0cj013


hf0cj103


hf0cj201


hf0cj501


trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util


175.93
1.33
0.94
0.39
8.97
8.62
0.00


457.63
1.56
1.17
0.39
31.94
31.60
0.00


546.08
1.47
1.08
0.39
30.83
30.62
0.00


704.48
1.74
1.32
0.42
48.72
44.30
0.00


CPU1 - runid


hf1cj012


hf1cj101


hf1cj202


hf1cj501


trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util


185.14
1.61
1.12
0.49
14.44
10.78
3.26


486.05
2.02
1.56
0.46
53.85
47.28
5.74


601.41
1.95
1.51
0.44
62.22
53.28
7.26


789.53
1.76
1.30
0.46
61.11
47.14
9.96


%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec


5.24
20.61
19.21
23.95


6.21
29.76
33.08
19.69


10.13
32.81
39.66
13.82


12.07
1.50
-1.53
11.08

Note: 2064-109; LPAR with 3 dedicated processors

Table 8. HiperSocket - CRR 56K


Client/Server pairs


1


10


20


50


CPU 0 - runid


hf0c5013


hf0c5101


hf0c5201


hf0c5502


trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util


174.87
1.32
0.94
0.38
9.17
8.62
0.00


460.10
1.61
1.22
0.39
34.17
32.16
0.00


544.76
1.45
1.05
0.40
28.80
28.12
0.00


706.31
1.65
1.24
0.41
44.62
43.97
0.00


CPU 1 - runid


hf1c5012


hf1c5101


hf1c5201


hf1c5503


trans/sec
tot_cpu_msec/trans
emul_msec/trans
cp_msec/trans
tot_cpu_util
vcpu0_util
vcpu1_util


185.71
1.66
1.18
0.48
15.28
11.38
3.56


485.49
1.96
1.52
0.44
52.78
47.88
5.89


594.50
2.00
1.54
0.46
62.56
46.60
7.80


787.43
1.80
1.33
0.47
63.59
52.46
10.20


%diff trans/sec
%diff tot_cpu_msec
%diff emul_msec
%diff cp_msec


6.20
25.96
24.98
28.40


5.52
21.63
24.67
12.14


9.13
38.50
47.10
15.81


11.49
10.94
7.51
13.74

Note: 2064-109; LPAR with 3 dedicated processors

Summary 

In general the costs per MB or transaction are higher due to the overhead for implementing the virtual MP support. However, the throughput, as reported by MB/sec or trans/sec, is greater in almost all cases measured because the stack virtual machine can now use more than one processor. In addition, overall between 10% to 30% of the workload is moved from CPU 0 (base processor) to CPU 1. The workload moved from CPU 0 to CPU 1 represents the device-specific processing which can now be done in parallel with the stack functions which must be done on the base processor. The best case scenario above is seen for Hipersocket - Streaming with an 8K MTU size. In this case the percentage of the workload moved from CPU 0 to CPU 1 ranged from 19% for 50 client-server pairs to 27% for one client-server pair. In addition, the throughput increased over 16% in all cases while the percent increase in CPU consumption ranged from a high of just over 13% with one client-server pair to a decrease of over 5% for 10 client-server pairs.

Contents | Previous | Next