|
Contents | Previous | Next
HiperSockets
Starting with the z/900 Model 2064, z/Architecture provides a new type
of I/O device called HiperSockets. As an extension to the Queued
Direct I/O Hardware Facility, HiperSockets provides a high-bandwidth
method for programs to communicate within the same logical partition
(LPAR) or across any logical partition within the same Central
Electronics Complex (CEC) using traditional TCP/IP socket connections.
VM Guest LAN support, also in z/VM 4.2.0, provides the capability
for a VM guest to define a virtual HiperSocket adapter and connect it
with other virtual network adapters on the same VM host system to
form an emulated LAN segment. While real HiperSockets support requires
a z/800 or z/900, VM Guest LAN support is available on G5 and G6
processors as well.
This section summarizes the results of a performance evaluation of
TCP/IP VM comparing the new HiperSockets and Guest LAN support with
existing QDIO, IUCV and VCTC support.
Methodology:
An internal tool was used to drive request-response (RR),
connect-request-response (CRR) and streaming (S) workloads.
The request-response workload consisted of the client sending 200
bytes to the server and the server responding with 1000 bytes. This
interaction lasted for 200 seconds.
The connect-request-response workload had the client connecting,
sending 64 bytes to the server, the server responding with 8K and the
client then disconnecting. This same sequence was repeated for
200 seconds.
The streaming workload consisted of the client sending 20 bytes to
the server and the server responding with 20MB. This sequence was
repeated for 400 seconds.
Each workload was run using IUCV, VCTC, HiperSockets, Guest LAN and
QDIO connectivity at various MTU sizes. For IUCV, VCTC and QDIO
1500 and 8992 MTU sizes were chosen. For HiperSockets and Guest LAN
8K, 16K, 32K and 56K MTU sizes were used.
For HiperSockets, the Maximum Frame Size (MFS) specified on the CHPID
definition is also important. The MFS defined for the system
were 16K, 24K, 40K and 64K and are associated with MTU sizes of
8K, 16K, 32K and 56K respectively.
All measurements included 1, 5, 10, 20 and 50 client-server pairs.
The clients and servers ran on the same VM system with a TCPIP stack
for the clients and a separate TCPIP stack for the servers.
The measurements were done on a 2064-109 in an LPAR with 2
dedicated processors. The LPAR had 1GB central storage and 2GB
expanded storage. APARs VM62938 and PQ51738, which enable HiperSockets
support, were applied. CP monitor data was captured during the
measurement run, and reduced using VMPRF.
Specifying DATABUFFERLIMITS 10 10 in the TCPIP configuration file
helped to increase the throughput. It was also necessary to specify
65536 for DATABUFFERPOOLSIZE and LARGEENVELOPEPOOLSIZE to support
the larger MTUs.
Results:
The following charts show, for each workload (RR, CRR and streaming),
throughput and CPU time. Each chart has a line for each
connectivity/MTU pair measured. Throughput for all cases shows
generally the same trend of reaching a plateau and then trailing off.
The corresponding CPU time, in general, shows the same pattern where
time decreases until the throughput plateau is reached. As throughput
trails off, the time increases showing we've passed the optimum point.
Specific details are mentioned for each workload after the charts for
that workload.
Figure 1. RR Throughput
Figure 2. RR CPU Time
The CPU Time shows that while the legacy connectivity types (IUCV
and VCTC) have basically the same time no matter how many
client-server pairs are running, the others do gain efficiency as the
number of connections increases until about 20 connections.
MTU size doesn't have much effect on throughput because of the
small amount of data sent in the RR workload.
IUCV and VCTC have better throughput since they have been
optimized for VM-to-VM communication whereas the other connectivity
types have to support more than just the VM environment and therefore
are not as optimized.
The throughput for all connectivity types plateaus at 10 users.
Figure 3. CRR Throughput
Figure 4. CRR CPU Time
With the CRR workload, IUCV and VCTC have lost their advantage
because they
are not as efficient with connect and disconnect. The optimization
done for moving data does not help as much with this workload.
The CPU times are greater for CRR than RR for the same reason
(connect and disconnect overhead).
The plateau is reached between 5 and 10 users, depending
on the connectivity type.
Guest LAN handles CRR more efficiently than
the other types.
Figure 5. Streaming Throughput
Figure 6. Streaming CPU Time
The number of connections is not as significant for the streaming
workload as it was for RR and CRR.
MTU size, however, does make a difference with this workload
because of the large amounts of data being transferred.
Bigger is better.
An anomaly was noted for the single user Guest LAN case when running
with 56K MTU size. This is currently being investigated.
It is possible that IUCV and VCTC may do better than shown in this
chart if an MTU size larger than 8992 is chosen.
The throughput and CPU time results for all runs are summarized in
the following tables (by connectivity type).
Table 1. Throughput and CPU Time: HiperSockets
MTU Size
|
8K
|
16K
|
32K
|
56K
|
Throughput (trans/sec)
|
|
|
|
|
RR1
RR5
RR10
RR20
RR50
|
852.06
1352.97
1508.20
1535.70
1351.26
|
856.69
1365.10
1595.59
1620.41
1462.30
|
877.11
1364.84
1656.22
1648.21
1486.11
|
913.99
1347.30
1646.97
1651.54
1542.31
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
70.10
83.29
80.99
80.90
88.70
|
71.78
84.92
84.32
82.00
79.99
|
65.27
76.98
82.60
82.19
87.89
|
64.04
77.98
80.54
81.58
80.36
|
Throughput (MB/sec)
|
|
|
|
|
S1
S5
S10
S20
S50
|
27.45
30.10
29.90
28.27
28.93
|
38.63
46.46
45.98
44.17
40.71
|
60.40
69.33
65.19
57.36
52.07
|
70.43
72.86
67.49
62.72
53.25
|
CPU time (msec/trans)
|
|
|
|
|
RR1
RR5
RR10
RR20
RR50
|
1.83
1.46
1.29
1.15
1.24
|
1.83
1.45
1.20
1.07
1.14
|
1.81
1.45
1.15
1.02
1.12
|
1.82
1.47
1.15
0.99
1.08
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
18.97
15.37
15.73
15.60
15.13
|
18.56
14.46
14.44
15.27
16.08
|
20.71
16.47
14.33
15.21
14.47
|
21.14
16.34
15.30
14.66
16.05
|
CPU time (msec/MB)
|
|
|
|
|
S1
S5
S10
S20
S50
|
61.86
53.89
51.64
52.92
54.61
|
45.41
36.50
36.45
37.49
39.65
|
31.13
27.23
27.76
30.65
33.99
|
26.44
24.43
25.16
26.75
31.70
|
| Note:
2064-109; z/VM 4.2.0 with 64-bit CP
|
Table 2. Throughput and CPU Time: Guest LAN
MTU Size
|
8K
|
16K
|
32K
|
56K
|
Throughput (trans/sec)
|
|
|
|
|
RR1
RR5
RR10
RR20
RR50
|
612.12
1352.50
1507.93
1530.21
1374.24
|
614.10
1349.25
1587.62
1596.38
1444.74
|
614.19
1359.39
1655.89
1654.49
1514.85
|
584.38
1293.02
1597.29
1602.41
1490.61
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
68.67
77.30
77.70
76.63
75.67
|
67.94
78.08
83.21
81.45
80.30
|
68.15
81.55
84.43
82.26
80.74
|
72.35
90.44
91.54
86.51
84.42
|
Throughput (MB/sec)
|
|
|
|
|
S1
S5
S10
S20
S50
|
14.89
15.54
14.43
12.94
13.72
|
36.02
40.95
37.65
28.37
38.13
|
60.29
65.36
61.40
55.57
49.03
|
6.46
62.39
58.57
58.30
51.51
|
CPU time (msec/trans)
|
|
|
|
|
RR1
RR5
RR10
RR20
RR50
|
1.64
1.44
1.29
1.14
1.21
|
1.63
1.42
1.20
1.08
1.16
|
1.63
1.42
1.14
1.02
1.11
|
1.72
1.49
1.18
1.03
1.13
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
16.75
14.67
14.62
14.80
15.09
|
19.52
16.19
14.49
15.32
16.21
|
19.52
15.92
14.43
15.15
15.88
|
18.38
14.13
13.63
14.75
15.28
|
CPU time (msec/MB)
|
|
|
|
|
S1
S5
S10
S20
S50
|
102.22
98.07
104.23
112.98
119.97
|
44.81
38.34
39.73
45.19
41.17
|
29.86
28.43
29.09
31.71
36.26
|
23.53
26.00
26.98
28.99
32.93
|
| Note:
2064-109; z/VM 4.2.0 with 64-bit CP
|
Table 3. Throughput and CPU Time: QDIO
MTU Size
|
1500
|
8992
|
Throughput (trans/sec)
|
|
|
RR1
RR5
RR10
RR20
RR50
|
594.29
1410.62
1631.10
1554.05
1527.24
|
596.85
1406.49
1625.78
1547.36
1527.62
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
62.10
75.30
78.08
77.81
77.30
|
64.17
75.96
83.22
81.68
81.19
|
Throughput (MB/sec)
|
|
|
S1
S5
S10
S20
S50
|
40.76
28.65
29.54
30.44
28.62
|
64.32
74.45
72.62
64.46
59.22
|
CPU time (msec/trans)
|
|
|
RR1
RR5
RR10
RR20
RR50
|
1.68
1.32
1.09
0.94
1.05
|
1.68
1.33
1.10
0.94
1.05
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
16.81
14.98
14.47
14.27
14.54
|
19.32
16.06
14.90
15.18
15.42
|
CPU time (msec/MB)
|
|
|
S1
S5
S10
S20
S50
|
43.23
45.45
48.75
52.76
59.89
|
28.64
26.46
26.47
29.23
31.17
|
| Note:
2064-109; z/VM 4.2.0 with 64-bit CP
|
Table 4. Throughput and CPU Time: IUCV
Throughput
MTU Size
|
1500
|
8992
|
Throughput (trans/sec)
|
|
|
RR1
RR5
RR10
RR20
RR50
|
|
1109.99
1985.96
2022.08
2159.24
1942.32
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
|
66.79
78.60
78.61
81.81
75.98
|
Throughput (MB/sec)
|
|
|
S1
S5
S10
S20
S50
|
34.39
43.18
40.01
35.75
30.30
|
45.21
54.02
52.32
46.41
41.38
|
CPU time (msec/trans)
|
|
|
RR1
RR5
RR10
RR20
RR50
|
|
0.94
0.98
0.98
0.92
0.99
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
|
18.84
17.02
18.06
18.68
20.87
|
CPU time (msec/MB)
|
|
|
S1
S5
S10
S20
S50
|
33.91
41.08
45.69
51.08
57.62
|
23.45
28.25
29.09
31.11
36.06
|
| Note:
2064-109; z/VM 4.2.0 with 64-bit CP
|
Table 5. Throughput and CPU Time: VCTC
MTU Size
|
1500
|
8992
|
Throughput (trans/sec)
|
|
|
RR1
RR5
RR10
RR20
RR50
|
|
1034.59
1933.19
1964.4
1990.4
1799.93
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
|
71.51
84.05
78.25
81.12
80.71
|
Throughput (MB/sec)
|
|
|
S1
S5
S10
S20
S50
|
40.58
47.12
46.19
43.30
36.35
|
48.21
57.36
56.14
49.48
44.94
|
CPU time (msec/trans)
|
|
|
RR1
RR5
RR10
RR20
RR50
|
|
0.99
1.03
1.01
0.99
1.04
|
CRR1
CRR5
CRR10
CRR20
CRR50
|
|
16.78
15.92
17.99
18.22
17.82
|
CPU time (msec/MB)
|
|
|
S1
S5
S10
S20
S50
|
37.11
39.86
41.44
44.16
52.27
|
27.01
30.20
29.96
32.34
36.63
|
| Note:
2064-109; z/VM 4.2.0 with 64-bit CP
|
Maximum Throughput Results:
The maximum throughput for each workload is summarized in the
following tables.
MB/sec (megabytes per second), trans/sec (transactions per second)
and response time were supplied by the workload driver and
show the throughput
rate. All other values are from CP monitor data or derived from
CP monitor data.
Total_cpu_util
| This field was obtained from the SYSTEM_SUMMARY_BY_TIME VMPRF
report that shows the average of both processors out of 100%.
| tot_cpu_util
| This field is calculated from the
USER_RESOURCE_UTILIZATION VMPRF report (CPU seconds, total)
for the client stack (tcpip1),
the server stack (tcpip2) and the driver clients and servers and
gives the total system CPU utilization. 100% is the equivalent of
one fully utilized processor.
| virt_cpu_util
| This field is calculated from the
USER_RESOURCE_UTILIZATION VMPRF report (CPU seconds, Virtual)
for the client stack (tcpip1),
the server stack (tcpip2) and the driver clients and servers (nontcp).
This result is the total virtual CPU utilization.
| run+wait
| This was calculated from the previous tot_cpu_util plus
percent of time waiting on CPU as reported in USER_STATES VMPRF
report for each stack (tcpip1 and tcpip2).
| cpu_msec/trans
| This field was calculated from the previous tot_cpu_util
divided by the number of transactions per second (or number of
megabytes per second for the streaming workload) to show the number
of milliseconds of time per transaction.
|
Table 6. Maximum Throughput: Request-Response
Protocol
MTU size
Number of clients
runid
|
QDIO
1500
10
qnxr1072
|
HIPER
56K
20
h5xr2072
|
GUEST
32K
20
g3xr2071
|
vCTC
8992
20
vjxr2071
|
IUCV
8992
20
ijxr2071
|
MB/sec
trans/sec
response time (msec)
elapsed time (sec)
total_cpu_util
|
1.87
1631.10
6.13
150.00
89.30
|
1.89
1651.54
12.10
150.00
81.90
|
1.89
1654.49
12.08
150.00
84.00
|
2.28
1990.40
10.04
150.00
99.00
|
2.42
2116.14
9.44
150.00
99.20
|
tcpip1_tot_cpu_util
tcpip1_virt_cpu_util
tcpip1_run+wait
|
58.70
45.30
85.40
|
43.30
30.70
66.00
|
47.30
32.00
78.00
|
48.00
22.70
73.30
|
40.00
20.70
69.30
|
tcpip2_tot_cpu_util
tcpip2_virt_cpu_util
tcpip2_run+wait
|
35.30
22.00
48.60
|
32.70
19.30
47.40
|
34.70
19.33
66.70
|
47.30
22.70
76.60
|
40.70
20.70
59.40
|
nontcp_tot_cpu_util
nontcp_virt_cpu_util
|
80.00
67.33
|
80.00
80.00
|
80.00
80.00
|
106.67
80.00
|
106.67
106.67
|
cpu_msec/trans
emul_msec/trans
cp_msec/trans
|
1.09
0.86
0.24
|
0.99
0.76
0.23
|
1.02
0.76
0.25
|
0.99
0.67
0.33
|
0.94
0.67
0.26
|
tcpip1_cpu_msec/trans
tcpip1_vcpu_msec/trans
tcpip1_ccpu_msec/trans
|
0.36
0.28
0.08
|
0.26
0.19
0.08
|
0.14
0.10
0.05
|
0.24
0.11
0.13
|
0.19
0.10
0.09
|
tcpip2_cpu_msec/trans
tcpip2_vcpu_msec/trans
tcpip2_ccpu_msec/trans
|
0.22
0.13
0.08
|
0.20
0.12
0.08
|
0.10
0.06
0.05
|
0.24
0.11
0.12
|
0.19
0.10
0.09
|
nontcp_cpu_msec/trans
nontcp_vcpu_msec/trans
nontcp_ccpu_msec/trans
|
0.49
0.41
0.08
|
0.48
0.48
0.00
|
0.77
0.61
0.16
|
0.54
0.40
0.13
|
0.50
0.50
0.00
|
| Note:
2064-109; z/VM 4.2.0 with 64-bit CP; TCP/IP 420
|
Both IUCV and vCTC attained 99% CPU utilization for this workload
and therefore are gated by the available processors.
IUCV had the best throughput with 2116.14 transactions a second.
The driver client virtual machines communicate with the
TCPIP2 stack virtual machine. The driver server virtual machines
communicate with the TCPIP1 stack virtual machine.
All cases show that the driver clients and servers used a large
portion of the resources available.
Table 7. Maximum Throughput: Connect-Request-Response
Protocol
MTU Size
Number of clients
runid
|
QDIO
8992
10
qjxc1071
|
HIPER
8K
50
h8xc5071
|
GUEST
56K
10
g5xc1073
|
vCTC
8992
05
vjxc0572
|
IUCV
8992
20
ijxc2072
|
MB/sec
trans/sec
response time (msec)
elapsed time (sec)
total_cpu_util
|
0.66
83.22
120.16
150.00
62.00
|
0.70
88.74
563.71
120.00
67.10
|
0.72
91.54
109.24
150.00
62.40
|
0.66
84.05
59.49
150.00
66.90
|
0.64
81.81
244.47
150.00
76.40
|
tcpip1_tot_cpu_util
tcpip1_virt_cpu_util
tcpip1_run+wait
|
88.70
87.30
98.00
|
85.80
84.20
92.50
|
86.70
84.70
89.40
|
84.70
82.70
99.40
|
92.70
91.30
100.70
|
tcpip2_tot_cpu_util
tcpip2_virt_cpu_util
tcpip2_run+wait
|
24.00
22.70
28.00
|
33.30
31.70
43.30
|
25.30
23.30
29.30
|
38.70
37.30
45.40
|
50.00
48.70
58.00
|
nontcp_tot_cpu_util
nontcp_virt_cpu_util
|
13.33
13.33
|
0.00
0.00
|
13.33
13.33
|
8.67
6.67
|
0.00
0.00
|
cpu_msec/trans
emul_msec/trans
cp_msec/trans
|
14.90
14.28
0.62
|
15.13
14.27
0.86
|
13.63
12.93
0.71
|
15.92
15.18
0.74
|
18.68
18.07
0.61
|
tcpip1_cpu_msec/trans
tcpip1_vcpu_msec/trans
tcpip1_ccpu_msec/trans
|
10.65
10.49
0.16
|
9.68
9.49
0.19
|
9.47
9.25
0.22
|
10.07
9.84
0.24
|
11.33
11.16
0.16
|
tcpip2_cpu_msec/trans
tcpip2_vcpu_msec/trans
tcpip2_ccpu/_msec/trans
|
2.88
2.72
0.16
|
3.76
3.57
0.19
|
2.77
2.55
0.22
|
4.60
4.44
0.16
|
6.11
5.95
0.16
|
nontcp_cpu_msec/trans
nontcp_vcpu_msec/trans
nontcp_ccpu_msec/trans
|
1.60
1.60
0.00
|
0.00
0.00
0.00
|
1.46
1.46
0.00
|
1.03
0.79
0.24
|
0.00
0.00
0.00
|
| Note:
2064-109; z/VM 4.2.0 with 64-bit CP; TCP/IP 420
|
CRR throughput is less than RR throughput due to the overhead
of connect/disconnect.
Guest LAN seemed to handle the CRR workload the best.
The client stack appeared to be the limiting factor in all cases.
For all cases 90% of the time or greater the stacks were either
running or waiting for the CPU.
Since the stack design is based on a uni-processor model, it will
never be able to exceed 100% of one processor.
Table 8. Maximum Throughput: Streaming
Protocol
MTU Size
Number of clients
runid
|
QDIO
8992
05
qjxs0573
|
HIPER
56K
05
h5xs0573
|
GUEST
32K
05
g3xs0573
|
vCTC
8992
05
vjxs0573
|
IUCV
8992
05
ijxs0572
|
MB/sec
trans/sec
response time(msec)
elapsed time (sec)
total_cpu_util
|
74.45
3.72
1343.24
330.00
98.50
|
72.86
3.64
1372.57
330.00
89.00
|
65.36
3.27
1530.00
330.00
92.97
|
57.36
2.87
1743.30
330.00
86.61
|
54.02
2.70
1851.07
330.00
76.39
|
tcpip1_tot_cpu_util
tcpip1_virt_cpu_util
tcpip1_run+wait
|
93.00
70.30
96.00
|
92.10
70.00
99.40
|
91.50
59.40
98.20
|
60.30
26.10
77.90
|
48.50
26.10
60.00
|
tcpip2_tot_cpu_util
tcpip2_virt_cpu_util
tcpip2_run+wait
|
64.50
40.30
77.20
|
52.10
31.80
63.50
|
57.30
34.50
71.20
|
67.00
26.40
88.80
|
61.50
25.20
76.70
|
nontcp_tot_cpu_util
nontcp_virt_cpu_util
|
37.88
33.33
|
32.12
24.85
|
36.36
31.82
|
42.73
37.88
|
40.91
34.85
|
cpu_msec/MB
emul_msec/MB
cp_msec/MB
|
26.46
19.23
7.23
|
24.43
17.90
6.53
|
28.43
19.25
9.18
|
30.20
15.73
14.47
|
28.25
16.11
12.14
|
tcpip1_cpu_msec/MB
tcpip1_vcpu_msec/MB
tcpip1_ccpu_msec/MB
|
12.50
9.44
3.05
|
12.64
9.61
3.04
|
14.00
9.09
4.91
|
10.51
4.54
5.97
|
8.98
4.82
4.15
|
tcpip2_cpu_msec/MB
tcpip2_vcpu_msec/MB
tcpip2_ccpu_msec/MB
|
8.67
5.41
3.26
|
7.15
4.37
2.79
|
8.76
5.29
3.48
|
11.68
4.60
7.08
|
11.39
4.66
6.73
|
nontcp_cpu_msec/MB
nontcp_vcpu_msec/MB
nontcp_ccpu_msec/MB
|
5.09
4.48
0.61
|
4.41
3.41
1.00
|
18.18
15.91
4.59
|
7.45
6.60
0.85
|
7.57
6.45
1.12
|
| Note:
2064-109; z/VM 4.2.0 with 64-bit CP; TCP/IP 420
|
QDIO was the winner for the streaming workload. For QDIO,
total system CPU utilization limited throughput.
For HiperSockets and Guest LAN, the client stack was the limiting
factor with more than 95% of the time either running or waiting on
the CPU.
Contents | Previous | Next
|