Contents | Previous | Next

Linux Guest Crypto Support

z/VM supports the IBM PCICA (PCI Cryptographic Accelerator) and the IBM PCICC (PCI Cryptographic Coprocessor) for Linux guest virtual machines. This support, first provided on z/VM 4.2.0 with APAR VM62905, has been integrated into z/VM 4.3.0. It enables hardware SSL acceleration for Linux on zSeries and S/390 servers, resulting in greatly improved throughput relative to using software encryption/decryption.

The z/VM support allows for an unlimited number of Linux guests to be sharing the same PCI cryptographic facilities. Enqueue and dequeue requests by the Linux guests are intercepted by CP which, in turn, submits real enqueue and dequeue requests to the hardware facilities on their behalf. If, when a virtual enqueue request arrives, all of the PCI queues are in use, CP adds that request to a CP-maintained queue of pending virtual enqueue requests and does the real enqueue request later when a PCI queue becomes available.

This section presents and discusses the results of a number of measurements that were designed to understand and quantify the performance characteristics of this support.

Methodology 

The workload consisted of SSL transactions. Each transaction included a connect request, send 2K of data, receive 2K of data, and a disconnect request. Hardware encryption/decryption was only done for data transferred during the initial SSL handshake that occurs during the connect request. The RC4 MD5 US cipher and 1024 bit keys were used. There was no session ID caching.

The workload was generated using a locally-written tool called the SSL Exerciser, which includes both client and server application code. One server application was started for each client to be used. Each client sent SSL transactions to its assigned server. As soon as each transaction completed, the client started the next one (zero think time). The degree of total system loading was varied by changing the number of these client/server pairs that were started.

Clients were distributed across one or more client systems so as to not overload any of those systems. Unless otherwise specified, these client systems were all RS/6000 workstations running AIX.

Servers were distributed across one or more V=V Linux guest virtual machines running in a z/VM 4.3.0 system. There was one SSL Exerciser server application per Linux guest. The z/VM system was run in a dedicated LPAR of a 2064-116 zSeries processor. This LPAR was configured with one or more PCICA cards, depending on the measurement. Gb Ethernet was used to connect the workstations and the zSeries system.

Unless otherwise stated, Linux 2.4.7 (internal driver 12) and the 11/01/01 level of the z90crypt Linux crypto driver were used.

For any given measurement, the server application was first started in the Linux guest(s). All the clients were then started. After a 5-minute stabilization period, hardware instrumentation and (for most of the measurements) CP monitor data were collected during a 20-minute measurement interval. The CP monitor data were reduced using VMPRF. Throughput results were provided by the client applications.

Comparison to Software Encryption 

Measurements were obtained to compare the performance of the SSL workload with and without the use of hardware encryption. For these measurements, there was one Linux guest running in an LPAR with 4 dedicated processors. The LPAR was configured with one domain of a PCICA card. The results are summarized in Table 1.

Table 1. The Benefits of Hardware Encryption


Hardware Encryption?
Clients
Client Workstations
Run ID


no
20
1
E2110BV1


yes
20
1
E1B08BV1


yes
180
9
E1C17BV1


Tx/sec


74.80


157.38


561.92


Total Util/Proc (h)
CP Util/Proc (h)
Emul Util/Proc (h)
Percent CP (h)


98.31
1.86
96.45
1.9


32.21
8.04
24.18
25.0


95.89
17.98
77.91
18.8


Total CPU/Tx (h)
CP CPU/Tx (h)
Emul CPU/Tx (h)


52.570
0.994
51.575


8.187
2.042
6.144


6.825
1.280
5.546


Tx/sec


1.000


2.104


7.512


Total Util/Proc (h)
CP Util/Proc (h)
Emul Util/Proc (h)
Percent CP (h)


1.000
1.000
1.000
1.000


0.328
4.323
0.251
13.158


0.975
9.667
0.808
9.895


Total CPU/Tx (h)
CP CPU/Tx (h)
Emul CPU/Tx (h)


1.000
1.000
1.000


0.156
2.054
0.119


0.130
1.288
0.108

Note: 2064-116; 1 PCICA card; LPAR with: 1 PCICA domain, 4 dedicated processors, 2G central storage, no expanded storage; Gb Ethernet; RS/6000 workstations; Workload: SSL exerciser, 2048 bytes each way, 1024 bit keys, RC4 MD5 US cipher, no session id caching; One Linux guest with 4 virtual CPUs; z/VM 4.3.0; Linux 2.4.7 D12; 11/01/01 Linux crypto driver; CPU/Tx is in msec; (h) = hardware instrumentation

Without hardware encryption, throughput was limited by the LPAR's processor capacity, as shown by the very high processor utilization. Most of the processor utilization is in emulation and is primarily due to software encryption/decryption processing in the Linux guest.

When hardware encryption was enabled, this software encryption overhead was eliminated, reducing Total CPU/Tx by 84%. This resulted in a much higher throughput at a much lower processor utilization, allowing the load applied to the system to be increased (by starting more clients). The observed 562 Tx/sec is 7.5 times higher than the 74.8 Tx/sec that could be achieved when software encryption was used.

Percent CP is the percentage of all CPU usage that occurs in CP. It represents processing that would not have occurred had the Linux system been run directly on the LPAR. Percent CP increases in the hardware encryption cases because each crypto request now flows through CP.

Horizontal Scaling Study 

The preceding results were for the case of one Linux guest. Additional measurements were obtained to see how performance is affected when the applied SSL transaction workload is distributed across multiple Linux guests. Those results are summarized in Table 2.

Table 2. Horizontal Scaling Study


Linux Guests
Virtual CPUs/Guest
TCP/IP VM Router?
Clients
Real Storage
Run ID


1
4
no
180
2G
E1C17BV1


1
4
yes
180
2G
E2128BV2


24
1
yes
120
2G
E2131BV1


118
1
yes
590
6G
E2204BV1


Tx/sec


626.86


695.55


675.55


624.87


Total Util/Proc (h)
CP Util/Proc (h)
Emul Util/Proc (h)
Percent CP (h)


95.89
17.98
77.91
18.8


99.57
4.12
95.46
4.1


99.57
10.05
89.52
10.1


96.44
10.55
85.89
10.9


Total CPU/Tx (h)


6.118


5.726


5.895


6.173


TCPIP Total CPU/Tx (v)
TCPIP Emul CPU/Tx (v)
TCPIP CP CPU/Tx (v)


na
na
na


na
na
na


1.004
0.786
0.218


1.158
0.958
0.200


Linux Total CPU/Tx (v)
Linux Emul CPU/Tx (v)
Linux CP CPU/Tx (v)
Percent Linux CP (v)


5.865
5.101
0.764
13.0


na
na
na
na


4.831
4.537
0.294
6.1


4.914
4.570
0.344
7.0


Tx/sec (h)


1.000


1.110


1.078


0.997


Total Util/Proc (h)
CP Util/Proc (h)
Emul Util/Proc (h)
Percent CP (h)


1.000
1.000
1.000
1.000


1.038
0.229
1.225
0.218


1.038
0.559
1.149
0.537


1.006
0.587
1.102
0.580


Total CPU/Tx (h)
Total CPU/Tx (h)


1.000
1.068


0.936
1.000


0.964
1.030


1.009
1.078


Linux Total CPU/Tx (v)
Linux Emul CPU/Tx (v)
Linux CP CPU/Tx (v)
Percent Linux CP (v)


1.000
1.000
1.000
1.000


na
na
na
na


0.824
0.889
0.385
0.469


0.838
0.896
0.450
0.538

Note: 2064-116; 1 PCICA card; LPAR with: 1 PCICA card domain, 4 dedicated processors, 2G central storage, no expanded storage; Gb Ethernet; 9 RS/6000 client workstations; Workload: SSL exerciser, 2048 bytes each way, 1024 bit keys, RC4 MD5 US cipher, no session id caching; z/VM 4.3.0; Linux 2.4.7 D12; 11/01/01 Linux crypto driver; CPU/Tx is in msec; (h) = hardware instrumentation, (v) = VMPRF

The number of started clients varies for these measurements. In each case, however, the number of clients was more than sufficient to fully load the measured LPAR.

CP monitor records were not collected for run E2128BV2.

It seemed appropriate to make two configuration changes when switching from one to multiple Linux guests. First, we set up a TCP/IP VM stack virtual machine to own the Gb Ethernet adapter and serve as a router for the Linux guests. Virtual channel-to-channel was used to connect the Linux guests to the TCP/IP VM stack. 1 Second, we defined each Linux guest as a virtual uniprocessor because it was no longer necessary to define a virtual 4-way to utilize all four processors and that is a somewhat more efficient way to run Linux. 2

The first two measurements in Table 2 show the transition from Linux communicating directly with the Gb Ethernet adapter to Linux communicating indirectly through the TCP/IP VM router. The 7% drop in Total CPU/Tx is mostly due to the elimination of CP's QDIO shadow queues, which are unnecessary in the TCP/IP VM case because the TCP/IP stack machine fixes the QDIO queue pages in real storage. This improvement more than compensated for the additional processing arising from the more complex router configuration.

Percent Linux CP is the percentage of all CPU time consumed by the Linux guests that is in CP. This CP overhead is partly due to the CP crypto support and partly reflects the normal CP overhead required to support any V=V guest.

The results show that processing efficiency decreases slightly as the workload is distributed across more Linux guests. Relative to 1 Linux guest with TCP/IP VM router (column 2), Total CPU/Tx increased by 3% with 24 Linux guests and by 8% with 118 Linux guests. Analysis of the hardware instrumentation data revealed that most of these increases are in CP and that the increases are not related to the CP crypto support.

Effect of Improved Crypto Driver 

The performance of the Linux crypto driver has recently been improved substantially. The overall effects of this improvement in a multiple Linux guest environment are illustrated in Table 3.

For these measurements, the Linux guests were run in an LPAR with 8 dedicated processors. The LPAR was configured with access to multiple PCICA cards to prevent this resource from limiting throughput. The SSL workload was distributed across multiple TCP/IP VM stack virtual machines to prevent TCP/IP stack utilization from limiting throughput (each TCP/IP VM stack machine can only run on 1 real processor at a time).

For run E2308BV1, all clients were run on a G6 Enterprise Server running z/OS. For run E2415BV1, the clients were distributed across 11 RS/6000 AIX workstations.

RMF data showing PCICA card utilization were also collected for these measurements. This was done from a z/OS system running in a different LPAR.


Table 3. Effect of Improved Crypto Driver


Linux Crypto Driver
Linux Guests
Linux
Gb Ethernet Adapters
TCP/IP VM Routers
PCICA Cards
Domains per PCICA Card
Clients
Run ID


11-01-01
118
2.4.7 D12
1
4
2
15
590
E2308BV1


03-28-02
116
2.4.17 D22.3
2
6
6
1
548
E2415BV1










Ratios


Tx/sec


1259.25


2409.33


1.913


Total Util/Proc (h)
CP Util/Proc (h)
Emul Util/Proc (h)
Percent CP (h)


97.75
14.32
83.43
14.6


99.37
30.35
69.02
30.5


1.017
2.119
0.827
2.089


Avg PCICA Util (rmf)


57.5


37.9


0.659


Total CPU/Tx (h)
CP CPU/Tx (h)
Emul CPU/Tx (h)


5.244
0.768
4.476


3.310
1.011
2.299


0.631
1.316
0.514


Total CPU/Tx (v)


6.207


3.301


0.532


TCPIP Total CPU/Tx (v)
TCPIP Emul CPU/Tx (v)
TCPIP CP CPU/Tx (v)


0.655
0.357
0.298


0.750
0.390
0.360


1.145
1.092
1.208


Linux Total CPU/Tx (v)
Linux Emul CPU/Tx (v)
Linux CP CPU/Tx (v)
Percent Linux CP (v)


5.441
4.977
0.464
8.5


2.449
1.938
0.511
20.9


0.450
0.389
1.101
2.459

Note: 2064-116; LPAR with: 8 dedicated processors, 14G central storage, no expanded storage; Gb Ethernet; RS/6000 workstations; Workload: SSL exerciser, 2048 bytes each way, 1024 bit keys, RC4 MD5 US cipher, no session id caching; z/VM 4.3.0; 1 virtual CPU per Linux guest; CPU/Tx is in msec; (h) = hardware instrumentation, (v) = VMPRF

Several aspects of the configuration were changed between these two runs, mostly to accommodate the higher throughput. For example, the number of PCICA cards and the number of TCP/IP stack machines were increased. A more recent level of the Linux kernel was also used for the second measurement.

Although these changes had some effect on the comparison results, nearly all of the observed performance changes are due to the crypto driver improvement, which caused a 61% reduction in Linux Emul CPU/Tx. This allowed throughput achieved by the measured configuration to be increased by 91%.

CP CPU/Tx increased by 32%, partly due to increased CP CPU usage by the TCP/IP stack machines and partly due to increased MP lock contention resulting from the much higher throughput. None of the CP CPU/Tx increases are related to the CP crypto support.

Percent Linux CP increased from 8.5% TO 20.9%. Most of this increase is due to the large decrease in Linux Emul CPU/Tx caused by the Linux crypto driver improvement.


Footnotes:

1
These choices were rather arbitrary. We could have instead, for example, set up another Linux guest to serve as the router and used VM Guest LAN for communication between it and the other Linux guests and that would have worked quite well.

2
The same MP-capable Linux kernal was used. It is possible to generate a Linux kernel that does not include MP locks but we did not try this.

Contents | Previous | Next