NVMe EDEVs
Abstract
Starting with z/VM 7.3, when an IBM Adapter for NVMe is installed into a LinuxONE server and the adapter is fitted with SSD storage, z/VM can exploit that storage to create EDEVs. Throughput characteristics of NVMe EDEVs will vary from one SSD manufacturer to another. Control Program (CP) CPU time per I/O is competitive with FCP SCSI EDEVs. The NVMe SSD we tried was suitable for use as a paging device in a heavy-paging workload.
Background
In z/VM 5.1 IBM introduced the notion of an emulated device or EDEV. An EDEV is a virtualized, persistent block storage device that obeys FBA channel programs but is not backed by real FBA hardware. Rather, it is backed merely by hardware of geometry sufficient for CP to store 512-byte blocks of data on it. For its first EDEVs, CP used an FCP device and a SCSI LUN to create the appearance of an FBA DASD. Such EDEVs are in use today in many clients.
With the appearances on IBM LinuxONE of the IBM Adapter for NVMe and of NVMe SSD storage cards from a variety of manufacturers, it became interesting to discuss whether z/VM could use said hardware as backing store for an EDEV. In z/VM 7.3 CP has been equipped to do this. EDEVs backed by NVMe SSD storage can be used for guest minidisks, guest-attached or guest-dedicated volumes, and for CP paging, spooling, T-disk, or directory extents.
In contrast to FCP SCSI EDEVs, NVMe EDEVs also implement the HyperPAV facility. CP causes the NVMe adapter and its installed SSD storage to appear to be a logical control unit (LCU) containing base EDEVs and HyperPAV alias EDEVs. Like a 3990 LCU, an NVMe EDEV LCU has an SSID. Further, like 3390 HyperPAV aliases, NVMe HyperPAV aliases can be attached to SYSTEM so CP can use them to parallelize guest I/O or CP I/O.
Method
NVMe EDEVs were evaluated as guest minidisk storage and as CP paging storage, according to the details below.
Guest Minidisk I/O Workload
A number of base NVMe EDEVs were created. The EDEVs were formatted with completely PERM extents. The EDEVs were then ATTACHed to SYSTEM. Guest minidisks were placed on these EDEVs, many minidisks per EDEV, one minidisk per guest. HyperPAV alias EDEVs were also created and ATTACHed to SYSTEM. CMS guests were created, one per minidisk, to run the IO3390 workload. When run against FBA DASD, IO3390 reads or writes eight FB-512 blocks per I/O. Each guest was configured to run IO3390 to its minidisk repeatedly, with no delay between I/Os. MONWRITE data was collected.
A corresponding configuration was run with FCP EDEVs. As before, the EDEVs were allocated as all PERM space and ATTACHed to SYSTEM. CMS guests were given minidisks on those EDEVs, and IO3390 was used to drive workload against the minidisks. No alias EDEVs were in play because there are no such devices for FCP EDEVs.
CP Paging I/O Workload
A number of base NVMe EDEVs were created. The EDEVs were formatted with completely PAGE extents. The EDEVs were then ATTACHed to SYSTEM. HyperPAV alias EDEVs were also created and ATTACHed to SYSTEM. A number of CMS guests were logged on, each configured to run the VIRSTOR workload. The size of central storage, the number and size of the guests, and the configuration given to VIRSTOR were chosen to make the system page. MONWRITE data was collected.
Results and Discussion
Minidisk I/O: CP CPU Time per Transfer
Early in the development of the enhancement, it was asked whether performance of NVMe EDEVs was good enough to justify continuing the project. Recognizing device service time was largely in the hands of the hardware vendors, it was decided that CP CPU time per transfer would be the success metric for continuing.
The guest minidisk workload described above was used to evaluate CP CPU time per transfer. A base case was run using FCP EDEVs. A comparison case was run with NVMe EDEVs. Table 1 shows the findings.
Table 1. Minidisk I/O: NVMe EDEV vs. FCP EDEV | ||||
Metric | SE43REC8 | NV020000 | SA43REC8 | NV020100 |
EDEV type | FCP | NVMe | FCP | NVMe |
I/O direction | writes | writes | reads | reads |
ETR (FCX239 transfers/sec) | 16675 | 35762 | 23300 | 5141 |
FCX225 %Busy | 12.4 | 12.5 | 17.1 | 2.1 |
CPUs | 2 | 4 | 2 | 4 |
Total CPU %Busy | 24.8 | 50 | 34.2 | 8.4 |
FCX225 T/V | 7.3 | 6.47 | 7.15 | 6.6 |
CP busy | 21.4 | 42.3 | 29.4 | 7.13 |
CP CPU-sec/sec | 0.214 | 0.423 | 0.294 | 0.071 |
CP CPU-sec/tx | 1.28E-05 | 1.18E-05 | 1.26E-05 | 1.39E-05 |
D6 R3 service time (msec) | 0.135 | 0.000121 | 0.128 | 0.125 |
Notes: 3906-M05. IBM Adapter for NVMe. Intel P4510 4 TB SSD module. Shared LPAR, all cores VH. z/VM 7.2 with experimental NVMe EDEV code. FCP runs used 4 FCP EDEVs and 12 IO3390 workers. NVMe runs used one base EDEV and 20 IO3390 workers. |
For NVMe EDEVs, CP CPU time per transfer was found
to line up with FCP EDEVs.
In these measurements we also found our NVMe SSD could write much more quickly than it could read. We later learned that NVMe SSDs are built either write-optimized or read-optimized.
Minidisk I/O: Scaling
A second measurement examined whether there were serialization bottlenecks in the implementation. To evaluate this, an arc of 12 minidisk I/O measurements was collected, ranging from one base device with no aliases up to one base device with 11 aliases. A first arc was done for 100% writes and then a second arc was done for 100% reads. Table 2 shows the findings.
Table 2. Minidisk I/O: NVMe Scaling Results | ||||||||||||
Run name | NV020000 | NV120000 | NV220000 | NV320000 | NV420000 | NV520000 | NV620000 | NV720000 | NV820000 | NV920000 | NVA20000 | NVB20000 |
Direction | write | write | write | write | write | write | write | write | write | write | write | write |
Aliases | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Blks/sec | 240417 | 337046 | 704185 | 961712 | 859125 | 981859 | 906266 | 951017 | 965490 | 980604 | 989048 | 985269 |
CP CPU-sec per million blks | 0.18 | 1.37 | 1.49 | 1.60 | 1.78 | 1.92 | 2.18 | 2.14 | 1.93 | 1.99 | 1.94 | 1.93 |
Service time (msec) | 0.000 | 0.000 | 0.000 | 0.001 | 0.003 | 0.004 | 0.023 | 0.023 | 0.029 | 0.027 | 0.029 | 0.030 |
Run name | NV020100 | NV120100 | NV220100 | NV320100 | NV420100 | NV520100 | NV620100 | NV720100 | NV820100 | NV920100 | NVA20100 | NVB20100 |
Direction | read | read | read | read | read | read | read | read | read | read | read | read |
Aliases | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Blks/sec | 34367 | 70929 | 126555 | 156357 | 195652 | 239110 | 245133 | 233372 | 276430 | 299644 | 354203 | 413079 |
CP CPU-sec per million blks | 0.22 | 1.61 | 1.57 | 1.55 | 1.52 | 1.57 | 2.52 | 2.25 | 1.66 | 1.67 | 1.60 | 1.60 |
Service time (msec) | 0.126 | 0.119 | 0.081 | 0.095 | 0.095 | 0.090 | 0.113 | 0.119 | 0.213 | 0.218 | 0.203 | 0.189 |
Notes: 3906-M05. IBM Adapter for NVMe. Intel P4510 4 TB SSD module. Shared LPAR, four cores, all VH. 3 GB central storage. z/VM 7.1 with experimental NVMe EDEV code. Twenty CMS guests running IO3390. |
No scaling problem was found.
CP Paging I/O
A third measurement explored whether NVMe EDEVs could be used as paging storage. To evaluate this, the paging workload described above was used. Table 3 shows the findings.
Table 3. CP Paging: NVMe Results | |
Metric | Value |
Run name | PG05903C |
Guest busy, % | 21.4 |
Chargeable CP busy, % | 18.1 |
Nonchargeable CP busy, % | 199.2 |
Total CP busy, % | 217.3 |
Page reads, /sec | 146000 |
Page writes, /sec | 157000 |
Total paging, /sec | 303000 |
CP busy % / paging rate | 7.17E-04 |
Paging SSCH rate, reads, /sec | 20878 |
Pages read per read SSCH | 7 |
Paging SSCH rate, writes, /sec | 1960 |
Pages written per write SSCH | 80 |
Total paging SSCH rate, /sec | 22838 |
CP busy % / paging SSCH rate | 9.51E-03 |
Paging volume I/O rate, /vol, /sec | 5709 |
Paging volume service time, msec | 1.28 |
Alias pool tries, /sec | 17264 |
Alias pool fails, /sec | 0 |
Aliases in use, avg | 27.2 |
Notes: 3906-M05. IBM Adapter for NVMe. Intel P4510 4 TB SSD module. Shared LPAR, four cores, all VH. 4 GB central storage. z/VM 7.3. Four NVMe base EDEVs, each with a PAGE extent of 16777211 4 KB pages (64 GB). Fifty-nine NVMe alias EDEVs attached to SYSTEM. Thirty CMS users, each 4096 MB, with VIRSTOR configured to thrash memory in the guest's upper 2 GB. CPU busy is expressed such that a value of 100% means one processor completely consumed. |
NVMe EDEVs were found to be suitable for paging.
Summary
The IBM Adapter for NVMe and an NVMe SSD module form a viable backing store for an EDEV. NVMe EDEVs can be used successfully for guest minidisks or for CP paging. CP CPU time per transfer was found to be about the same as for an FCP EDEV. With the SSD module we used, NVMe EDEVs were found to be suitable as CP paging devices.