Contents | Previous | Next

NVMe EDEVs

Abstract

Starting with z/VM 7.3, when an IBM Adapter for NVMe is installed into a LinuxONE server and the adapter is fitted with SSD storage, z/VM can exploit that storage to create EDEVs. Throughput characteristics of NVMe EDEVs will vary from one SSD manufacturer to another. Control Program (CP) CPU time per I/O is competitive with FCP SCSI EDEVs. The NVMe SSD we tried was suitable for use as a paging device in a heavy-paging workload.

Background

In z/VM 5.1 IBM introduced the notion of an emulated device or EDEV. An EDEV is a virtualized, persistent block storage device that obeys FBA channel programs but is not backed by real FBA hardware. Rather, it is backed merely by hardware of geometry sufficient for CP to store 512-byte blocks of data on it. For its first EDEVs, CP used an FCP device and a SCSI LUN to create the appearance of an FBA DASD. Such EDEVs are in use today in many clients.

With the appearances on IBM LinuxONE of the IBM Adapter for NVMe and of NVMe SSD storage cards from a variety of manufacturers, it became interesting to discuss whether z/VM could use said hardware as backing store for an EDEV. In z/VM 7.3 CP has been equipped to do this. EDEVs backed by NVMe SSD storage can be used for guest minidisks, guest-attached or guest-dedicated volumes, and for CP paging, spooling, T-disk, or directory extents.

In contrast to FCP SCSI EDEVs, NVMe EDEVs also implement the HyperPAV facility. CP causes the NVMe adapter and its installed SSD storage to appear to be a logical control unit (LCU) containing base EDEVs and HyperPAV alias EDEVs. Like a 3990 LCU, an NVMe EDEV LCU has an SSID. Further, like 3390 HyperPAV aliases, NVMe HyperPAV aliases can be attached to SYSTEM so CP can use them to parallelize guest I/O or CP I/O.

Method

NVMe EDEVs were evaluated as guest minidisk storage and as CP paging storage, according to the details below.

Guest Minidisk I/O Workload

A number of base NVMe EDEVs were created. The EDEVs were formatted with completely PERM extents. The EDEVs were then ATTACHed to SYSTEM. Guest minidisks were placed on these EDEVs, many minidisks per EDEV, one minidisk per guest. HyperPAV alias EDEVs were also created and ATTACHed to SYSTEM. CMS guests were created, one per minidisk, to run the IO3390 workload. When run against FBA DASD, IO3390 reads or writes eight FB-512 blocks per I/O. Each guest was configured to run IO3390 to its minidisk repeatedly, with no delay between I/Os. MONWRITE data was collected.

A corresponding configuration was run with FCP EDEVs. As before, the EDEVs were allocated as all PERM space and ATTACHed to SYSTEM. CMS guests were given minidisks on those EDEVs, and IO3390 was used to drive workload against the minidisks. No alias EDEVs were in play because there are no such devices for FCP EDEVs.

CP Paging I/O Workload

A number of base NVMe EDEVs were created. The EDEVs were formatted with completely PAGE extents. The EDEVs were then ATTACHed to SYSTEM. HyperPAV alias EDEVs were also created and ATTACHed to SYSTEM. A number of CMS guests were logged on, each configured to run the VIRSTOR workload. The size of central storage, the number and size of the guests, and the configuration given to VIRSTOR were chosen to make the system page. MONWRITE data was collected.

Results and Discussion

Minidisk I/O: CP CPU Time per Transfer

Early in the development of the enhancement, it was asked whether performance of NVMe EDEVs was good enough to justify continuing the project. Recognizing device service time was largely in the hands of the hardware vendors, it was decided that CP CPU time per transfer would be the success metric for continuing.

The guest minidisk workload described above was used to evaluate CP CPU time per transfer. A base case was run using FCP EDEVs. A comparison case was run with NVMe EDEVs. Table 1 shows the findings.

Table 1. Minidisk I/O: NVMe EDEV vs. FCP EDEV
Metric SE43REC8 NV020000 SA43REC8 NV020100
EDEV type FCP NVMe FCP NVMe
I/O direction writes writes reads reads
ETR (FCX239 transfers/sec) 16675 35762 23300 5141
FCX225 %Busy 12.4 12.5 17.1 2.1
CPUs 2 4 2 4
Total CPU %Busy 24.8 50 34.2 8.4
FCX225 T/V 7.3 6.47 7.15 6.6
CP busy 21.4 42.3 29.4 7.13
CP CPU-sec/sec 0.214 0.423 0.294 0.071
CP CPU-sec/tx 1.28E-05 1.18E-05 1.26E-05 1.39E-05
D6 R3 service time (msec) 0.135 0.000121 0.128 0.125
Notes: 3906-M05. IBM Adapter for NVMe. Intel P4510 4 TB SSD module. Shared LPAR, all cores VH. z/VM 7.2 with experimental NVMe EDEV code. FCP runs used 4 FCP EDEVs and 12 IO3390 workers. NVMe runs used one base EDEV and 20 IO3390 workers.

 
For NVMe EDEVs, CP CPU time per transfer was found to line up with FCP EDEVs.

In these measurements we also found our NVMe SSD could write much more quickly than it could read. We later learned that NVMe SSDs are built either write-optimized or read-optimized.

Minidisk I/O: Scaling

A second measurement examined whether there were serialization bottlenecks in the implementation. To evaluate this, an arc of 12 minidisk I/O measurements was collected, ranging from one base device with no aliases up to one base device with 11 aliases. A first arc was done for 100% writes and then a second arc was done for 100% reads. Table 2 shows the findings.

Table 2. Minidisk I/O: NVMe Scaling Results
Run name NV020000 NV120000 NV220000 NV320000 NV420000 NV520000 NV620000 NV720000 NV820000 NV920000 NVA20000 NVB20000
Direction write write write write write write write write write write write write
Aliases 0 1 2 3 4 5 6 7 8 9 10 11
Blks/sec 240417 337046 704185 961712 859125 981859 906266 951017 965490 980604 989048 985269
CP CPU-sec per million blks 0.18 1.37 1.49 1.60 1.78 1.92 2.18 2.14 1.93 1.99 1.94 1.93
Service time (msec) 0.000 0.000 0.000 0.001 0.003 0.004 0.023 0.023 0.029 0.027 0.029 0.030
Run name NV020100 NV120100 NV220100 NV320100 NV420100 NV520100 NV620100 NV720100 NV820100 NV920100 NVA20100 NVB20100
Direction read read read read read read read read read read read read
Aliases 0 1 2 3 4 5 6 7 8 9 10 11
Blks/sec 34367 70929 126555 156357 195652 239110 245133 233372 276430 299644 354203 413079
CP CPU-sec per million blks 0.22 1.61 1.57 1.55 1.52 1.57 2.52 2.25 1.66 1.67 1.60 1.60
Service time (msec) 0.126 0.119 0.081 0.095 0.095 0.090 0.113 0.119 0.213 0.218 0.203 0.189
Notes: 3906-M05. IBM Adapter for NVMe. Intel P4510 4 TB SSD module. Shared LPAR, four cores, all VH. 3 GB central storage. z/VM 7.1 with experimental NVMe EDEV code. Twenty CMS guests running IO3390.

 
No scaling problem was found.

CP Paging I/O

A third measurement explored whether NVMe EDEVs could be used as paging storage. To evaluate this, the paging workload described above was used. Table 3 shows the findings.

Table 3. CP Paging: NVMe Results
Metric Value
Run name PG05903C
Guest busy, % 21.4
Chargeable CP busy, % 18.1
Nonchargeable CP busy, % 199.2
Total CP busy, % 217.3
Page reads, /sec 146000
Page writes, /sec 157000
Total paging, /sec 303000
CP busy % / paging rate 7.17E-04
Paging SSCH rate, reads, /sec 20878
Pages read per read SSCH 7
Paging SSCH rate, writes, /sec 1960
Pages written per write SSCH 80
Total paging SSCH rate, /sec 22838
CP busy % / paging SSCH rate 9.51E-03
Paging volume I/O rate, /vol, /sec 5709
Paging volume service time, msec 1.28
Alias pool tries, /sec 17264
Alias pool fails, /sec 0
Aliases in use, avg 27.2
Notes: 3906-M05. IBM Adapter for NVMe. Intel P4510 4 TB SSD module. Shared LPAR, four cores, all VH. 4 GB central storage. z/VM 7.3. Four NVMe base EDEVs, each with a PAGE extent of 16777211 4 KB pages (64 GB). Fifty-nine NVMe alias EDEVs attached to SYSTEM. Thirty CMS users, each 4096 MB, with VIRSTOR configured to thrash memory in the guest's upper 2 GB. CPU busy is expressed such that a value of 100% means one processor completely consumed.

 
NVMe EDEVs were found to be suitable for paging.

Summary

The IBM Adapter for NVMe and an NVMe SSD module form a viable backing store for an EDEV. NVMe EDEVs can be used successfully for guest minidisks or for CP paging. CP CPU time per transfer was found to be about the same as for an FCP EDEV. With the SSD module we used, NVMe EDEVs were found to be suitable as CP paging devices.

Contents | Previous | Next