IBM: z/VM Performance Report: Linux Disk I/O Alternatives

Linux Disk I/O Alternatives

Introduction

With z/VM 5.2.0, customers have a number of choices for the technology they use to perform disk I/O with Linux guest systems. In this chapter, we compare and contrast the performance of several alternatives for Linux disk I/O as a guest of z/VM. The purpose is to provide insight into the z/VM alternatives that may yield the best performance for Linux guest application workloads that include heavy disk I/O. This study does not explore Linux-specific alternatives.

The evaluated disk I/O choices are:

Dedicated Extended Count Key Data (ECKD) on an ESS 2105-F20, via FICON channel
Minidisks on ECKD on ESS 2105-F20, via FICON channel
Dedicated emulated Fixed Block Architecture (FBA) on ESS 2105-F20, via Fibre Channel Protocol (FCP)
Minidisks on emulated FBA on ESS 2105-F20, via FCP
Linux-owned FCP subchannel to an ESS 2105-F20, via FCP
Linux Diagnose X'250' Block I/O to minidisks on ECKD on ESS 2105-F20, via FICON channel

The Diagnose X'250' evaluation was done with an internal Diagnose driver for 64-bit Linux. This driver is not yet available in any Linux distributions. It is expected to be available in distributions sometime during 2006.

Absent from this study is an evaluation of Diagnose X'250' with emulated FBA DASD. The version of the internal Diagnose driver that we used has a kernel level dependency that is not met with SLES 9 SP 1. We expect to include this choice in future Linux disk I/O evaluations.

For this study, we used the Linux disk exerciser IOzone to measure disk performance a Linux guest experiences. We ran IOzone against the disk choices listed above. In the following sections we discuss the results of the experiments using these choices.

Method

To measure disk performance with Linux guests, we set up a single Linux guest running the IOzone disk exerciser with an 800 MB file. IOzone is a file system exercise tool. See our IOzone workload description for details about how we run IOzone.

For this experiment, we used the following configuration:

2084-324, two-way dedicated partition, 2 GB central storage, 2 GB expanded storage
No activity in the partition being measured, except our Linux guest and the test case driver
2105-F20, 16 GB of cache, FICON and FCP attached
z/VM 5.2.0 GA RSU
One VM virtual machine, 192 MB virtual uniprocessor, running 64-bit Linux SLES 9 SP 1, Linux ext2 file systems

Notes:

A 31-bit Linux SLES 9 SP 1 system was also used with the ECKD Diagnose X'250' cases. The results were very similar to the 64-bit Linux cases presented here.
A virtual machine size of 192 MB was used to ensure that a significant portion of the 800 MB ballast file did not fit into Linux page cache. If the file does fit into the page cache, few if any disk I/Os will be done.
Native SCSI allows more data to be moved in a single I/O request with the FCP subchannel than traditional ECKD, FBA, and Diagnose X'250'. With the large ballast file used in this study (800 MB), this creates an advantage for native SCSI over other disk I/O alternatives.
The 2105-F20 had multiple FICON chpids attaching it to the 2084. The Linux guest had one FCP device attached to it, so it was not doing any nontrivial path selection activity. Note also that the IOzone workload we ran was serial and single-threaded.

This chapter compares disk I/O choices with Linux as a guest virtual machine on z/VM 5.2.0. To view performance comparisons from a regression perspective (z/VM 5.2.0 compared with z/VM 5.1.0), refer to the CP Disk I/O Performance chapter.

The disk configurations mentioned in this chapter (e.g., EDED, LNS0, and so on) are defined in detail in our IOzone workload description appendix. The configuration naming conventions used in the tables in this chapter include some key indicators that help the reader to decode the configuration without the need to refer to the appendix:

Name prefix	Name suffix
E for ECKD with the Linux ECKD CCW driver.	DED for a dedicated volume. MD0 for a minidisk with MDC OFF. MD1 for a minidisk with MDC ON.
F for emulated FBA on 2105-F20 with the Linux FBA CCW driver.	DED for a dedicated volume. MD0 for a minidisk with MDC OFF. MD1 for a minidisk with MDC ON.
D2 for ECKD minidisk with the Linux Diagnose X'250' driver.	10 for blocksize 1024, MDC OFF. 11 for blocksize 1024, MDC ON. 20 for blocksize 2048, MDC OFF. 21 for blocksize 2048, MDC ON. 40 for blocksize 4096, MDC OFF. 41 for blocksize 4096, MDC ON.
LNS0 for Linux owning an FCP subchannel and using the zFCP driver.	None.

Summary of Results

While this study of performance shows native SCSI (Linux-owned FCP subchannel) is the best performing choice for Linux disk I/O, customers should also consider the challenges associated with managing the different disk I/O configurations as part of their system. This evaluation of Linux disk I/O alternatives as a z/VM guest system considers performance characteristics only. It is also important to keep in mind that this evaluation was done with FCP and FICON channels, which have comparable bandwidth characteristics. If ESCON channels had been used for the ECKD configuration, the throughput results would be significantly different.

The results of this study show that native SCSI outperforms all of the other choices evaluated in this experiment when considering reads and writes. It combines high levels of throughput with efficient use of CPU capacity. That said, there may be other I/O choices that provide favorable throughput and efficient use of CPU capacity based on the I/O characteristics of customer application workloads.

For application workloads that are predominantly read I/O with many rereads (for example, in cases where shared, read only DASD is used), there are other attractive choices. While the Linux-owned FCP subchannel is a good choice, there are other good choices when minidisk cache is exploited. ECKD minidisk, Diagnose X'250' ECKD minidisk (with block sizes of 2K or 4K), and emulated FBA on ESS 2105-F20 are all good choices with MDC ON. They all provide impressive throughput rates with efficient use of CPU capacity.

For application workloads that are predominantly write I/O, Linux-owned FCP subchannel is the best choice. It provides the best throughput rates with the most efficient use of CPU time. However, customers may want to consider other choices that yield improvements in throughput and use less CPU time when compared to the baseline dedicated ECKD case.

Customers should consider the characteristics of their environment when considering which disk I/O configuration to use for Linux guest systems on z/VM. Characteristics such as systems management and disaster recovery should be considered along with application workload characteristics.

Discussion of Results

For each configuration, the tables show the configuration values as a ratio scaled to the dedicated ECKD case. The tables are organized to show the KB per second ratio (KB/sec), total CPU time per KB ratio (Total CPU/KB), the VM Control Program CPU per KB ratio (CP CPU/KB), and the virtual CPU per KB ratio (Virtual CPU/KB). There are five tables in all included in this chapter. This allows us to compare the data rates and CPU consumptions for each of the four IOzone phases:

Initial write phase
Rewrite phase
Initial Read phase
Reread phase

The last table is a summary table that shows the average of the ratios that are shown in the four IOzone phases. For customers that have applications that result in a mixture of writes and reads where the percentage of each is similar or the percentages have not been determined, this table can be valuable as a summary of overall performance. For customers with applications that are heavily skewed to read or write I/O operations, the other four tables will provide valuable insight as to the best choices and acceptable alternatives.

IOzone Initial Write Results

IOzone Initial Write Results (scaled to EDED)
Configuration	KB/sec	Total CPU/KB	CP CPU/KB	Virtual CPU/KB
ECKD ccw
EDED	1.00	1.00	1.00	1.00
EMD0	1.12	0.996	0.946	1.00
EMD1	1.12	0.989	0.961	0.993
ECKD Diag X'250' (64-bit)
D210	0.245	1.27	2.89	1.06
D211	0.452	1.18	2.38	1.03
D220	0.354	1.10	1.68	1.02
D221	0.800	1.04	1.30	1.00
D240	0.467	1.03	1.16	1.02
D241	0.965	0.969	0.866	0.982
EFBA ccw (2105-F20)
FDED	1.24	1.57	5.44	1.06
FMD0	1.24	1.58	5.46	1.07
FMD1	1.24	1.57	5.45	1.06
Dedicated FCP (2105-F20)
LNS0	1.54	0.952	0.674	0.988
Notes: 2084-324. Two-way dedicated partition. 2 GB central. 2 GB XSTORE. 2105-F20 16 GB FICON/FCP. z/VM 5.2.0 GA RSU + VM63893. Linux SLES 9 SP 1, 192 MB virtual uniprocessor.

The IOzone initial write results show that the native SCSI case (Linux-owned FCP subchannel) is the best performer. It provides a 54% improvement in throughput over the benchmark dedicated ECKD case and a savings of 4.8% in total CPU time per KB moved.

The ECKD Diagnose X'250' cases show that throughput is the best at the 4K block size.

The emulated FBA cases on the 2105-F20 show much higher CPU time per transaction to achieve their throughput. Much of this can be attributed to the additional processing required in the VM Control Program to emulate FBA.

IOzone Rewrite Results

IOzone Rewrite Results (scaled to EDED)
Configuration	KB/sec	Total CPU/KB	CP CPU/KB	Virtual CPU/KB
ECKD ccw
EDED	1.00	1.00	1.00	1.00
EMD0	1.08	0.992	1.01	0.989
EMD1	1.09	0.986	0.960	0.991
ECKD Diag X'250' (64-bit)
D210	0.454	1.27	2.37	1.05
D211	0.448	1.29	2.50	1.05
D220	0.787	1.07	1.29	1.03
D221	0.781	1.08	1.43	1.02
D240	1.10	0.946	0.799	0.975
D241	1.08	0.961	0.868	0.979
EFBA ccw (2105-F20)
FDED	1.23	1.93	6.12	1.12
FMD0	1.23	1.95	6.11	1.14
FMD1	1.22	1.93	6.06	1.13
Dedicated FCP (2105-F20)
LNS0	1.46	0.915	0.628	0.971
Notes: 2084-324. Two-way dedicated partition. 2 GB central. 2 GB XSTORE. 2105-F20 16 GB FICON/FCP. z/VM 5.2.0 GA RSU + VM63893. Linux SLES 9 SP 1, 192 MB virtual uniprocessor.

For the rewrite phase, we see similar results to the write phase.

IOzone Initial Read Results

IOzone Initial Read Results (scaled to EDED)
Configuration	KB/sec	Total CPU/KB	CP CPU/KB	Virtual CPU/KB
ECKD ccw
EDED	1.00	1.00	1.00	1.00
EMD0	1.00	0.998	0.997	0.998
EMD1	0.539	1.34	2.65	1.02
ECKD Diag X'250' (64-bit)
D210	0.380	1.34	2.38	1.08
D211	0.206	1.72	4.85	0.946
D220	0.656	1.08	1.36	1.01
D221	0.357	1.49	4.09	0.846
D240	0.976	0.905	0.711	0.954
D241	0.546	1.36	2.76	1.01
EFBA ccw (2105-F20)
FDED	0.950	2.31	6.73	1.21
FMD0	0.950	2.29	6.60	1.21
FMD1	0.771	2.64	9.17	1.01
Dedicated FCP (2105-F20)
LNS0	1.64	0.852	0.515	0.936
Notes: 2084-324. Two-way dedicated partition. 2 GB central. 2 GB XSTORE. 2105-F20 16 GB FICON/FCP. z/VM 5.2.0 GA RSU + VM63893. Linux SLES 9 SP 1, 192 MB virtual uniprocessor.

For the initial read phase, the native SCSI case (Linux-owned FCP subchannel) is the best performer once again. It provides a 64% improvement in throughput over the baseline dedicated ECKD case, along with a 14.8% savings in total CPU time per KB moved.

The ECKD minidisk cases illustrate the cost in throughput and CPU time per transaction when MDC is ON. Comparing the EMD0 and EMD1 runs, there is a 46% loss in throughput and a 34% increase in total CPU time per KB moved with MDC ON. These costs are the result of populating the minidisk cache. When we look at the reread phase, we should find that there is a significant benefit with MDC ON as the read is done from the cache (i.e., no I/O is performed from the disk).

The ECKD Diagnose X'250' cases show a similar trend to the write and rewrite phases in terms of block size. The 4K block size results in the best throughput. Comparing the 4K block size cases (D240 and D241), we find a similar trend to the ECKD minidisk runs related to MDC. The cost of MDC ON is paid in terms of throughput and CPU time per transaction. As mentioned above in the ECKD minidisk discussion, we should find that there is a significant benefit with MDC ON in the reread phase.

The emulated FBA cases on the 2105-F20 show much higher CPU time per transaction, as is the case in the write and rewrite phases. A difference in the read phase is that the throughput in 2105-F20 cases is less than the dedicated ECKD baseline case. In the write and rewrite phases we saw that there was a significant increase in throughput at the cost of high CPU time per transaction.

IOzone Reread Results

IOzone Reread Results (scaled to EDED)
Configuration	KB/sec	Total CPU/KB	CP CPU/KB	Virtual CPU/KB
ECKD ccw
EDED	1.00	1.00	1.00	1.00
EMD0	0.999	1.03	1.03	1.03
EMD1	10.9	0.783	1.09	0.706
ECKD Diag X'250' (64-bit)
D210	0.380	1.35	2.37	1.09
D211	9.49	0.929	2.29	0.588
D220	0.656	1.10	1.29	1.05
D221	10.2	0.929	2.25	0.598
D240	0.975	0.973	0.785	1.02
D241	11.0	0.642	0.918	0.573
EFBA ccw (2105-F20)
FDED	0.947	2.35	6.78	1.25
FMD0	0.948	2.35	6.78	1.24
FMD1	9.07	0.901	2.20	0.576
Dedicated FCP (2105-F20)
LNS0	1.64	0.951	0.609	1.04
Notes: 2084-324. Two-way dedicated partition. 2 GB central. 2 GB XSTORE. 2105-F20 16 GB FICON/FCP. z/VM 5.2.0 GA RSU + VM63893. Linux SLES 9 SP 1, 192 MB virtual uniprocessor.

For the reread phase, the native SCSI case (Linux-owned FCP subchannel) is the best performer as it is in the other three phases (write, rewrite, and read phases). It provides a 64% improvement in throughput over the baseline dedicated ECKD case, along with a 4.9% savings in total CPU time per KB moved.

As expected, the ECKD minidisk case with MDC ON yields a very large benefit in throughput and a significant savings in CPU time per KB moved of 21.7%. As discussed in the read phase, this benefit is achieved because the reread phase performs the reread using the minidisk cache, so there is no I/O performed with the disk. The benefit of MDC is even more substantial when you consider z/VM environments with multiple Linux guest systems sharing read only minidisks as part of their application workload. Please note, however, that in cases where the Linux page cache is made large enough to achieve a high hit ratio, you should consider turning off MDC because it is redundant.

The ECKD Diagnose X'250' cases with MDC ON all show large improvements in throughput ratios, similar to what we see with the ECKD minidisk cases, along with significant savings in CPU time per transaction. As in the other three IOzone phases, ECKD Diagnose X'250' shows the most benefit using a block size of 4K. In this case, the throughput is improved by 1000%, and the total CPU time per KB moved is reduced by 35.8% over the baseline dedicated ECKD case. For the MDC OFF cases, the 4K block size case yields the best throughput.

The emulated FBA cases on the 2105-F20 show much higher CPU time per transaction as is the case in the write and rewrite phases, with one exception. The emulated FBA case for the 2105-F20 with MDC ON shows a reduction in total CPU time of 9.9% along with more than 800% increase in throughput. All other cases have a very high CPU time per transaction.

Overall IOzone Results

IOzone Overall Results (scaled to EDED)
Configuration	KB/sec	Total CPU/KB	CP CPU/KB	Virtual CPU/KB
ECKD ccw
EDED	1.00	1.00	1.00	1.00
EMD0	1.06	1.00	0.994	1.00
EMD1	1.07	1.02	1.40	0.944
ECKD Diag X'250' (64-bit)
D210	0.337	1.30	2.51	1.07
D211	0.432	1.27	2.99	0.938
D220	0.535	1.09	1.41	1.03
D221	0.752	1.12	2.24	0.903
D240	0.739	0.975	0.871	0.995
D241	1.03	0.980	1.34	0.911
EFBA ccw (2105-F20)
FDED	1.12	1.95	6.25	1.14
FMD0	1.12	1.95	6.22	1.14
FMD1	1.29	1.74	5.72	0.979
Dedicated FCP (2105-F20)
LNS0	1.55	0.923	0.608	0.983
Notes: 2084-324. Two-way dedicated partition. 2 GB central. 2 GB XSTORE. 2105-F20 16 GB FICON/FCP. z/VM 5.2.0 GA RSU + VM63893. Linux SLES 9 SP 1, 192 MB virtual uniprocessor.

The overall IOzone results table summarizes the performance of disk I/O choices across all four IOzone phases (initial write, rewrite, initial read, reread). This table characterizes the performance that can be expected for each choice for customers that have workloads that are not predominantly write or predominantly read.

As in the four phase discussions, the native SCSI case (Linux-owned FCP subchannel) is the clear winner. It outperforms all other choices with a 55% improvement in throughput, with a 7.7% savings in total CPU time per KB moved in comparison to the dedicated ECKD baseline case.

The ECKD minidisk cases show an increase in throughput over the dedicated ECKD case with little change in CPU cost.

The ECKD Diagnose X'250' cases show that throughput is best at the 4K block size. Minidisk cache (MDC) ON shows some improvement over MDC OFF in both throughput and total CPU time.

The emulated FBA cases on the 2105-F20 show very high CPU time per transaction to achieve their throughput.

Contents | Previous | Next