IBM: z/VM Performance Report: Linux Guest DASD Performance

Linux Guest DASD Performance

Purpose

The purpose of this experiment was to measure the disk I/O performance of an ESA/390 Linux 2.4 guest on z/VM 4.2.0. We sought to measure write performance, non-cached (minidisk cache (MDC) OFF) read performance, and cached (MDC ON) read performance.

Executive Summary of Results

We found that CP spends about 11% more CPU on writes when MDC is on. We also found that CP spends about 285% more CPU on a read when it has to do an MDC insertion as a result of the read.

Together, these results suggest that setting MDC ON for a Linux guest's DASD volumes is a good idea only when the I/O to the disk is known to be mostly reads.

Hardware

2064-109, LPAR with 2 dedicated CPUs, 1 GB real, 2 GB XSTORE, LPAR dedicated to this experiment. DASD is RAMAC-1 behind 3990-6 controller. ¹

Software

z/VM 4.2.0. An early internal development driver of 31-bit Linux 2.4.5. We configured the Linux virtual machine with 128 MB of main storage, no swap partition, and its root file system residing on a 3000-cylinder minidisk. Finally, we used a DASD I/O exercising tool which opens a new Linux file (the "ballast file"), writes it in 16 KB chunks until the desired file size is reached, closes it, then performs N (N>=0) open-read-close passes over the file, reading the file in 16 KB chunks during each pass.

We used a 512 MB ballast file for these experiments. We chose 512 MB because it was large enough to prohibit a 128 MB Linux guest from using its own internal file cache yet small enough to fit completely in our 2 GB of XSTORE (minidisk cache).

Experiment

We ran the disk exerciser in several different configurations, varying the setting of MDC and varying the number of read passes over the ballast file. The configuration used is encoded in the run name. Each run name is Mmnn, where the name decodes as follows:

Portion	Meaning
M`m`	M0 for MDC OFF, and M1 for MDC ON.
`nn`	The number of read passes over the file: 0, 1, 2, or 10.

We also synthesized some runs by subtracting actual runs' resource consumption from one another. We did this to isolate the resource consumption incurred during one read pass of the ballast file under various conditions. These are our "synthetic" runs:

Run NCR0 is the non-cached (i.e., first-pass) read performance for MDC OFF. Its resource consumption is the difference between what M01 consumed and what M00 consumed.
Run NCR1 is the non-cached (i.e., first-pass) read performance for MDC ON. Its resource consumption is the difference between what M11 consumed and what M10 consumed.
Run CR is the cached (i.e., second-pass) read performance. Its resource consumption is the difference between what M12 consumed and what M11 consumed.

We used CP QUERY TIME to record virtual CPU time, CP CPU time, and elapsed time for each run. We also used CP INDICATE USER * EXP to record virtual I/O count for each run. Also, the disk exerciser tool prints its observed write data rate (KB/sec) and observed read data rate (KB/sec) when it finishes its run.

Finally, for each configuration, we ran the exerciser 10 times. We computed the mean and standard deviation of the samples for each configuration, so as to get a measure of natural run variability and so we could reliably compare runs using a twin-tailed t-test.

Observations

Run ID	Virtual CPU time (hundredths of seconds)	CP CPU time (hundredths of seconds)	Tool-reported read rate (KB/sec)	Tool-reported write rate (KB/sec)
M00	10/16499.3/1426	10/38.9/1.221	n/a	10/3029.5/10.93
M01	10/17382.4/23.62	10/75.2/1.887	10/5376.8/36.14	10/3037.5/3.557
M10	10/17184.1/21.76	10/43.1/1.578	n/a	10/3036.8/1.939
M11	10/17386.5/30.94	10/183/1.789	10/4360.3/16.86	10/3037.2/3.709
M12	10/17564/39.51	10/268.1/2.211	10/8531/48.23	10/3035.9/5.87
M110	10/18943.4/40.52	10/955.8/4.167	10/36439.9/149.7	10/3037.9/7.203
NCR0	10/883.1/1434	10/36.3/2.410	see M01	n/a
NCR1	10/202.4/42.82	10/139.9/2.773	see M11	n/a
CR	10/177.5/35.40	10/85.1/2.548	10/217950/79475 (see below)	n/a
Note: 2064-109, LPAR with 2 dedicated CPUs, 1 GB real, 2 GB XSTORE, LPAR dedicated to these runs. RAMAC-1 behind 3990-6. z/VM 4.2.0. Linux 2.4, 31-bit, internal lab driver. 128 MB Linux virtual machine, no swap partition, Linux DASD is 3000-cylinder minidisk, not CMS formatted. Values in this table are recorded as `N/m/sd`, where `N` is the number of samples taken, `m` is the mean of the samples, and `sd` is the standard deviation of the samples.

Analysis

In the analysis below, all comparisons of runs were done using a twin-tailed t-test with 95% confidence level cutoff.

The disk exerciser tool reports a write rate of right around 3030 KB/sec, or 2.96 MB/sec. Note that the write-only runs' CPU time is almost entirely virtual time. It takes the Linux guest much more CPU time to figure out what to write than it does CP to actually write it. This is not unexpected.
Comparing write performance for MDC OFF (M00) to MDC ON (M10), we see that there is no statistically significant difference in any performance measure except CP CPU time, where the t-test yields a 99% confidence level:
```
M00.CPcpu vs. M10.CPcpu:  cl=99%, delta=10.79%
```
This shows that for MDC ON, writes cost about 11% more CP CPU than for MDC OFF. We expect that this CP CPU time increase is due to MDC maintenance (e.g., invalidation).
The tool-reported read rate for MDC OFF is 5376 KB/sec. With MDC ON, the tool reports the first read (run M11) took place at 4360 KB/sec, which is slower than the MDC OFF read rate. This makes sense, because when MDC is ON, CP takes CPU time to do MDC insertions, and so the read rate drops. We can get a look at how much these insertions cost by comparing the CP CPU time for the synthesized NCR0 and NCR1 runs:
```
NCR0.CPcpu vs. NCR1.CPcpu:  cl=99%, delta=285.4%
```
This tells us that an MDC insertion is expensive -- CP CPU goes up by 285% when an insertion happens.
With MDC ON, when we read the file twice (run M12) instead of once (run M11), the average tool-reported read rate rises dramatically, from 4360 KB/sec to 8531 KB/sec. This is expected because the second read is satisfied from MDC.
With further analysis we can compute what data rate M12 experienced on its second read -- in other words, what synthesized run CR's tool-reported read rate would have been. The following sample calculation, using the average read rates experienced by M11 and M12 as inputs, illustrates how the analysis goes:
```
Let t12 = elapsed time for M12's two reads of the file
        = (2 * 524288 KB / (8531 KB/sec))
        = 122.91 sec
 
Let t11 = elapsed time for M11's one read of the file
        = 524288 KB / (4360.3 KB/sec)
        = 120.24 sec
 
Let tr  = time M12 spent in its second read of the file
        = t12 - t11
        = 2.67 seconds
 
Let dr  = data rate experienced by M12 during second read
        = 524288 KB / 2.67 sec
        = 196000 KB / sec
        = data rate run "CR" would have reported
```
In spite of this particular calculation's result, the reader is strongly cautioned that the average read rate experienced by our synthesized run CR is not 196000 KB/sec. It is a fallacy to try to calculate CR's average read rate by using the average read rates of M11 and M12 in these formulas. To calculate CR's average read rate, we must do the above calculation for every pair of (M11,M12) read rate samples, and then compute the average of the 10 results of said calculation. Said technique is how we got the table's value for the average read rate experienced by hypothetical run CR. The table's value -- 213 MB/sec -- dramatically illustrates the effectiveness of MDC in speeding up the second read of a file.
Comparing runs NCR1 and CR tells us how much CP CPU time it costs to incur a minidisk cache miss:
```
NCR1.CPcpu vs: CR.CPcpu:  cl=99% delta=-39.17%
```
When it's a hit in MDC, CP spends 39% less CPU than when it's a miss. Another way to say this is that when it's a miss, CP spends 64% more CPU than when it's a hit.

Recommendations

For a Linux disk that is write-mostly, one will definitely want to set MDC OFF for it. This is because CP spends about 11% more CPU on a write if MDC is ON, doing MDC management.

For a Linux disk that is evenly-mixed I/O, one will still want to set MDC OFF for it. This is because of the high price of MDC insertions on read.

The case where MDC is really helpful is the read-mostly case, where the data rate rises dramatically and where CP CPU time per read is at a minimum.

Footnotes:

¹: While the RAMAC-1 is not the most current technology, the purpose of these experiments was to evaluate certain aspects of VM configuration.

Contents | Previous | Next