Contents | Previous | Next
z/VM PAV Exploitation
Starting in z/VM 5.2.0 with
APAR VM63855,
z/VM now exploits IBM's Parallel Access Volumes (PAV) technology
so as to expedite guest minidisk I/O.
In this article we
give some background about PAV, describe z/VM's exploitation
of PAV, and show the results of some measurements we ran so as
to assess the exploitation's
impact on minidisk I/O performance.
2007-06-15:
With z/VM 5.3 comes HyperPAV support. For performance
information about HyperPAV, and for performance management
advice about the use of both PAV and HyperPAV, see
our HyperPAV chapter.
Introduction
A zSeries data processing machine
lets software perform
only one I/O
to a given device at a time.
For DASD, this means zSeries lets software perform only
one I/O to a given disk at a time.
In some environments, this can have limiting effects.
For example, think of a
real 3390 volume on z/VM, carved up into N user minidisks,
each minidisk being a CMS user's 191 disk.
There is
no reason why we couldn't have N concurrent
I/Os in progress at once,
one to each minidisk. There would be no data integrity exposure,
because the minidisks are disjoint. As long as there were demand,
and as long as the DASD subsystem could keep up, we might experience
increased I/O rates to the volume, and thereby increase performance.
Since 1999
zSeries DASD subsystems (such as the IBM TotalStorage
Enterprise
Storage Server 800) have supported technology
called Parallel Access Volumes, or PAV. With PAV, the
DASD subsystem can offer the host processor more than one device
number per disk volume. For a given volume, the first device number
is called the "base" and the rest are called "aliases". If there are
N-1 aliases, the host can have N I/Os in progress to the volume
concurrently, one to each device number.
DASD subsystems offering PAV do so in a static
fashion. The IBM CE or other support professional uses a DASD
subsystem configuration utility program to
equip selected volumes with
selected fixed PAV alias device numbers.
The host can sense the aliases' presence when it varies the devices
online. In this way, the host operating system
can form a representation
of the base-alias relationships present in the DASD subsystem
and exploit that relationship if it chooses.
z/VM's first support for PAV,
shipped as APAR VM62295 on VM/ESA 2.4.0,
was to let guests exploit PAV.
When real volumes had PAV, and when said volumes were DEDICATEd or
ATTACHed to a guest, z/VM could pass its PAV knowledge to the
guest, so the guest could exploit it. But z/VM itself did not
exploit PAV at all.
With APAR VM63855 to z/VM 5.2.0,
z/VM can now
exploit PAV for I/O to PERM extents (user minidisks)
on volumes attached to SYSTEM.
This support lets
z/VM exploit a real volume's PAV configuration
on behalf of guests
doing virtual I/Os to minidisks defined on the real volume.
For example,
if 20 users have minidisks on a volume, and
if the volume has a few PAV aliases associated with it,
and if those users generate
sufficient I/O demand for the volume,
the Control Program will use the aliases to
drive more than one I/O to the volume concurrently.
This support is not limited to driving one I/O per
minidisk. If 20 users are all linked to the same
minidisk, and I/O workload to that one minidisk demands it,
z/VM will use the real volume's PAV aliases to drive more
than one I/O to the single minidisk concurrently.
To measure the effect of z/VM's PAV exploitation,
we crafted an
I/O-intensive workload whose
concurrency level and read-write mix we could control.
We shut off minidisk cache and then ran
the workload repeatedly,
varying its concurrency level,
its read-write mix,
the PAV configuration of the real volumes,
and the kind of DASD subsystem.
We looked for changes in three I/O performance metrics -- I/O response
time, I/O rate, and I/O service time -- as a function of these variables.
This article documents our findings.
Executive Summary
Adding PAV aliases helps improve a real DASD volume's performance
only if I/O requests are queueing at the volume.
We can tell whether this is happening by comparing the
volume's I/O response time to its I/O service time. As long
as response time equals service time, adding PAV aliases
will not change the volume's performance. However, if I/O
response time is greater than I/O service time, queueing
is happening and adding some PAV capability for the volume
might be helpful.
Results when using PAV will
depend on the amount of I/O concurrency
in the workload, the fraction of the I/Os that are reads, and the kind of
DASD subsystem in use. In our scenarios,
workloads with a very low percentage of reads
or a very high I/O concurrency level
tended not to improve as much as workloads
where the concurrency level exactly matched the
number of aliases available or the read
percentage was high. Also, modern storage subsystems, such as the
IBM DS8100, tended to do better with PAV than IBM's older offerings.
Measurement Environment
IO3390 Workload
Our exerciser IO3390 is a CMS application that
uses Start Subchannel (SSCH)
to perform random one-block I/Os to an 83-cylinder minidisk
formatted at 4 KB block size.
The random block numbers are drawn from a uniform
distribution [0..size_of_minidisk-1].
We organized
the IO3390 machines' minidisks onto real volumes
so that as we
logged on additional virtual machines, we added load
to the real volumes equally.
For example, with eight virtual
machines running, we had one IO3390 instance assigned
to each real volume. With sixteen virtual machines
we had two IO3390s per real volume. Using this scheme,
we ran 1, 2, 3, 4, 5, 10, and 20 IO3390s per volume.
For each number of concurrent
IO3390 instances
per volume, we varied the
aliases per volume in the range [0..4].
For each combination of number of IO3390s and number
of aliases, we tried four different I/O mixes:
0% reads,
33% reads,
66% reads,
and 100% reads.
The IO3390 agents
are CMS virtual
uniprocessor machines, 24 MB.
System Configuration
Processor:
2084-C24, model-capacity indicator 322,
2 GB central, 2 GB XSTORE, 2 dedicated processors.
Two 3390-3 paging volumes.
IBM TotalStorage ESS F20 (2105-F20) DASD:
2105-F20, 16 GB cache.
Two 1 Gb FICON chpids leading to a FICON switch, then
two 1 Gb FICON chpids from the switch to the 2105.
Four 3390-3 volumes in one LSS and
four 3390-3 volumes in a second LSS.
Four aliases defined for each volume.
IBM TotalStorage DS8100 (2107-921) DASD:
2107-921, 32 GB cache.
Four 1 Gb FICON chpids leading to a FICON switch, then
four 1 Gb FICON chpids from the switch to the 2107.
Eight 3390-3 volumes in a single LSS.
Four aliases defined for each volume.
IBM TotalStorage DS6800 (1750-511) DASD:
1750-511, 4 GB cache.
Two 1 Gb FICON chpids leading to a FICON switch, then
two 1 Gb FICON chpids from the switch
to the 1750.
Eight 3390-3 volumes in a single LSS.
Four aliases defined for each volume.
With these configurations,
each of our eight real volumes has up to
four aliases
the z/VM Control Program
can use to parallelize I/O. By using
CP VARY OFF to shut off some of the aliases, we can
control the amount of parallelism available for each volume.
We ran all measurements with z/VM 5.2.0 plus
APAR VM63855, with CP SET MDCACHE SYSTEM OFF
in effect.
Metrics
For each experiment, we measured I/O rate, I/O service time, and I/O response time.
I/O rate is the rate at which I/Os are completing at a volume.
For example, a
volume might experience an I/O rate of 20 I/Os per second.
As long as the size of the I/Os remains constant, using
PAV to achieve a higher I/O rate for a volume is
a performance improvement, because we move more data each second.
For a PAV volume, we assess the I/O rate for the volume by adding up the I/O
rates for the device numbers mapping the volume. For example, if the base
device number experiences 50/sec and each of three alias devices experiences
15/sec, the volume experiences 95/sec. Such summing is how we measure the effect
of PAV on I/O rate. We always compute the volume's I/O rate by summing the
individual rates for the device numbers mapping the volume.
I/O service time is the amount of time it takes for the DASD subsystem to perform
the requested operation, once the host system starts the I/O. Factors
influencing I/O service time include channel speed, load on the DASD subsystem,
amount of data being moved in the I/O, whether the I/O is a read or a write,
and the presence or availability of cache memory in the controller, just to name
a few.
For a PAV volume, we measure the I/O service time for the volume by computing the
average I/O service time for the device numbers mapping the volume. The
calculation takes into account the I/O rate at each device number and the I/O service
time incurred at each device number, so as to form an estimate (aka
expected value) of the I/O service time a hypothetical I/O to the volume would incur.
For example, if the base device is doing 100/sec with service time 5
msec, and the lone alias is doing 50/sec with service time 7 msec, the I/O service
time for the volume is calculated to be (100*5 + 50*7) / 150, or 5.7 msec.
I/O response time is the total amount of time a guest virtual machine
perceives
it takes to do an I/O to its minidisk.
This comprises I/O service time, explained previously, plus wait time. As a
real
device becomes busy, guest
I/O operations destined for that real volume
wait a little while in the real volume's I/O wait queue
before they start. Time spent in the wait queue, called I/O wait time,
is added to
the I/O service time so as to produce the value called I/O response time.
For a PAV volume owned by SYSTEM,
I/Os queued to a volume spend their waiting time queued on
the base device number. When the I/O gets to the front of the line, it is
pulled off the queue by the first device (base or one of its aliases) that becomes
free. For a PAV volume, then, I/O response time is equal to the wait time
spent in the base device queue plus the expected value of the I/O service time
for the volume, the calculation of which was explained previously.
Of these three metrics, the most interesting ones from an application
performance perspective are I/O rate and I/O response time. Changes in
I/O service time, while indicative of storage server performance, are not
too important to the application as long as they do not cause
increases in I/O response time.
We ran each configuration for ten minutes, with CP Monitor set to emit
sample records at one-minute intervals.
To calculate average performance of
a volume over the ten-minute interval, we threw away the first minute's
and the last minute's values (so as to discard samples possibly affected
by the run's startup and shutdown behaviors) and then
averaged the remaining eight minutes'
worth of samples. We used Performance Toolkit's interim FCX168 reports
as the raw input for our calculations.
Tabulated Results
The cells in the tables below state the average values of the three
I/O metrics over the eight volumes being exercised.
IBM TotalStorage ESS F20 (2105)
IO3390 4K 0% reads |
Suite K411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
2105-F20, 16 GB cache, 2 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 33% reads |
Suite L411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
2105-F20, 16 GB cache, 2 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 66% reads |
Suite M411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
2105-F20, 16 GB cache, 2 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 100% reads |
Suite N411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
2105-F20, 16 GB cache, 2 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IBM TotalStorage DS8100 (2107)
IO3390 4K 0% reads |
Suite O411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
2107-921, 32 GB cache, 4 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 33% reads |
Suite P411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
2107-921, 32 GB cache, 4 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 66% reads |
Suite Q411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
2107-921, 32 GB cache, 4 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 100% reads |
Suite R411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
2107-921, 32 GB cache, 4 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IBM TotalStorage DS6800 (1750)
IO3390 4K 0% reads |
Suite S411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
1750-511, 4 GB cache, 2 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 33% reads |
Suite T411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
1750-511, 4 GB cache, 2 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 66% reads |
Suite U411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
1750-511, 4 GB cache, 2 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
IO3390 4K 100% reads |
Suite V411 |
Aliases per volume |
Workers per volume |
Metric |
0 |
1 |
2 |
3 |
4 |
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
3 |
|
|
|
|
|
|
4 |
|
|
|
|
|
|
5 |
|
|
|
|
|
|
10 |
|
|
|
|
|
|
20 |
|
|
|
|
|
|
Note:
2084-324, 2 dedicated processors, 2 GB central, 2 GB XSTORE.
1750-511, 4 GB cache, 2 FICON chpids.
z/VM 5.2.0 + PAV SPE.
ior is I/O rate (/sec).
iost is I/O service time (msec).
iort is I/O response time (msec).
|
Discussion
Expectations
In general, we expected that as we added aliases to a configuration,
we would experience improvement in one or more I/O metrics, provided
enough workload existed to exploit the aliases,
and provided no other bottleneck limited the workload.
For example, with only
one IO3390 worker per volume, we would not expect adding aliases to
help anything. However, as we increase IO3390 workers per volume,
we would expect adding aliases to help matters, if the configuration
is not otherwise limited.
We also expected that adding
aliases would help only up to the workload's ability to drive I/Os
concurrently. For example, with only three workers per volume, we
would not expect four exposures (one base plus three aliases) to
perform better than three exposures.
We also expected that when the number of exposures was greater than
or equal to the concurrency level, I/O response time would equal
I/O service time. In other words, in such configurations, we
expected device wait queues to disappear.
IBM ESS F20 (2105)
For the 2105, we saw that
adding aliases did not appreciably change I/O rate or
I/O response time.
The 100%-read workloads were the exception to this.
For those runs, we did
notice that adding aliases did improve I/O rate and
I/O response time. However, there was little
improvement beyond adding one alias, that is, two or
more aliases offered about the same performance as one
alias.
We also noticed that as we added aliases to a workload,
I/O service time increased.
However, we almost always saw
offsetting reductions
in wait time, so I/O response time remained
about flat. We believe this suggests that this workload
drives the 2105 intensely enough that some bottleneck
within it comes into play.
Because adding aliases
did not increase I/O rate or decrease I/O
response time, we believe that by adding aliases, all
we did was move the I/O queueing from z/VM to inside
the 2105.
To investigate our suspicion,
we spot-checked the
components of I/O service time (pending time, disconnect
time, connect time) for some configurations.
Generally we found that increases in I/O
service time were due to increases in disconnect time.
We believe this suggests
queueing inside the 2105.
We did not check every case, nor did we tabulate our
findings.
IBM DS8100 (2107)
For the 2107, we saw that adding
aliases definitely caused improvements in I/O rate and
I/O response time. In some cases, the improvements were
dramatic.
Like the 2105, we saw that for the 2107, adding aliases to a
workload tended to increase I/O service time. However,
for the 2107, the increase in service time was more than offset
by a decrease in wait time, so I/O response time
decreased. This was true in all but the most extreme
workloads (100% writes or large numbers of users).
In those extreme cases, we believe we hit a 2107 limit,
just as we did in most of the 2105 runs.
IBM DS 6800 (1750)
The 1750, like the 2107, showed improvements in many
workloads as we added aliases. However, the 1750 struggled
with the 0%-read workload, and it did not do well with
small numbers of users per volume. As workers per volume
increased and as the fraction of reads increased, the
effect of PAV became noticable and positive.
Conclusions
For the DS8100 and the DS6800, we can recommend PAV when
the workload contains enough concurrency, especially for
workloads that are not 100% writes. We expect customers
to see decreases in I/O response time and increases in
I/O rate per volume. Exact results will depend heavily
on the customer's workload.
For the ESS F20, we can recommend PAV only when the customer's
workload has a high read percentage. For low and moderate
read percentages, neither I/O rate nor I/O response time
improves as we add aliases.
Workloads that might benefit from adding PAV aliases are
characterized by I/O response time being greater than
I/O service time -- in other words, a wait queue forming.
Customers considering adding PAV aliases can add an alias
or two to volumes showing this trait. A second measurement
will confirm whether I/O rate or I/O response time
improved.
We do not recommend adding PAV aliases past the
point where the wait queue disappears.
A guest that does its own I/O scheduling, such as Linux or
z/OS, might be maintaining
device wait queues on its own. Such queues would be invisible
to z/VM and to performance management products that consider only
CP real device I/O. If your analysis of what's happening inside
your guest shows you that wait queues are forming inside your
guest, you might consider exploring whether your guest can
exploit
PAV (sometimes we call this being PAV-aware).
If it does, you can use the new z/VM minidisk PAV support
to give your guest more than one virtual I/O device number for the
minidisk on which the guest is doing its own queueing. We did
not do any measurements of such configurations, but we would
expect to see some queueing relief like what we observed in
the configurations we measured.
Contents | Previous | Next