Understanding and Tuning z/VM Paging
(Last revised: 2010-06-17, BKW)
A key element of the z/VM value proposition is that we often
get the best TCO result when we overcommit various physical
resources, such as CPUs or memory.
When it's memory (aka storage) we wish to overcommit, the
inevitable condition is that the z/VM system will page.
Thus it's in our best interest
to understand z/VM's paging behaviors very
well, including understanding
how to measure z/VM paging,
how to measure the influence of paging on workload behavior, and
how to configure and tune the paging system for
best performance.
In this article we will explore z/VM paging, including
how it works,
how to measure the paging system's behavior,
how to assess the measurements for suitability, and
how to repair or tune the paging system when its
behavior seems unsuitable.
We assume here that the reader has a basic understanding
of the notions of virtual memory and of paging
as a means to overcommit storage.
A Few Reminders about Performance Measurements
Whenever we talk about performance, and especially when we
discuss whether a
given system's performance is suitable, remember that we must
enter the discussion with well-formed ideas about what the
word performance means to us, and what the words
good or suitable mean.
Throughout the rest of this article, our tacit assumption is
that for your workload and environment, you have settled on
what specific phenomena you wish to measure
to come to your practical, workable assessment of
"performance".
We also assume that you have developed some success or
suitability thresholds
for the phenomena you've chosen to
measure, so as to differentiate between
"meets criteria" and "needs improvement".
Moreover, to apply this article's tips and techniques, you will
need to understand
the relationship between your chosen success metrics and
the z/VM system's tendency to page.
If you've accomplished all of that, great, read on.
If not, you will want to do some work along those lines
before proceeding.
A Basic Look at z/VM Paging
On a z/VM system, pages are kept in three different places:
central storage, expanded storage (aka XSTORE), and on
disk.
Of course, z/VM's overall objective is to keep
the "hot" pages -- that is, the ones that are often used --
in central, and
the "warm" pages in XSTORE, and
the "cool" pages on DASD.
In this way, we help minimize the effect of paging on
application performance.
To accomplish all of this, z/VM accomplishes paging by bringing
together all of the following algorithmic elements:
-
z/VM has a well-defined page circulation or page
motion scheme. In other words, pages move among the three
residency points according to specific flows.
-
Further,
z/VM also has a well-defined page marking and aging
scheme.
In other words, for each page, z/VM tracks how the pages are
touched, and in some cases,
how long ago the touches occured.
-
Finally, z/VM has a well-defined page selection scheme.
When pages must circulate, z/VM inspects the pages' markings
to decide which pages to shuffle.
Page Motion
First let's talk about page motion. In general, pages move among
central, XSTORE, and DASD in a definite prescribed motion pattern,
like this:
-
When central becomes constrained, the least important page in
central has to leave. The chosen ejected page always goes to XSTORE.
This is called a PGOUT operation.
-
When XSTORE becomes constrained, the least important page in
XSTORE has to leave. The chosen ejected page always goes to DASD.
This is called a migration operation.
-
When a page is needed in central, z/VM finds the page and brings
it into central:
- If the needed page is in XSTORE, it is moved back to central and
the XSTORE slot is freed. This is called a PGIN operation.
- If the needed page is on DASD, it is read in, but the DASD slot
isn't necessarily freed right away. This is called a page read
operation.
-
Regarding least-important pages leaving XSTORE, it is important to
note that z/VM tries to evacuate XSTORE in groups of pages, instead of
individually. This technique, called
block paging, helps us to move
many pages onto DASD without having to do so many I/O operations
(Start Subchannel, or SSCH).
-
Regarding needed pages coming in from DASD, it is also important
to note that when z/VM must read DASD, it doesn't read just one page.
Instead, it tries to read blocks of pages. Again, this technique is
called block paging.
-
Regarding pages moving from central to XSTORE (PGOUT), and XSTORE
to central (PGIN), these pages move one at a time.
Page Marking
Now let's talk about page marking. Here are
some descriptions of the marking techniques z/VM uses.
-
For pages in central, the System z hardware marks a page as
referenced whenever an instruction references a location on the page.
-
Also for pages in central, the System z hardware marks a page as
changed whenever an instruction modifies the content of the page.
-
Every once in a while, for each user, z/VM sorts
the user's
list of resident pages according to these markings,
so that it's easy for z/VM to find the user's
unreferenced pages. This occasional sorting is called reorder.
-
Whenever a page is placed into XSTORE, z/VM updates its XSTORE map
with the TOD value of when the page entered XSTORE.
Page Selection
When central storage
is constrained and a frame is needed, something has to leave. The
choosing process is called demand scan.
Here is the prioritized list of
places z/VM looks, to find something to move out. z/VM works its
way down through these stages of the selection algorithm until it
has chosen and ejected enough pages to satisfy the request.
-
In a first pass called pass 1, z/VM tries to be friendly to
dispatched users. So it looks for these kinds of pages to move out:
- Unreferenced pages belonging to shared address spaces.
- Pages belonging to users on the long-term-dormant list. In
other words, users that aren't running.
- Pages belonging to users on the eligible list. These users
want to run, but they
can't run right now because resource is too constrained.
- Pages belonging to dispatched users, but z/VM will take from a
dispatched user only
down to something called working-set size (WSS). A user's WSS is
z/VM's perception of how many pages the user needs to have in central
to run without incurring a page fault.
-
In a second pass called pass 2, z/VM gets a little more
aggressive:
- It skips shared address spaces, but
- It will take from dispatched users, but only down to their SET
RESERVE value.
-
In a third pass called emergency scan, z/VM will take whatever
it can find to move out.
When XSTORE is constrained, z/VM moves out its oldest pages,
using the pages' TOD stamps.
Determining Storage Consumption
Tuning and measuring the z/VM paging system is relevant only after
we understand the storage demands of the workload.
These z/VM Performance Toolkit (herein, Perfkit)
reports help you calculate
how much storage your workload
is using.
FCX113 UPAGE tells us how many pages each guest seems
to need, and where those pages are residing. The information we seek
is on the right side of the report. Because the UPAGE
report is so wide, in the
excerpt below we've cropped out many of the middle columns. We've
indicated the cropping by '...' in the excerpt.
1FCX113 Ru ... Page 34
... ty and Storage Utilization
From 2009/ ... xxxxxxx
To 2009/ ... CPU 2097-700 SN xxxxx
For 3540 ... Run z/VM V.5.4.0 SLU 0902
__________ ... __________________________________________________________________________
...
. ... . . . . . . . . .
... <----------------- Number of Pages ----------------->
... <-Resident-> <--Locked--> Stor Nr of
Userid ... WSS Resrvd R<2GB R>2GB L<2GB L>2GB XSTOR DASD Size Users
>System< ... 1423k 0 15804 1407k 4 147 0 0 5602M 30
DTCVSW1 ... 2649 0 136 2609 14 82 0 0 32M
DTCVSW2 ... 2600 0 173 2555 38 90 0 0 32M
FTPSERVE ... 1407 0 73 1335 0 1 0 0 32M
LINWRK ... 94513 0 3050 91490 0 13 0 0 512M
MAINT ... 1473 0 55 1418 0 0 0 1 128M
MONWRITE ... 187 0 37 150 0 0 0 0 4M
R<2GB is pages resident below the 2 GB line in real storage.
R>2G is pages resident above the 2 GB line in real storage.
XSTOR is pages in XSTORE.
DASD is pages on paging DASD.
The
>System<
row gives us the residency statistics for the
average user. If we multiply its values by the Nr of Users value at the right,
we can calculate how many user pages are in these various places on average.
Remember that all of these values
are average values over the time interval of the report.
The time interval of the report is shown in the report's upper left-hand corner.
FCX147 VDISKS is very much like FCX113 UPAGE, except the
VDISKS report describes the residency distribution for virtual disks
in storage, aka VDISKs.
Look at
the columns on the right side of the report,
under the Nr of Pages heading. These
describe resident pages, XSTORE pages, and DASD pages.
Also like UPAGE, the
>System<
row tells us the distribution for the average VDISK. To know how many
VDISKs were involved in the calculation of the average, count the
individual rows in the report. Use the count and the >System< values
to calculate the total storage your VDISKs are using.
FCX134 DSPACESH is very much like FCX147 VDISKS, except it
describes the residency distribution for shared data spaces
in general. Look on the right side of the report to find the residency
counts. Some notes on using this report correctly:
-
Remember that VDISKs are themselves shared data spaces, so if
you counted their consumption by using FCX147 VDISKS, don't double-count
them by picking up their numbers again from FCX134.
-
Minidisk Cache (MDC) is implemented as shared data spaces too, but MDC's
storage use is NOT expressed in FCX134. We have to look at FCX178 MDCSTOR
to see that.
FCX178 MDCSTOR tells us how much storage
Minidisk Cache is using,
both in central and in XSTORE. The columns labelled Actual are
the ones we need to inspect.
Notice there is a >>Mean>> row that is
the average value over the time interval of the report, and then there
is a separate row for each reported-on time interval.
FCX253 STORLOG
gives general information about how CP believes it
is using storage.
FCX254 AVAILLOG
tells how many storage frames are available in
various places.
The Times empty columns give us information about
whether CP is routinely finding itself out of storage
on various free-storage lists. Excessive values here
indicate that the system is generally short on storage.
Using Less Storage
If after examining your storage consumption
you decide to try to use less,
here are some things you can try.
For Linux users, try:
- Size the heap correctly for the workload. Consult your application
provider for guidance.
- Trim each guest's real storage size until the guest starts
barely swapping. Each guest should be provided a hierarchy
of swap devices with a VDISK as the first device.
You can see the VDISK I/O rates
on the FCX147 VDISKS report.
- Put the Linux kernel into a segment. This lets all Linux
guests share a single copy of the kernel.
- Use the XIP file system. The storage benefits of using
XIP are documented
here.
- Use VM Resource Manager's Cooperative Memory Management
feature to encourage Linux guests
to give up storage they might be using unnecessarily.
For CMS users, try:
- IPL from a segment.
- Put applications into segments, and use the CMS SEGMENT support.
- If you are using SFS, try using DIRCONTROL directories in data spaces.
For minidisk cache, try:
- Use CP SET MDCACHE to change the amount of storage MDC will use
in central or in XSTORE.
- Use CP SET MDCACHE or MINIOPT NOMDC
to control which real volumes or minidisks are cached in MDC.
Turn off MDC for disks that aren't almost all reads.
For CP storage, try:
-
Using dedicated OSAs for Linux leads to excess guest pages
being locked into real storage. Consider VSWITCH devices
instead.
-
Having excess real devices
in the configuration leads to excessive
consumption of CP free storage. In SYSTEM CONFIG, mark unused
devices as offline at IPL.
Determining Paging Rates
Several Perfkit reports comment
on paging rates.
FCX225 SYSSUMLG contains two columns that tell us
the system's basic paging rates.
The PGIN+PGOUT column
tells the paging rate to or from XSTORE.
The Read+Write column tells the paging rate
to or from DASD.
FCX225 SYSSUMLG is a handy report because it comments on many diverse
performance metrics all at once.
Here's an excerpt.
1FCX225 Run 2009/11/04 09:12:16 SYSSUMLG
System Performance Summary by Time
From 2009/10/29 14:26:05
To 2009/10/29 15:25:05
For 3540 Secs 00:59:00 Result of xxxxxxx Run
__________________________________________________________________________________
<------- CPU --------> <--Users--> <---I/O---> <-Paging-->
<--Ratio--> SSCH DASD Users <-Rate/s-->
Interval Pct Cap- On- Pct Log- +RSCH Resp in PGIN+ Read+
End Time Busy T/V ture line Busy ged Activ /s msec Elist PGOUT Write
>>Mean>> 21.6 1.12 .9281 20.0 .... 30 23 6.3 .3 .0 .0 .0
14:27:05 20.7 1.09 .9485 20.0 .... 30 23 5.9 .2 .0 .0 .0
14:28:05 17.5 1.10 .9402 20.0 .... 30 23 6.0 .3 .0 .0 .0
14:29:05 23.6 1.08 .9505 20.0 .... 30 23 6.0 .3 .0 .0 .0
FCX143 PAGELOG is probably one of the more handy reports
for systematic studies of paging behavior. Interval by interval, PAGELOG
comments on PGINs, PGOUTs, migrations, reads, and writes.
It also alerts us to single-page reads and writes.
Unfortunately, PAGELOG is so
wide that it is difficult to post an excerpt in this article.
Let's try looking at it
in two excerpts: the left side, which discusses XSTORE,
and then the right side, which discusses DASD.
Here's the left side, which discusses XSTORE. Notice the
report tabulates PGINs, PGOUTs, and migrations, all
separately.
1FCX143 Run 2009/11/04 09:12:16 PAGELOG
Total Paging
From 2009/10/29 14:26:05
To 2009/10/29 15:25:05
For 3540 Secs 00:59:00 Result of xxx
_____________________________________________________
<----------- Expanded Storage ----------->
Fast- Est. Page
Interval Paging PGIN Path PGOUT Total Life Migr
End Time Blocks /s % /s /s sec /s
>>Mean>> 1049k .0 .0 .0 .0 .... .0
14:27:05 1049k .0 .0 .0 .0 .... .0
14:28:05 1049k .0 .0 .0 .0 .... .0
14:29:05 1049k .0 .0 .0 .0 .... .0
Here's the right side, which discusses DASD. The '...'
placeholders mark where we deleted columns from the report.
Notice the report tabulates reads and writes, and also tabulates
single-page ops, which can be expensive.
1FCX143 Run ...
...
From 2009/1 ...
To 2009/1 ...
For 3540 ...
___________ ... _________________________________________
...
... <----------- Paging to DASD ------------>
... <-Single Reads-->
Interval ... Reads Write Total Shrd Guest Systm Total
End Time ... /s /s /s /s /s /s /s
>>Mean>> ... .0 .0 .0 3.0 .0 .0 .0
14:27:05 ... .0 .0 .0 3.0 .0 .0 .0
14:28:05 ... .0 .0 .0 3.0 .0 .0 .0
14:29:05 ... .0 .0 .0 3.0 .0 .0 .0
FCX113 UPAGE gives a lot of information about
paging, broken out by user.
The left half of UPAGE comments on users' paging activity.
The values are averages over the time interval of the report.
The time interval of the report is located in the report's
upper left-hand corner.
Here's an excerpt.
1FCX113 Run 2009/11/04 09:12:16 UPAGE
User Paging Activi
From 2009/10/29 14:26:05
To 2009/10/29 15:25:05
For 3540 Secs 00:59:00 Result of xxxxxxx
__________________________________________________________
. . _____ . . . . . .
Data <--------- Paging Activity/s ---------->
Spaces Page <--Page Migration-->
Userid Owned Reads Write Steals >2GB> X>MS MS>X X>DS
>System< .0 .0 .0 .0 .0 .0 .0 .0
DTCVSW1 .0 .0 .0 .0 .0 .0 .0 .0
DTCVSW2 .0 .0 .0 .0 .0 .0 .0 .0
FTPSERVE .0 .0 .0 .0 .0 .0 .0 .0
LINWRK .0 .0 .0 .0 .0 .0 .0 .0
MAINT .0 .0 .0 .0 .0 .0 .0 .0
MONWRITE .0 .0 .0 .0 .0 .0 .0 .0
X>MS is PGINs.
MS>X is PGOUTs.
X>DS is migrations.
For interval-by-interval studies, FCX113 INTERIM UPAGE is a handy
tool. To study a specific user's paging experience over time,
consider using FCX163 UPAGELOG.
FCX147 VDISKS gives paging rates, by VDISK.
Again, the values are averages over the report's time interval.
FCX134 DSPACESH gives paging rates, by data space.
Again, these are averages over the report's time interval.
Remember that only shared data spaces appear in this report.
Effect of Paging on The Workload
FCX114 USTAT
reveals the effect of paging on users' ability to
run. For each user, %PGW is the percent of user state
samples revealing the user
is in a page-fault wait. %PGA is the percent of
samples revealing the
user has loaded a wait-state PSW while waiting for a page-fault
operation to complete. Generally, excessive values in these
percentages will correlate to high per-user paging rates as
reported in FCX113 UPAGE.
FCX163 UPAGELOG, mentioned earlier, chronicles
one user's paging experience, interval by interval. This report
is useful in studying the paging behavior of a specific user.
FCX145 SCHEDLOG
reports on the lengths of the scheduler queues, by time.
If the z/VM
scheduler is holding back users because of storage constraints,
we will see nonzero lengths for the eligible lists.
You can
use the CP SET SRM command to adjust the z/VM scheduler's
resource protection thresholds and therefore its propensity
for forming an eligible list. If you relax the protection
thresholds,
you must
first prepare
for the corresponding increase
in resource pressure.
This will probably mean beefing
up your paging system; to do so, follow
the best practices named in the next section
of this article.
If it is necessary to afford storage preference to certain users,
the following methods can be used:
- Use the CP SET RESERVED command to guarantee a given user
a certain minimum number of real storage frames, unless the user
has fallen inactive.
- Use the OPTION QUICKDSP statement or SET QUICKDSP ON command
to let a user enter the dispatch list even though the
z/VM scheduler
might otherwise have held the user back because of a
CPU, storage, or paging constraint.
Configuring a z/VM System to Page Well
Here are some guidelines for how to set up a z/VM
paging system
to give it every advantage in supporting your workload.
-
Define about 25% of the partition's total storage as XSTORE, up
to a maximum of 2 to 4 GB. This is a general rule of thumb. We
will give you some XSTORE tuning tips later on.
-
Use enough paging packs so that the packs run no more than
about 50% full.
As a rough start for this, calculate the total size of your logged-on
guests, plus their VDISKs, plus their shared data spaces. Subtract
the total storage for the partition. Then double that number.
Roughly, the answer is how much page space you will need in
the worst case.
You will find this calculation sizes your paging space quite generously.
After you run for a while, you might decide to add or remove some paging
volumes. That's fine.
-
Remember that paging well is all about being able to run more
than one paging I/O at a time. This means you should spread your
paging space over as many volumes as possible. Get yourself lots of
little paging volumes, instead of one or two big ones. The more
paging volumes you provide, the more paging I/Os z/VM can run
concurrently.
-
Make all of your volumes the same size. Use all
3390-3s, or 3390-9s, or whatever.
When the volumes are unequally sized, the smaller ones
fill and thereby become ineligible as targets for
page-outs, thus restricting z/VM's opportunity for
paging I/O concurrency.
-
A disk volume should be either all paging (cylinders 1 to END)
or no paging at all. Never allocate paging space on a volume
that also holds other kinds of data, such as spool space or
user minidisks.
-
Think carefully about which of your
DASD subsystems you choose for paging.
Maybe you have DASD controllers of vastly different speeds, or
cache sizes, or existing loads. When you decide where to place
paging volumes, take the DASD subsystems'
capabilities and existing loads into account.
-
Within a given DASD controller,
volume performance
is generally sensitive to how the
volumes are placed. Work with your DASD people to avoid poor
volume placement, such as putting all of your paging volumes
into one rank.
-
If you can avoid ESCON chpids for paging, do it. An ESCON chpid
can carry only one I/O at a time. FICON chpids can run multiple
I/Os concurrently: 32 or 64, depending on the generation of the
FICON card.
-
If you can, run multiple chpids to each DASD controller that holds
paging volumes. Consider two, or four, or eight chpids per controller.
Do this even if you are using FICON.
-
If you have FCP chpids and SCSI DASD controllers, you might consider
exploiting them for paging. A SCSI LUN defined to the z/VM system as
an EDEV and ATTACHed to SYSTEM for paging has the very nice property
that the z/VM
Control Program can overlap I/Os to it. This lets you achieve
paging I/O concurrency without needing multiple volumes. However,
don't run this configuration if you are CPU-constrained. It takes
more CPU cycles per I/O to do EDEV I/O than it does to do classic
ECKD I/O.
-
Make sure you run with a few reserved slots in the CP-owned list,
so you can add paging volumes without an IPL if the need arises.
Inspecting and Tuning Paging Health
To determine whether the paging system per se
is configured and
operating OK, examine the following Perfkit reports and
fields.
FCX225 SYSSUMLG and
FCX143 PAGELOG
tell us the balance between XSTORE
paging and DASD paging. We usually consider the system to have enough
XSTORE assigned to paging if the PGIN+PGOUT rate is greater than or
equal to the DASD read+write rate. If the XSTORE rate is too low,
then add XSTORE to the partition, or use the CP SET MDCACHE command
so that Minidisk Cache uses less XSTORE. You can examine FCX103
STORAGE to see how much XSTORE Minidisk Cache is using. This XSTORE
tuning tip is more precise than the "25%, up to 2 to 4 GB" XSTORE
rule of thumb we gave earlier.
FCX109 DEVICE CPOWNED
tells us lots of things about the health of the
paging system. Things FCX109 reveals are:
-
In the report's
upper text, the caption Page slot utilization tells us how full
the paging system is altogether. We want this number to be 50%
or less. If it's too large, add paging volumes or reduce the
workload's memory requirement.
-
In the Area Extent column, the report tells us
how much paging space is allocated on each volume. The entry
is either a cylinder start and end, or it is a number of 4 KB
slots. To convert cylinder start and end to a number of slots,
calculate s = (end - start + 1) * 180. We want all volumes to
be the same size.
-
In the Used % column, the report tells us
how full each volume is separately. We want each volume to be about
the same percent full. If you have sized each volume the same,
this will take care of itself.
-
In the Serv Time /Page column, the report tells us
how long on average it takes to move a page on or off the volume,
once the transfer actually begins.
We want this to be less than 1.0 msec. If the value is higher,
we need to do DASD tuning,
which we'll describe shortly.
-
In the MLOAD Resp Time column, the report tells us
how long on average it takes to move a page on or off the volume,
including time the paging request waits in line to get access to
the volume.
We want this also to be less than 1.0 msec. If the value is higher
but Serv Time /Page is OK, spread the work across more volumes.
Otherwise do DASD tuning,
which we'll describe shortly.
-
In the Queue Lngth column, the report tells us
whether paging operations are queueing at the volume.
Queue formation at paging volumes is a very bad thing.
If we see this value nonzero,
we need either to add volumes or to do DASD tuning, both of
which we'll describe shortly.
Note nonzero queue lengths are of course the cause of
elevated MLOAD.
Keep in mind that
the values in FCX109
are averages over the time interval of the report.
The time interval of the report is located in the report's
upper left-hand corner.
Use FCX109 INTERIM DEVICE CPOWNED
for interval-by-interval studies of paging health.
FCX103 STORAGE tells us
the system's overall block-paging factor for
reads and for writes. Generally we want these to be greater than
or equal to 10. If it is too small, the usual cause is that
paging space is too full. Refer to FCX109 DEVICE CPOWNED to
see how full the paging space is altogether. You probably need
to add volumes.
Inspecting and Tuning DASD, In General
The z/VM system will be able to page well only if its paging
volumes are performing correctly. Here are some Perfkit
reports you can examine and some tuning actions you can
take.
FCX131 DEVCONF tells us
which chpids are servicing your paging volumes. Determining
whether your paging system has enough channels and
appropriate channel technology
starts by knowing
what the paging channels are.
Use the information in this report to figure out
how many chpids lead to each DASD subsystem, and what the chpid
numbers are. If you don't already have one, sketch yourself
a diagram of your paging DASD configuration, and keep the
diagram handy.
FCX161 LCHANNEL tells us two interesting
things relative to paging I/O:
-
The Descr column tells us what technology is in use.
If you are using ESCON for paging, consider changing to FICON.
-
The Channel %Busy Distribution histogram
tells us,
by chpid, how CPU-busy the microprocessor is on the chpid's
adapter card. Notice the report shows us a histogram of
the distribution of each adapter's CPU-busy. The column
headings are percent-busy range
bands, and the entries in the
columns show us what fraction of samples showed the card
in said band. This lets us see not only the average busy
but also how variable the CPU-busy value is. Here's
an excerpt, showing the column headings.
1FCX161 Run 2009/11/04 09:12:16 LCHANNEL
Channel Load and Channel Busy Distribution
From 2009/10/29 14:26:05
To 2009/10/29 15:25:05
For 3540 Secs 00:59:00 Result of xxxxxxx Run
_____________________________________________________________________________________________
CHPID Chan-Group <%Busy> <----- Channel %Busy Distribution 14:26:05-15:25:05 ------>
(Hex) Descr Qual Shrd Cur Ave 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100
The adapter cards can usually carry heavy paging
data rates without experiencing unduly high CPU-busy values.
What tends to drive up adapter CPU-busy
is high rates of very small I/Os,
which usually comes from applications' I/O, not paging I/O.
If you see high CPU-busy on paging chpids, determine whether
application or guest I/O habits are the cause. Consider separating
application I/O and paging I/O from one another.
FCX215 FCHANNEL tells us what the data rates are
on each chpid. The Write/s and Read/s columns tell the story.
What we are looking for here is whether the fiber's data rate
is approaching the fiber speed. To determine whether this is happening,
first determine what kind of channel card is involved.
An ESCON chpid moves about 17 Mb/sec, and FICON cards move 1, 2, 4,
or 8 Gb/sec, depending on their generation. To compare the FCX215
byte rates to the channel's fiber speed,
use a conversion factor of about
9 bits per byte. While this is not exact, it will give you a rough
estimate of whether the fiber is full.
Remember the values are averages over the time interval of the report.
The time interval of the report is located in the report's
upper left-hand corner.
FCX232 IOPROCLG tells us whether your System z's
I/O subsystem (aka channel subsystem) is encountering busy
conditions as it tries to start I/Os.
The values are reported as busy conditions encountered
per SSCH attempted, so of course "zero" is the optimal
answer for these values. The specific kinds of busy conditions
reported are these:
- Channel busies generally mean you have not provided enough
channels to the DASD subsystem. Add some.
- Switch busies mean there is a congestion problem in your
switch. Work with your switch provider.
- Control unit busies generally mean the CPU in the DASD
subsystem is too busy. Work with your DASD provider or spread
activity across more DASD controllers.
- Device busies generally mean that for a given volume, I/O
from one partition is blocking I/O from another partition.
We would never expect to see this for paging volumes.
Older versions of Perfkit have a significant defect in the FCX232
IOPROCLG report that is worth mentioning.
If your column headings
look like this, you are running a defective Perfkit:
Interval Proc <-Activity/Sec--> Proc <-- I/O Path Percent Busy --->
End Time Number Beg_SSCH I/O_Int %Busy Channel Switch CU Device
The numbers in the last four columns are in fact
busies encountered per SSCH performed,
but the displayed values are a factor of 100 too large. Divide them all by 100 and
then proceed.
If your column headings
look like this, you are running a corrected Perfkit:
Interval Proc <-Activity/Sec--> Proc <- Busy conditions per SSCH ->
End Time Number Beg_SSCH I/O_Int %Busy Channel Switch CU Device
If this is your situation, the last four columns' numbers are correct.
FCX108 DEVICE reports three principal
measures of performance for individual disk volumes.
The measures
appear in the report on a per-volume basis.
The measures,
their explanations, and
their remediation strategies are:
-
Pending time per I/O is the time it takes the System z
channel subsystem to find a chpid to use for the I/O, plus the
amount of time it takes the DASD subsystem to return an
initial response indicating that it has received the channel
program.
If you see
more than 0.1 msec of pending time, you might have one or more of
these problems:
-
The channels leading to the DASD subsystem are too busy.
FCX232 IOPROCLG should confirm this in channel busies.
Add chpids or spread the
paging volumes among more controllers.
-
The CPU in the DASD subsystem is too busy.
Unfortunately we have no way to measure controller CPU-busy, but
FCX232 IOPROCLG should confirm this in control-unit busies.
Spread the paging volumes among more controllers.
-
Disconnect time per I/O is a measure of controller
duress. The controller disconnects when it cannot immediately
satisfy the I/O by using its cache.
If you see more than
1 msec of disconnect time, you might have one or more of these problems:
- Controller volatile cache is overloaded or inadvertently off or
otherwise ineffective.
- Controller NVS is overloaded or inadvertently off or otherwise
ineffective.
Perfkit's FCX176 CTLUNIT and FCX177 CACHEXT reports
can help narrow these down. These two reports are especially
good for seeing the read-write distribution of the I/Os and
the controller cache hit rates, both by-volume and by-I/O-type.
If you have
problems with controller cache,
your mitigations will be to spread volumes out, or to add
controller cache, or to contact your DASD vendor for help.
-
Connect time per I/O is the amount of time an I/O
actually uses the fiber.
It is measured from the beginning
of the first FICON packet to the end of the last FICON packet.
In usual situations, one has very little influence over connect
time. However, in very highly loaded FICON situations, connect
time can elongate because of excessive I/O concurrency
(aka excessive concurrent
open exchanges). Because Perfkit doesn't report open
exchange level, there is not much we can really do to measure it.
However, if there is an excessive concurrent
open exchange problem, there's
probably also a pending-time or an IOPROCLG channel-busies problem,
so apply those remediations.
Note service time per I/O is just the sum of
pending time, disconnect time, and connect time.
FCX108 also reports several additional measures that are of
little to no
utility for studying
paging but which are very important for understanding
DASD performance in general. Though this article is really about
paging, for completeness we'll go ahead here and describe these
additional important measures:
Keep in mind that
all of the values in FCX108
are averages over the time interval of the report.
The time interval of the report is located in the report's
upper left-hand corner.
Use FCX108 INTERIM DEVICE
for interval-by-interval studies of the system's DASD health.
To study the interval-by-interval of a specific DASD volume,
use FCX168 DEVLOG.
FCX176 CTLUNIT and
FCX177 CACHEXT
report on DASD I/O performance data harvested from the DASD
controller. These reports can provide useful information if
you are looking for causes of poor storage subsystem performance,
such as insufficient cache or insufficient NVS. These reports
are best interpreted in consultation with your storage
subsystem provider.
Summary
Because the z/VM value proposition pays off substantially
only in the face of successful
resource overcommitment, it's important for us to understand how
z/VM paging works, how to measure it, and how to tune it.
The z/VM paging system does its work by implementing well-defined
page motion, page marking, and page selection schemes.
Pages move among central storage, expanded storage, and DASD
according to demands for central and according to their use
patterns and ages. Pages to be ejected from central are chosen
in a way that tries to reduce impact on running users.
z/VM Performance Toolkit reports on storage consumption by guests,
by VDISKs, by shared data spaces, by minidisk cache, and by the
Control Program itself. After we understand where storage is
being used, we can tune storage use, by adjusting guests or by
adjusting the Control Program.
Perfkit also reports on paging rates, either overall, or by users,
by VDISKs, and so on. It also reports on whether users are
being significantly held back by paging I/O or by storage constraints.
There are many steps one can take to configure a z/VM system so
that it will page well. Key steps are to make sure there is
enough paging space, and that it is spread over enough volumes,
and that the volumes are used for only paging,
and that the volumes are placed intelligently in DASD subsystems,
and that the DASD subsystems have enough chpids.
Perfkit tells us whether the paging system is healthy:
whether XSTORE is bearing enough of the load,
whether the paging DASD is too full, or
whether the Control Program is experiencing undue
delays moving pages on and off of paging volumes.
Perfkit also tells us whether the paging DASD themselves are healthy.
Key metrics here are pending time, disconnect time, and pending time.
When one or more of these metrics is out of bounds, reallocating
or redistributing paging volumes or paging chpids
can usually solve the problem.
|