Skip to main content

IBM Systems  >   System z  >   z/VM  >  

Understanding and Tuning z/VM Paging

(Last revised: 2010-06-17, BKW)

A key element of the z/VM value proposition is that we often get the best TCO result when we overcommit various physical resources, such as CPUs or memory.

When it's memory (aka storage) we wish to overcommit, the inevitable condition is that the z/VM system will page.

Thus it's in our best interest to understand z/VM's paging behaviors very well, including understanding how to measure z/VM paging, how to measure the influence of paging on workload behavior, and how to configure and tune the paging system for best performance.

In this article we will explore z/VM paging, including how it works, how to measure the paging system's behavior, how to assess the measurements for suitability, and how to repair or tune the paging system when its behavior seems unsuitable.

We assume here that the reader has a basic understanding of the notions of virtual memory and of paging as a means to overcommit storage.

A Few Reminders about Performance Measurements

Whenever we talk about performance, and especially when we discuss whether a given system's performance is suitable, remember that we must enter the discussion with well-formed ideas about what the word performance means to us, and what the words good or suitable mean.

Throughout the rest of this article, our tacit assumption is that for your workload and environment, you have settled on what specific phenomena you wish to measure to come to your practical, workable assessment of "performance".

We also assume that you have developed some success or suitability thresholds for the phenomena you've chosen to measure, so as to differentiate between "meets criteria" and "needs improvement".

Moreover, to apply this article's tips and techniques, you will need to understand the relationship between your chosen success metrics and the z/VM system's tendency to page.

If you've accomplished all of that, great, read on. If not, you will want to do some work along those lines before proceeding.

A Basic Look at z/VM Paging

On a z/VM system, pages are kept in three different places: central storage, expanded storage (aka XSTORE), and on disk.

Of course, z/VM's overall objective is to keep the "hot" pages -- that is, the ones that are often used -- in central, and the "warm" pages in XSTORE, and the "cool" pages on DASD. In this way, we help minimize the effect of paging on application performance.

To accomplish all of this, z/VM accomplishes paging by bringing together all of the following algorithmic elements:

  • z/VM has a well-defined page circulation or page motion scheme. In other words, pages move among the three residency points according to specific flows.

  • Further, z/VM also has a well-defined page marking and aging scheme. In other words, for each page, z/VM tracks how the pages are touched, and in some cases, how long ago the touches occured.

  • Finally, z/VM has a well-defined page selection scheme. When pages must circulate, z/VM inspects the pages' markings to decide which pages to shuffle.

Page Motion

First let's talk about page motion. In general, pages move among central, XSTORE, and DASD in a definite prescribed motion pattern, like this:

  • When central becomes constrained, the least important page in central has to leave. The chosen ejected page always goes to XSTORE. This is called a PGOUT operation.

  • When XSTORE becomes constrained, the least important page in XSTORE has to leave. The chosen ejected page always goes to DASD. This is called a migration operation.

  • When a page is needed in central, z/VM finds the page and brings it into central:
    • If the needed page is in XSTORE, it is moved back to central and the XSTORE slot is freed. This is called a PGIN operation.
    • If the needed page is on DASD, it is read in, but the DASD slot isn't necessarily freed right away. This is called a page read operation.

  • Regarding least-important pages leaving XSTORE, it is important to note that z/VM tries to evacuate XSTORE in groups of pages, instead of individually. This technique, called block paging, helps us to move many pages onto DASD without having to do so many I/O operations (Start Subchannel, or SSCH).

  • Regarding needed pages coming in from DASD, it is also important to note that when z/VM must read DASD, it doesn't read just one page. Instead, it tries to read blocks of pages. Again, this technique is called block paging.

  • Regarding pages moving from central to XSTORE (PGOUT), and XSTORE to central (PGIN), these pages move one at a time.

Page Marking

Now let's talk about page marking. Here are some descriptions of the marking techniques z/VM uses.

  • For pages in central, the System z hardware marks a page as referenced whenever an instruction references a location on the page.

  • Also for pages in central, the System z hardware marks a page as changed whenever an instruction modifies the content of the page.

  • Every once in a while, for each user, z/VM sorts the user's list of resident pages according to these markings, so that it's easy for z/VM to find the user's unreferenced pages. This occasional sorting is called reorder.

  • Whenever a page is placed into XSTORE, z/VM updates its XSTORE map with the TOD value of when the page entered XSTORE.

Page Selection

When central storage is constrained and a frame is needed, something has to leave. The choosing process is called demand scan. Here is the prioritized list of places z/VM looks, to find something to move out. z/VM works its way down through these stages of the selection algorithm until it has chosen and ejected enough pages to satisfy the request.

  1. In a first pass called pass 1, z/VM tries to be friendly to dispatched users. So it looks for these kinds of pages to move out:
    1. Unreferenced pages belonging to shared address spaces.
    2. Pages belonging to users on the long-term-dormant list. In other words, users that aren't running.
    3. Pages belonging to users on the eligible list. These users want to run, but they can't run right now because resource is too constrained.
    4. Pages belonging to dispatched users, but z/VM will take from a dispatched user only down to something called working-set size (WSS). A user's WSS is z/VM's perception of how many pages the user needs to have in central to run without incurring a page fault.

  2. In a second pass called pass 2, z/VM gets a little more aggressive:
    1. It skips shared address spaces, but
    2. It will take from dispatched users, but only down to their SET RESERVE value.

  3. In a third pass called emergency scan, z/VM will take whatever it can find to move out.

When XSTORE is constrained, z/VM moves out its oldest pages, using the pages' TOD stamps.

Determining Storage Consumption

Tuning and measuring the z/VM paging system is relevant only after we understand the storage demands of the workload. These z/VM Performance Toolkit (herein, Perfkit) reports help you calculate how much storage your workload is using.

FCX113 UPAGE tells us how many pages each guest seems to need, and where those pages are residing. The information we seek is on the right side of the report. Because the UPAGE report is so wide, in the excerpt below we've cropped out many of the middle columns. We've indicated the cropping by '...' in the excerpt.

1FCX113 Ru ... Page 34 ... ty and Storage Utilization From 2009/ ... xxxxxxx To 2009/ ... CPU 2097-700 SN xxxxx For 3540 ... Run z/VM V.5.4.0 SLU 0902 __________ ... __________________________________________________________________________ ... . ... . . . . . . . . . ... <----------------- Number of Pages -----------------> ... <-Resident-> <--Locked--> Stor Nr of Userid ... WSS Resrvd R<2GB R>2GB L<2GB L>2GB XSTOR DASD Size Users >System< ... 1423k 0 15804 1407k 4 147 0 0 5602M 30 DTCVSW1 ... 2649 0 136 2609 14 82 0 0 32M DTCVSW2 ... 2600 0 173 2555 38 90 0 0 32M FTPSERVE ... 1407 0 73 1335 0 1 0 0 32M LINWRK ... 94513 0 3050 91490 0 13 0 0 512M MAINT ... 1473 0 55 1418 0 0 0 1 128M MONWRITE ... 187 0 37 150 0 0 0 0 4M R<2GB is pages resident below the 2 GB line in real storage. R>2G is pages resident above the 2 GB line in real storage. XSTOR is pages in XSTORE. DASD is pages on paging DASD.

The >System< row gives us the residency statistics for the average user. If we multiply its values by the Nr of Users value at the right, we can calculate how many user pages are in these various places on average.

Remember that all of these values are average values over the time interval of the report. The time interval of the report is shown in the report's upper left-hand corner.

FCX147 VDISKS is very much like FCX113 UPAGE, except the VDISKS report describes the residency distribution for virtual disks in storage, aka VDISKs. Look at the columns on the right side of the report, under the Nr of Pages heading. These describe resident pages, XSTORE pages, and DASD pages. Also like UPAGE, the >System< row tells us the distribution for the average VDISK. To know how many VDISKs were involved in the calculation of the average, count the individual rows in the report. Use the count and the >System< values to calculate the total storage your VDISKs are using.

FCX134 DSPACESH is very much like FCX147 VDISKS, except it describes the residency distribution for shared data spaces in general. Look on the right side of the report to find the residency counts. Some notes on using this report correctly:

  • Remember that VDISKs are themselves shared data spaces, so if you counted their consumption by using FCX147 VDISKS, don't double-count them by picking up their numbers again from FCX134.
  • Minidisk Cache (MDC) is implemented as shared data spaces too, but MDC's storage use is NOT expressed in FCX134. We have to look at FCX178 MDCSTOR to see that.

FCX178 MDCSTOR tells us how much storage Minidisk Cache is using, both in central and in XSTORE. The columns labelled Actual are the ones we need to inspect. Notice there is a >>Mean>> row that is the average value over the time interval of the report, and then there is a separate row for each reported-on time interval.

FCX253 STORLOG gives general information about how CP believes it is using storage.

FCX254 AVAILLOG tells how many storage frames are available in various places. The Times empty columns give us information about whether CP is routinely finding itself out of storage on various free-storage lists. Excessive values here indicate that the system is generally short on storage.

Using Less Storage

If after examining your storage consumption you decide to try to use less, here are some things you can try.

For Linux users, try:

  • Size the heap correctly for the workload. Consult your application provider for guidance.
  • Trim each guest's real storage size until the guest starts barely swapping. Each guest should be provided a hierarchy of swap devices with a VDISK as the first device. You can see the VDISK I/O rates on the FCX147 VDISKS report.
  • Put the Linux kernel into a segment. This lets all Linux guests share a single copy of the kernel.
  • Use the XIP file system. The storage benefits of using XIP are documented here.
  • Use VM Resource Manager's Cooperative Memory Management feature to encourage Linux guests to give up storage they might be using unnecessarily.

For CMS users, try:

  • IPL from a segment.
  • Put applications into segments, and use the CMS SEGMENT support.
  • If you are using SFS, try using DIRCONTROL directories in data spaces.

For minidisk cache, try:

  • Use CP SET MDCACHE to change the amount of storage MDC will use in central or in XSTORE.
  • Use CP SET MDCACHE or MINIOPT NOMDC to control which real volumes or minidisks are cached in MDC. Turn off MDC for disks that aren't almost all reads.

For CP storage, try:

  • Using dedicated OSAs for Linux leads to excess guest pages being locked into real storage. Consider VSWITCH devices instead.
  • Having excess real devices in the configuration leads to excessive consumption of CP free storage. In SYSTEM CONFIG, mark unused devices as offline at IPL.

Determining Paging Rates

Several Perfkit reports comment on paging rates.

FCX225 SYSSUMLG contains two columns that tell us the system's basic paging rates. The PGIN+PGOUT column tells the paging rate to or from XSTORE. The Read+Write column tells the paging rate to or from DASD. FCX225 SYSSUMLG is a handy report because it comments on many diverse performance metrics all at once. Here's an excerpt.

1FCX225 Run 2009/11/04 09:12:16 SYSSUMLG System Performance Summary by Time From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of xxxxxxx Run __________________________________________________________________________________ <------- CPU --------> <Vec> <--Users--> <---I/O---> <Stg> <-Paging--> <--Ratio--> SSCH DASD Users <-Rate/s--> Interval Pct Cap- On- Pct Log- +RSCH Resp in PGIN+ Read+ End Time Busy T/V ture line Busy ged Activ /s msec Elist PGOUT Write >>Mean>> 21.6 1.12 .9281 20.0 .... 30 23 6.3 .3 .0 .0 .0 14:27:05 20.7 1.09 .9485 20.0 .... 30 23 5.9 .2 .0 .0 .0 14:28:05 17.5 1.10 .9402 20.0 .... 30 23 6.0 .3 .0 .0 .0 14:29:05 23.6 1.08 .9505 20.0 .... 30 23 6.0 .3 .0 .0 .0

FCX143 PAGELOG is probably one of the more handy reports for systematic studies of paging behavior. Interval by interval, PAGELOG comments on PGINs, PGOUTs, migrations, reads, and writes. It also alerts us to single-page reads and writes.

Unfortunately, PAGELOG is so wide that it is difficult to post an excerpt in this article. Let's try looking at it in two excerpts: the left side, which discusses XSTORE, and then the right side, which discusses DASD.

Here's the left side, which discusses XSTORE. Notice the report tabulates PGINs, PGOUTs, and migrations, all separately.

1FCX143 Run 2009/11/04 09:12:16 PAGELOG Total Paging From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of xxx _____________________________________________________ <----------- Expanded Storage -----------> Fast- Est. Page Interval Paging PGIN Path PGOUT Total Life Migr End Time Blocks /s % /s /s sec /s >>Mean>> 1049k .0 .0 .0 .0 .... .0 14:27:05 1049k .0 .0 .0 .0 .... .0 14:28:05 1049k .0 .0 .0 .0 .... .0 14:29:05 1049k .0 .0 .0 .0 .... .0

Here's the right side, which discusses DASD. The '...' placeholders mark where we deleted columns from the report. Notice the report tabulates reads and writes, and also tabulates single-page ops, which can be expensive.

1FCX143 Run ... ... From 2009/1 ... To 2009/1 ... For 3540 ... ___________ ... _________________________________________ ... ... <----------- Paging to DASD ------------> ... <-Single Reads--> Interval ... Reads Write Total Shrd Guest Systm Total End Time ... /s /s /s /s /s /s /s >>Mean>> ... .0 .0 .0 3.0 .0 .0 .0 14:27:05 ... .0 .0 .0 3.0 .0 .0 .0 14:28:05 ... .0 .0 .0 3.0 .0 .0 .0 14:29:05 ... .0 .0 .0 3.0 .0 .0 .0

FCX113 UPAGE gives a lot of information about paging, broken out by user. The left half of UPAGE comments on users' paging activity. The values are averages over the time interval of the report. The time interval of the report is located in the report's upper left-hand corner. Here's an excerpt.

1FCX113 Run 2009/11/04 09:12:16 UPAGE User Paging Activi From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of xxxxxxx __________________________________________________________ . . _____ . . . . . . Data <--------- Paging Activity/s ----------> Spaces <Page Rate> Page <--Page Migration--> Userid Owned Reads Write Steals >2GB> X>MS MS>X X>DS >System< .0 .0 .0 .0 .0 .0 .0 .0 DTCVSW1 .0 .0 .0 .0 .0 .0 .0 .0 DTCVSW2 .0 .0 .0 .0 .0 .0 .0 .0 FTPSERVE .0 .0 .0 .0 .0 .0 .0 .0 LINWRK .0 .0 .0 .0 .0 .0 .0 .0 MAINT .0 .0 .0 .0 .0 .0 .0 .0 MONWRITE .0 .0 .0 .0 .0 .0 .0 .0 X>MS is PGINs. MS>X is PGOUTs. X>DS is migrations.

For interval-by-interval studies, FCX113 INTERIM UPAGE is a handy tool. To study a specific user's paging experience over time, consider using FCX163 UPAGELOG.

FCX147 VDISKS gives paging rates, by VDISK. Again, the values are averages over the report's time interval.

FCX134 DSPACESH gives paging rates, by data space. Again, these are averages over the report's time interval. Remember that only shared data spaces appear in this report.

Effect of Paging on The Workload

FCX114 USTAT reveals the effect of paging on users' ability to run. For each user, %PGW is the percent of user state samples revealing the user is in a page-fault wait. %PGA is the percent of samples revealing the user has loaded a wait-state PSW while waiting for a page-fault operation to complete. Generally, excessive values in these percentages will correlate to high per-user paging rates as reported in FCX113 UPAGE.

FCX163 UPAGELOG, mentioned earlier, chronicles one user's paging experience, interval by interval. This report is useful in studying the paging behavior of a specific user.

FCX145 SCHEDLOG reports on the lengths of the scheduler queues, by time. If the z/VM scheduler is holding back users because of storage constraints, we will see nonzero lengths for the eligible lists. You can use the CP SET SRM command to adjust the z/VM scheduler's resource protection thresholds and therefore its propensity for forming an eligible list. If you relax the protection thresholds, you must first prepare for the corresponding increase in resource pressure. This will probably mean beefing up your paging system; to do so, follow the best practices named in the next section of this article.

If it is necessary to afford storage preference to certain users, the following methods can be used:

  • Use the CP SET RESERVED command to guarantee a given user a certain minimum number of real storage frames, unless the user has fallen inactive.
  • Use the OPTION QUICKDSP statement or SET QUICKDSP ON command to let a user enter the dispatch list even though the z/VM scheduler might otherwise have held the user back because of a CPU, storage, or paging constraint.

Configuring a z/VM System to Page Well

Here are some guidelines for how to set up a z/VM paging system to give it every advantage in supporting your workload.

  1. Define about 25% of the partition's total storage as XSTORE, up to a maximum of 2 to 4 GB. This is a general rule of thumb. We will give you some XSTORE tuning tips later on.

  2. Use enough paging packs so that the packs run no more than about 50% full. As a rough start for this, calculate the total size of your logged-on guests, plus their VDISKs, plus their shared data spaces. Subtract the total storage for the partition. Then double that number. Roughly, the answer is how much page space you will need in the worst case. You will find this calculation sizes your paging space quite generously. After you run for a while, you might decide to add or remove some paging volumes. That's fine.

  3. Remember that paging well is all about being able to run more than one paging I/O at a time. This means you should spread your paging space over as many volumes as possible. Get yourself lots of little paging volumes, instead of one or two big ones. The more paging volumes you provide, the more paging I/Os z/VM can run concurrently.

  4. Make all of your volumes the same size. Use all 3390-3s, or 3390-9s, or whatever. When the volumes are unequally sized, the smaller ones fill and thereby become ineligible as targets for page-outs, thus restricting z/VM's opportunity for paging I/O concurrency.

  5. A disk volume should be either all paging (cylinders 1 to END) or no paging at all. Never allocate paging space on a volume that also holds other kinds of data, such as spool space or user minidisks.

  6. Think carefully about which of your DASD subsystems you choose for paging. Maybe you have DASD controllers of vastly different speeds, or cache sizes, or existing loads. When you decide where to place paging volumes, take the DASD subsystems' capabilities and existing loads into account.

  7. Within a given DASD controller, volume performance is generally sensitive to how the volumes are placed. Work with your DASD people to avoid poor volume placement, such as putting all of your paging volumes into one rank.

  8. If you can avoid ESCON chpids for paging, do it. An ESCON chpid can carry only one I/O at a time. FICON chpids can run multiple I/Os concurrently: 32 or 64, depending on the generation of the FICON card.

  9. If you can, run multiple chpids to each DASD controller that holds paging volumes. Consider two, or four, or eight chpids per controller. Do this even if you are using FICON.

  10. If you have FCP chpids and SCSI DASD controllers, you might consider exploiting them for paging. A SCSI LUN defined to the z/VM system as an EDEV and ATTACHed to SYSTEM for paging has the very nice property that the z/VM Control Program can overlap I/Os to it. This lets you achieve paging I/O concurrency without needing multiple volumes. However, don't run this configuration if you are CPU-constrained. It takes more CPU cycles per I/O to do EDEV I/O than it does to do classic ECKD I/O.

  11. Make sure you run with a few reserved slots in the CP-owned list, so you can add paging volumes without an IPL if the need arises.

Inspecting and Tuning Paging Health

To determine whether the paging system per se is configured and operating OK, examine the following Perfkit reports and fields.

FCX225 SYSSUMLG and FCX143 PAGELOG tell us the balance between XSTORE paging and DASD paging. We usually consider the system to have enough XSTORE assigned to paging if the PGIN+PGOUT rate is greater than or equal to the DASD read+write rate. If the XSTORE rate is too low, then add XSTORE to the partition, or use the CP SET MDCACHE command so that Minidisk Cache uses less XSTORE. You can examine FCX103 STORAGE to see how much XSTORE Minidisk Cache is using. This XSTORE tuning tip is more precise than the "25%, up to 2 to 4 GB" XSTORE rule of thumb we gave earlier.

FCX109 DEVICE CPOWNED tells us lots of things about the health of the paging system. Things FCX109 reveals are:

  • In the report's upper text, the caption Page slot utilization tells us how full the paging system is altogether. We want this number to be 50% or less. If it's too large, add paging volumes or reduce the workload's memory requirement.

  • In the Area Extent column, the report tells us how much paging space is allocated on each volume. The entry is either a cylinder start and end, or it is a number of 4 KB slots. To convert cylinder start and end to a number of slots, calculate s = (end - start + 1) * 180. We want all volumes to be the same size.

  • In the Used % column, the report tells us how full each volume is separately. We want each volume to be about the same percent full. If you have sized each volume the same, this will take care of itself.

  • In the Serv Time /Page column, the report tells us how long on average it takes to move a page on or off the volume, once the transfer actually begins. We want this to be less than 1.0 msec. If the value is higher, we need to do DASD tuning, which we'll describe shortly.

  • In the MLOAD Resp Time column, the report tells us how long on average it takes to move a page on or off the volume, including time the paging request waits in line to get access to the volume. We want this also to be less than 1.0 msec. If the value is higher but Serv Time /Page is OK, spread the work across more volumes. Otherwise do DASD tuning, which we'll describe shortly.

  • In the Queue Lngth column, the report tells us whether paging operations are queueing at the volume. Queue formation at paging volumes is a very bad thing. If we see this value nonzero, we need either to add volumes or to do DASD tuning, both of which we'll describe shortly. Note nonzero queue lengths are of course the cause of elevated MLOAD.

Keep in mind that the values in FCX109 are averages over the time interval of the report. The time interval of the report is located in the report's upper left-hand corner. Use FCX109 INTERIM DEVICE CPOWNED for interval-by-interval studies of paging health.

FCX103 STORAGE tells us the system's overall block-paging factor for reads and for writes. Generally we want these to be greater than or equal to 10. If it is too small, the usual cause is that paging space is too full. Refer to FCX109 DEVICE CPOWNED to see how full the paging space is altogether. You probably need to add volumes.

Inspecting and Tuning DASD, In General

The z/VM system will be able to page well only if its paging volumes are performing correctly. Here are some Perfkit reports you can examine and some tuning actions you can take.

FCX131 DEVCONF tells us which chpids are servicing your paging volumes. Determining whether your paging system has enough channels and appropriate channel technology starts by knowing what the paging channels are. Use the information in this report to figure out how many chpids lead to each DASD subsystem, and what the chpid numbers are. If you don't already have one, sketch yourself a diagram of your paging DASD configuration, and keep the diagram handy.

FCX161 LCHANNEL tells us two interesting things relative to paging I/O:

  • The Descr column tells us what technology is in use. If you are using ESCON for paging, consider changing to FICON.

  • The Channel %Busy Distribution histogram tells us, by chpid, how CPU-busy the microprocessor is on the chpid's adapter card. Notice the report shows us a histogram of the distribution of each adapter's CPU-busy. The column headings are percent-busy range bands, and the entries in the columns show us what fraction of samples showed the card in said band. This lets us see not only the average busy but also how variable the CPU-busy value is. Here's an excerpt, showing the column headings. 1FCX161 Run 2009/11/04 09:12:16 LCHANNEL Channel Load and Channel Busy Distribution From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of xxxxxxx Run _____________________________________________________________________________________________ CHPID Chan-Group <%Busy> <----- Channel %Busy Distribution 14:26:05-15:25:05 ------> (Hex) Descr Qual Shrd Cur Ave 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100

    The adapter cards can usually carry heavy paging data rates without experiencing unduly high CPU-busy values. What tends to drive up adapter CPU-busy is high rates of very small I/Os, which usually comes from applications' I/O, not paging I/O. If you see high CPU-busy on paging chpids, determine whether application or guest I/O habits are the cause. Consider separating application I/O and paging I/O from one another.

FCX215 FCHANNEL tells us what the data rates are on each chpid. The Write/s and Read/s columns tell the story. What we are looking for here is whether the fiber's data rate is approaching the fiber speed. To determine whether this is happening, first determine what kind of channel card is involved. An ESCON chpid moves about 17 Mb/sec, and FICON cards move 1, 2, 4, or 8 Gb/sec, depending on their generation. To compare the FCX215 byte rates to the channel's fiber speed, use a conversion factor of about 9 bits per byte. While this is not exact, it will give you a rough estimate of whether the fiber is full. Remember the values are averages over the time interval of the report. The time interval of the report is located in the report's upper left-hand corner.

FCX232 IOPROCLG tells us whether your System z's I/O subsystem (aka channel subsystem) is encountering busy conditions as it tries to start I/Os. The values are reported as busy conditions encountered per SSCH attempted, so of course "zero" is the optimal answer for these values. The specific kinds of busy conditions reported are these:

  • Channel busies generally mean you have not provided enough channels to the DASD subsystem. Add some.
  • Switch busies mean there is a congestion problem in your switch. Work with your switch provider.
  • Control unit busies generally mean the CPU in the DASD subsystem is too busy. Work with your DASD provider or spread activity across more DASD controllers.
  • Device busies generally mean that for a given volume, I/O from one partition is blocking I/O from another partition. We would never expect to see this for paging volumes.

Older versions of Perfkit have a significant defect in the FCX232 IOPROCLG report that is worth mentioning. If your column headings look like this, you are running a defective Perfkit:

Interval Proc <-Activity/Sec--> Proc <-- I/O Path Percent Busy ---> End Time Number Beg_SSCH I/O_Int %Busy Channel Switch CU Device The numbers in the last four columns are in fact busies encountered per SSCH performed, but the displayed values are a factor of 100 too large. Divide them all by 100 and then proceed.

If your column headings look like this, you are running a corrected Perfkit:

Interval Proc <-Activity/Sec--> Proc <- Busy conditions per SSCH -> End Time Number Beg_SSCH I/O_Int %Busy Channel Switch CU Device

If this is your situation, the last four columns' numbers are correct.

FCX108 DEVICE reports three principal measures of performance for individual disk volumes. The measures appear in the report on a per-volume basis. The measures, their explanations, and their remediation strategies are:

  • Pending time per I/O is the time it takes the System z channel subsystem to find a chpid to use for the I/O, plus the amount of time it takes the DASD subsystem to return an initial response indicating that it has received the channel program. If you see more than 0.1 msec of pending time, you might have one or more of these problems:
    • The channels leading to the DASD subsystem are too busy. FCX232 IOPROCLG should confirm this in channel busies. Add chpids or spread the paging volumes among more controllers.
    • The CPU in the DASD subsystem is too busy. Unfortunately we have no way to measure controller CPU-busy, but FCX232 IOPROCLG should confirm this in control-unit busies. Spread the paging volumes among more controllers.

  • Disconnect time per I/O is a measure of controller duress. The controller disconnects when it cannot immediately satisfy the I/O by using its cache. If you see more than 1 msec of disconnect time, you might have one or more of these problems:
    • Controller volatile cache is overloaded or inadvertently off or otherwise ineffective.
    • Controller NVS is overloaded or inadvertently off or otherwise ineffective.
    Perfkit's FCX176 CTLUNIT and FCX177 CACHEXT reports can help narrow these down. These two reports are especially good for seeing the read-write distribution of the I/Os and the controller cache hit rates, both by-volume and by-I/O-type. If you have problems with controller cache, your mitigations will be to spread volumes out, or to add controller cache, or to contact your DASD vendor for help.

  • Connect time per I/O is the amount of time an I/O actually uses the fiber. It is measured from the beginning of the first FICON packet to the end of the last FICON packet. In usual situations, one has very little influence over connect time. However, in very highly loaded FICON situations, connect time can elongate because of excessive I/O concurrency (aka excessive concurrent open exchanges). Because Perfkit doesn't report open exchange level, there is not much we can really do to measure it. However, if there is an excessive concurrent open exchange problem, there's probably also a pending-time or an IOPROCLG channel-busies problem, so apply those remediations.

Note service time per I/O is just the sum of pending time, disconnect time, and connect time.

FCX108 also reports several additional measures that are of little to no utility for studying paging but which are very important for understanding DASD performance in general. Though this article is really about paging, for completeness we'll go ahead here and describe these additional important measures:

  • Avoid is the number of I/Os per second that were avoided for the volume because z/VM got a hit on MDC. To determine whether the value is "correct" for your situation, consider these factors:
    • If you have enabled MDC for the volume but the hit rate is low, check FCX177 CACHEXT to see whether the volume's I/Os are mostly reads, and also check FCX138 MDCACHE and FCX178 MDCSTOR to see whether MDC is operating as intended.
    • If you have disabled MDC for the volume, check FCX177 CACHEXT to see whether the volume's operations are mostly reads. If so, and if the volume is not a dedicated volume, consider enabling MDC for it.

  • Req. Qued reports the depth of the I/O wait queue at the real DASD volume. In general, if we see wait queues forming at a real DASD volume, it means the device's service time is excessively large or the volume is just plain overworked. If the components of service time seem out of range, remediate as described above. Otherwise, reorganize data to relieve stress on the volume. It is also worth checking whether MDC is functioning as intended for the volume; moreover, if FCX177 CACHEXT reports the volume's I/Os are mostly reads but you have disabled MDC for the volume, reconsider your configuration.

    For a paging volume, we will never see queued requests in FCX108 because z/VM queues paging I/O differently than it queues general-purpose DASD I/O. Keep in mind that FCX109 DEVICE CPOWNED reports the depth of the paging operation queue for each paging volume.

  • Resp reports the average response time for I/O requests to the volume. Response time is service time plus time spent waiting in the volume's I/O queue. Because z/VM does not queue paging I/O in the manner it queues general-purpose I/O, for paging volumes we will always see response time equal to service time. For general-purpose DASD volumes, if we see response time exceeding service time, it means there is an I/O wait queue on average, and we'll see the queue in the Req. Qued column. When we see a queue, remediate as just described above.

Keep in mind that all of the values in FCX108 are averages over the time interval of the report. The time interval of the report is located in the report's upper left-hand corner. Use FCX108 INTERIM DEVICE for interval-by-interval studies of the system's DASD health. To study the interval-by-interval of a specific DASD volume, use FCX168 DEVLOG.

FCX176 CTLUNIT and FCX177 CACHEXT report on DASD I/O performance data harvested from the DASD controller. These reports can provide useful information if you are looking for causes of poor storage subsystem performance, such as insufficient cache or insufficient NVS. These reports are best interpreted in consultation with your storage subsystem provider.

Summary

Because the z/VM value proposition pays off substantially only in the face of successful resource overcommitment, it's important for us to understand how z/VM paging works, how to measure it, and how to tune it.

The z/VM paging system does its work by implementing well-defined page motion, page marking, and page selection schemes. Pages move among central storage, expanded storage, and DASD according to demands for central and according to their use patterns and ages. Pages to be ejected from central are chosen in a way that tries to reduce impact on running users.

z/VM Performance Toolkit reports on storage consumption by guests, by VDISKs, by shared data spaces, by minidisk cache, and by the Control Program itself. After we understand where storage is being used, we can tune storage use, by adjusting guests or by adjusting the Control Program.

Perfkit also reports on paging rates, either overall, or by users, by VDISKs, and so on. It also reports on whether users are being significantly held back by paging I/O or by storage constraints.

There are many steps one can take to configure a z/VM system so that it will page well. Key steps are to make sure there is enough paging space, and that it is spread over enough volumes, and that the volumes are used for only paging, and that the volumes are placed intelligently in DASD subsystems, and that the DASD subsystems have enough chpids.

Perfkit tells us whether the paging system is healthy: whether XSTORE is bearing enough of the load, whether the paging DASD is too full, or whether the Control Program is experiencing undue delays moving pages on and off of paging volumes.

Perfkit also tells us whether the paging DASD themselves are healthy. Key metrics here are pending time, disconnect time, and pending time. When one or more of these metrics is out of bounds, reallocating or redistributing paging volumes or paging chpids can usually solve the problem.