Understanding and Tuning z/VM Paging
(Last revised: 2022-04-11, BKW)
A key element of the z/VM value proposition is that we often get the best TCO result when we overcommit various physical resources, such as CPUs or memory.
When it's memory (aka storage) we wish to overcommit, the inevitable condition is that the z/VM system will page.
Thus it's in our best interest to understand z/VM's paging behaviors very well, including understanding how to measure z/VM paging, how to measure the influence of paging on workload behavior, and how to configure and tune the paging system for best performance.
In this article we will explore z/VM paging, including how it works, how to measure the paging system's behavior, how to assess the measurements for suitability, and how to repair or tune the paging system when its behavior seems unsuitable.
We assume here that the reader has a basic understanding of the notions of virtual memory and of paging as a means to overcommit storage.
A Few Reminders about Performance Measurements
Whenever we talk about performance, and especially when we discuss whether a given system's performance is suitable, remember that we must enter the discussion with well-formed ideas about what the word performance means to us, and what the words good or suitable mean.
Throughout the rest of this article, our tacit assumption is that for your workload and environment, you have settled on what specific phenomena you wish to measure to come to your practical, workable assessment of "performance".
We also assume that you have developed some success or suitability thresholds for the phenomena you've chosen to measure, so as to differentiate between "meets criteria" and "needs improvement".
Moreover, to apply this article's tips and techniques, you will need to understand the relationship between your chosen success metrics and the z/VM system's tendency to page.
If you've accomplished all of that, great, read on. If not, you will want to do some work along those lines before proceeding.
A Basic Look at z/VM Paging
On a z/VM system, pages are kept in two different places: central storage and disk.
Of course, z/VM's overall objective is to keep the "hot" pages -- that is, the ones that are often used -- in central and the "cool" pages on DASD. In this way, we help minimize the effect of paging on application performance.
To accomplish all of this, z/VM accomplishes paging by bringing together all of the following algorithmic elements:
- z/VM has a well-defined page circulation or page motion scheme. In other words, pages move between the two residency points according to specific flows.
- Further, z/VM also has a well-defined page marking and aging scheme. In other words, for each page, z/VM tracks how the pages are touched, and in a large sense, how long ago the touches occured.
- Finally, z/VM has a well-defined page selection scheme. When pages must circulate, z/VM inspects the pages' markings to decide which pages to shuffle.
First let's talk about page motion. In general, pages move between central and DASD in a definite prescribed motion pattern, like this:
- Pages belonging to users and VDISKs reside in lists called user-frame-owned (UFO) lists. Very roughly speaking, the pages on UFOs are kept in touch order, most recently touched at the top.
- Every once in a while, the UFOs get visited, and the page table entries (PTEs) for some of the pages on the bottoms of the UFOs get marked as invalid. This is called trial invalidation. The longer a PTE stays marked invalid, the more chance that page has of getting paged out.
- Also every once in a while, invalid pages get moved from the bottoms of UFOs to the top of a system-wide list called the global aging list (GAL).
- The GAL is limited in size, so you can imagine that on some stimulus, pages get taken from the bottom of the GAL, written if needed, and their frames reclaimed by putting them onto free-frame lists.
- As pages sink toward the bottom of the GAL, they encounter the prewrite line. When they arrive at the prewrite line, they get written, but their frame is not necessarily reclaimed yet.
- Any frame on the GAL below the prewrite line can be reclaimed, because the page it is holding has already been written.
- At any time while the page is still in memory, whether invalid on a UFO or invalid on the GAL, if the guest touches the page, that marks it as valid and the page goes back to the UFO. It will eventually get put onto the top of the UFO, but not instantly.
- If a guest touches a page that is not in memory, a frame is allocated, the page is read from DASD, and the page is inserted into the UFO to which it belongs.
For more information about the above, refer to Storage Management Scaling Improvements.
Page Marking and Aging
Now let's talk about page marking. This is all done with marking PTEs invalid. In that sense, the mark itself does not contain an age stamp. Rather the position of the page in its list (UFO, GAL) expresses in a large sense how long ago the page was last made valid. By the time a page sinks to the bottom of the GAL, it is the best candidate available for reclaim.
The selecting of pages for reclaim is entirely based on the notion of the GAL. Frames are reclaimed from the bottom of the GAL.
Determining Storage Consumption
Tuning and measuring the z/VM paging system is relevant only after we understand the storage demands of the workload. These z/VM Performance Toolkit (herein, Perfkit) reports help you calculate how much storage your workload is using.
FCX113 UPAGE tells us how many pages each guest seems to need, and where those pages are residing. The information we seek is on the right side of the report. Because the UPAGE report is so wide, in the excerpt below we've cropped out many of the middle columns. We've indicated the cropping by '...' in the excerpt.
The >System< row gives us the residency statistics for the average user. If we multiply its values by the Nr of Users value at the right, we can calculate how many user pages are in these various places on average.
Remember that all of these values are average values over the time interval of the report. The time interval of the report is shown in the report's upper left-hand corner.
FCX292 UPGUTL gives information similar to that found on FCX113 UPAGE.
FCX147 VDISKS is very much like FCX113 UPAGE, except the VDISKS report describes the residency distribution for virtual disks in storage, aka VDISKs. Look at the columns on the right side of the report, under the Nr of Pages heading. These describe resident pages, XSTORE pages (now always zero), and DASD pages. Also like UPAGE, the >System< row tells us the distribution for the average VDISK. To know how many VDISKs were involved in the calculation of the average, count the individual rows in the report. Use the count and the >System< values to calculate the total storage your VDISKs are using.
FCX134 DSPACESH is very much like FCX147 VDISKS, except it describes the residency distribution for shared data spaces in general. Look on the right side of the report to find the residency counts. Some notes on using this report correctly:
- Remember that VDISKs are themselves shared data spaces, so if you counted their consumption by using FCX147 VDISKS, don't double-count them by picking up their numbers again from FCX134.
- Minidisk Cache (MDC) is implemented as shared data spaces too, but MDC's storage use is NOT expressed in FCX134. We have to look at FCX178 MDCSTOR to see that.
FCX178 MDCSTOR tells us how much storage Minidisk Cache is using, both in central and in XSTORE (now deprecated). The columns labelled Actual are the ones we need to inspect. Notice there is a >>Mean>> row that is the average value over the time interval of the report, and then there is a separate row for each reported-on time interval.
FCX253 STORLOG gives general information about how CP believes it is using storage.
FCX254 AVAILLOG tells how many storage frames are available in various places. The Times empty columns give us information about whether CP is routinely finding itself out of storage on various free-storage lists. Excessive values here indicate that the system is generally short on storage.
FCX294 AVLB2GLG and FCX295 AVLA2GLG give available-frames information for the below-2 GB and above-2 GB real memory respectively. Be careful here, for the units on the amounts are bytes (e.g., 17M represents 17 MB).
Using Less Storage
If after examining your storage consumption you decide to try to use less, here are some things you can try.
For Linux users, try:
- Size the heap correctly for the workload. Consult your application provider for guidance.
- Trim each guest's real storage size until the guest starts barely swapping. Each guest should be provided a hierarchy of swap devices with a VDISK as the first device. You can see the VDISK I/O rates on the FCX147 VDISKS report.
- Put the Linux kernel into a segment. This lets all Linux guests share a single copy of the kernel.
- Use the XIP file system. The storage benefits of using XIP are documented here.
- Use VM Resource Manager's Cooperative Memory Management feature to encourage Linux guests to give up storage they might be using unnecessarily.
For CMS users, try:
- IPL from a segment.
- Put applications into segments, and use the CMS SEGMENT support.
- If you are using SFS, try using DIRCONTROL directories in data spaces.
For minidisk cache, try:
- Use CP SET MDCACHE to change the amount of storage MDC will use in central or in XSTORE (now deprecated).
- Use CP SET MDCACHE or MINIOPT NOMDC to control which real volumes or minidisks are cached in MDC. Turn off MDC for disks that aren't almost all reads.
For CP storage, try:
- Using dedicated OSAs for Linux leads to excess guest pages being locked into real storage. Consider VSWITCH devices instead.
- Having excess real devices in the configuration leads to excessive consumption of CP free storage. In SYSTEM CONFIG, mark unused devices as offline at IPL.
Determining Paging Rates
Several Perfkit reports comment on paging rates.
FCX225 SYSSUMLG contains two columns that tell us the system's basic paging rates. The PGIN+PGOUT column tells the paging rate to or from XSTORE (deprecated). The Read+Write column tells the paging rate to or from DASD. FCX225 SYSSUMLG is a handy report because it comments on many diverse performance metrics all at once. Here's an excerpt.
FCX143 PAGELOG is probably one of the more handy reports for systematic studies of paging behavior. Interval by interval, PAGELOG comments on PGINs, PGOUTs, migrations, reads, and writes. It also alerts us to single-page reads and writes. The left half is all about XSTORE, long deprecated. Let's focus on the right half, which discusses DASD. The '...' placeholders mark where we deleted columns from the report. Notice the report tabulates reads and writes, and also tabulates single-page ops, which can be expensive.
FCX113 UPAGE gives a lot of information about paging, broken out by user. The left half of UPAGE comments on users' paging activity. The values are averages over the time interval of the report. The time interval of the report is located in the report's upper left-hand corner. Here's an excerpt.
For interval-by-interval studies, FCX113 INTERIM UPAGE is a handy tool. To study a specific user's paging experience over time, consider using FCX163 UPAGELOG.
FCX290 UPGACT is similar to FCX113 but gives more details. UPGACT discusses instantiations and releases, invalidations and revalidations, page reads and writes, and other stats.
FCX147 VDISKS gives paging rates, by VDISK. Again, the values are averages over the report's time interval.
FCX134 DSPACESH gives paging rates, by data space. Again, these are averages over the report's time interval. Remember that only shared data spaces appear in this report.
Effect of Paging on The Workload
FCX114 USTAT reveals the effect of paging on users' ability to run. For each user, %PGW is the percent of user state samples revealing the user is in a page-fault wait. %PGA is the percent of samples revealing the user has loaded a wait-state PSW while waiting for a page-fault operation to complete. Generally, excessive values in these percentages will correlate to high per-user paging rates as reported in FCX113 UPAGE.
FCX163 UPAGELOG, mentioned earlier, chronicles one user's paging experience, interval by interval. This report is useful in studying the paging behavior of a specific user.
If it is necessary to afford storage preference to certain users, the following method can be used:
- Use the CP SET RESERVED command to guarantee a given user a certain minimum number of real storage frames, unless the user has fallen inactive.
Here are some guidelines for how to set up a z/VM paging system to give it every advantage in supporting your workload.
- Use enough paging volumes so that the percent of slots used is at a comfortable value for you. The objective here is to avoid abending for running out of paging space without grossly overconfiguring. If your paging load doesn't vary much, you might be comfortable at some higher percent-used value, maybe 75%. But if your paging load has a high variance you might need to run at some lower average to leave room for peaks. Keep an eye on FCX109 DEVICE DASD. That report's prologue tells us how full paging is. Be prepared to add paging volumes as needed.
- Remember that paging well is all about being able to run more than one paging I/O at a time. This means you should spread your paging space over as many volumes as possible. Get yourself lots of little paging volumes, instead of one or two big ones. The more paging volumes you provide, the more paging I/Os z/VM can run concurrently.
- Make all of your volumes the same size. Use all 3390-3s, or 3390-9s, or whatever. When the volumes are unequally sized, the smaller ones fill and thereby become ineligible as targets for page-outs, thus restricting z/VM's opportunity for paging I/O concurrency.
- A disk volume should be either all paging (cylinders 1 to END) or no paging at all. Never allocate paging space on a volume that also holds other kinds of data, such as spool space or user minidisks.
- Think carefully about which of your DASD subsystems you choose for paging. Maybe you have DASD controllers of vastly different speeds, or cache sizes, or existing loads. When you decide where to place paging volumes, take the DASD subsystems' capabilities and existing loads into account.
- Within a given DASD controller, volume performance is generally sensitive to how the volumes are placed. Work with your DASD people to avoid poor volume placement, such as putting all of your paging volumes into one rank.
- If you are paging to ECKD, use the best FICON adapters you have. With each generation comes advances in speed and ability to overlap I/Os. This advice extends to the GBICs you use. A recent FICON card does no good with a slow GBIC.
- If you can, run multiple chpids to each DASD controller that holds paging volumes. Consider two, or four, or eight chpids per controller.
- Consider using z/HPF (High Performance FICON) I/O for paging volumes. For more information, see z/VM Paging Improvements.
- Consider using HyperPAV aliases for paging volumes. For more information, see z/VM Paging Improvements.
- If you have FCP chpids and SCSI DASD controllers, you might consider exploiting them for paging. A SCSI LUN defined to the z/VM system as an EDEV and ATTACHed to SYSTEM for paging has the very nice property that the z/VM Control Program can overlap paging I/Os to it. This lets you achieve paging I/O concurrency without needing multiple volumes. However, don't run this configuration if you are CPU-constrained. It takes more CPU cycles per I/O to do EDEV I/O than it does to do classic ECKD I/O.
- Make sure you run with a few reserved slots in the CP-owned list, so you can add paging volumes without an IPL if the need arises.
Inspecting and Tuning Paging Health
To determine whether the paging system per se is configured and operating OK, examine the following Perfkit reports and fields.
FCX109 DEVICE CPOWNED tells us lots of things about the health of the paging system. Things FCX109 reveals are:
- In the report's upper text, the caption Page slot utilization tells us how full the paging system is altogether. Again, run at a percent-full that is comfortable for you and leaves room for the variation you observe in your workload.
- In the Area Extent column, the report tells us how much paging space is allocated on each volume. The entry is either a cylinder start and end, or it is a number of 4 KB slots. To convert cylinder start and end to a number of slots, calculate s = (end - start + 1) * 180. We want all volumes to be the same size.
- In the Used % column, the report tells us how full each volume is separately. We want each volume to be about the same percent full. If you have sized each volume the same, this will take care of itself.
- In the Serv Time /Page column, the report tells us how long on average it takes to move a page on or off the volume, once the transfer actually begins. We want this to be less than 1.0 msec. If the value is higher, we need to do DASD tuning, which we'll describe shortly.
- In the MLOAD Resp Time column, the report tells us how long on average it takes to move a page on or off the volume, including time the paging request waits in line to get access to the volume. We want this also to be less than 1.0 msec. If the value is higher but Serv Time /Page is OK, spread the work across more volumes. Otherwise do DASD tuning, which we'll describe shortly.
- In the Queue Lngth column, the report tells us whether paging operations are queueing at the volume. Queue formation at paging volumes is a very bad thing. If we see this value nonzero, we need either to add volumes or to do DASD tuning, both of which we'll describe shortly. Note nonzero queue lengths are of course the cause of elevated MLOAD.
Keep in mind that the values in FCX109 are averages over the time interval of the report. The time interval of the report is located in the report's upper left-hand corner. Use FCX109 INTERIM DEVICE CPOWNED for interval-by-interval studies of paging health.
FCX103 STORAGE tells us the system's overall block-paging factor for reads and for writes. Generally we want these to be greater than or equal to 10. If it is too small, the usual cause is that paging space is too full. Refer to FCX109 DEVICE CPOWNED to see how full the paging space is altogether. You probably need to add volumes.
Inspecting and Tuning DASD, In General
The z/VM system will be able to page well only if its paging volumes are performing correctly. Here are some Perfkit reports you can examine and some tuning actions you can take.
FCX131 DEVCONF tells us which chpids are servicing your paging volumes. Determining whether your paging system has enough channels and appropriate channel technology starts by knowing what the paging channels are. Use the information in this report to figure out how many chpids lead to each DASD subsystem, and what the chpid numbers are. If you don't already have one, sketch yourself a diagram of your paging DASD configuration, and keep the diagram handy.
FCX161 LCHANNEL tells us two interesting things relative to paging I/O:
- The Descr column tells us what technology is in use. You should see FICON here always.
%BusyDistribution histogram tells us, by chpid, how CPU-busy the microprocessor is on the chpid's adapter card. Notice the report shows us a histogram of the distribution of each adapter's CPU-busy. The column headings are percent-busy range bands, and the entries in the columns show us what fraction of samples showed the card in said band. This lets us see not only the average busy but also how variable the CPU-busy value is. Here's an excerpt, showing the column headings. 1FCX161 Run 2009/11/04 09:12:16 LCHANNEL Channel Load and Channel Busy Distribution From 2009/10/29 14:26:05 To 2009/10/29 15:25:05 For 3540 Secs 00:59:00 Result of xxxxxxx Run _____________________________________________________________________________________________ CHPID Chan-Group <%Busy> <----- Channel %Busy Distribution 14:26:05-15:25:05 ------> (Hex) Descr Qual Shrd Cur Ave 0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100
The adapter cards can usually carry heavy paging data rates without experiencing unduly high CPU-busy values. What tends to drive up adapter CPU-busy is high rates of very small I/Os, which usually comes from applications' I/O, not paging I/O. If you see high CPU-busy on paging chpids, determine whether application or guest I/O habits are the cause. Consider separating application I/O and paging I/O from one another.
FCX215 FCHANNEL tells us what the data rates are on each chpid. The Write/s and Read/s columns tell the story. What we are looking for here is whether the fiber's data rate is approaching the fiber speed. The fiber speed is on the right side of the report.
FCX232 IOPROCLG tells us whether your System z's I/O subsystem (aka channel subsystem) is encountering busy conditions as it tries to start I/Os. The values are reported as busy conditions encountered per SSCH attempted, so of course "zero" is the optimal answer for these values. The specific kinds of busy conditions reported are these:
- Channel busies generally mean you have not provided enough channels to the DASD subsystem. Add some.
- Switch busies mean there is a congestion problem in your switch. Work with your switch provider.
- Control unit busies generally mean the CPU in the DASD subsystem is too busy. Work with your DASD provider or spread activity across more DASD controllers.
- Device busies generally mean that for a given volume, I/O from one partition is blocking I/O from another partition. We would never expect to see this for paging volumes.
FCX108 DEVICE reports three principal measures of performance for individual disk volumes. The measures appear in the report on a per-volume basis. The measures, their explanations, and their remediation strategies are:
Pending time per I/O is the time it takes the System z
channel subsystem to find a chpid to use for the I/O, plus the
amount of time it takes the DASD subsystem to return an
initial response indicating that it has received the channel
If you see
more than 0.1 msec of pending time, you might have one or more of
- The channels leading to the DASD subsystem are too busy. FCX232 IOPROCLG should confirm this in channel busies. Add chpids or spread the paging volumes among more controllers.
- The CPU in the DASD subsystem is too busy. Unfortunately we have no way to measure controller CPU-busy, but FCX232 IOPROCLG should confirm this in control-unit busies. Spread the paging volumes among more controllers.
Disconnect time per I/O is a measure of controller
duress. The controller disconnects when it cannot immediately
satisfy the I/O by using its cache.
If you see more than
1 msec of disconnect time, you might have one or more of these problems:
- Controller volatile cache is overloaded or inadvertently off or otherwise ineffective.
- Controller NVS is overloaded or inadvertently off or otherwise ineffective.
- Connect time per I/O is the amount of time an I/O actually uses the fiber. It is measured from the beginning of the first FICON packet to the end of the last FICON packet. One way to help decrease connect time is to make sure the system is using z/HPF (High Performance FICON, aka transport-mode I/O) for paging. For more information, refer to z/VM Paging Improvements.
Note service time per I/O is just the sum of pending time, disconnect time, and connect time.
FCX108 also reports several additional measures that are of little to no utility for studying paging but which are very important for understanding DASD performance in general. Though this article is really about paging, for completeness we'll go ahead here and describe these additional important measures:
Avoid is the number of I/Os per second that were
avoided for the volume because z/VM got a hit on MDC. To determine
whether the value is "correct" for your situation, consider these
- If you have enabled MDC for the volume but the hit rate is low, check FCX177 CACHEXT to see whether the volume's I/Os are mostly reads, and also check FCX138 MDCACHE and FCX178 MDCSTOR to see whether MDC is operating as intended.
- If you have disabled MDC for the volume, check FCX177 CACHEXT to see whether the volume's operations are mostly reads. If so, and if the volume is not a dedicated volume, consider enabling MDC for it.
- Req. Qued reports the depth of the I/O wait queue at the real DASD volume. In general, if we see wait queues forming at a real DASD volume, it means the device's service time is excessively large or the volume is just plain overworked. If the components of service time seem out of range, remediate as described above. Otherwise consider using PAV or HyperPAV to let CP start more than one real I/O to the volume concurrently. For more information about this, refer to z/VM Paging Improvements. Last, reorganize data to relieve stress on the volume. It is also worth checking whether MDC is functioning as intended for the volume; moreover, if FCX177 CACHEXT reports the volume's I/Os are mostly reads but you have disabled MDC for the volume, reconsider your configuration.
- Resp reports the average response time for I/O requests to the volume. Response time is service time plus time spent waiting in the volume's I/O queue. If we see response time exceeding service time, it means there is an I/O wait queue on average, and we'll see the queue in the Req. Qued column. When we see a queue, remediate as just described above.
Keep in mind that all of the values in FCX108 are averages over the time interval of the report. The time interval of the report is located in the report's upper left-hand corner. Use FCX108 INTERIM DEVICE for interval-by-interval studies of the system's DASD health. To study the interval-by-interval of a specific DASD volume, use FCX168 DEVLOG.
FCX176 CTLUNIT and FCX177 CACHEXT report on DASD I/O performance data harvested from the DASD controller. These reports can provide useful information if you are looking for causes of poor storage subsystem performance, such as insufficient cache or insufficient NVS. These reports are best interpreted in consultation with your storage subsystem provider.
Because the z/VM value proposition pays off substantially only in the face of successful resource overcommitment, it's important for us to understand how z/VM paging works, how to measure it, and how to tune it.
The z/VM paging system does its work by implementing well-defined page motion, page marking, and page selection schemes. Pages move between central storage and DASD according to demands for central and according to their use patterns and ages. Pages to be ejected from central are chosen in a way that tries to reduce impact on running users.
z/VM Performance Toolkit reports on storage consumption by guests, by VDISKs, by shared data spaces, by minidisk cache, and by the Control Program itself. After we understand where storage is being used, we can tune storage use, by adjusting guests or by adjusting the Control Program.
Perfkit also reports on paging rates, either overall, or by users, by VDISKs, and so on. It also reports on whether users are being significantly held back by paging I/O or by storage constraints.
There are many steps one can take to configure a z/VM system so that it will page well. Key steps are to make sure there is enough paging space, and that it is spread over enough volumes, and that the volumes are used for only paging, and that the volumes are placed intelligently in DASD subsystems, and that the DASD subsystems have enough chpids.
Perfkit tells us whether the paging system is healthy: whether the paging DASD is too full or whether the Control Program is experiencing undue delays moving pages on and off of paging volumes.
Perfkit also tells us whether the paging DASD themselves are healthy. Key metrics here are pending time, disconnect time, and pending time. When one or more of these metrics is out of bounds, reallocating or redistributing paging volumes or paging chpids can usually solve the problem.