The VM/ESA Scheduler Made Simple

VM/ESA and VSE/ESA Technical Conference - October 1996
Bill Bitner
VM Performance Evaluation
IBM Corp.
1701 North St.
Endicott, NY 13760
(607) 752-6022
Internet: bitner@vnet.ibm.com
Expedite: USIB1E29 at IBMMAIL
Bitnet: BITNER at VNET
Home Page: http://www.vm.ibm.com/devpages/bitner/

DISCLAIMER
Trademarks
Introduction
The Main Loops
Class Structure
Routines Involved
Entry to Eligible List (HCPSCHEP)
Exit Eligible List (HCPSCHSE)
Entry to Dispatch List (HCPSCIAD)
End of minor timeslice (HCPSCIDP)
Drop from Dispatch List (HCPSCKDD)
Pre-empt User (HCPSCKPR)
Overview of details not discussed
QUERY Commands
INDICATE QUEUES EXP Command
RTM Data
Monitor Data
QUICKDSP Tuning
Fitting in Storage
SRM XSTORE Tuning
SRM STORBUF Tuning
STORBUF Tuning Example
SRM LDUBUF Tuning
SRM DSPBUF Tuning
Share Tuning
Absolute Share Tuning
Relative Share Tuning
Share Tuning Example
Maximum Share Setting
Other Things to Know
Undesirable Features (Known Requirements)
Summary
References
Acronyms...
WSC Flash

Background
Deadline Scheduler
Competing Share Value
Low System Utilization and Limit Shares
Limiting Shares in a Low Utilization Environment
Conclusion

DISCLAIMER

The information contained in this document has not been submitted to any formal IBM test and is distributed on an "As is" basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environment do so at their own risk.

In this document, any references made to an IBM licensed program are not intended to state or imply that only IBM's licensed program may be used; any functionally equivalent program may be used instead.

Any performance data contained in this document was determined in a controlled environment and, therefore, the results which may be obtained in other operating environments may vary significantly.

Users of this document should verify the applicable data for their specific environment.

It is possible that this material may contain reference to, or information about, IBM products (machines and programs), programming, or services that are not announced in your country or not yet announced by IBM. Such references or information must not be construed to mean that IBM intends to announce such IBM products, programming, or services.

Should the speaker start getting too silly, IBM will deny any knowledge of his association with the corporation.

Trademarks

The following are trademarks of the IBM corporation:
- IBM
- VM/ESA
- VTAM

Introduction

Objective
- Provide useful information on how the VM/ESA Scheduler works (or does not work).
- Explore some tuning methodologies.
- This presentation will not make you an expert.
Steps
- Background on Scheduler
- How things work in a nutshell
- Tuning discussion
All based on VM/ESA 2.1.0 (except where noted)

Speaker Notes

I will attempt to keep the material in this presentation current, by updating it periodically. Please let me know if you have any comments, suggestions, or corrections.
I personally don't consider myself a scheduler expert. My background is performance analysis for VM since 1985. The first couple years, I ran benchmarks on VM/SP. Over time I gained responsibility for analysis of the benchmarks and our lab took on VM/ESA. My "great leap" started around 1990 when I started dealing with customers and real systems. I gained a great appreciation for the difference between theory and application.
Not being an expert is good, because in performance theory, is nice but making it work is more important. This presentation attempts to explain the scheduler and what it can do. Some description of the algorithms and structure is necessary, but it should be treated as background information. A summary of how things work are presented and then some information on how you can affect the scheduler (tune it). Some areas of the scheduler are not discussed. That is because I either find them unimportant, too specialized, or I just do not understand them well enough to include them here.
This material is based on VM/ESA 2.1.0. Over time some tuning knobs have been added and other changes made.

I would like to say that everything in this presentation was based on my own experimentation, analysis, and discovery; but that would be a lie. Credit to much of what I've learned goes to various customers, IBMers and former IBMers: Paul VanLeer, John Harris, Gerry Davison, Bob Guenther, Virg Meredith, and Geoff Blandy; and Barton Robinson of Velocity Software; plus many others. Thank you.

The Main Loops

Each user is in one of the following lists

Dispatch List - (D-list) users ready (near-ready) to run.
Eligible List - (E-list) "Time Out" chair for users that want to be in Dispatch List.
Dormant List - users that are idle (as far as we know).

+----------+ +----------+ | Dispatch | | Eligible | | List |<-----------------------| List | | | | | | | | | | |-----+ | | | | | Minor | | | | | Dispatch | | | |<----+ Slice | | | | | | | | | | | |----------------------->| | | |----------+ | | +----------+ | +----------+ | goes | A | idle | | User V V |Runnable +----------------------------------------------+ | | | Dormant List | | | +----------------------------------------------+

Speaker Notes

This is the classic VM/ESA scheduler picture. I've yet to see a scheduler presentation that does not include it. The three major lists are dispatch, eligible, and dormant. On most systems only the dispatch and dormant lists are of interest (because a true eligible list does not exist). The rest of the presentation discusses how VMDBK (virtual machine description blocks (users)) move around in the lists and how they get positioned. At some point it's probably good to draw a line between scheduler and dispatcher functions. For this presentation, consider the movement between the lists and prioritization within the lists to be the scheduler part. The dispatcher, not covered here, is concerned with pulling users off the dispatch list, running them, and returning them.

VM/ESA 1.2.2 added the maximum or limit share concept which is associated with the limit list. While called a list, it is really a subset of the dispatch list.

Class Structure

User belongs to one Transaction Class
- 1 - users with "short-running" transactions
- 2 - users with "medium-running" transactions
- 3 - users with "long-running" transactions
- 0 - special users or users with special transactions (QUICKDSP ON, hotshot, lockshot users).
Elapsed Time Slice for each class
- Class 1 ETS is dynamic with goal of keeping n% of users in class 1.
- Class 2 ETS is 8 times C1ETS
- Class 3 ETS is Max(48 times C1ETS,time to read in WSS)
When entering E-list from Dormant list, start as class 1.
Bump up class when ETS expires, but transaction is not complete.

Speaker Notes

The concept of classes is important to understanding the VM/ESA scheduler. There are four classes. Classes 1, 2 and 3 map to short running, medium running, and long running jobs. The exact definition is dynamically determined by CP. A user only belongs to one class at a time, though the exact class may vary over time. When first entering, the eligible list from dormant list, we start as class 1 transaction (unless special class 0). If elapsed time slices are consumed while still completing a given transaction, the user is promoted to a higher class. The dynamic part is that Class 1 ETS (elapsed time slice) is varied to keep a certain percentage of transactions in class 1. The class 1 ETS value is reported as C1ETS in the RTM/ESA product. We arrive at the other ETS values by using a multiplier to the class 1 ETS. The multiplier for class two was changed in VM/ESA 1.2.2. Prior to that release the multiplier was 6.

The remaining class is class 0, and is for special uses. This is meant for high priority work. Key virtual machines given the QUICKDSP option (discussed in more detail later) run as class 0 users. Virtual machines holding key system resources also fall into this class for the time over which they hold the resource. This is called "lockshot". "Hotshot" is another example of class 0 work.

Routines Involved

HCPSCHEP:: E-list priority at entry to E-list
HCPSCHSE:: Select E-list user based on priority and available resources.
HCPSCHTE:: Misc transaction end processing.
HCPSCIAD:: Initial D-list priority when user added to D-list.
HCPSCIDP:: Re-prioritize after minor timeslice exceeded
HCPSCJGL:: Notification of growth in working set size.
HCPSCKDD:: D-list processing when user is dropped.
HCPSCKPR:: Choose a D-list user to pre-empt.
HCPSTK:: Routines to manage status and actually move VMDBKs from list to list.

Speaker Notes

Despite not wanting to get into a lot of the details, it is worthwhile looking at some of the scheduler functions. Here we break it down by CP routines. I did this for two reasons: 1) I wasn't sure I could come up with an easier method of breaking down the process, and 2) it gives good references for additional information. The comments in the source code can be very helpful in understanding the scheduler workings.
VM/ESA 1.2.2 added considerable code to the scheduler and resulted in some module splits. For the above routines, there is only one change HCPSCJGL used to be HCPSCIGL. HCPSCI was split into HCPSCJ. In addition, release 2.2 split HCPSTK into HCPSTK, HCPSTL, and HCPSTM. Several other new routines are not listed above.

All but three of the routines will be discussed in more detail in this presentation. HCPSCHTE handles transaction end processing. Transaction definition in VM/ESA is complex and unfortunately controversial at times. For fear of being side tracked and lack of time, I have omitted discussion of it. HCPSCIGL handles notification of growth in working set size. I do not feel comfortable enough with this area to discuss it at this time. HCPSTK does a lot of the management work of the scheduler. It is responsible for calling many of the other routines listed and for actually moving VMDBKs between lists.

Entry to Eligible List (HCPSCHEP)

Determine new transaction or reason for coming to E-list.
Determine E-list class
Determine Elapsed time slice (VMDESLIC)
Determine E-list priority (VMDEPRTY) based on:
- User class
- User resource consumption
- System resource contention
- Share setting
Determine Core working set size (VMDCWSS)

Speaker Notes

Since most systems do not run with an eligible list, this step is often moot. However, we do determine some key things in the process, such as: class and working set information.

The working set size calculation is a side note which should come up at some point in this discussion. Currently, the biggest gotcha is that shared spaces are not considered part of an individual users working set. This makes sense in that if a space is used by all users, it shouldn't be associated with any individual user. However, CP does not have a good method of distinguishing between highly shared and just shareable. Simply making a space shareable is the same as having it shared by thousands of users.

Exit Eligible List (HCPSCHSE)

Compute available storage (DPA) and largest user that can fit for each class.
Scan used to select users to move to dispatch list
Class 0 always selected
Not selected if user class blocked.

If user fits in storage then
- Select if also meets LDUBUF and DSPBUF limits
If user does not fit in storage then
- Select if user meets LDUBUF and DSPBUF and is only user in class
- Select if E1 and behind by certain delay factor (pre-empt another).
- Block class if not E1, but behind.

Speaker Notes

Okay, so far we have moved a user from the dormant list into the eligible list. Now, the scheduler needs to determine who is important enough to move on to the dispatch list. Note there are two foils for this (try as I might, I couldn't get it narrowed down to one foil).
Some upfront work includes determining the available storage (DPA) that the scheduler has to work with and the largest user that can fit for each class. SET SRM STORBUF tuning is discussed later, but this is where it becomes a factor.
From here we do a scan of the eligible list to select users to move to the dispatch list. Parts of the selection process are simple, like all class 0 users are selected. Also, it turns out that if a user is the only one in its class, it is selected. The foils show the basic selection process.

Two concepts introduced here are "class blockage" and "pre-emption". Both involve the case where we determine that a user is too far behind schedule. If this occurs for class 1 user, the scheduler will attempt to pre-empt another user (discussed later in presentation). If class 2 or 3, the scheduler blocks the class. Blocking a class means the scheduler will not select any other user in this class or higher class.

Entry to Dispatch List (HCPSCIAD)

Calculate Dispatch List priority with following factors:

ATOD
Running timer in TOD form for user time.
OFFSET
Used to initially offset ATOD. Based on a number of factors.
Paging Bias
Priority boost given when the transaction is being continued and pages have been stolen. One time boost.
IABIAS
Interactive Bias - priority improvement based on SRM IABIAS settings.
Hotshot
Priority boost given to user with transaction in progress, but interacts with terminal.

OFFSET =
- Minor Timeslice + Previous Timeslice Overrun
- -----------------------------------------------------------------
- Dispatch List Share * Number of CPUs

Minor Timeslice:: Minor dispatch time slice (SRM DSPSLICE) in TOD units.
Previous Timeslice Overrun:: amount of time user exceeded previous minor timeslice.
Dispatch List Share:: Factor of user's share and time spent in eligible list.

Speaker Notes

This is another two foil section and the routine that is most likely to send us off on a long theory and algorithm discussion. At this point, the scheduler has selected a user to move to the dispatch list. The biggest factor here is determining dispatch list priority. Listed on the first foil are the major factors. The second one (OFFSET) is discussed in more detail on second foil.
I believe ATOD stands for artificial TOD or adjusted TOD, depending on your mood. In any case, since the VM/ESA scheduler is a deadline scheduler, a time line is needed. ATOD is used for this purpose it runs for CPU usage for user time only.
Paging Bias is a one-time priority boost.
IABIAS comes from the SET SRM IABIAS settings.
HOTSHOT has been discussed earlier. It involves giving a user a boost for a trivial terminal interaction in the middle of a longer transaction.
The OFFSET formula has additional terms to define. Number of CPUs is pretty obvious and should be constant (if not, you probably have other problems). As shown, Previous Timeslice Overrun is the amount of time user exceeded previous minor timeslice. I personally have never seen a non-zero number here. The minor timeslice is also a constant either set by system default or by SET SRM command. The final variable, Dispatch List Share, is a factor of user's normalized share and time spent in the eligible list. As mentioned earlier, most systems do not have an eligible list. Therefore, time spent there is null.

Looking at all the variables, most are constants or negligible with the exception of normalized share. It is a key part of the formula and something we have control over.

End of minor timeslice (HCPSCIDP)

Calculate Dispatch list priority much like first time.
Update some statistics.
If user got into dispatch list by hotshot, then set flag to drop user.
Do we need to limit the user? Set flags as appropriate.

Speaker Notes

This is the small inner loop between users running and the scheduler getting control to check some things and re-prioritize.

This may be a good time to discuss maxfall, limithard, and limitsoft. Think of the time line that ATOD moves along as a vertical line, with 0 at the top and infinity at the bottom. If a user starts to get increasingly ahead of ATOD (starts falling), it means it has been getting plenty of CPU cycles and the scheduler has moved it way ahead of ATOD. It can reach a point referred to as maxfall or the maximum amount we will let a VMDBK get ahead of ATOD. The maxfall point is relative to the offset of the user (ATOD + 4 * offset). Once reaching maxfall, unlimited and limitsoft users are not permitted below this point. Limithard users are marked to be moved to the limit list.

Drop from Dispatch List (HCPSCKDD)

Update statistics based on why dropping
Keep track of status to help determine monitor transaction end
Calculate resource consumption values for later
- Working Set Size (WSS)
- Paging Rate
Record other resource stats (user and system).
Optionally cut monitor record.

Speaker Notes

This routine does a lot of the bookkeeping required for scheduling work. If we were to discuss the definition of transactions more, this would probably be the place. The resource consumption calculations get used in various places, including the feedback algorithms in entry to eligible list priority. At this point, we may be going back to the eligible list or the dormant list. The main loop of VM/ESA has been completed.

I haven't pointed out all the places where monitor data is involved, but here is one place.

Pre-empt User (HCPSCKPR)

Find user in dispatch list to pre-empt to make room for a class 1 user to run.
Pre-empt user if
- Not class 0 or class 1
- Not under the influence of a bias
- Not last user in dispatch list of their class (except if bigger than STORBUF limit).

Speaker Notes

This subject was broached back on the Exit Eligible List foils. The scheduler will attempt preemption to make room for a class 1 user which has fallen behind schedule by a certain delay factor. All of the criteria must be met for preemption to occur.

Overview of details not discussed

TOD Tied concept
MAX Fall concept
Scheduling of Virtual MPs
Scheduling of V=R/F virtual machines
Dedicated Processors
Growth Limit

Speaker Notes

The items on this list are other things (footnotes) that are involved with the VM/ESA scheduler. I mention them as potential off-line discussions or as warnings that these items may put a different slant on the scheduling.
The first two deal with trying to keep the VMDBKs within a manageable range on the timeline compared to ATOD. We mentioned maxfall on the foil about re-prioritizing after minor timeslice expires. Maxfall keeps a user from falling too far below (infront) of ATOD, and TOD tied deals with keeping a user from rising too far above ATOD. These two approaches help us avoid a race condition when the VMDBK starts to need more processor resources or processor resources suddenly become constrained.
With scheduling of virtual MPs two things to remember are the base VMDBK is always equal or ahead of the other VMDBKs and the resources get split evenly across the virtual processors.
V=R/F virtual machines do not have to worry about the scheduling impacts of storage. Dedicated processors means not worrying about processor resources.

WSS Growth limit. I have nothing intelligent to say at this point.

QUERY Commands

QUERY SRM
 
IABIAS : INTENSITY=90%; DURATION=2
LDUBUF : Q1=100% Q2=75%  Q3=60%
STORBUF: Q1=100% Q2=85%  Q3=75%
DSPBUF : Q1=32767 Q2=32767 Q3=32767
DISPATCHING MINOR TIMESLICE = 5 MS
MAXWSS : LIMIT=9999%
XSTORE : 0%

QUERY QUICKDSP VTAM
 
USER VTAM    :  QUICKDSP = ON

QUERY SHARE BARTMAN
USER BARTMAN :  RELATIVE SHARE = 705
                 MAXIMUM SHARE = LIMITHARD RELATIVE 705

Speaker Notes

Here are example outputs of commands that can be useful in gathering scheduler related information. The QUERY SRM command shows values for different options which have corresponding SET SRM command settings. Most of these are discussed further.
QUERY QUICKDSP userid just gives setting for a user.

QUERY SHARE userid gives share setting.

INDICATE QUEUES EXP Command

Userid      List Stat Resid/  WSS  Flag  xPRTY Affinity
RSCS          Q0 PS  000489/000486 .... -2.752 A02
LIBMAINT      Q1 PS  000156/000134 .I.. -2.435 A01
BITNER        Q1 R00 000106/000081 .I.. -2.419 A03
HELLEND7      Q1 R02 000712/000680 .I.. -2.411 A02
TCPIP         Q0 PS  000902/000891 .... -.8154 A02
RUPRIGHT      Q2 IO  000724/000565 ....  2.371 A05
ESCHENBR      Q2 PG  000214/000206 ....  2.471 A05
DUGUID        Q3 -   000419/000198 ....  2.549 A00
LIBTOOLS      Q3 IO  000518/000488 ..D.  2.565 A04
BERMANES      Q3 EX  001719/001714 ....  2.604 A05
SAFE          Q2 R   000094/000085 ....  2.610 A02
DITOMMAD      Q2 IO  000424/000352 ....  2.621 A02
CRAST         Q3 R01 001198/001523 ....  2.625 A01
WILKINSS      Q3 R04 000861/000943 ....  2.653 A04
MOYW          Q2 IO  000949/000942 ....  3.445 A03
NLSLIB        Q3 IO  000116/000111 ....  3.487 A05
EDLWRK11      Q3 IO  000361/000330 ....  59.47 A00
EDLWRK14      Q3 IO  000404/000373 ....  59.50 A05
RACFVM        Q0 PS  000733/000731 ....  99999 A03
GIANTS1       Q1 PS  000510/000471 .I..  99999 A03
DISKACNT      Q2 PS  000060/000058 .I..  99999 A02
EDLSFS1       Q0 PS  000934/000913 .I..  99999 A04
LUNSFORD      Q1 PS  000291/000266 .I..  99999 A05
NPM           Q3 PS  000185/000184 ....  99999 A05
PLAVCHAN      Q1 PS  000352/000309 ....  99999 A00
VTAM          Q0 PS  001327/001326 ....  99999 A05

Speaker Notes

More sample command output for INDICATE QUEUES EXPanded CP command. Note that I added the header line. The EXPanded option for the command went out as an APAR at one time and then rolled into VM/ESA. First two fields are obvious. Stat field gives what user is doing or waiting for (eg. IO = waiting for I/O, PG = page wait, R00 = running on CPU 00). The next two fields are count of pages resident in central storage and in projected working set size. The Flag field is a series of flags. In this example, the "I"s for LIBMAINT, BITNER, etc. mean the user has IABIAS in effect. The "D" means that last time in E-list the user was behind deadline (eg. LIBTOOLS). The xPRTY is either dispatch or eligible list priority. A negative value means it is behind schedule. Values of 99999 are used for test idle users or for values greater than 5 digits.

RTM Data

DISPLAY SCLOG - shows number of users in various lists/classes, plus loading user count.
DISPLAY XSLOG
- EL - number users in eligible list
- DL - number users in dispatch list
- RLSH - sum of relative shares in dispatch list
- ABSH - sum of absolute shares in dispatch list
SHARE field on GENERAL, USER, and ULOG displays

Speaker Notes

There are a few key pieces of scheduler related information available from RTM/ESA. The scheduler log display (SCLOG) shows the number of users in the various lists by class over time, plus counts of loading users over time. The extended system log (XSLOG) gives total users in eligible and dispatch lists over time. In addition the sum of the absolute shares and relative shares for dispatch users are given. The value of this information is seen when we talk about the SET SHARE tuning command. The share settings appear on various displays.

Monitor Data

System Domain Records:
- 10: MRSYTSCG - Scheduler Activity (Global)
- 13: MRSYTSCP - Scheduler Activity (Per Processor)
Scheduler Domain records:
- 1: MRSCLRDB - Begin Read
- 2: MRSCLRDC - Read Complete
- 3: MRSCLWRR - Write Response
- 4: MRSCLADL - Add User to Dispatch List
- 5: MRSCLDDL - Drop User From Dispatch List
- 6: MRSCLAEL - Add User to Eligible List
- 7: MRSCLSRM - Set SRM Changes
- 8: MRSCLSTP - System Timer Pop
- 9: MRSCLSHR - Set SHARE Changes
- 10: MRSCLSQD - Set QUICKDSP Changes
Some are exposed by VMPRF and/or VMPAF

Speaker Notes

The records listed are particular to the scheduler. In addition, information on QUICKDSP and SHARE settings is included in the USER domain. Some of the scheduler domain records can be interesting but enabling this domain for all users can result in excessive overhead. You might want to enable for a userid that does not ever log on such as $SPOOL$.

QUICKDSP Tuning

command- SET QUICKDSP userid ON
directory- OPTION QUICKDSP

Userid becomes transaction class 0 and never waits in the eligible list.
Impacts decision to move from E-list to D-list on userid basis.
Used for key server virtual machine (process synchronous requests from end users) and virtual machines that demand the best performance.

Speaker Notes

The biggest fence to keep users under control is the eligible list. However, there are certain virtual machines that should not be subject to this control. These machines include key server machines such as VTAM which must get good service in order for end users to get good performance. It is also a good idea to give this privilege to some special users such as OPERATOR or MAINT so that if the system becomes overloaded and hits a problem, one has a better chance of getting commands to fix the system through.

The Quickdsp setting impacts a single userid. It should be used for any server virtual machine that is critical to end user response time or holds key system resources. Over use of this control for other users can take away the effectiveness of the scheduler.

Fitting in Storage

Total Available =
- Total DPA page frames
- - non-pageable frames
- - system owned resident shared frames
- - system owned locked frames
Roughly AVAIL field from QUERY FRAMES
SCLADL_SRMTOTST monitor field
Add in bonus if applicable for XSTORE

Speaker Notes

Before discussing the next couple of tuning commands, it's worth understanding the scheduler's concept of fitting users into storage. You may want to refer back to the "Exit Eligible List" foils. The scheduler determines how much storage it has available to commit to users. The formula is as shown and is approximated by AVAIL field from the QUERY FRAMES CP command. The listed monitor field is the actual value. It may vary over time. If users suddenly stop fitting in storage, first determine if there's less storage available or more storage required.
There can be a bonus for expanded storage (see next foil).

One misconception to avoid (I've confused people in the past with this), is that we don't actually move all of a users pages into storage when moving them from eligible list to dispatch list. Many of their pages may already be there. It's more a commitment to allow them to stay and keep using them.

SRM XSTORE Tuning

command- SET SRM XSTORE percentage

Impacts decision to move from E-list to D-list on system basis.
Determines how much expanded storage to be viewed as real storage for purpose of fitting user in STORBUF limitation.
Percentage of existing expanded storage to add to available storage.
- 100% of 0 is still 0.

Speaker Notes

This command comes into effect during selection process of users to add to dispatch list from eligible list. The percentage value is a percentage of expanded storage that will be added to what the scheduler considers available. Note that if no expanded storage is available, turning this knob will have no effect.

SRM STORBUF Tuning

command- SET SRM STORBUF p1 p2 p3

Impacts decision to move from E-list to D-list on class basis.

p1
percentage of storage available for classes 1, 2, and 3.
p2
percentage of storage available for classes 2 and 3.
p3
percentage of storage available for classes 3
The default is too low for most guest environments.

STORBUF Tuning Example

Guests showing up in E-list, get WSS size and class and increase STORBUF appropriately.
Available storage is 96MB
2 Guests with WSS of 32MB each constantly appear as E3 users.
64 / 96 = 67%
Increase STORBUF by 70 to 170 155 145

Speaker Notes

Here you see theory and application, hence two foils.
Prior to VM/ESA 1.2.2, the defaults for SET SRM STORBUF were 100 85 75. In VM/ESA 1.2.2, they were changed to 125 105 95. This command also affects the determination of whether a user fits in storage during the selection process for leaving the eligible list. Key differences from SRM XSTORE command are that this is on a class basis and does not require expanded storage.

In the example, we see where default STORBUF settings are not appropriate for this system. The storage required to keep both quests in storage would be 64MB (32MB times 2 guests). Since available storage is 96MB, 64MB is 67% of available storage. Round up to 70% so no one thinks we are too scientific. Add 70 to defaults for all classes since we do not want to penalize other classes.

SRM LDUBUF Tuning

command- SET SRM LDUBUF p1 p2 p3

Impacts decision to move from E-list to D-list on class basis.

p1
percentage of paging exposures available for classes 1, 2, and 3.
p2
percentage of paging exposures available for classes 2 and 3.
p3
percentage of paging exposures available for classes 3
Count of loading users in each class given in RTM SCLOG Display.
Loading capacity in monitor field MTRSCH_SRMLDGCP.

Speaker Notes

SET SRM LDUBUF tuning command also affects users on a class basis. It is meant to control demands for DASD paging. The system determines the loading capacity (how much paging it can tolerate). This is recorded in monitor and shown in VMPRF SYSTEM_CONFIGURATION report. Basically it's how many loading users the system can tolerate. A loading user also being defined by the system, and reported by VMPRF. The parameters of SET SRM LDUBUF then give percentage of exposures available for the given classes. IND QUEUES EXP, RTM, and VMPRF all provide information on loading users.

SRM DSPBUF Tuning

command- SET SRM DSPBUF n1 n2 n3

Impacts decision to move from E-list to D-list on class basis.

n1
number of class 1 users permitted in dispatch list
n2
number of class 2 users permitted in dispatch list
n3
number of class 3 users permitted in dispatch list
Practically set off by default.
See RTM SCLOG display for count of users in various classes.

Speaker Notes

The command sets how many users are permitted in the dispatch list for each class. It varies from STORBUF and LDUBUF in that their parameters were for cumulative classes (i.e. p1 was percentage of storage for classes 1, 2 and 3), while DSPBUF is absolute number for class (n1 is just class 1 users).
The default is 32767 for each class, so it is effectively off. (If you have more than 32767 users in any class, call me cause I want to see that system.)

This tuning command is for the very brave, the very smart, or the very silly. I've seen it used effectively a few times, but they were in unique situations.

Share Tuning

Two flavors
- Absolute
- Relative
Impacts calculation of Dispatch list priority on userid basis (directly) and system basis (indirectly).
Shares may be normalized before being used.
- If sum of absolutes is > 99%, then normalized to 99%.
- Relatives normalized to absolute leftovers.
There is a minimum (regular) share and a limit share

Speaker Notes

Several foils will be spent on discussing this topic. Share is meant as a way to adjust a virtual machine's share of system resources. There are two types of share: relative and absolute. Details on each will follow in other foils. The default share is RELATIVE 100.
As shown earlier it impacts the calculation of the dispatch list priority for the given userid. In addition, it may also affect other users due to normalization.

When used in the calculations, shares may be normalized to other users in the dispatch and eligible lists. Absolute shares are percentages of system. So on a two way processor Absolute 50% is equivalent to one processor. If the sum of absolute shares is greater than 99%, they get normalized to 99%. Relative shares are normalized to what's leftover after absolute shares are determined.

Absolute Share Tuning

command- SET SHARE userid ABSOLUTE ppp%
directory- SHARE ABSOLUTE ppp%

Percentage of 0.1 to 100 of system resources for user.
If sum of absolutes is > 99%, then normalize to 99%.
As long as sum is not greater than 99%, the absolute share worth is constant.
For total of absolute shares in dispatch list see
- ABSH field in RTM
- SYTSCG_SRMABSDL or SCLADL_SRMABSDL in monitor.
For total of absolute shares in D-list and E-list see SCLAEL_SRMABSDE monitor field.

Relative Share Tuning

command- SET SHARE userid RELATIVE nnnnn
directory- SHARE RELATIVE nnnnn

value from 1 to 10000, 10000 being highest priority
Relatives normalized to absolute leftovers.
As system becomes busier (more users in dispatch list), a relative share is worth less after normalization.
For total of relative shares in dispatch list see
- RLSH field in RTM
- SYTSCG_SRMRELDL or SCLADL_SRMRELDL in monitor.
For total of relative shares in D-list and E-list see SCLAEL_SRMRELDE monitor field.

Speaker Notes

ABSOLUTE SHARE is a percentage expressed as a percentage from 0.1 to 100. It is meant as a percentage of system resources. In the absence of an eligible list, it is basically a share of CPU resources. If sum of absolute shares for non dormant users is greater than 99%, the scheduler normalizes to 99%. As long as this doesn't occur, the absolute share is a constant. That is, it doesn't vary as system load changes. The current sum of absolute shares for dispatch list users can be found in RTM ABSH field or in monitor data.

RELATIVE SHARE is a value ranging from 1 to 10000 where 10000 is the highest priority. These values are normalized amongst relative users for the absolute share leftovers. Therefore, as system becomes busier, a relative share is worth less after normalization. For example, assume 10 Relative 100 users in dispatch list. This results in each getting a normalized share of 10%. Now put 20 Relative 100 users in dispatch list. This results in normalized shares of 5% each. RTM and monitor provide sums of relative shares.

Share Tuning Example

VTAM originally set to Relative 10000
RLSH value from RTM is 13556
ABSH value from RTM is 3
VTAM normalized share is 72%
- (100-3) x (10000/13556)
34 other relative users in D-list each have normalized share of less than 1%.
This was a 2-way processor so anything over 50% is surplus.
36 users in dispatch list, no eligible list.
VTAM CPU consumption was about 20 to 25%.
- Convert to absolute 25% or possibly relative 1200.

Speaker Notes

This example is meant to drive the point that Relative 10000 is the highest priority, but one has to ask if a virtual machine is really 100 times more important than another dispatched virtual machine.
This was a 9121-480 (frame 2-way). RTM lists RLSH and ABSH as shown. So, relative shares are normalized to 97% after subtracting absolutes. Since 10000/13556 is 74%, VTAM gets 74% of 97% or 72%. There were also 34 other relative share users in the dispatch list. Each of those had default share which when normalized is (100/13556) * 97% which is about .7%.

When we look at actual CPU consumption, VTAM is only using 20 to 25%. Therefore a more appropriate share would be ABSOLUTE 25% or RELATIVE 1200.

Maximum Share Setting

    SET SHARE MYUSERID REL 2000 REL 4000  LIMITHARD

New parameters available on SET SHARE and Directory options.
Second share value listed is maximum share.
Maximum share can be relative or absolute.
Two types of maximum share limits
- LIMITSOFT means only let the virtual machine use more than this amount of resources if no other virtual machine needs the resources.
- LIMITHARD means do not give the virtual machine more than this amount of resources even if they are available.

Speaker Notes

>>--SET--SHARE--userid-----------------------------------------> >----INITial---------------------------------------------------------->< | |-NOLimit---| | |---ABSolute--nnn%-------+-----------+---------------------------| | |-RELative--nnnnn-| | |-LIMITSoft-| | | | | |-LIMITHard-| | | | | |-LIMITSoft-| | | | '-----------------mmm%----+-----------+-| | | | |-ABSolute-| | |-LIMITHard-| | | |---------------mmmmm-' | | |-RELative-| | | | |---NOLimit------------------------------------------------------| |-LIMITSoft-| |-LIMITHard-|

Above you see the syntax for the SET SHARE command. The most recent addition was the maximum share value added in VM/ESA 1.2.2. Just like the minimum share both absolute and relative values are allowed. The new maximum share provides a method to hold back users from getting extra processor resources. It is different from the minimum share in that it deals only with processor resources. There are two types of limits: limithard and limitsoft. Limithard keeps a user from ever getting more than the limit (maximum) share. Limitsoft keeps a user from getting extra processor resources unless no other unlimited user on the system can use the resources.

Other Things to Know

The Scheduler is not a consumption scheduler (we really do not measure processor usage in the scheduler algorithms).
Priority is based off of minimum share (not maximum).
Side effects of tuning do exist.
Absolute does not mean Precisely
Absolute is in terms of entire system, not a processor.
ATOD runs at near CPU time, ATOD2 runs at near wall clock.
Dedicating processors requires changing how you look at things.

Speaker Notes

Some areas of the scheduler can be confusing. Let's talk about a few of these. The first item is that the scheduler is not a consumption scheduler. We do not track actual processor consumption and use that to determine when to run or not. A users priority is based on the minimum share, so increasing the limit share will not necessarily provide it with more resources. Side effects will arise in tuning. If you hold users back with STORBUF, do not expect them to be able to consume as much resource as you expected. Some misconceptions about absolute share are that it's of a processor (really system) and that it is an exact number (it is not exact).

ATOD which guides the minimum share and ATOD2 which guides the maximum share, run at different rates. One runs more at the rate of CPU time and the other more at the rate of wall clock time. (So in an LPAR where the two can be much different be careful). Dedicating processors changes things, in that available resources are lowered for the other guests.

Undesirable Features (Known Requirements)

Stuck in E-list:: Users E-list deadline set much too far in the future (hours) in severe scenarios (lead to SRM XSTORE knob), even when no one else wants to run. Also known as "no time off for good behavior".
Non-dormant Dormant:: In highly constrained systems, users waiting on what should be short wait process (such as page read), appear idle since task takes over 300 ms. User ends up in dormant list making analysis more difficult and sometimes misleading. (This is scheduler "over hang", not "hang over".) Addressed to some degree in VM/ESA 1.2.2
Additional Monitor Data: To debug complicated scheduler problems additional data is required, such as ATOD, ATOD2, VMDLPRTY, SRMWTCPU, etc..
No control on C1ETS: There could be times when being able to bound the class 1 ETS. For example, playing with DSPBUF settings can have whiplash effect with C1ETS.
Virtual MP Scheduling: The share value is split evenly across the virtual processors even if they are idle. Some people would like to see this changed.

Speaker Notes

What I like about this foil are the things you do not see. The last time I presented the weak points of the scheduler, I had several items on it you do not see here. Of course, I've added a few since then as well. :-)
Stuck in E-list is problem where some users can have E-list deadline set much too far in future (hours) and the scheduler does not revisit user.
Non-dormant Dormant user occurs in very constrained environments. It results in users being put in dormant list because the short term wait task becomes long term. This hides "user" stats and confuses others. Addressed in VM/ESA 1.2.2, but may not be completely solved.
Having tried to understand some scheduler problems that have come in since we shipped VM/ESA 1.2.2, I've concluded that additional monitor data would be helpful. Various scheduling priorities are often listed in the monitor data, but they are relative only to ATOD which without make the data less valuable.
Using tuning knobs like SRM DSPBUF or running your system past the point of contention can cause the class 1 elapsed time slice to dynamically adjust to an undesirable value. Being able to bound the range for C1ETS could be helpful.

As CMS Multitasking and POSIX become more common in our workloads, the desire to run with virtual MP configurations will increase. One of the drawbacks to virtual MPs is that the share is divided amongst all the virtual processors. At peak times, your POSIX and multitasking applications might require several virtual processors to run optimally. However, afterwards they could be sitting idle and reducing your effective share.

Summary

All users are looping users, the loops are just pretty big.
Methods exist to see where users are in the ride.
Some tuning knobs/switches exist to change the ride.
IBM knows there is always room for improvement in the design.
I learned a lot putting this presentation together, you can expect changes in the future.

References

VM/ESA Performance. (SC24-5782)
VM/ESA 2.1.0 Performance Report (GC24-5801-00)
VM/ESA Release 2.2 Performance Report (GC24-5673-01)
VM/ESA CP Planning and Administration. (SC24-5750)
VM/ESA CP Command and Utility Reference. (SC24-5773)
MONITOR LIST1403 file on MAINT 194 disk.

Acronyms...

VM/ESA: Virtual Machine / Enterprise Systems Architecture
VM/XA: Virtual Machine / Extended Architecture
RTM: Realtime Monitor
VMPRF: VM Performance Reporting Facility
VTAM: Virtual Telecommunications Access Method
VSCS: VM/SNA Console Services/Support
CSL: Callable Services Library

WSC Flash

Additional Considerations for using SET SHARE LIMITHARD

The purpose of this Flash is to document information gained from customer experiences and additional testing of the limit shares capability added to the VM/ESA CP scheduler.

Background

The capability to specify limit shares was introduced with VM/ESA V1 R2.2. Since the introduction of this capability, customers as well as the IBM VM development organization have compiled information that can assist in making the most effective use of this function. Most of the information and guidelines discussed in the remainder of this FLASH will pertain to the SET SHARE command with the limithard option.

Deadline Scheduler

The VM/ESA scheduler is not a consumption scheduler. It is rather, a deadline scheduler. This means that the scheduler does not track the amount of processor resources consumed in order to provide a certain amount of processor resource. Instead, the scheduler works to control access to processor resources by managing the priority associated with a virtual machine (actually virtual processor since a virtual machine can actually have multiple virtual processors).

Competing Share Value

As mentioned earlier, the scheduler controls access to resources by setting priorities. Users compete against each other for priority. The target minimum share value (not the limit share) is used in computing these competing share values. For example, consider the following three users and their share settings:

      Userid    Target Min Share     Limit Share
      USERA          10%  absolute     25%  absolute limithard
      USERB          10%  absolute     30%  absolute limithard
      USERC          15%  absolute     20%  absolute limithard

Assume all three users have the same resource requirements and none of them are approaching their limits. USERC will be given a better priority because it has the best target minimum share. Users USERB and USERA will have equal priorities because of their equivalent target minimum shares, despite USERA having a lower limit share. The limit share values are not used in normal prioritization. They are only a factor when a user reaches their designated limit.

It is also worth noting that the target minimum share is a share of system resources. This includes not only processor, but paging and storage resources as well. However, in describing the behavior of the scheduler, it is sometimes easier to focus on one resource such as the processor.

If a user is getting less than their target minimum share in processor resources, it could be because of contention between other resources. The other resources managed by the scheduler are not considered however, when applying limits. The limit share value was designed to limit consumption of processor resources only.

Low System Utilization and Limit Shares

Customer experience, and additional testing have revealed that specifying limithard does not always produce the desired result when system utilization is low. This experience has shown that when system utilization is low, a user with a limithard value might be limited more than it should be. For example, on a 4-way processor, you would expect a user with a share setting of 20% ABS 20% ABS LIMITHARD to be able to get 80% of a single processor (20% of 400% = 80%). However, if there is little else going on in the system, the particular virtual machine may only receive 60% of a single processor.

The key factor in this is the low system utilization. The current limithard design is not perfect at compensating for idle processor cycles and errs to the conservative side in order to avoid allowing a user to get more than their limit.

Limiting Shares in a Low Utilization Environment

If your VM/ESA system environment is characterized by low utilization, and you want to limit the CPU consumption of certain virtual machines, the following guidelines should be helpful to you.

One approach to effectively using the limithard feature of the scheduler in this environment is to run background jobs at low share settings to "soak up" cycles. In this case you would set up a virtual machine to run an EXEC that loops continuously. By giving this virtual machine a very low priority, you will ensure that it does not interfere with useful work on the system. It will however provide the necessary system utilization to allow the current limithard scheduler design to function as expected.

Another alternative would be to create a virtual machine that can function as a "second level" scheduler. The code running in this virtual machine would need to constantly monitor system CPU utilization. When utilization exceeds a threshold, the code in this virtual machine could issue the SET SHARE command with limithard for virtual machines that should be limited. When utilization of the CPU drops below the threshold, the virtual machine could reissue the SET SHARE command for the particular virtual machines to remove the limit.

A final alternative would be to use LIMITSOFT. The algorithms associated with LIMITSOFT are implemented in such a way that they are not affected by low system CPU utilization.

Conclusion

In general, limithard has been effective for most customers that have implemented it. Most VM/ESA systems do not run in a low utilization environment, and thus the characteristics noted above have not been observed. However, the behavior of limithard would be of concern for service bureau and similar environments where processor resources are being managed to control contracted services or performance expectations. VM/ESA Development is aware of the desire to extend the usefulness of limithard to include low utilization environments. To this end, a requirement has been opened to enhance the scheduler in this area. However, a change to the scheduler of this nature will not be made in the service stream.