SFS Performance Management Part II: Mission Possible

SHARE Winter 1995
Bill Bitner
VM Performance Evaluation
IBM Corp.
1701 North St.
Endicott, NY 13760
(607) 752-6022
Internet: bitner@vnet.ibm.com
Expedite: USIB1E29 at IBMMAIL
Bitnet: BITNER

(C) Copyright IBM Corporation 1995, 1997 - All Rights Reserved


Table of Contents

DISCLAIMER
Trademarks
Introduction
A Quick Review
SFS Strengths
SFS vs. Minidisk
Estimated Processor Requirements
Estimated Real Storage Requirements
Access Performance - File Control
VM data spaces
Access and Storage
Agents
Counter Refresher
Monitoring Agents
Detailed Agent Information
File Pool Requests
Server Utilization
Checkpoint Processing
Catalogs
Data Space Usage
Additional Pearls
SFS Progress
References
Acronyms...


DISCLAIMER

The information contained in this document has not been submitted to any formal IBM test and is distributed on an "As is" basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environment do so at their own risk.

In this document, any references made to an IBM licensed program are not intended to state or imply that only IBM's licensed program may be used; any functionally equivalent program may be used instead.

Any performance data contained in this document was determined in a controlled environment and, therefore, the results which may be obtained in other operating environments may vary significantly.

Users of this document should verify the applicable data for their specific environment.

It is possible that this material may contain reference to, or information about, IBM products (machines and programs), programming, or services that are not announced in your country or not yet announced by IBM. Such references or information must not be construed to mean that IBM intends to announce such IBM products, programming, or services.

Permission is hereby granted to SHARE to publish an exact copy of this paper in the SHARE proceedings. IBM retains the title to the copyright in this paper, as well as the copyright in all underlying works. IBM retains the right to make derivative works and to republish and distribute this paper to whomever it chooses in any way it chooses.

Should the speaker start getting too silly, IBM will deny any knowledge of his association with the corporation.


Trademarks

  • The following are trademarks of the IBM corporation:

    • DFSMS

    • IBM

    • VM/ESA

Introduction

  • Follow-on to the Introduction to SFS Performance Management presentation.

  • Topics

    • Quick Review

    • SFS comparison with minidisk

    • Access performance

    • SFS and VM data spaces

    • Agents

    • Counters and File Pool Requests

Speaker Notes

This presentation assumes the attendee has seen the Introduction to SFS Performance Management or has equivalent experience. A little time will be spent in review, but the knowledge of the basics are assumed.

Last update February 3, 1997

The speaker notes were added to enhance this presentation, however they were not created by a professional writer. So please excuse grammar and typos. However, any suggestions or corrections are appreciated.

I'd also like to acknowledge several people who helped pull this presentation together. Many sections in this presentation are based on questions that have come up on various on-line forums. The experts that have helped answer those questions are the following:

  • Ed Bendert
  • Wes Ernsberger
  • Sue Farrell
  • Scott Nettleship
  • Butch Terry

A Quick Review

  • Good planning is a must.

  • Preventative Tuning

    • CP tuning considerations

    • CMS tuning considerations

    • DASD placement

    • VM data spaces

    • Multiple file pools

  • Monitor Performance to establish baseline

    • Monitor Data

    • QUERY FILEPOOL REPORT

  • Check out the documentation

    • SFS and CRR Planning, Administration, and Operation

    • Performance

    • CMS Application Development Guide

Speaker Notes

While many of us resist this approach, it really is very important to read all the directions before putting your SFS filepool together. In few areas of VM/ESA have I seen the value as great in preventative tuning. In particular, DASD placement is key. CP tuning is the traditional tuning that is done for server virtual machines, such as SET SHARE, SET QUICKDSP, SET RESERVE, and so forth. We do not often think of tuning CMS, but there are key start-up and configuration parameters that can be of interest. These include the SFS file buffer, the USER parameter and the use of shared segments.

Two other areas of particular interest to investigate is whether multiple file pools are required for performance or other reasons and whether the use of VM data spaces is feasible.

As was stressed in the SFS introduction presentation, a lot of good information exists in the VM/ESA library. The key ones are listed above for a complete listing, see the reference list at the end of this presentation.


SFS Strengths

  • file-level sharing

  • DASD space management

  • improved file referencing

    • hierarchical directories

    • direct file referencing

    • file aliases

  • file-level security

  • distributed (remote) file access

  • high-level language callable API (CSL)

  • data integrity through workunit concepts

Speaker Notes

Before entering the debate of minidisk versus SFS performance, I thought it would be of value to take a look at some of the SFS strengths. Most are functional strengths, but performance is definitely a part of many of these items. "File-level sharing" is what SFS provides (hence the name). The ability to manage space in pools makes for more efficient and easier DASD space management. When SFS is used in conjunction with DFSMS, you have a very powerful facility. As will be discussed later 'access' of a directory is performance sensitive. The various features of SFS file referencing can be used to minimize the number of files on a directory that are accessed or avoid accessing them all together. Just as there is file-level sharing, security also exists at that same level. The callable services library (CSL) provides various services to work with SFS, including the ability to make most of the calls asynchronously. The concept of workunits will also be covered in more details later. Logical units of work allow one to process work and have it committed or rolled-back as appropriate. SFS participates in CRR (coordinated resource recovery) which provides synchronization of committing work over multiple resources.


SFS vs. Minidisk

  • Processor Requirements

    • Increase with SFS in proportion to file operations.

    • For the IBM FS8F workload processor time per command increases 16%.

  • Real Storage Requirements

    • There is a base per-user increase in storage requirements for using SFS

    • Increase can be minimized or reversed by exploitation of SFS file referencing capabilities and VM data spaces.

  • I/O Requirements

    • Similar

Speaker Notes

One has to be careful in making comparisons between minidisk and SFS. They each have strengths. With that said, let's look at the key system resources. Because of the added function and structure of SFS, there is an increase in processor requirements for doing file functions with SFS instead of minidisk. As we'll see later on, data spaces can help minimize this and other resource requirements. The processor requirements increase is proportional to the file activity. For our CMS interactive workload the processor time per command increase is 16% when going from a minidisk to an SFS environment. The SFS environment contains both file control and directory control directories.

There can also be an increase in real storage requirements. Storage is required to run the server machines and for data areas in the end-user virtual machines. Unlike processor requirements, the storage increase is related to the number of SFS users and not as sensitive to file activity.

Overall I/O requirements are similar. One beneficial characteristic of SFS is that control data and content data are kept separate. That can provide benefits in terms of caching by CP.


Estimated Processor Requirements

  • Proportional to file I/O

    • Minidisk mostly diagnoses x'A4' and x'A8'.

    • SFS server counter for I/O

  • FS8F workload does approximately 13.0 mdisk I/Os per million instructions for minidisk configuration.

  • Moving 3.7 of those I/Os to SFS filecontrol directories results in the processor usage increasing about 20% per command.

  • When I/O is moved to dircontrol directory exploiting data spaces, the processor usage increase is close to 0% for those I/Os.

CPU/CMD increase = 6% * IO/MI moved to filecontrol

Speaker Notes

Using the fact that the processor requirements are proportional to file I/O activity and using existing data, you can estimate the impact. For the IBM CMS interactive workload FS8F, there are approximately 13.0 mdisk I/Os per million instructions executed on the system. When 3.7 of those I/Os are moved to SFS file control directories, the processor usage increased about 20%. The mdisk I/O can be determined by looking at the diagnose A4 and A8 rate from RTM/ESA or from the virtual DASD I/O rate from monitor data. Processor requirements for dircontrol directories using data spaces are similar to minidisks.

The rule of thumb comes down to a 6% increase for each minidisk I/O per million instructions moved to a filecontrol directory. Recall that rules of thumb are usually just starting points and seldom accurate enough to write performance guarantees against.

Estimated Processor Requirements ...

  • Example -

    • 10 MIP processor at 75% utilization does 50 A4+A8s /second.

    • 50 / (10*.75) = 6.7 I/Os / million instructions

    • What will 2 I/Os moving to filecontrol SFS cost?

    • CPU/CMD increase = 6% * 2 = 12%

  • Use DASD I/Os (plus MDC) to a volume to compute how much I/O is moving.

  • Even when all data is put into SFS, most I/O is still to the S and Y disks.

  • Don't forget I/O to temporary disk or virtual disk in storage.

  • Exploitation of SFS features can further reduce the overhead.

Speaker Notes

For an example, assume a 10 MIP processor is running at 75% utilization and does a total of 50 x'A4' and x'A8' diagnoses a second. (I know we hate talking about MIPS, but this is just a rough estimate okay). This example system would do roughly 6.7 mdisk I/Os per million system instructions.

   50 I/Os per second       50
   -------------------  =  ----  =  6.7  I/Os per Million Instructions
   10 MIPS  * .75           7.5
So in this example, the processor increase per command would be about 2 * 6% = 12%.

It is interesting to note that even when we move as much as we can to SFS, the FS8F workload still does a great deal of minidisk I/O. This remaining minidisk I/O is from S-disk, Y-disk, temporary disk, and virtual disk in storage. To determine how much I/O is moved, look at the I/O counts to volumes being moved. Do not forget to include I/Os satisfied from minidisk cache.


Estimated Real Storage Requirements

  • User Independent - Approximately 1800 pages

    • file pool servers

    • CRR server

  • Per SFS User - 4 pages

  • It Depends

    • Start-up parameters and other tuning

    • number of concurrent active workunits

    • use of two phase commit

    • use of data spaces

    • The biggest factor to storage can be FSTs (file status control blocks) and how they are handled.
1800 pages + 4 pages per user

Speaker Notes

The rule of thumb is 1800 pages plus 4 pages per user connected to a file pool. There are many factors that come into play in looking at storage requirements. The biggest may be how FSTs are handled. We will look at FSTs in detail on a later foil.

Storage for the filepool servers really depends on file activity, number of users, start-up parameters, and other factors. The CRR recovery server is fairly passive in environments that do not recovery two phase syncpoints (multiple file pools being updated inside single work unit).

For each SFS user, plan on an additional 4 pages. This is made up of storage for the work unit structures, additional APPC/VM usage, the SFS file cache, and additional server control blocks. This number can be impacted by the size of the SFS file cache and the number of active workunits per user.

Tuning can affect this for better or worse. Consider things like the SFS file cache, saved segment for VMLIB, CMSFILES segment for servers, or xxxBUFFERS setting in the server machine.


Access Performance - File Control

  • First access by any user - slower because file pool server has not cached any of the needed information yet.

  • First Access by user - build structure for all files on the directory requires getting all the information from the server.

  • Proportional to the number of files on directory.

  • Proportional to the number of files in the directory that require authorization checking (you do not own or are not public).

  • Directory information in end-user updated when required so that changes are reflected.

Speaker Notes

Performance of the ACCESS command can be important. The first access of a given directory by any user on the system is normally the slowest one. That access causes the applicable catalog data to be cached in the server's catalog buffer pool (as much as will fit - size is governed by the CATBUFFERS start-up parameter). The presence of minidisk caching can also contribute to faster access of that directory by non-first users. Access is faster when the files on the directory are ones that you own or that are public. For others, the access time is related to the number and types of authorizations present.

Only on first access by a given user is there a trip to the server to get the file information. Once that first access is done, the results are cached in the accessing user's virtual address space. This is retained even if that directory is subsequently released and reaccessed. The cached information is given up if the user virtual machine gets short on storage or if it is reset (e.g. IPL CMS).

Access time is proportional to the number of files on the directory being accessed. This is due to having to obtain the data from the file pool server to build the in storage information (FST) for each file. The file referencing features of SFS can avoid accessing all together (direct referencing) or minimized the number of files accessed (aliases and hierarchical directory).

Requests for updates to directory information occur when (1) CMS explicitly sends a request to the server asking for updates. This is done when CMS thinks it needs to ask for updates (see next question), or (2) the server sends updates along with the response to some action that caused a trip to the server. Starting with CMS 7, support was added to CMS to have file pool server asynchronously notify CMS when there are changes to the accessed directory. Certain places in CMS then notice that there has been a change and then go ask the server for the updates before processing the current request.

Note that the above is for file control directories. Directory control directories have a consistent view from access to release, and therefore the directory update process does not apply. Following foils will describe the difference for directory control (dircontrol) directories where VM data spaces are exploited.


VM data spaces

  • Concept SFS Directory End User +------------+ Server +----------+ | Data Space | +----------+ | |--| |--| | | CMS | +------------+ | CMS | | | | | +----------------------------------------+ | CP | +----------------------------------------+

  • Usage considerations

    • Most benefit from highly used shared R/O or read-mostly data

    • Group updates to minimize multiple versions

    • Users should run in XC mode for most benefit

    • Separate R/O from R/W directories in different file pools.

VM data spaces...

Performance advantages

  • Relative to SFS without data spaces

    • CMS retrieves data from shared virtual storage (more efficient than server reading from DASD for each user)

    • Communication overhead with server eliminated

    • XC mode users:

      • get data directly from data space

      • FSTs in data space (shared). This can help:
        • reduce real storage requirements
        • CMS initialization
    • 370 and XA mode users:
      • get data from data space by asking CP
      • FSTs in user storage (not shared)
  • Relative to minidisk

    • Performance similar to minidisk with minidisk caching

    • Shared FSTs (without manual management or 16M limitations)

Speaker Notes

The server (logically) puts directory in VM data space, and user virtual machine takes from VM data space.

The benefit of data spaces is based on degree of sharing. They provide a great benefit in user virtual storage as the FSTs are shared among accessed users and I/Os as the data is moved from the data space without a trip to the server.

Grouping updates will minimize the likelihood of having multiple versions in data spaces. (discuss ACCESS to RELEASE consistency here). Having users run in XC mode is how the previously stated benefits are achieved. Remote users obviously do not have access to the data spaces. The file pool server can use the data spaces on behalf of the remote user, but network performance tends to be the significant player in remote performance.

Separate servers for 1) less scheduled down time for R/O and 2) multiple user rules (discussed later) do not apply.

The benefit of data spaces is based on the degree of sharing. Not only will exploitation of VM data spaces minimize expensive server requests, but it will allow a single copy of data to be shared among several users. This can be a significant boost for storage constrained systems.

Performance is similar compared to read-mostly minidisks in minidisk cache. There are measurements that show both ends of the spectrum. It is dependent on workload and storage constraint.


Access and Storage

StyleStorage
Impact
less 16MB
limit?
Mdisk w/o SAVEFDYesYes
SFS filecontrol YesYes
Mdisk w/ SAVEFD No Yes
SFS data space No No

  • SFS FSTs slightly larger than minidisk

  • Benefits of managing SFS dircontrol directories over SAVEFD

  • SFS FSTs in a data space must still be under 16MB line, but are not in base address space

  • 2000 files for 1000 users on minidisk, SAVEFD saves almost 122MB

Speaker Notes

When a minidisk is accessed the FSTs require 64 bytes per file. SAVEFD can be used to put these in a saved segment which can be shared. For minidisks that are always accessed by every user on system, it is simple to see that SAVEFD should be used for the read-only disks.

An SFS FST is slightly larger than a minidisk FST, how much larger depends on whether it is a dircontrol or filecontrol directory. When dircontrol, using a data space makes a lot of sense. The management of it is simple, and you can save storage similar to SAVEFD with minidisks. In addition, the SFS FSTs in this case would be in a separate data space instead of the primary address space of the end-user virtual machines. This leaves more virtual storage below the 16 meg line for user applications or other saved segments.


Agents

  • dispatchable tasks in file pool server

  • typically associated with user work, mapping logical units of work in progress.

  • Number of agents determined by USERS value start-up parameter
    • number_of_agents = 4 + truncate(USERS/8)
  • A single user can be associated with multiple agents if multiple workunits are active.
  • If insufficient number of agents, work gets queued up. This is bad.
  • If too many agents exist, storage may be wasted.
  • Better to make number too big than too small!
  • Monitor the QUERY FILEPOOL Active Agents Highest Value value or the corresponding value from monitor data.

Speaker Notes

The first time I ever mentioned the term "agents" in describing SFS performance, I got this funny look like I was talking about something mysterious or even sinister. Well, these agents are agents of good. Requests made of the file pool server will become associated with an agent, and the dispatcher within the file pool server will dispatch these agents. The number of existing agents is determined by the USERS start-up parameter which is in the DMSPARMS file. The USERS value chief use is for computing the number of agents, but it will also be used to determine other values (such as CATBUFFERS) that are not explicitly set. The formula is:

     agents =  4 + TRUNCATE(USERS / 8 )
The USERS value should be the number of logged-on SFS users expected during peak system activity. Peak activity can change over time so this should be monitored.

Counter Refresher

  • Available through Monitor or QUERY FILEPOOL commands

  • Snapshot of running counters

  • VMPRF, FCON, VMPAF, and other vendors support monitor data

  • Execs
/* QREFRESH Query SFS Refresh directory req rate */
/* note Refresh Directory Request is 61st line   */
'PIPE cms q filepool counter | STEM counter1.'
JUNK = TIME('R')
'CP SLEEP 60 SEC'
'PIPE cms q filepool counter | STEM counter2.'
elapsed = TIME('R')
rrate  = (WORD(counter2.61,1) - WORD(counter1.61,1) ) / elapsed
say rrate  'dir refresh requests per second'
exit

Speaker Notes

The SFS counter information is available either from QUERY FILEPOOL commands or Domain 10 (APPLDATA) monitor data. The majority of this data are accumulating counters of requests, time, or I/Os. Looking at the counters at a single point in time is not very meaningful. Therefore, we typically take two snapshots, compute the delta between times for various counters, and analyze that data. Various performance products do this work for you on differing degrees. We will refer to this method several times throughout the rest of this presentation.


Monitoring Agents

  • Good indication of file pool performance

  • QUERY FILEPOOL or Monitor gives:

    • Agent Holding Time

    • File Pool Request Service Time

  • Held agents

    • Agents associated with a logical unit of work.

    • Agent Holding Time / Elapsed Time

  • Active agents

    • Agents currently doing work

    • File Pool Request Service Time / Elapsed Time

  • Typical ratio of held to active is less than 10.

  • Watch for applications remaining in an LUW without doing work.

Speaker Notes

The SFS server has these agents to get the work done. When a logical unit of work is started, an agent becomes associated with this unit of work. It is "held" by this unit of work until the logical unit of work is either committed or rolled back. An active agent is an agent busy doing work in the server on behalf of a file pool request. Typically an increase in agents in use (held) or busy agents (active), is an indication that SFS work is increasing or it is taking longer to complete the work. Note these are two different cause/effect pairs.

An analogy can be made to VM systems in general. For a given VM system, you have a number of users logged on and a number of users active. If you add users (and work) to a system, you tend to increase the users logged and active counts. The same affect can occur if you change the system in a different manner. If you move the system to a slower processor, it will take longer to process commands and transactions. Therefore you are likely to increase the user active and logged count this way also.

While agents are typically associated with user work, there are internal functions in the file pool server that run as agents as well.


Detailed Agent Information

  • QUERY FILEPOOL AGENT

  • Agents are either "User" or some system type

  • Status for User is typically "Read", maybe "Write", for log information.

  • Various "wait" values exist.
                           SERVER8  File Pool Agents
 
Start-up Date 02/18/95                            Query Date 02/18/95
Start-up Time 06:17:34                              Query Time 13:52:21
========================================================================
AGENT INFORMATION
 
         66  Total Number of Agents
         11  Active Agents Highest Value
          4  Current Number of Agents
 
Userid   Type        Status   Agent Number Wait           Uncommitted Blks
CHECKPT  Chkpt       Inact           2     I/O                        0
BITNER   User        Read            4     None                       0
DEVO1    User        Read           10     Communication              0
BITMAN   User        Read           14     I/O                        0

Speaker Notes

The detailed information on agents can be useful when doing problem determination, but is seldom reviewed for normal monitoring activity. The type column will indicate 'User' unless the agent is in use for some file pool server specific task. For 'User's, the Status is usually 'Read' or 'Write' which indicates whether any log information has been written. In the 'Wait' column, the most common values are 'I/O', 'Communication', and 'None'. 'Communication' could mean the agent is held and that the server is waiting for next request from user. Large numbers of agents in less common wait values (such as ESM_Wait) should be investigated.

The 'Uncommitted Blks' column is the number of SFS file blocks used by an agent that have not yet been committed or rolled back. Very high numbers here are an indication that an application is running-away.


File Pool Requests

  • File Pool Request = basic unit of processing in server.

  • Normalize usage to file pool requests

    • Elapsed time (File Pool Request Service Time)

    • CPU time (available from other monitor data)

    • Lock time (Lock Wait Time)

    • ESM time (Security Manager Exit Time)

    • Block I/O time (BIO Request Time)
 PRF083  Run 02/17/95 18:43:14         SFS_BY_TIME
                                         SFS Activity by time
 From 02/17/95 08:02:08
 To   02/17/95 16:57:08
 For  32100 Secs 08:54:59                Bill Bitner looking at GDLVM7
 
                                    <-----Time Per File Pool Request---->
 
 
 From  To                FPR    FPR                     Block
 Time  Time  Userid    Count   Rate  Total    CPU  Lock   I/O   ESM Other
 08:02 16:57 CALSERV   93863  2.924  0.044  0.011 0.000 0.035     0 0.001
 08:02 16:57 EDLSFS   516041 16.076  0.012  0.002 0.000 0.010     0 0.000
 08:02 16:57 EDLSFS1  354268 11.036  0.009  0.002 0.000 0.006     0 0.001
 08:02 16:57 EDLSFS2  261827  8.157  0.076  0.042 0.031 0.006     0 0.003

Speaker Notes

File pool requests are a good unit to use for SFS throughput or as a transaction rate. File pool requests will use various resources in the server and require different delays. By normalizing time for various functions to the file pool requests, you can get a breakdown of the various components making up the service time on an SFS request. Note that the 'Other' bucket is the delta from the 'Total' column. The VMPRF PRF083 SFS_BY_TIME report shown above illustrates how this data can be viewed.

FCON/ESA also provides this type of breakdown in its 'Shared File System Server Screen'.


Server Utilization

  • Server Serialization:

    • Running (Processor time from other monitor data)

    • Checkpoint (Checkpoint Time)

    • Page fault resolution (other monitor data - user state)

    • Control data back up (QSAM Time)

  • Asynchronous I/O except for above
 PRF083  Run 02/17/95 18:43:14         SFS_BY_TIME
                                         SFS Activity by time
 From 02/17/95 08:02:08
 To   02/17/95 16:57:08
 For  32100 Secs 08:54:59                Bill Bitner looking at GDLVM7
 
                                     <-------Server Utilization------->
 
 
 From  To                FPR    FPR                  Page Check-
 Time  Time  Userid    Count   Rate   Total    CPU   Read  point   QSAM
 
 08:02 16:57 CALSERV   93863  2.924     3.4    3.3    0.0    0.0      0
 08:02 16:57 EDLSFS   516041 16.076     3.2    2.9    0.2    0.1      0
 08:02 16:57 EDLSFS1  354268 11.036     3.5    2.6    0.2    0.1    0.5
 08:02 16:57 EDLSFS2  261827  8.157    34.4   34.3    0.1    0.0      0

Speaker Notes

Previously we looked at the breakdown of file pool request service time. The file pool server can be processing several file pool requests at any given time because of its exploitation of asynchronous I/O and communication. However, there are some tasks that serialize a file pool server and it is important to understand these. By utilizing SFS counters and other monitor data, we can create a break down of the server's time.

Like any other resource, SFS server utilization can not exceed 100%. The higher the utilization, the more contention there will be between different SFS agents due to server utilization. Therefore, it can be valuable to monitor the SFS server utilization.

At any given time, the server can be running on a processor for file pool requests, waiting for page fault resolution, performing checkpoint processing, or waiting for QSAM (back up I/O). Checkpoint and control data backup (except when done to another SFS file pool), are functions that serialize the server. SFS is also serialized by page fault resolution.

As a guideline, server utilization of less than 50% should not be a concern. When attacking utilization problems, the largest component is often where the most improvement can be found. The Performance manual and the SFS Introduction presentation show techniques for this approach. It is possible that high file pool server utilization is an indication that the file pool should be split into two file pools.

The VMPRF PRF083 SFS_BY_TIME report does this with its Server Utilization section. FCON/ESA also provides this type of information in the SFS Server Details screen.


Checkpoint Processing

  • Should be normally be less than 4 seconds.

    • Checkpoint Time / Checkpoints Taken

  • Long checkpoint time affects mostly response time, not resource consumption

  • Longer checkpoint time from

    • Too few control buffers

      • Control Minidisk Blocks Read / Total File Pool Requests should be less than 0.005

    • Poor I/O performance

    • Too many changed catalog buffers

Speaker Notes

Checkpoint processing is an internal SFS file pool server operation during which the changes recorded on the log minidisks are permanently made to the filepool. I think of it like balancing the checkbook. By doing checkpoint processing, if SFS is asked to recover changes it only has to go back to the last checkpoint on the log. Checkpoint processing is started after a certain number of log blocks have been written.

Checkpoint processing serializes the server and can impact response time. Since the resources used during checkpoint processing are relatively low, most checkpoint problems affect response time instead of resource usage. From the QUERY FILEPOOL or monitor data, one can calculate the checkpoint processing time.

Factors to checkpoint processing include: number of control data buffers, I/O performance, and the number of changed catalog buffers. Having sufficient Control data buffers help checkpoint processing. Insufficient control data buffers is the most common reason for long checkpoint times. Since checkpoint processing involves significant I/O to the control minidisks and storage group 1 (catalog) minidisk, a poor performing I/O configuration will affect checkpoint time. The more catalog buffers that have been changed, the more information that needs to be written to disk. After changes in VM/ESA R1.1, the number of modified catalog buffers is not as significant due to pre-flushing.


Catalogs

  • Fragmented Catalogs

    • Watch for increase in catalog blocks read per file pool request

    • Reorganize the catalogs using FILESERV REORG command

  • Catalog buffers (CATBUFFERS)

    • Trade off between I/O and storage

    • Catalog Blocks Read / Total DASD Block transfers should be between 0.20 and 0.25.

Speaker Notes

Good catalog performance is necessary since this is where information on authorization, directory structure, aliases, etc. is kept. The tuning knob you have available is the CATBUFFERS setting which controls the number of catalog buffers. If allowed to default, the default value is computed based on the USERS start-up parm. An increase in catalog I/Os can be caused by fragmentation of index information. A symptom is an increase in catalog blocks read per file pool request. If this occurs one should plan to reorganize the catalogs using the FILESERV REORG command.

The catalog buffer setting (CATBUFFERS) presents a performance trade off. If set too low, more catalog I/O will be required which could result in high block I/O time. If set too high, paging could result from additional storage requirements. Therefore, this value should be set with consideration of system constraints. Note that for special processing, such as restoring control data, you might want to temporarily increase the CATBUFFERS value significantly.


Data Space Usage

  • A separate file pool server is recommended

  • File pool server should have very little activity

    • Only activity is for initial handshaking

    • Activity is a sign that something is wrong

  • Users not accessing as R/O

  • Number of data spaces available exhausted

    • CP directory XCONFIG statement determines amount of storage and number of data spaces.

    • Monitor data or CP IND USER server EXP gives number of data spaces

    • SFS QUERY ACCESSORS command with DATASPACE option

    • Check for multiple copies of directories

  • Storage resources used can be determined by monitor data or CP INDICATE USER userid EXP and CP INDICATE SPACE USER userid commands.

Speaker Notes

Since a R/O file pool will have less need for planned down time and less cause for unplanned down time, a separate server machine is recommended. Also, the capacity limits associated with R/W file pools do not apply here. A R/O file pool using data spaces should have little file pool activity. Therefore, if a significant number of file pool requests are being made, then something is wrong. Scenarios that would cause normal file pool requests to be made include: the directory configured as file control instead of dir control, users accessing directory as R/W, use of CSL and direct file referencing without accessing the directory, access from remote users, or the server is not in XC mode.

The other main cause of not using data spaces is that the server has exhausted the number of available data spaces. This is set by the CP directory statement

XCONFIG ADDRSPACE MAXNUMBER nnnnn TOTSIZE nnnnG SHARE
Most people set the TOTSIZE to 8192G (the maximum) and control the usage by the MAXNUMBER value. This is maximum number of data spaces the server can define.

Monitor data or the CP INDICATE USER EXP command will show the number of data spaces the server has. To determine what those spaces are and who has access to the various levels of the directories use the SFS QUERY ACCESSORS command with DATASPACE option.


Additional Pearls

  • Control data backups should be structured to avoid prime shift if possible.

    • Watch for spikes of high QSAM (serial I/O ) time

    • Doing backups to another filepool is an alternative

    • Check size of log disks, increase size to lower the frequency of control data backups.

  • Use DMSFILEC or copyfile command to move data from one file to another file in the same file pool, instead of using minidisk in between.

Speaker Notes

The overhead and serialization associated with control data backups can seriously impact performance. Therefore, scheduling control data backups during off-peak hours can be useful. If this is not possible, directing the backups to another filepool allows the I/O to be done asynchrnously and minimize the serialization. Very small log disks may result in control data back-ups being kicked off more frequently than is acceptable. While there is a guideline in the Filepool Admin for the log disk size, your mileage may vary. There is really no major downside to having the log disks too big (other than wasted DASD space and perhaps slightly longer start-up times.

SFS is smart enough to recognize copyfile requests between two files in the same file pool. In this case, all the work would be done on the server side instead of moving all the data between the server and end user and then back again. This pearl was missing from the Application Development Guide in past releases, we will see that it gets in the VM/ESA 2.1.0 book.


SFS Progress

  • VM/ESA 1.1.0
    • introduced some asynchronous function calls
    • reduction in filepool requests per command
    • improvements in locking to minimize rollbacks due to deadlock
  • VM/ESA 1.1.1
    • VM data space support
    • checkpoint improvements
    • asynchronous function calls through CSL
  • VM/ESA 1.2.0
    • checkpoint improvements
    • improved catalog insert algorithm
  • VM/ESA 1.2.1
    • improved control data backup
    • SFS thread blocking I/O
  • VM/ESA 1.2.2
    • improved handling of released file blocks
    • revoke performance
  • General improvements to APPC/VM performance over various releases.

Speaker Notes

VM/ESA 1.1.0 - The key improvements this release were to allow SFS to run on larger systems.

VM/ESA 1.1.1 - Checkpoint processing occurs less frequently, thus avoiding serialization. This improved response time but has little impact on resource utilization. The asynchronous file functions added in release 1.0 were miscellaneous ones. This release added the real file functions such as read/write.

VM/ESA 1.2.0 - Pre-flushing buffers and exploiting multi-block I/O improved checkpoint performance. The new catalog insert algorithm improved performance for catalog insert in terms of CPU and in cases I/O. The log manager I/O was also changed to use multi-block I/O and SFS file cache default was changed to 20KB.

VM/ESA 1.2.1 - The improvement to SFS control data backup reduces backup time, space required, and allows for more accurate planning. The SFS thread blocking I/O changes make it more practical for CMS multitasking applications to use the SFS functions asynchronously.

VM/ESA 1.2.2 - By improving the handling of released file blocks, a scenario that could cause instances of high processor utilization was corrected. The revoke performance changes addressed problems with the overhead of revoking authority from an SFS file or directory.

The performance of APPC/VM is key to the performance of SFS file pool requests. The various improvements here have helped SFS over the past several releases.

There were no direct, major performance enhancements for SFS in releases 2.1.0 or 2.2.0.


References

  • VM/ESA Performance. (SC24-5782)

  • VM/ESA Release 2.2 Performance Report (GC24-5673-01)

  • VM/ESA Release 2.1 Performance Report (GC24-5673-00)

  • VM/ESA Release 2 Performance Report (GG66-3245)

  • VM/ESA Release 1.1 Performance Report. (GG66-3236)

  • VM/ESA CMS Filepool Planning, Administration, and Operation (SC24-5751)

  • VM/ESA CMS Application Development Guide (SC24-5761)

  • VM/ESA Planning and Administration. (SC24-5750)

  • VM/ESA CP Command and Utility Reference. (SC24-5773)

  • VM Performance Reporting Facility User's Guide and Reference. (SC23-0460)

Acronyms...

CSL
Callable Services Library
FST
File Status Table
LUW
Logical Unit of Work
SFS
Shared File System
VM/ESA
Virtual Machine / Enterprise Systems Architecture
VMPRF
VM Performance Reporting Facility