SFS Performance Management Part IFebruary 1997 - Version 6.1 (c) Copyright IBM Corporation 1991, 1997 - All Rights Reserved
Table of Contents
Disclaimer DisclaimerThe information contained in this document has not been submitted to any formal IBM test and is distributed on an "As is" basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environment do so at their own risk. In this document, any references made to an IBM licensed program are not intended to state or imply that only IBM's licensed program may be used; any functionally equivalent program may be used instead. Any performance data contained in this document was determined in a controlled environment and, therefore, the results which may be obtained in other operating environments may vary significantly. Users of this document should verify the applicable data for their specific environment. It is possible that this material may contain references to, or information about, IBM products (machines and programs), programming, or services that are not announced in your country. Such references or information must not be construed to mean that IBM intends to announce such IBM products, programming, or services in your country. Should the speaker get too silly, IBM will deny existence or responsibility for the speaker.
This presentation has been made over a dozen times since 1991. Except for a few changes, the bulk of the content is the same. One should not read that as a sign of lack of commitment to SFS, but as a sign of performance management being a priority since day one. Last updated February 3, 1997 (Version 6.1) TrademarksThe following are trademarks of the IBM Corporation
AcknowledgementsMy thanks to various folks for helping pull this material together.
The speaker notes were never written with the intent of including them in handouts. So if you are reading this, please keep in mind that I never took the time to do a quality job with the speaker notes. Please excuse grammar and typos. However, any suggestions or corrections are appreciated. Overview
This presentation is geared towards VM/ESA ESA, not the 370 feature or VM/SP or VM/HPO (however some things do apply). This presentation will cover the tasks related to the performance of SFS file pool servers. The presentation is meant to take the mystery out of this. This information is in the VM/ESA 1.1.1 Chapter 20 of the CMS Planning and Administration manual. After VM/ESA 1.2.0, much of this material was moved to the VM/ESA Performance manual. In addition, some performance tips to consider for applications utilizing SFS data. In the VM library in VM/ESA 1.2.0 all performance data was consolidated into a single manual "VM/ESA Performance". This contains material relative to this presentation SFS Structure - The presentation is really meant for those that know and understand at least the basics of SFS. However, since some folks attend out of curiosity, a few foils are provided to cover the basics and structure of SFS. SFS Performance Managment - most of the time will be spent on this topic. SFS Concepts
Speaker NotesThe presentation is really meant for those that know and understand at least the basics of SFS. However, since a lot of folks attend out of curiosity, a few foils are provided to cover the basics and structure of SFS. SFS coexists with the current minidisk (EDF) file system. For our purposes SFS is made up of two chunks: stuff in end user virtual machine (CMS nuc + CSL) and stuff in server virtual machine. Important to note that communication is performed via APPC/VM with private protocol. The figure represents SFS (without data space exploitation)
SFS Structure - Server Data
To level set on terminology we split up the SFS Server structure into 3 parts:
SFS Performance Management
This presentation is broken down into three pieces. I often use the lawn mower analogy. It is best to read the instructions when putting it together. Periodically check the fluids and replace spark plugs as necessary. When it is performing poorly, check various items and make adjustments as necessary. Preventative Tuning
The first task, preventing problems, we refer to as preventative tuning, involves a list of performance guidelines when you are defining a new file pool or modifying an existing one. The ones I'll discuss are: If these guidelines are followed, you usually don't have any SFS performance problems. CP Tuning Considerations
setting quickdsp on ensures that the server will not have to wait in the eligible list for system resources to become available. For more info, refer to the CP Command and Utilitity Reference manual. set share rel will place the server machine in a more favorable position in the dispatch queue. why 1500?...the default setting for user is 100. The server supports multiple users such as 15 so we recommend 1500. This should be set inline with other server settings such as VTAM.
CMS Tuning Considerations
USERS tells the server how much work it should configure itself to handle. If specified too large, may experience serial page faults problem in server and/or increased checkpoint duration (long blip). If specified too small, the server will not configure enough agents (tasking objects) to handle the incoming requests. This can cause an undesirable queueing effect. It is better to overestimate a little. The CMS cache controls the amount of read ahead and write behind. The cache is specified for all users in their nonshared virtual storage. Some measurements indicate a value larger than 12k would benefit most environments. This cache is for the SFS file I/O and should not be confused with minidisk cache for read ahead, write behind. To change CMS file cache size, update BUFFSIZE parm in DEFNUC macro; assemble DMSNGP ASSEMBLE; rebuild CMS nucleus. Refer to Service Guide and CP Planning and Administration manuals for more details. Allowable range is 1 to 28K (96K in VM/ESA 1.1.0 and above). A CRR recovery server should be active or a significant degradation in performance will be experienced (40% increase). Use QUERY FILEPOOL STATUS recovery: to determine if users connected. DASD Placement
VM Data Spaces
Performance advantages
The server (logically) puts directory in VM data space, and user virtual machine takes from VM data space. The benefit of data spaces is based on degree of sharing. They provide a great benefit in user virtual storage as the FSTs are shared among accessed users and I/Os as the data is moved from the data space without a trip to the server. Grouping updates will minimize the likelihood of having multiple versions in data spaces. (discuss ACCESS to RELEASE consistency here). Having users run in XC mode is how the previously stated benefits are achieved. Separate servers for 1) less scheduled down time for R/O and 2) multiple user rules (discussed later) do not apply. The benefit of data spaces is based on the degree of sharing. Not only will exploitation of VM data spaces minimize expensive server requests, but it will allow a single copy of data to be shared among several users. This can be a significant boost for storage constrained systems. Performance is similar compared to read-mostly minidisks in minidisk cache. There are measurements that show both ends of the spectrum. It is dependent on workload and storage constraint. Recovery
The following suggestions should minimize the amount of time required to restore the control data of a file pool. "too large" refers to number of objects (files, alias, directories, etc) and is relative to restore rate. Some measurements showed - restore rate = 22Mb/min or 49000 objects/min ; redo rate = 5.3 log blocks/min. The less file pool change activity since the last backup, the less time it will take to apply. SFS can do double buffering on restore when backup is from another filepool. For a 32mb machine try setting CATBUFFERS 5000 this will reduce time to reapply changes to catalog. Multiple File Pools
There is a practical upper limit to the rate at which a server can process requests. This has been expressed in the following formula. System defined users are system CP directory entries for your system. Active users is the average # of users during peak hours who have interacted with the system during a one minute interval. This can be found using monitor output such as is provided by VMPRF. SYSTEM_SUMMARY_BY_TIME report, USERS ACTIV columnThe gating factors for this calculation are 1) involuntary rollbacks; 2) checkpoint processing. Catalogs are shared, so even if unique data there are locks and potential for deadlocks. Multiple filepools doesn't mean duplicating data. Monitoring Performance
Speaker NotesOverall monitoring the performance of your system is unchanged if you use SFS. Still check overall system indicators and collect SFS data shown here. Use this data for performance problem determination. Data for history/trend analysis can come from VM Monitor data. VMPRF uses some of the SFS supplied statistics and combines with other monitor data to produce 3 different reports. VMPAF will use VMPRF Summary files, so that one can access all the individual counters if need be. The QUERY FILEPOOL STATUS command (or new ones in VM/ESA 1.2.0) can be used for immediate snapshot of SFS server. The same counters and timers are involved. Solving Performance Problems
Most people understand the general performance analysis process. So this shouldn't be new. SFS fits right in here, there is no need to really do anything drastically different. Confirm and Isolate the problem
To make the determination whether it is an SFS or a general system problem, compare the percentage increase in average file pool request service time to the percentage increase in average response time. Average file pool request service time is displayed in the SFS_BY_TIME VMPRF report or can be calculated from the QUERY FILEPOOL STATUS output by dividing File Pool Request Service Time by Total File Pool Requests. If the file pool request time is much greater, then the server is probably contributing to the problem. The symptoms/Causes table was moved to the VM/ESA Performance Manual in Release 2. Prior to that it was in the CMS Planning and Administration Guide. Take Corrective ActionSymptom/Causes table will point to page with possible corrective actions. Page 169 (VM/ESA Performance Manual)
Try ONE of the possible actions. Evaluate for effectiveness
After reading possible corrective actions, choose one (and only one at a time) and implement it. An often skipped step is the validation that the fix really worked. Now on to the case study... Case Study - VMPRF Report (PRF006)Before
RESPONSE_ALL_BY_TIME
Transaction Response Time and Throughput for ALL Users
<-----------Response Time---------------->
<---Triv---> <--Non-Triv-->
From To Quick
Time Time UP MP UP MP Disp Mean
09:24 09:54 0.163 0.000 69.095 0.000 9.158 38.635
Case Study - VMPRF Report (PRF083)Before
SFS_BY_TIME
SFS Activity by time
<---Time Per File Pool Request--->
From To FPR FPR Block
Time Time Userid Count Rate Total CPU Lock I/O ESM Other
09:24 09:54 RWSERV1 22545 12.540 3.443 0.004 0.140 1.740 0 1.559
09:24 09:54 RWSERV2 21470 11.942 4.205 0.004 0.190 1.986 0 2.027
<----Server Utilization-------> <----Agents----->
Page Check Deadlocks
Total CPU Read point QSAM Active Held w/ RB
75.29 5.47 60.38 9.44 0.00 43.2 152.6 0
82.95 5.29 67.27 10.40 0.00 50.2 146.7 0
"BEFORE" here means before we get done fixing the system. Ideally we'd like a before the before picture where things are good, then we move to "bad". In this case, things are so bad it is obvious that there is a problem. Response time is horrible. We assume it is SFS since all users with SFS show problem. We can look further into VMPRF reports at the SFS_BY_TIME report. It's worth spending some time here pointing out stuff. Notice that most of the categories from the symptoms and causes table map to the Time per filepool Request areas. We have 2 filepool servers. We mentioned "deadlocks w/ RB" before. point that out on last column. Right off bat we know something is wrong since FPR total time is several seconds!! A large chunk of that is in Other. From there, we look at Utilization and see Page Read time is out of sight. Case Study - Use of S and C Table
Case Study - VMPRF Report (PRF008)Before
USER_RESOURCE_UTIL
Resource Utilization by User
Est
Userid ...... WSS Resid .....
RWSERV1 1163 1142
RWSERV2 1225 1217
Can go back to symptom and cause table then to pointer about "too much server paging". SET RESERVED with WSS . We can get the value for WSS from VMPRF or INDICATE USER. And issue the above commands. Case Study - VMPRF Report (PRF006)Before
RESPONSE_ALL_BY_TIME
Transaction Response Time and Throughput for ALL Users
<-----------Response Time---------------->
<---Triv---> <--Non-Triv-->
From To Quick
Time Time UP MP UP MP Disp Mean
09:24 09:54 0.163 0.000 69.095 0.000 9.158 38.635
After
RESPONSE_ALL_BY_TIME
Transaction Response Time and Throughput for ALL Users
<-----------Response Time---------------->
<---Triv---> <--Non-Triv-->
From To Quick
Time Time UP MP UP MP Disp Mean
09:52 10:22 0.072 0.000 0.866 0.000 7.396 0.579
Case Study - VMPRF Report (PRF083)Before
SFS_BY_TIME SFS Activity by time
<---Time Per File Pool Request--->
From To FPR FPR Block
Time Time Userid Count Rate Total CPU Lock I/O ESM Other
09:24 09:54 RWSERV1 22545 12.540 3.443 0.004 0.140 1.740 0 1.559
09:24 09:54 RWSERV2 21470 11.942 4.205 0.004 0.190 1.986 0 2.027
<----Server Utilization-------> <----Agents----->
Page Check Deadlocks
Total CPU Read point QSAM Active Held w/ RB
75.29 5.47 60.38 9.44 0.00 43.2 152.6 0
82.95 5.29 67.27 10.40 0.00 50.2 146.7 0
After
SFS_BY_TIME SFS Activity by time
<---Time Per File Pool Request--->
From To FPR FPR Block
Time Time Userid Count Rate Total CPU Lock I/O ESM Other
09:52 10:22 RWSERV1 63617 35.343 0.158 0.003 0.002 0.051 0 0.103
09:52 10:22 RWSERV2 63479 35.266 0.158 0.003 0.002 0.050 0 0.103
<----Server Utilization-------> <----Agents----->
Page Check Deadlocks
Total CPU Read point QSAM Active Held w/ RB
39.51 11.64 15.44 12.43 0.00 5.6 9.5 0
42.52 11.81 17.44 13.27 0.00 5.6 9.6 0
Being good little performance managers we look at the after case. The response time is much more acceptable. We need to go a step further and see if the change in Resp Time is really from what we did. In the after picture, things are much better. We see FPR total time is subsecond, where it should be. Also notice that the FPR rate has increased. Not only are we getting better response time, but better throughput as well. The Deadlocks w/RB are still zero which is good. You can see that the number of active agents and held agents also decreased. This is all part of the change to avoid serialization from page faults. This case study was a gross problem, but is sufficient to show the methodology. Some Application Performance TipsSee CMS Application Development Guide (SC24-5450)
As users become more comfortable with SFS they will write or use applications that exploit SFS. It is good to understand the performance impacts.
Understanding your application's performance
At times you want to evaluate an application of your own or to be added to system. Foil describes method. Note in this example, the sfs time (118 milliseconds) is a small part of application time (1.3 seconds). Summary
When performance is considered upfront, there should be no performance problems. SFS performance doesn't need constant attention, but periodically check it out. Bottom line is VM tried to make SFS performance management as painless as possible. Both by automating and by documentation. If you find this not to be the case, we need to know. We can't fix what we don't know about. Do you want to learn even more about SFS performance management? Then check out SFS Performance Management Part II: Mission Possible. You can get this by sending a request to Bill Bitner. ReferencesPrimary Sources (VM/ESA 2.2.0)
Others:
Acronyms
|