IBM: Performance Considerations for Linux Guests

Linux Performance when running under VM

The following are some things to keep in mind when running Linux as a guest of VM. Some of these factors are more important if you are running 100s of guests as opposed to a single guest. Many of the guidelines that exist for other guests (VSE and OS/390) apply here as well.

Why is my network response so slow for my Linux guest?

Several factors can affect response time. Be sure to check and watch the following:

Ensure /etc/resolv.conf points to the correct (valid) name server.
As much as possible make sure that the mtu is the same size along the entire path. Do traceroute to see what path is taken. Any negotiation that occurs for mtu can affect performance.

Why does my idle Linux Guest consume processor resources?

By default Linux "wakes up" 100 times per second. These timer pops are part of the way that it determines if it has work to do. It also maintains its "jiffies" through this mechanism. These timer pops have three system impacts when run as a guest of VM

Processor time is consumed in this process. The majority of the cycles are used by VM's control program in virtualizing the interrupt.
Parts of the interrupt reflection must run on VM's master processor.
Because the Linux guest wakes up so often, VM considers it to always be active. Therefore, it will not be dropped from the dispatch list and the storage management routines will tend not to steal pages from these idle guests.

A couple approaches have been used to address this problem

Do not define more virtual CPUs for a guest than are needed. The timer pops are on a per virtual processor basis (not per virtual machine). See also Scheduler Basics.
Some customers have changed the HZ value in Linux (default of 100) to a lower value. Caution must be used since lowering the HZ value could make Linux less responsive or stop functioning. I have heard of some success at a value of 16, with lower responsiveness.
There has been work on a patch for the 2.4 kernel that avoids the timer pops and jiffies. There is on-going discussions about whether this should be incorporated into the official Linux kernel. Some zSeries distributions do incorporate it already. Our measurement results showed it lowered the total processor time for an idle Linux guest by about 78%.

How Big of a Virtual Machine Should I Use?

In general, do not define the virtual machine larger than you need. Consider decreasing the size of the virtual machine and watching the Linux swapping. If you decrease the virtual machine and no swapping occurs, then the smaller machine size may be acceptable. Sometimes a good guess at virtual machine size is the z/VM scheduler's assessment of the Linux guest's working set size.

Excessive virtual machine sizes negatively impact system performance. As far as the Linux guest is concerned, the storage it has access to is all real and it should use it. Therefore, whatever storage it doesn't need for code and control blocks and virtual pages, it will often use as part of its file buffer cache. So if you define the storage for a guest to be 2 GB, it will often use all of that storage, even though the applications the Linux guest is running could fit effectively into a far smaller virtual machine.

Where should Linux swap?

Try to avoid swapping in Linux whenever possible. It adds pathlength and significant hit to response time. However, sometimes swapping is unavoidable. If you have to swap, these are your choices.

Dedicated volume: If the storage load on your Linux guest is large, the guest might need a lot of room for swap. One way to accomplish this is simply to ATTACH or DEDICATE an entire volume to Linux for swapping. If you have the DASD to spare, this can be a simple and effective approach.
Traditional Minidisk: Using a traditional minidisk on physical DASD requires some setup and formatting the first time and whenever changes in size of swap space are required. However, the storage burden on z/VM to support minidisk I/O is small, the controllers are well-cached, and I/O performance is generally very good. If you use a traditional minidisk, you should disable z/VM Minidisk Cache (MDC) for that minidisk (use MINIOPT NOMDC statement in the user directory).
VM T-disk: A VM temporary disk (tdisk) could be used. This lets one define disks of various sizes with less consideration for placement (having to find 'x' contiguous cylinders by hand if you don't have DIRMAINT or a similiar product). However, t-disk is temporary, so it needs to be configured (perhaps via PROFILE EXEC) whenever the Linux virtual machine logs on. Storage and performance benefits of traditional minidisk I/O apply. If you use a t-disk, you should disable minidisk cache for that minidisk.
VM V-disk: A VM virtual disk in storage (VDISK) is transient like a t-disk is. However, VDISK is backed by a memory address space instead of by real DASD. While in use, VDISK blocks resides in central storage (which makes it very fast). When not in use, VDISK blocks can be paged out to expanded storage or paging DASD. The use of VDISK for swapping is sufficiently complex that we have written a separate tips page for it.
XPRAM: Attach expanded storage to the Linux guest and allow it to swap to this media. This can give good performance if the Linux guest makes good use of the memory, but it can waste valuable memory if Linux uses it poorly or not at all. In general, this is not recommended for use in a z/VM environment.
DCSS: Create an EW/EN DCSS and configure the Linux guest to swap to the DCSS. This technique is useful for cases where the Linux guest is storage-constrained but the z/VM system is not. The technique lets the Linux guest dispose of the overhead associated with building channel programs to talk to the swap device. For one illustration of the use of swap-to-DCSS, read the paper here.

Linux assigns priorities to swap extents. So, for example, you could set up a small VDISK with higher priority (higher numeric value) and it would be selected for swap as long as there were space on the VDISK to contain the process being swapped. Swap extents of equal priority are used in round-robin fashion. Equal prioritization can be used to spread swap I/O across chpids and controllers, but if you are doing this, be careful not to put all the swap extents on minidisks on the same physical DASD volume, for if you do, you will not be accomplishing any spreading. Use swapon -p ... to set swap extent priorities.

Should I set Quick Dispatch on for my Linux Guest?

Quick dispatch (QUICKDSP) can be set in the directory or via the CP SET command. It makes a virtual machine exempt from being held back in the eligible list during scheduling. Instead, the virtual machine goes directly to the dispatch list.

In general, we recommend setting QUICKDSP on for production guests and server virtual machines that perform critical system functions. However, you may not want to set it on for all your test or non-critical guests. Allowing the VM scheduler to create an eligible list allows it to avoid some thrashing situations that could occur from over committing real storage.

There is also an excellent synopsis of the situation on the Linux-390 listserver from Malcolm Beattie.

Eligible lists are forming on my system. Certain Linux guests are remaining unscheduled for very long periods. What do I do?

This is closely related to the above question. When the sum of the virtual machine sizes of the logged-on Linux guests approaches the size of central storage, eligible lists will tend to form. This is because Linux guests tend to want to touch all of their pages and because Linux guests tend not to drop from the dispatch list.

You can solve this with one of two approaches: use of QUICKDSP or changing the SRM STORBUF settings. The choice is dependent on where you want the responsibility to protect the system from thrashing. The more you use QUICKDSP, the greater the responsibility on yourself. Setting appropriate STORBUF settings puts the responsibility on the VM scheduler.

The next line of defense is to set up the Linux guests conservatively as regards the virtual storage sizes and to set up the VM system well for paging. Here are some guidelines:

Set each Linux machine's virtual storage size only as large as it needs to be to let the desired Linux application(s) run. This will suppress the Linux guest's tendency to use its entire address space for file cache. Make up for this with MDC if the Linux file system is hit largely by reads. Otherwise turn MDC off, because it induces about an 11% instruction path length penalty on writes, consumes storage for the cached data, and pays off little because the read fraction is not high enough.
Use whole volumes for VM paging, instead of fractional volumes. In other words, never mix paging I/O and non-paging I/O on the same pack.
Implement a one-to-one relationship between paging CHPIDs and paging volumes.
Spread the paging volumes over as many DASD control units as you can.
If the paging control units support NVS or DASDFW, turn them on (applies to RAID devices).
Provide at least twice as much DASD paging space (CP QUERY ALLOC PAGE) as the sum of the Linux guests' virtual storage sizes.
Having at least one paging volume per Linux guest is a great thing. If the Linux guest is using synchronous page faults, exactly one volume per Linux guest will be enough. If the guest is using asynchronous page faults, more than one per guest might be appropriate; one per active Linux application would be more like it.
It is best if the VM paging volumes are all of the same model and characteristics. Undesirable effects can occur when mixing different size or speed devices.
In QDIO-intensive environments, plan that 1.25 MB per idling real QDIO adapter will be consumed out of CP below-2GB free storage, for CP control blocks (shadow queues). If the adapter is being driven very hard, this number could rise to as much as 40 MB per adapter. This tends to hit the below-2GB storage pretty hard. CP prefers to resolve below-2GB contention by using XSTORE. Consider configuring at least 2 GB to 3 GB of XSTORE so as to back the below-2GB central storage, even if central storage is otherwise large.
If you need to favor storage use toward certain Linux guests, CP SET RESERVE might be something to try.

How should I configure Linux guests to communicate with each other on VM?

For early 2.2 based Linux systems, IUCV is faster than virtual CTC for communicating between two Linux guests. See the z/VM Performance Report for some measurement information. However, some recent 2.4 measurements show IUCV to be slightly slower than virtual CTC, at least at small MTU sizes. Not all non-Linux guests can use IUCV, so connecting to an OS/390 guest may require use of virtual CTCs. One must also balance RAS considerations when making a choice in communication methods. For customers on z/VM 4.2.0 and with appropriate support in Linux, the Guest LAN connectivity is a good choice. Unlike point-to-point methods like IUCV and vCTC, Guest LAN is ring based and can be much simplier to configure and maintain. And the performance is good.

How should I configure processor memory for this environment?

On most S/390 and zSeries processors, one has the option of configuring processor memory as either central or expanded storage. With 31-bit addressing systems, 2 gigabytes of central storage was the limit that could be used. With the introduction of z/VM and 64-bit support, that limit has been lifted and begs the question Do I still need expanded storage?

Yes. Even with 64-bit support, we still recommend that some processor storage be defined as expanded storage. See Configuring Processor Storage for more details. A good starting point may be to define 25% of the processor storage as expanded storage.

On z/VM Version 4 systems, if there is contention for real memory below 2GB, some relief can be found by limiting use of minidisk cache (MDC) in expanded storage, via the CP command SET MDC XSTORE.

Can I run a shared copy of Linux like I run a shared CMS?

Yes, to some extent. There has been work done to create an NSS (named saved system) for the Linux kernel. Think of an NSS as a snapshot of the Linux kernel at boot time. Instead of IPLing (booting) the kernel off of disk and reading in all of the kernel, you can IPL this snapshot which VM can keep in memory. Further, the pages in memory making up this snapshot can be shared with many virtual machines. This decreases memory requirements by having one copy of the kernel in memory instead of one for each guest.

Do I need Paging Space on DASD?

YES.One of the common mistakes with new VM customers is to ignore paging space. The VM system as shipped contains enough page space to get the system installed and running trivial work. However, you should add DASD page space to do real work. The Planning and Admin book has details on determining how much space is required. Here are a few thoughts:

If the system is not paging. You may not care where you put the page space. However, it has been my experience that sooner or later the system grows to a point where it pages and then you'll wish you had thought about it.
VM paging is most optimal when it has large contiguous available space on volumes that are dedicated to paging. Therefore, do not mix page space with other space (user, tdisk, spool, etc.).
A rough starting point for page allocation is to add up the virtual machine sizes of virtual servers running and multiple by 2. Keep an eye on the allocation percentage and the block read set size.

Other Miscellaneous Tips

After you shutdown a Linux system, it is best to logoff the virtual machine. If you only shutdown a Linux system running in a guest virtual machine, VM continues to back the memory used by Linux. You can also use the CP SYSTEM CLEAR command to reset the virtual machine and clear free backing storage.
Be careful of cron jobs. Some distributions ship with cron jobs configured to do processing automatically, such as file integrity or security scans. These may run fine for one or two virtual servers, but could cause significant problems for more servers depending on the job. For example, to have 100 virtual servers all wake up at midnight to scan and validate every file on the system would result in significant storage and processor resource consumption. You might want to adjust the cron settings in those cases.
Application performance can have an impact on system performance. An application in an error loop or an application processing large amounts of data in a byte-by-byte fashion are two examples where a larger than expected impact can be placed on the system.
You can get a sense of the system your Linux virtual server is running on by issuing cat /proc/sysinfo. In the example below, the virtual machine LINDV1 running Linux is on a guest on a z/VM 4.3.0 system with 1 virtual processor. The z/VM system runs on an LPAR with 3 logical processors dedicated to this VM partition. The physical box is 9-way (2064-109).

Is there other information of interest?

The following links may be of interest.

Back to the Performance Tips Page