Much Ado About SIIS
Last revised: 2021-04-13, BKW
In this article we discuss the programming situation called store into instruction stream, or SIIS. Topics will include a definition, what causes it, why on recent machines we need to avoid it, metrics that tell us how intensely it's happening, and remediations.
An IBM Z or LinuxONE machine has split L1 and L2 caches. By split we mean at each level the cache is split into two distinct sides. One side, called the instruction cache or I-cache, is for instructions. The I-cache holds memory lines recently touched because of fetching instructions to be executed. The other side is called the data cache or D-cache. The D-cache holds memory lines recently touched because of uses of data upon which instructions were operating.
When software does a store into memory, the hardware keeps the caches updated. It especially keeps the I-caches and D-caches updated. When the store is into a memory line that happens to be held in an I-cache, the event is called store into instruction stream, or SIIS. This name is used even though technically we perhaps should say store near instruction stream, or SNIS. In this article we use the term SIIS.
The usual cause of SIIS stores is that software placed data areas very near to, or perhaps right in line with, instructions. A good example of this is what happens when software places a parameter list macro right inline with the code that uses the parameter list. To use the parameter list data area, software must first store into that data area and then call the procedure that uses the parameter list. The stores into the parameter list are SIIS stores.
Sometimes a SIIS store requires extra mediation or intervention to assure cache coherency. The reason is usually temporal. On the zEC12 and several earlier families, all of which had a split L1 and a unified L2, the extra mediation was done in the L2, which is per-processor. On the z13 and later families, which have a split L1, a split L2, and a unified L3, the mediation job was moved to the L3, which is per-chip. It takes more time to reach the L3 than it does to reach the L2. For well-behaved software -- that is, for software that cleanly separates data from code -- the increased time needed for SIIS mediation is not a problem, because the situation rarely happens anyway. But poorly-written software, that is, software that freely mixes data and code, might well feel the difference.
So why in the z13 did IBM split the L2? Gaining performance over the zEC12 could not be achieved by just upping the clock rate. The clock rate had already reached the limit of what physics would allow. Rather, to perform better, the z13 had to be smarter. Splitting the L2 let the processor index the L1 and the L2 at the same time, instead of referring L1 misses to the L2. This improved the resolution time for the very common L1-miss, L2-hit scenario. The design change caused well-behaved software to run better. But poorly-behaved software -- software that stores into I-cache -- definitely felt the difference. But that's the minority case and so the design change was very much a purposeful move.
Readers might be interested to know that z13 was not the first time a cache was split. The z900 was the first machine that had a split L1. Just as we are seeing performance effects in the jump to z13 and later, we also saw them back then in the jump to z900 and later. What happened on the z900 prompted IBMer Kathy Walsh to write a paper about it. The original paper and updated information are both available here.
The IBM Z CPU MF host counters keep track of events related to SIIS. One counter -- for convenience, let's call it D -- counts I-cache stores. Another counter -- for convenience, let's call it N -- counts I-cache stores that required L3 intervention. A third counter -- for convenience, let's call it I -- counts instructions completed. The I counter is the same on all families, but the N and D counters are family-dependent.
One way to express the harm SIIS is causing is to look at the relationship between N and D. The percentage 100 * N/D shows us the percent of I-cache stores that required L3 intervention. The ratio is known by a number of names: SIIS proxy metric, SIIS indicator percent, or SIIS percent. Analysts who use SIIS percent as an action metric have certain thresholds they use to mark boundaries for actions of increasing urgency. Table 1 cites the percents those analysts tend to use.
Another approach to understanding the impact of SIIS is to compare our numerator N and our instruction count I. The ratio I/N tells us the number of completed instructions per aberrant I-cache store. This metric reminds me of watching a baby learn to walk. The baby takes two steps and then falls down. Next time the baby takes five steps and then falls down. In this metaphor, I is the count of successful steps and N is the number of falls. One analyst I know uses I/N as the SIIS trouble metric. Table 1 cites that analyst's action thresholds.
My personal favorite for a SIIS trouble metric is the reciprocal, N/I, aka the number of aberrant I-cache stores per instruction completed. To make the ratio more readable, I multiply the quotient by 1000000. I like this metric because it has several emotionally satisfying properties. First, if there is no trouble, the metric is zero. Second, as the trouble grows, so does the magnitude of the metric. Third, unlike SIIS percent, it will not yield a high value if N and D are both negligibly small yet similar to one another. Again, Table 1 cites action thresholds.
The CPUMF package from our download library prints the SIIS percent and also prints my favorite metric. The latter is tabulated under column ICWL3PMI, "I-cache stores with L3 intervention per million instructions completed". For more information, refer to our CPU MF article.
|Table 1. SIIS Metrics, Thresholds, and Responses.|
|Description||SIIS indicator % (SIIS)||Instructions per L3 intervention||L3 interventions per million instructions (ICWL3PMI)||Response|
|Noise||< 2%||n/a||< 10||None|
|Minimal impact||2% - 5%||Five digits||10 - 100||Low priority|
|Noteworthy impact||5% - 10%||Four digits||100 - 1000||Medium priority - investigate and remediate|
|Considerable impact||> 10%||Three digits||>1000||High priority - investigate and remediate|
|Notes: Response thresholds for various SIIS metrics. Column headings in parentheses are column headings that can be found in a z/VM CPU MF log report (filetype $CPUMFLG).|
If a problem is found, the remediation is to reorganize the software so that data areas are not located near code. For example, put save areas, local data, and procedure parameter list macros at the bottom of the compilation unit instead of inline. Be sure to place the data portion of the compilation unit at the next 256-byte boundary after the spot where the code ends. Some analysts even suggest a "guard band" or "no-man's land" of three to four cache lines between code and data, so as to keep instruction prefetching from inadvertently fetching data. Further, after all data areas are generated, pad the data area out to the next 256-byte boundary, and if you like, leave guard space after the data too. The 256-byte number is magic because that is the width of our cache line.
It's easy to say what the remediation is. It's much more difficult to find the spots where remediation needs to be done. The CPU MF counters do not show us where the problem is. Generally, knowledge of source code and of specific programmers' tendencies and techniques is required. Compilers are well aware of good practices and so compiler-generated code will not induce SIIS problems. Hand-built assembler is usually the place to look. Moreover, look for very old hand-built assembler, because in earlier times, separation of code and data was not so important.
Modern processor families increasingly achieve their performance improvements via microarchitecture enhancements because we no longer can just dial up the clock rate. This means it is especially important to collect CPU MF counters both before and after a processor upgrade. The CPU MF counters record how the processor reacted to the specific instruction stream it was asked to run. Thus the before-counters and the after-counters are both crucial in investigating hardware performance questions that come up after the upgrade. For more information about what to collect as part of an upgrade, read our upgrade article.
Over the past few years IBM has shipped service for z/VSE so as to repair some spots in z/VSE where SIIS was a problem. z/VSE clients will want to be sure they have applied DY47824, DY47814, DY47815, and DY47847. For latest information, consult IBM z/VSE Level 2.