Cycles per Instruction and its components
April 30th, 2006 admin
Cycles per Instruction is one way to evaluate the microarchitecture’s effect on a processor’s performance. Performance also depends on what constitutes a cycle (frequency of the machine) and what constitutes an instruction (the architecture of the machine). Interestingly, the ratio of these two, i.e. Cycles per Instruction, tells us about something other than those two factors. CPI tells us how many cycles, at the given frequency, the machine takes to chomp down instructions of the given architecture. In other words CPI tells us about the microarchitectural robustness of a design. It tells us, given a certain cycle period (frequency) and a certain group of instructions (architecture), how many cycles does it take to execute an “average” instruction. The “average” instruction depends on the workload (the program) being run through the processor. And therefore, when quoting CPI’s the workload to which that CPI applies is also quoted. The smaller the CPI the better performing a computer system is, all other factors such as the frequency and architecture remaining the same. Of course, other factors such as area, complexity and power must be considered, in addition to performance and the definition of “better” changes accordingly.
CPI is affected by several factors. Architects and designers often need to know where to start when trying to improve performance by making microarchtectural changes. The overall CPI has to be broken into components and the component that is contributing the largest is usually attacked first, since improvement in that design component could potentially reduce the CPI by the largest amount. I will come back to this point again with an example, shortly. For this write-up I will break the CPI into two main components. The “core” component and the “memory-nest” component. The core component deals with all the microarchitectural decision within the processing unit, which in modern microprocessors includes several pipelines (superscalar machines), each with several stages and associated logic – instruction fetch, decode, issue, register access, execute, write-back etc. The memory-nest component deals with getting appropriate instructions and data from the memory. Several levels of caching is provided to take advantage of certain properties seen in programs. These properties are spatial-locality (if a certain memory location is used, something nearby will be used soon) and temporal locality (if a certain memory location is used, it will likely be used again). To keep the data close at hand, and to avoid the potentially hundreds of processor cycles worth of time to go to the memory every time a new instruction or data is needed, instructions and data are stored in caches. There are some great resources on the web and in books which talk about caches in more detail. For this write-up it suffices to say that caches usually are arranged in a hierarchy. Level 1 is the smallest cache but is the fastest to access and is called the L1Cache. Level 2 cache is bigger than the Level 1 cache but because it is bigger and is usually only accessed after it it known that the instruction or data the core was looking for was not available in the L1Cache, it is slower. It is called the L2Cache. Usually the hierarchy has 2-3 levels of caching.
Therefore, if the CPI component is mostly affected not by the design of the core itself, but because of the sizes or geometries of the caches, the designers would be better served by the information that the memory-nest, or more accurately, a certain level cache should be the first place to look for potential improvements, rather than the core. This brings us to the two main topics of this article. First, what quantities in the cache must be measured to help understand its effect on the CPI. Secondly, from these statistics collected on the caches, how do we build the overall CPI number.
Cache Statistics – What they are, what they mean
When modeling caches there are 4 statistics that are often talked about. These are all not completely independent, and are, therefore, easy to get confused about. The 4 main performance-realted statistics collected for a cache are:
MissesPerReference
HitsPerReference
MissesPerInstruction
HitsPerInstruction
MissesPerReference indicates the ratio of how many times, of all the times the cache was accessed, was the cache unable to provide the data requested. HitsPerReference is the complement. It is the ratio of how many times, out of all the times the cache was accessed, was the cache able to provide the data requested. The complementary nature of these quantities makes it necessary to only quote one number.
MissesPerReference + HitsPerReference = 1
MissesPerInstruction is not that straightforward. It is the ratio of how many times the accesses to the cache missed, out of all the “instructions in the workload”! This is a strange measure at first, and takes some getting used to. It’s the ratio if a design-dependent variable quantity – cache misses, to an unchanging quantity, the number of instructions in the workload. In the earlier measure of missesPerReference, the denominator, the number of references to the cache, was variable depending on what happened before this cache was reached by the instruction flow. The number of reference to this cache depends on how well the previous levels of cache did. HitsPerInstruction, similarly is the ratio of the number of hits in the cache to the total number of instructions in the workload. MissesPerInstructions and HitsPerInstruction do NOT add up to 1! They add up to the number of references made to the cache per instruction.
These 4 quantities are depicted by an example in Figure1.

Why collect PerInstruction statistics
So the question is, “what is the reason we need these two ways of looking at a cache’s performance?”. Why doesn’t the simpler PerReference style of collecting stats suffice? The answer is that the PerReference numbers are great for designing the cache under study provided the number of references does not change, which implies, the design before the cache in study remains unchanged. The PerInstruction numbers are useful to get an idea of how much the cache under study is contributing to the overall CPI. We will shortly see how that overall CPI is calculated from the PerInstruction numbers. Before the PerReference numbers can be used to improve cache designs, the PerInstruction numbers must point out that it is worthwhile to go after a cache design because the cache is adding a big component to the CPI of the system. And we want to keep the CPI low as far as possible. That is why we want to keep track of the PerInstruction statistics.
How to breakdown overall CPI into components
There are two ways to break the overall CPI, or another way to look at it is, two way to build the overall CPI.
One way is to look at what fraction of instructions “pass through” each level (whether they hit or miss at that level is immaterial) in the hierarchy and add up the cycles they spend at each level. The idea is depicted in Figure 2.
The other way is to look at what fraction of instructions “finish at” each level (i.e. they hit) and add up the cycles those fractions take to finish their journey through the memory hierarchy. The idea is depicted in Figure 3.
Method 1: Breaking CPI into components for each level of the memory hierarchy an instruction passes through

This method is often used to represent a CPI stack, which assigns a portion of the CPI to each level. This is very useful in figuring out which level of the cache to go after for improvements.
CPI = CoreComponent + L1Component + L2Component + L3Component + MemComponent
Here the overall CPI is reached by adding CPI contributions at each level of the memory hierarchy. Simulation models for the core often generate “infinite” CPIs. Infinite CPIs are used to keep the effect of the memory hierarchy separate from the effect of the core design, so separate teams can work on both models in parallel. The effect of the memory hierarchy on the CPI has to be added in to the “infinite” CPI to get the overall CPI.
infL1CPI = CoreComponent + L1Component
infL2CPI = CoreComponent + L2Component + L2Component
and so on.
The calculation of infL1CPI assumes the L1 always hits, but has the normal hit latency. So it is an infinitely large L1, but miraculously the directory lookup and array access takes time equal to a realistic sized L1.
The calculation of infL2CPI assumes the L2 always hits, but has the normal hit latency. The L1 is realistically modeled, the L2 is the infinite L2 with correct hit latency modeled.
Going back to
CPI = CoreComponent + L1Component + L2Component + L3Component + MemComponent
Say we have cache models for the L1,L2, L2 caches and the memory. So how do we get the L1Component, L2Component, L3Component and MemComponent, such that we can get the overall CPI for the system. This is where the hitsPerInstruction and missesPerInstruction numbers come in very handy.
L1Component = L1ReferencePerInstruction * L1Latency –(A1)
L2Component = L2ReferencesPerInstruction * (L2Latency – L1Latency) –(A2)
L2Component = L3ReferencesPerInstruction * (L3Latency – L2Latency) –(A3)
MemComponent = MemReferencePerInstruction * (MemLatency – L3Latency) –(A4)
Notice that each component is a product of the fraction of instructions that reference that level and the amount of time those instructions which reference that level are guaranteed to spend at that lavel. Two observations here:
1. the ReferencesPerInstruction at each level is the sum of hitsPerInstruction and missesPerInstruction for that level
2. an easy way to determine the common amount of extra time introduced by a level into the overall execution of an instruction is to find the difference between the time when an instruction hits at that level and one level before.
Method 2: Breaking CPI into components for each level of the memory hierarcy an instruction finishes at

This method is often used when the aim is to simply calculate the overall CPI and we are not dealing with breaking the CPI into its CPI stack components.
CPI = CoreComponent + (L1HitsPerInstruction*L1Latency) + (L2HitsPerInstr*L2Latency) + (L3HitsPerInstr*L3Latency) + (MemHitsPerInstr*MemLatency) –(B)
Another way to say this is
CPI = Time spent by all instructions which do not have to even go to the L1Cache(s)
+ time spent by the fraction which hit L1 and face the L1Latency on top of the CoreComponent
+ time spent by the fraction which hit L2 and face the L2Latency on top of the CoreComponent
+ time spent by the fraction which hit L3 and face the L3Latency on top of the CoreComponent
+ time spent by the fraction which hit memory and face memory latency on top of the CoreComponent
Notice that each component looked at this way is the product of the fraction of instructions which hit in a particular level of the hierarchy and the time it takes to service those instructions.
CAVEAT: There is a distinct difference between the CoreComponent component in equations A and equation B. In equations A, CoreComponent applies to all the instructions, since all the instructions pass through the core. In equation B, the coreComponent applies to only those instructions which do not have to leave the core to access L1Cache (or caches, since Level 1 is usually broken into separate instruction and data caches). To simplify this, simulation models of the core often are tightly coupled with at least the L1cache or L1caches. There is a significant amount of overlap in accessing the caches, especially the level 1 cache and therefore the time spent in accessing one instruction or data granule might be able to hide the time to access another. This could adversely affect the CPI calculations. That is why this style of breaking up the CPI into its components and adding them up, only works as an estimate. A realistic CPI can only be realized by building in a more detailed understanding of the paralellism in a workload into the equations above, or by just creating a more accurate simulation model with the core model working hand-in-hand with the memory nest model. One way to make the above equations a little bit closer to reality is by only considering loads (i.e. both instructions and data loads) for the missesPerReference, hitsPerInstruction etc. Stores usually do not directly affect CPI since they are handled by mechanisms that are non-critical to CPI calculations.
Posted By: Anil Krishna at 2:30PM on Sunday, April 30th, 2006
Posted in Tutorials | No Comments »