Notes on Memory Consistency and Cache Coherence

June 27th, 2008 admin

Here are some of my notes on the topic of memory consistency and cache coherence, and how uniprocessor and multiprocessor cores have to be built to support the consistency models. Most of this was written up when I was preparing for my Qualifying Exam at NC State University last semester. This is a relatively complicated topic to understand well, and there might still be several mistakes in how I understood the ideas. Also, this might only make sense, and be interesting, to people familiar with these areas of computer architecture. Here’s the pdf file: Notes on Memory Consistency and Cache Coherence

Posted in Information, Tutorials | No Comments »

Setup and Hold Violations in Digital Systems

May 19th, 2008 admin

I wrote this up when trying to prepare for my PhD Qualifying Examination this past semester (Spring 2008). It is a pdf file. You can read it here.

Posted in Tutorials | No Comments »

Finally, one Contact List to rule them all

July 29th, 2007 admin

I have a Yahoo email account. I have a Google email account. I have a few more that I do not use much. Each email application or service provider typically has an Address Book or Contacts List, which we can use to list the names, email addresses and other information about people. For a while, I have been thinking about getting the contact lists organized. There were several layers to the word “organized” and I was apprehensive about starting to peel those layers. The first layer of the problem was figuring out which mail server I wanted to stick to. The next thought was to create a superset of all the contact lists, currently scattered across applications and mail servers, at one place. The next issue was to find a way to update the contact list quickly, rather than clicking around a web-based Address Book or Contact List applications such as the one provided by Yahoo and Google Mail. Then there was the hope that I could keep a copy of the contact list locally on my personal computer, in case, at some point, I did not have internet access to get to the Yahoo Address Book.

This list of requirements seemed formidable in itself, yet, what made me skeptical of a final solution, was one last requirement I had. I had maintained a list of birthdays and anniversaries in a text file separate from the contact lists in the mail servers I mentioned. It was a simple text file and a simple Perl script I wrote could go through this text file everyday and send me an email if it found any upcoming event. I wanted to retain the ability to do such scripting and not have to maintain a separate text file version of the contact list, just for the purposes of being able to run such a reminder script.

After collecting and formulating these thoughts over a long time, I finally spent a few minutes last week looking for a solution to the multi-layered problem. Searching on the internet revealed that there WAS a relatively easy solution that fixes ALL the above problems, including giving me the ability to run a simple script to extract birthday and anniversary information! Here is the solution. Yahoo and Google Address Books allow the existing contacts-list to be exported as a CSV (Comma Separated Variable) file, or a CSV file to be imported to populate the Contact List or Address Book application. A CSV file, as the name suggests, is just a regular text file, with many fields belonging to a record typed across a single line, with the comma symbol (”,”) separating the fields. A new record starts in a new line. The file can be opened with a regular text-editor such as Notepad, Wordpad or Textpad in Windows and vi, pico and emacs in Unix. The file may also be opened using Microsoft Excel spread sheet and the fields show up in separate column and the lines show up in separate rows. This solves the problem of easily modifying the contact list in bulk and storing the contact list as a local file on your personal computer. The CSV file is compatible across Yahoo and Google, and probably across many other applications like Microsoft Outlook and Orkut (web-based networking application). The CSV file can then be imported into Yahoo Mail, Google Mail or other such applications. Problem solved. Single contact list. Storable and updateable locally. Uploadable to multiple web-based servers.

The CSV file based common contact list also allowed me to enter the anniversary and birthday in appropriate columns. I wrote a script called contact.py in the Python scripting language to read the contact list file as a simple text file (in the CSV format) and search for upcoming events. This allowed me to get rid of the earlier text file I had my Perl script read. The CSV file, I called it contactlist.csv, was truly the one file I needed to retain for all my address-book related needs. Whenever I want to add a new contact or update information about an existing contact, I update the local copy of the contact list, contactlist.csv, and then import it into Yahoo Mail and Google Mail to keep them up to date. I have noticed that before I import the latest contactlist.csv file into Yahoo or Google, I need to delete all the existing contacts from Yahoo and Google, respectively. Once, we have an empty contact list on the mail server, the importing of contactlist.csv recreates the complete list. Not starting with an empty contact list on the mail servers, creates duplicates, probably because the “import” function is not smart enough to recognize duplicates.

Here is an example of what a few rows from the CSV file contactlist.csv looks like. It gives us idea of what the fields are. The example also shows that all the fields in a CSV file need not be filled. A field can be left empty if we do not know the information relating to that field for a given contact. Also, I use xxxx for the year field of a date (such as a birthday or an anniversary date), in case I do not know the year. This is OK because the script that parses this CSV file, called contact.py, and which is shown later, does not use the year field to determine if an anniversary is approaching. It only uses the day and month parts of the field.

First,Middle,Last,Nickname,Email,Messenger ID,Home,Work,Pager,Fax,Mobile,Other,Yahoo! Phone,Alternate Email 1,Alternate Email 2,Personal Website,Business Website,Title,Company,Work Address,Work City,Work State,Work ZIP,Work Country,Home Address,Home City,Home State,Home ZIP,Home Country,Birthday,Anniversary,Custom 1,Custom 2,Custom 3,Custom 4,Comments,Messenger ID1,Messenger ID2,Messenger ID3,Messenger ID4,Messenger ID5,Messenger ID6,Messenger ID7,Messenger ID8,Messenger ID9,Skype ID,IRC ID,ICQ ID,Google ID,MSN ID,AIM ID,QQ ID
Shahrukh,Mayur,Khan,srk,srk@bollywood.com,,,,,,,,,,,,,,,,,,,,,,,,,1/2/xxxx,,,,,,,,,,,,,,,,,,,,,,

Here is the contact.py Python script which then works on the CSV file called contactlist.csv with contents as shown above, and sends email to your email account. You might have to appropriately fix some of the fields in the script to get it to work. I present it here just as a hint.

import csv, datetime, re
from string import split
filename = “contactlist.csv”
warnZone = 8 #number of days before which email reminder should be sent

daysInMonth = [’31′,’28′,’31′,’30′,’31′,’30′,’31′,’31′,’30′,’31′,’30′,’31′];
def dayOfYear(month, day):
#print “%s %s” %(month, day)
doy = 0
for n in range(0,int(month)):
if(n == int(month)-1):
doy = doy+int(day)
return doy
else:
doy = doy+int(daysInMonth[n])

now = datetime.datetime.now()
today_month = now.strftime(”%m”)
today_day = now.strftime(”%d”)
today_doy = dayOfYear(today_month, today_day)
#print “%s %s %s” %(today_month, today_day, today_doy)

reader = csv.reader(open(filename))
content = “”
for row in reader:
firstname = (row[0])
middlename = (row[1])
lastname = (row[2])
anniversary = (row[30])
birthday = (row[29])
anni_split = anniversary.split(’/')
bday_split = birthday.split(’/')
#print “len anni_split %s” %(len(anni_split))
#print “len bday_split %s” %(len(bday_split))
if(len(anni_split)>1): #keeps “a/b/c, gets rid of “A”, as in 1st row
anni_month = anni_split[0]
anni_day = anni_split[1]
anni_doy = dayOfYear(anni_month, anni_day)
diff = anni_doy - today_doy
if((anni_doy >= today_doy and anni_doy <= today_doy + warnZone) or (anni_doy <= today_doy + warnZone - 365)):
# print “%s %s %s’s anniversary is on %s/%s” %(firstname, middlename, lastname, anni_month, anni_day)
content += firstname+” “+middlename+” “+lastname+”\’s anniversary is on “+anni_month+” “+anni_day+”\n”
if(len(bday_split)>1): #keeps “a/b/c, gets rid of “A”, as in 1st row
bday_month = bday_split[0]
bday_day = bday_split[1]
bday_doy = dayOfYear(bday_month, bday_day)
diff = bday_doy - today_doy
if((bday_doy >= today_doy and bday_doy <= today_doy + warnZone) or (bday_doy <= today_doy + warnZone - 365)):
# print “%s %s %s’s birthday is on %s/%s” %(firstname, middlename, lastname, anni_month, anni_day)
content += firstname+” “+middlename+” “+lastname+”\’s birthday is on “+bday_month+” “+bday_day+”\n”

#print “%s” %(content)

import smtplib
smtpserver = ‘mailserver.department.company.com’
AUTHREQUIRED = 0
RECIPIENTS = [’gol345die@gmail.com’]
SENDER = [’con789vey@po.doc.com’]
session = smtplib.SMTP(smtpserver)
smtpresult = session.sendmail(SENDER, RECIPIENTS, content)
if smtpresult:
errstr = “”
for recip in smtpresult.keys():
errstr = “”"Could not deliver mail to : %s Server said: %s %s %s”"” % (recip, smtpresult[recip][0], smtpresult[recip][1], errstr)
raise smtplib.SMTPException, errstr

Posted in Information, Tutorials | No Comments »

Cycles per Instruction and its components

April 30th, 2006 admin

Cycles per Instruction is one way to evaluate the microarchitecture’s effect on a processor’s performance. Performance also depends on what constitutes a cycle (frequency of the machine) and what constitutes an instruction (the architecture of the machine). Interestingly, the ratio of these two, i.e. Cycles per Instruction, tells us about something other than those two factors. CPI tells us how many cycles, at the given frequency, the machine takes to chomp down instructions of the given architecture. In other words CPI tells us about the microarchitectural robustness of a design. It tells us, given a certain cycle period (frequency) and a certain group of instructions (architecture), how many cycles does it take to execute an “average” instruction. The “average” instruction depends on the workload (the program) being run through the processor. And therefore, when quoting CPI’s the workload to which that CPI applies is also quoted. The smaller the CPI the better performing a computer system is, all other factors such as the frequency and architecture remaining the same. Of course, other factors such as area, complexity and power must be considered, in addition to performance and the definition of “better” changes accordingly.

CPI is affected by several factors. Architects and designers often need to know where to start when trying to improve performance by making microarchtectural changes. The overall CPI has to be broken into components and the component that is contributing the largest is usually attacked first, since improvement in that design component could potentially reduce the CPI by the largest amount. I will come back to this point again with an example, shortly. For this write-up I will break the CPI into two main components. The “core” component and the “memory-nest” component. The core component deals with all the microarchitectural decision within the processing unit, which in modern microprocessors includes several pipelines (superscalar machines), each with several stages and associated logic - instruction fetch, decode, issue, register access, execute, write-back etc. The memory-nest component deals with getting appropriate instructions and data from the memory. Several levels of caching is provided to take advantage of certain properties seen in programs. These properties are spatial-locality (if a certain memory location is used, something nearby will be used soon) and temporal locality (if a certain memory location is used, it will likely be used again). To keep the data close at hand, and to avoid the potentially hundreds of processor cycles worth of time to go to the memory every time a new instruction or data is needed, instructions and data are stored in caches. There are some great resources on the web and in books which talk about caches in more detail. For this write-up it suffices to say that caches usually are arranged in a hierarchy. Level 1 is the smallest cache but is the fastest to access and is called the L1Cache. Level 2 cache is bigger than the Level 1 cache but because it is bigger and is usually only accessed after it it known that the instruction or data the core was looking for was not available in the L1Cache, it is slower. It is called the L2Cache. Usually the hierarchy has 2-3 levels of caching.

Therefore, if the CPI component is mostly affected not by the design of the core itself, but because of the sizes or geometries of the caches, the designers would be better served by the information that the memory-nest, or more accurately, a certain level cache should be the first place to look for potential improvements, rather than the core. This brings us to the two main topics of this article. First, what quantities in the cache must be measured to help understand its effect on the CPI. Secondly, from these statistics collected on the caches, how do we build the overall CPI number.

Cache Statistics - What they are, what they mean

When modeling caches there are 4 statistics that are often talked about. These are all not completely independent, and are, therefore, easy to get confused about. The 4 main performance-realted statistics collected for a cache are:

MissesPerReference
HitsPerReference
MissesPerInstruction
HitsPerInstruction

MissesPerReference indicates the ratio of how many times, of all the times the cache was accessed, was the cache unable to provide the data requested. HitsPerReference is the complement. It is the ratio of how many times, out of all the times the cache was accessed, was the cache able to provide the data requested. The complementary nature of these quantities makes it necessary to only quote one number.

MissesPerReference + HitsPerReference = 1

MissesPerInstruction is not that straightforward. It is the ratio of how many times the accesses to the cache missed, out of all the “instructions in the workload”! This is a strange measure at first, and takes some getting used to. It’s the ratio if a design-dependent variable quantity - cache misses, to an unchanging quantity, the number of instructions in the workload. In the earlier measure of missesPerReference, the denominator, the number of references to the cache, was variable depending on what happened before this cache was reached by the instruction flow. The number of reference to this cache depends on how well the previous levels of cache did. HitsPerInstruction, similarly is the ratio of the number of hits in the cache to the total number of instructions in the workload. MissesPerInstructions and HitsPerInstruction do NOT add up to 1! They add up to the number of references made to the cache per instruction.

These 4 quantities are depicted by an example in Figure1.

Figure1

Why collect PerInstruction statistics

So the question is, “what is the reason we need these two ways of looking at a cache’s performance?”. Why doesn’t the simpler PerReference style of collecting stats suffice? The answer is that the PerReference numbers are great for designing the cache under study provided the number of references does not change, which implies, the design before the cache in study remains unchanged. The PerInstruction numbers are useful to get an idea of how much the cache under study is contributing to the overall CPI. We will shortly see how that overall CPI is calculated from the PerInstruction numbers. Before the PerReference numbers can be used to improve cache designs, the PerInstruction numbers must point out that it is worthwhile to go after a cache design because the cache is adding a big component to the CPI of the system. And we want to keep the CPI low as far as possible. That is why we want to keep track of the PerInstruction statistics.

How to breakdown overall CPI into components

There are two ways to break the overall CPI, or another way to look at it is, two way to build the overall CPI.
One way is to look at what fraction of instructions “pass through” each level (whether they hit or miss at that level is immaterial) in the hierarchy and add up the cycles they spend at each level. The idea is depicted in Figure 2.
The other way is to look at what fraction of instructions “finish at” each level (i.e. they hit) and add up the cycles those fractions take to finish their journey through the memory hierarchy. The idea is depicted in Figure 3.

Method 1: Breaking CPI into components for each level of the memory hierarchy an instruction passes through

Figure2

This method is often used to represent a CPI stack, which assigns a portion of the CPI to each level. This is very useful in figuring out which level of the cache to go after for improvements.

CPI = CoreComponent + L1Component + L2Component + L3Component + MemComponent

Here the overall CPI is reached by adding CPI contributions at each level of the memory hierarchy. Simulation models for the core often generate “infinite” CPIs. Infinite CPIs are used to keep the effect of the memory hierarchy separate from the effect of the core design, so separate teams can work on both models in parallel. The effect of the memory hierarchy on the CPI has to be added in to the “infinite” CPI to get the overall CPI.

infL1CPI = CoreComponent + L1Component
infL2CPI = CoreComponent + L2Component + L2Component
and so on.

The calculation of infL1CPI assumes the L1 always hits, but has the normal hit latency. So it is an infinitely large L1, but miraculously the directory lookup and array access takes time equal to a realistic sized L1.
The calculation of infL2CPI assumes the L2 always hits, but has the normal hit latency. The L1 is realistically modeled, the L2 is the infinite L2 with correct hit latency modeled.

Going back to
CPI = CoreComponent + L1Component + L2Component + L3Component + MemComponent
Say we have cache models for the L1,L2, L2 caches and the memory. So how do we get the L1Component, L2Component, L3Component and MemComponent, such that we can get the overall CPI for the system. This is where the hitsPerInstruction and missesPerInstruction numbers come in very handy.
L1Component = L1ReferencePerInstruction * L1Latency –(A1)
L2Component = L2ReferencesPerInstruction * (L2Latency - L1Latency) –(A2)
L2Component = L3ReferencesPerInstruction * (L3Latency - L2Latency) –(A3)
MemComponent = MemReferencePerInstruction * (MemLatency - L3Latency) –(A4)

Notice that each component is a product of the fraction of instructions that reference that level and the amount of time those instructions which reference that level are guaranteed to spend at that lavel. Two observations here:
1. the ReferencesPerInstruction at each level is the sum of hitsPerInstruction and missesPerInstruction for that level
2. an easy way to determine the common amount of extra time introduced by a level into the overall execution of an instruction is to find the difference between the time when an instruction hits at that level and one level before.

Method 2: Breaking CPI into components for each level of the memory hierarcy an instruction finishes at

Figure3

This method is often used when the aim is to simply calculate the overall CPI and we are not dealing with breaking the CPI into its CPI stack components.

CPI = CoreComponent + (L1HitsPerInstruction*L1Latency) + (L2HitsPerInstr*L2Latency) + (L3HitsPerInstr*L3Latency) + (MemHitsPerInstr*MemLatency) –(B)

Another way to say this is
CPI = Time spent by all instructions which do not have to even go to the L1Cache(s)
+ time spent by the fraction which hit L1 and face the L1Latency on top of the CoreComponent
+ time spent by the fraction which hit L2 and face the L2Latency on top of the CoreComponent
+ time spent by the fraction which hit L3 and face the L3Latency on top of the CoreComponent
+ time spent by the fraction which hit memory and face memory latency on top of the CoreComponent

Notice that each component looked at this way is the product of the fraction of instructions which hit in a particular level of the hierarchy and the time it takes to service those instructions.

CAVEAT: There is a distinct difference between the CoreComponent component in equations A and equation B. In equations A, CoreComponent applies to all the instructions, since all the instructions pass through the core. In equation B, the coreComponent applies to only those instructions which do not have to leave the core to access L1Cache (or caches, since Level 1 is usually broken into separate instruction and data caches). To simplify this, simulation models of the core often are tightly coupled with at least the L1cache or L1caches. There is a significant amount of overlap in accessing the caches, especially the level 1 cache and therefore the time spent in accessing one instruction or data granule might be able to hide the time to access another. This could adversely affect the CPI calculations. That is why this style of breaking up the CPI into its components and adding them up, only works as an estimate. A realistic CPI can only be realized by building in a more detailed understanding of the paralellism in a workload into the equations above, or by just creating a more accurate simulation model with the core model working hand-in-hand with the memory nest model. One way to make the above equations a little bit closer to reality is by only considering loads (i.e. both instructions and data loads) for the missesPerReference, hitsPerInstruction etc. Stores usually do not directly affect CPI since they are handled by mechanisms that are non-critical to CPI calculations.

Posted By: Anil Krishna at 2:30PM on Sunday, April 30th, 2006

Posted in Tutorials | No Comments »

PseudoLRU Replacement Scheme

October 10th, 2005 admin

I am taking ECE521, a Computer Architecture course at NCSU and am thoroughly enjoying it. One of the topics that was touched upon in the class, and with which I later struggled for a few hours was how pseudoLRU replacement policy works . I could not find useful tutorials on the net and decided to write one for myself. Feel free to use it as is if you find it helpful, or let me know and I can send you the ppt version.

Posted in Tutorials | No Comments »

Branch instructions in the PowerPC Architecture

May 31st, 2005 admin

1 Must-Have Branches and Good-To-Have Branches

Branch instructions show up in a program when sequential execution of instructions in memory is not feasible or possible.

1.1 Not Possible category (Must-have-branches):

When the program’s execution direction can go multiple directions depending on a condition, it becomes imperative to have some kind of branching. This type of branch is called a conditional branch. A condition must be checked to decide which direction the branch would take.

1.2 Not Feasible category (Good-to-have branches):

When a certain sequence of instructions repeats multiple times, it might be put into a loop. That way the instruction sequence is just stored once in memory and executed by jumping back to the start of the loop upon hitting the end of the loop. It is sometimes possible to avoid looping (and compilers often use a technique called loop unrolling) by just laying the instruction sequence out repeatedly in memory and running through it without any branches. It is not possible to unroll the loop effectively if the number of times the loop is executed is unknown up until the point the execution reaches the loop. In that case the example falls under the not possible category, since it requires a condition evaluation to determine how many times to execute the sequence of instructions. Often a sequence of instructions that is used several times and/or at several places in execution is put into a “subroutine” and “called” when required. This kind of subroutine/function calls and returns from those functions to mainline execution, require branch instructions, since these instructions are stored only one place in memory and not necessarily after the calling instruction. These branches could also be avoided by not having function calls and instead inlining code, which is another technique compilers sometimes use to reduce branches. Unconditional branches are branches where the program does not have to check for any condition and can branch to the target address unequivocally. Unconditional branches are branches that are taken no matter what. The instructions executed upon branching could have been simply moved into the main line execution to avoid the branch. That is not feasible because the memory holding the instructions could have a large number of copies of the same sequence in a lot of places, thereby utilizing a lot more memory than necessary.

2 Architectural resources required to handle branches

Hardware that supports PowerPC architecture provides three registers that are used by the various flavors of branch instructions

2.1 Condition Register

This 32-bit register has 8 fields (or 4 bits each) and basically stores bits that describe the state the system is in. There are instructions provided by the PowerPC architecture that can set/clear/perform operations on bits in this register. The bits then are used by conditional branch instructions to determine whether to branch or not (non-branching path is known as the flow through path). Such usage of the condition register is required only by the conditional branches.

2.2 Link Register

When branching to a subroutine it is important to know how to join back with the main line execution when the subroutine is executed. The Link Register is the 64-bit register provided by the hardware to store the address of the instruction that follows the branch instruction. This is the address where execution joins the main line upon completing the subroutine. Since not all branches intend to join back the main line (or, in other words, they are not subroutine calls) this storing of the next instruction’s address into the Link Register has to be requested by the branch instruction (or, in other words, the software).

2.3 Count Register

This 64-bit register is used to hold a loop count, which can be checked as a condition for deciding the branch direction, and also be decremented by a branch instruction. This is clearly useful when a branch is used to determine if the requisite number of loops of a sequence of instructions have been executed.

The Count Register can also be used for an entirely different purpose - to store a target address, although such usage might be a redundancy of the PowerPC architecture.


3 The Branch Instructions – basic (orthogonal) and extended (redundant)

Sections 3.1, 3.2, 3.3 and 3.4 describe the 4 sets of basic branch instructions. These are all the instructions that are required for branching in PowerPC software. Section 3.5 discusses branch instructions that are special cases of the basic instructions, and were used often enough to be rightfully given their own opcodes. These are extended mnemonics, which are good to have but not imperative, since one of the basic instructions could be used for an identical operation.

3.1 b, ba, bl, bla : Unconditional branches to relatively nearby addresses

The target address has to be relatively nearby, to permit use of this instruction because the address is specified in an immediate field in the instruction opcode itself.

The instructions ending with the ‘a’ indicate that the target address is absolute as specified in the instruction opcode itself.

The instructions which do not end in ‘a’ indicate that the target address is relative to the address of the current instruction. So the target address is the sum of the current instructions address and the immediate value in the opcode.

How near is “nearby”?

24 bits in the opcode specify the address. This is extended by a 0b00 (two 0 bits) to make it an address aligned at 4 bytes, which is how every instruction in the machine must be placed in memory. The 26 bits specify 64MB of space. Therefore “nearby” refers to 32MB on either side of the current instruction (relative addressing) or 0-64MB of the programs address range (absolute addressing).

The instructions ending in ‘l’ indicate that the next sequential instruction’s address must be saved into the link register. This is useful when a branch is made to a subroutine and at the end of the subroutine the program need only indicate its wish to return to the main flow of execution. It does not have to know the return address for it can use the address stored in the link register as the target.

Target address is known for these instructions, and can be calculated way in advance in the pipeline, since it is part of the instruction opcode, and in the extreme case might need to be added to the current instruction address.


3.2 bc, bca, bcl, bcla : Conditional branches to relatively nearby addresses

Just as is the case with unconditional branches, for conditional branches too the instructions ending in a and l, mean absolute addressing and branch with link, respectively.

The difference, compared to the unconditional branches, is that two more fields are required BO and BI. Each is 5 bits. BI indicates which bit in the CR (Condition Register) must be looked at to make the branching decision. BO indicates a couple of things.

Firstly it specifies what rule should be used when looking at the BI bit in CR to make the branching decision. So say the BI indicates bit 2. BO might indicate a rule that says branch if the CR bit 2 is 0. BO bits can be used to indicate a slightly more complicated rule such as decrement the Count Register by 1 and then branch if the Count Register holds a non-zero value and the CR bit 2 is 0. There are a few other combinations of conditions that can be checked using the bits in the BO field. These conditions are various combinations of

- Decrementing or not decrementing the CTR (Count Register)

- Branching if the CTR is zero or non-zero and a CR bit is 0 or 1

- Branching if the CTR is zero or non-zero

- Branching if a CR bit is 0 or 1

Secondly, the BO field also provides bits that can be used by the compiler to place hints about the direction the branch is likely going to take. This can help speculatively fetching instructions if the branch is predicted taken by the BO field and the target address is known.

Target address is known for these instructions since it is part of the instruction opcode, and in the extreme case might need to be added to the current instruction address.

How near is “nearby” in this case?

14 bits in the opcode specify the address. This is extended by a 0b00 (two 0 bits) to make it an address aligned at 4 bytes, which is how every instruction in the machine must be placed in memory. The 16 bits specify 64KB of space. Therefore “nearby” refers to 32KB on either side of the current instruction (relative addressing) or 0-64KB of the programs address range (absolute addressing).

If the BO field has 1’s in certain positions (1z1zz – where bit denoted by z is ignored), it indicates that the branch is always taken. So that means it is unconditional! So the four instructions described in section 3.1 on unconditional branch instructions, are actually redundant to a limited extent. The reason they are not completely redundant is because those instructions have 10 more bits for the target address field than the conditional branch instructions described in this section.

3.3 bclr, bclrl

These instructions represent conditional branches to the address specified in the link register. The bclrl, as indicated by the l at the end of the opcode, puts in the address of the next sequential instruction to the current instruction in the link register or on the link stack, after figuring out the target address form the link register.

The reason unconditional branches to the link register do not require additional PowerPC instructions (like blr or blrl) is because there is a way to specify a condition in the BO field of the instruction, that makes it “always taken”, or in other words, unconditional.

The target address for this instruction depends on the value residing in the Link Register (or Link Stack) at the time this instruction is being executed (and all previous instructions, that write the Link Register have done so). Therefore, due to the dependency on the Link Register, the target address may not be calculated very early in the pipeline, unless there are no older instructions in the pipeline that might write to the Link Register. Even in that case, an access to the Link Register is required to read its value. The instruction does not contain the target address as opposed to the cases in sections 3.1 and 3.2.

In addition to the BO and BI fields described in section 3.2, there is a 2-bit field in the opcode, BH. The BH field provides additional hints the hardware to assist it in fetching the correct target address ahead of time. The hints are of the type “this is a subroutine return”, “this is not a subroutine return so likely the same target address would be used as was used the last time this branch was taken” or “the target address is not predictable”. This is especially useful, when the compiler uses bclr but it is not for a subroutine call. In that case it might be useful to let the hardware know that there is a good chance that the target used the last time this branch was executed would likely be the target used this time too. The hardware can fetch the instructions from that path without having to wait for the Link Register to be ready and read to find the target address, provided the hardware maintains some kind of a cache to remember the previous targets.

3.4 bcctr, bcctrl

These instructions represent conditional branches to the address specified in the count register. The count register is used here to hold a target address, rather than a count. The name “count register”, is inaccurate in this case. The bcctrl, as indicated by the l at the end of the opcode, puts in the address of the next sequential instruction to the current instruction in the link register or on the link stack.

The reason unconditional branches to the count register do not require additional PowerPC instructions (like bctr or bctrl) is because there is a way to specify a condition in the BO field of the instruction, that makes it “always taken”, or in other words, unconditional.

The target address for this instruction depends on the value residing in the Count Register at the time this instruction is being executed (and all previous instructions, that write the Count Register have done so). Therefore, due to this dependency on the Count Register, the target address may not be calculated very early in the pipeline, unless there are no older instructions in the pipeline that might write to the Count Register. Even in that case, an access to the Count Register is required to read its value. The instruction does not contain the target address as opposed to the cases in sections 3.1 and 3.2.

In addition to the BO and BI fields described in section 3.2, there is a 2-bit field in the opcode, BH. The BH field provides additional hints the hardware to assist it in fetching the correct target address ahead of time. The hints are of the type “this is not a subroutine return so likely the same target address would be used as was used the last time this branch was taken” or “the target address is not predictable”. As seen in section 3.3, the BH field has more use as a hint for the branch to link register instructions.

3.5 Extended Mnemonics

Special cases of conditional instructions (sections 3.2, 3.3 and 3.4) have been given their own mnemonics for ease of readability of assembly listings.

3.5.1 blt

blt target is the same as bc 12, 0, target

0 is the BI field, which indicates the 0th bit in the 32-bit CR.

12 is the BO field, which, looking at the detailed meaning of the bits indicates “branch if CRBI is 1”, which in this case means “branch if CR0 is 1”.

Bit 0 of the CR is set to 1 when the result of a fixed-point computation is negative.

So effectively “bc 12,0,target” means “branch to the target if a fixed point computation yielded a negative result”. This happens often enough that “branch to target if less than zero” was given a new mnemonic “blt target”.

3.5.2 bne

bne cr2, target is the same as bc 4, 10, target

10 is the BI field, which indicates the 10th bit in the 32-bit CR. Notice that the 10th bit of the CR is in CR field 2.

4 is the BO field, which, looking at the detailed meaning of the bits indicates “branch if CRBI is 0″, which in this case means “branch if CR10 is 0″.

Bit 2 of a specific field in the CR is set to 1 when the result of a compare is equal to 0. The “specific field” has to be specified by the compare instruction. Result of a compare is 0 when the values being compared are equal. So if the bit in the CR is set to 0, that means the values being compared are not equal.

So effectively “bc 4,10,target” means “branch to target register when bit 10 of the CR is 0”, which shows up under the special circumstance meaning “branch to the target if a compare operation that writes to CR field 2 evaluated the operands to be not equal”. This happens often enough that “branch to target if less than zero” was given a new mnemonic “blt cri, target”. i is whatever CR field the compare used.

3.5.3 bdnz

bdnz target is the same as bc 16, 0, target

BI is 0.

BO is 16, which indicates, decrement CTR and branch is CTR is non-zero. There is no dependence on the CR!

“bc 16,0,target” therefore is “branch after decrement if non zero”, which has been given the special mnemonic bdnz.


3.5.4 And a few more extended mnemonics …

There are a few other extended mnemonics that extend from the base instruction in the same way as the instructions in sections 3.5.1, 3.5.2 and 3.5.3

Just like bc extends to blt, bne and bdnz, bclr extends to bltlr, bnelr and bdnzlr; and bcctr extends to bltctr and bnectr.

Note - PowerPC Architecture, POWER Architecture and PowerPC are registered trademarks of International Business Machines Corporation. Other trademarks used are owned by the respective owners.

Posted in Tutorials | No Comments »

Microsoft Word and embedding fonts into the file

December 1st, 2004 admin

I have often had problem where I create a document in a certain font and the recepient complains that the fonts are hideous. That happens because sometimes I use fonts that the recepient’s computer does not have and so it decides to use a font it thinks matches my font, but the choice is often a sad one. Recently I was working with my father on my wedding invitation (True! Kavita and I are getting married on February 13th 2005 and the news deserves a separate blog entry) and faced the problem of not being sure if he is seeing the same font that I intended for him to see the invitation in (he was going to get the cards printed back in Visakhapatnam, India, so he better know exactly what I had in mind). And I came to learn after looking for advice on the internet, that you can embed fonts right into the document just before you save it. This works in PowerPoint too. The way to do this is do a File->Save As, before a keen eye catches at the top right of the Save As window an option called Tools. Upon clicking that you will find the “Embed True Type Fonts” choice(PowerPoint) either right there or under General Options(Word).

Posted in Family, Tutorials | No Comments »

Microprocessors and microarchitectural innovations

November 2nd, 2004 admin

Added to the articles section a presentation I put together after reading an interesting paper by Andreas Moshovos and Gurindar Sohi. Check it out if you are interested in how microprocessors try to be smart about running programs fast here.

Posted in Tutorials | No Comments »