Step | Associated Steps (and checkoff criteria) | Points |
---|---|---|
1 | 2. Assembly instruction: arithmetic | 20 |
2 | 3. Assembly instruction: bit-wise/logical | 30 |
3 | 4. Assembly instruction: memory | 50 |
Β | Total | 100 |
You will not use the microcontroller to complete this lab.
In this lab we will briefly discuss the process of program generation to understand how C source files and assembly source files get translated into program that a machine (or a bare-metal simulator like QEMU) could understand.
We will also walk through several types of assembly instructions used with RISC-V 32-Bit, specifically from the RISC-V 32-bit Instruction Set Architecture (ISA). In addition, some assembly directives will also be discussed to let you be prepared for the next two labs.
The ISA used in the lab is same as the one covered in the book βComputer Organization and Design (RISC-V Edition) (2nd Edition)β. Hence the book serves as the reference for the lab.
A reference manual for the RISC-V ISA can be found here. Refer to the βUnprivilegedβ version, i.e., Volume 1.
You might have written some programs in C or other languages like JavaScript or Python before, just like the one below:
// HelloWorld.c
#include <stdio.h>
int main (int argc, char * argv[]) {
printf("Hello World!");
int x = 1 + 2;
return 0;
}
With a click of button in an IDE or some terminal commands, Voila! The program just works. Nevertheless, how exactly does the CPU recognize and execute the program?
For us, we could just read the source code above and know this is just printing the string Hello World!
to the screen.
Unfortunately, computers cannot understand English, even a small set of the English language like a programming language.
Instead, what the CPU understands is machine language, which is just a set of bits that map to some operations in hardware.
For instance, in the 32-bit RISC-V ISA, or simply RISCV32 (the one we are using) 0000000 01011 01100 000 01101 0110011
gets decoded as a 32-bit addition of the values in registers x11
and x12
, and saves the sum into x13
. Recognition of these bits is hard-wired on the CPU chip and are the machine language representation of the entire instruction set architecture.
But hey, the bit strings are just 1
and 0
, how did we get those from the program we just wrote?
Well you might already know the answer: through compilers and assemblers like those used by gcc
or clang
.
The compiler is programmed to understand specific programming languages and translate programs written in them to the assembly language corresponding to the hardware platform. For instance, if we compile the example above in RISCV32, it might become something like this:
li x11, 1
li x12, 2
add x13, x11, x12
sw x13, -20(sp)
The step of generating assembly from a high level language is done by the compiler. The assembly then needs to be translated into the 1βs and 0βs that correspond to machine code the computer understands, which is done by an assembler. You can also try seeing the machine instructions for your personal computer with the command gcc -S [C source file]
if you are using Linux.
The type of assembly generated depends on the architecture of your host computer - it might generate Intelβs x86 assembly instead of RISC-V, for example. When you run just gcc
on your own machine, youβre generating x86 assembly, allowing your program to run on your own computer. But it is also possible to cross-compile, i.e. you have a compiler that was compiled to run on x86 machines, but when it compiles programs, it does so for RISC-V machines. The riscv64-unknown-elf-gcc
command is an example of such a cross-compiler.
All of these details are generally hidden from the application developer who just relies on the compiler/assembler toolchain to generate binaries that the computer can understand.
Note: If you do try it with a modern compiler, the
int x = 1 + 2
part will likely just become:li a5, 3
sw a5, -20(s0)
This is because the compiler is actually smart enough to evaluate the value for you! Saving time for the actual execution!
Using the same environment you set up on eceprog
, or a lab machine, you can use the following instruction to generate the assembly file for the C program we gave you above.
riscv64-unknown-elf-gcc -S HelloWorld.c
You will now observe that a HelloWorld.s file is generated when you run this command. This file is the assembly language file which is seen by the Assembler.
Note: The reason why even though we are using RISC-V 32-bit ISA, we have compiler which has βriscv64β in its name is because it can compile both 32 and 64 bit RISCV. Note also that this compile wonβt readily be available on any Linux machine. It is already set up on your lab machines and
eceprog
for the purpose of the lab.
Converting assembly instructions to the bitstrings is handled by yet another program called an assembler.
The assembler will do the final translation from plain text assembly to bit strings:
As a reference, here is the HelloWorld.c main function in assembly:
...
.LC0:
.string "Hello World!"
.text
.align 1
.globl main
.type main, @function
main:
addi sp,sp,-48
sd ra,40(sp)
sd s0,32(sp)
addi s0,sp,48
mv a5,a0
sd a1,-48(s0)
sw a5,-36(s0)
lui a5,%hi(.LC0)
addi a0,a5,%lo(.LC0)
call printf
...
Now, we invoke the assembler and check the file content:
# NOTE: This is just an example. You are not expected to
# understand these instructions or what it is doing
# yet.
# The following example can be run on eceprog or your lab machine.
# Assemble the helloworld assembly code into an object
# file called HelloWorld.o
riscv64-unknown-elf-gcc HelloWorld.s -o HelloWorld.o
# Examine the HelloWorld.o content
# Use objdump to disassemble the object file
# You could also use xxd to examine the object file
# xxd: create a hexdump of the binary file
# xxd HelloWorld.o
riscv64-unknown-elf-objdump -D HelloWorld.o > HelloWorldObjectFile.txt
The following are the contents of the text file generated.
...
00000000000101a2 <main>:
101a2: 7179 add sp,sp,-48
101a4: f406 sd ra,40(sp)
101a6: f022 sd s0,32(sp)
101a8: 1800 add s0,sp,48
101aa: 87aa mv a5,a0
101ac: fcb43823 sd a1,-48(s0)
101b0: fcf42e23 sw a5,-36(s0)
101b4: 67f1 lui a5,0x1c
101b6: 79078513 add a0,a5,1936 # 1c790 <__clzdi2+0x48>
101ba: 14c000ef jal 10306 <printf>
101be: 478d li a5,3
101c0: fef42623 sw a5,-20(s0)
101c4: 4781 li a5,0
101c6: 853e mv a0,a5
101c8: 70a2 ld ra,40(sp)
101ca: 7402 ld s0,32(sp)
101cc: 6145 add sp,sp,48
101ce: 8082 ret
...
Generally, you cannot just run the generated object file on a computer running OS (though you might able to run it on a baremetal CPU). There are some additional tasks that need to be done to create an executable file and since we have to use the stdio.h
header file, we will have to link to it so that the program know where to find printf()
. To do so, simply use riscv64-unknown-elf-gcc HelloWorld.o -o HelloWorld
, which will automatically handle all the tasks of invoking the compiler, assembler, and a final tool called a linker that will create the final executable binary.
# Again you can use the riscv gcc compiler in order to generate an executable.
riscv64-unknown-elf-gcc HelloWorld.o -o HelloWorld
./HelloWorld
> Hello World!
If you have multiple source files, the linker will manage merging these object files together into a single executable. We actually use this feature to in the following exercises to link the autograder object file with your solutions. This way the grader could run your code and see if the results are correct.
The following figure provides an overview of the flow necessary to generate an executable from the source code you are familiar with.
βββββββββββββββ Compiler ββββββββββββ Assembler βββββββββββββββ
β Source Code ββββββββββββββΊβ Assembly ββββββββββββββΊβ Object File β
βββββββββββββββ ββββββββββββ ββββββββ¬βββββββ
Linker β
ββββββββββββββββββββββββββββ
β
ββββββββΌββββββββ
β Executable β
ββββββββββββββββ
In this sections, we will discuss the arithmetic instructions of RISCV32 ISA and write some code with them.
The set-up for the lab has already been provided to you. Complete the steps given in this markdown file before you continue further with Lab 8.1.
Fill in the subroutine body for q2_1_example
with instructions that will set the value of x10
to x10 + x11 + x12 + x13
. For instance, if you set the values x10 through x13 like this:
li x10, 1
li x11, 2
li x12, 4
li x13, 8
You should expect that the x10 register will contain the value 15 when execution reaches the nop following the subroutine invocation. It does not matter what values are left in x11, x12, and x13 after the return from example. Remember to use only the registers x10 through x13 when you write your instructions.
The operation cannot be implemented with a single instruction. You must compose multiple instructions to produce the result.
.global q2_1_example
q2_1_example:
/* Enter your code after this comment */
add x11, x10, x11 // now, x10 = x10 + x11
add x11, x11, x12 // now, x12 = x11 + x12
add x11, x11, x13 // finally, x10 = x10 + x11 + x12 + x13
add x10, x11, 0 // Move the results to x10
/* Enter your code above this comment */
ret
You should copy this into the example subroutine in the file, and trace through the execution with the debugger to make sure you understand how it works.
There are usually many ways to write the same high-level operation in assembly language. The fewer instructions you can use, the faster the code will run to completion. Here is another solution for the previous problem that has fewer instructions:
.global q2_1_example
q2_1_example:
/* Enter your code after this comment */
add x10, x10, x11 // now, x10 = x10 + x11
add x12, x12, x13 // now, x12 = x11 + x12
add x10, x10, x12 // finally, x10 = x10 + x11 + x12 + x13
/* Enter your code above this comment */
ret
Since some registers are reused, it may be a little more difficult to understand. You should study it to discover how it works. For todayβs exercises, it does not matter how slowly your solution works (within reason). What does matter is that no registers other than x10, x11, x12, or x13 are modified by the code.
Note: For the purpose of this lab we refer to the registers with their x name. You can write your solution using the ABI names of the registers as well. To get the ABI names, see page 137, Table 25.1 in the reference manual. TL:DR; x10 - x17 maps to a0-a7.
For a quadratic equation $ax^2 + bx + c = 0$, its discriminant, commonly represented by the Greek symbol Delta, determines if the equation has any real roots:
\[\begin{align*} \Delta &= b^2-4ac \end{align*}\]If $\Delta = 0$, the equation has two identical real roots; if $\Delta > 0$, two distinct real roots exist; if $\Delta < 0$, no real roots exist.
Use the assembly function q2_2_delta
to write the assembly necessary to compute the discrminiant of a quadratic equation with the coefficents given in registers x10-x12
. You will need to put the final result back in x10
when the function return. Specifically, the coefficients and registers mapping is:
// For the equation ax^2 + bx + c = 0
a: reg x10
b: reg x11
c: reg x12
// Final result need to be put in register x0
Delta: reg x10
After completing the problem, you could build and run the lab1 executable. The autograder will grade your function by compared its result with the C equivalent function on random inputs.
Hint: You might find the instructions
sub
,li
,mul
to be useful. A reference manual to the RISCV32 can be found here. Use the unprivileged version of the manual.
For a quicker reference, you can check out the Reference Data card weβll post on Piazza. You may want to bookmark that one.
For two vectors in $\mathbb{R}^2$ space, $\vec{A} = (a_1, a_2)$ and $\vec{B} = (b_1, b_2)$, the dot product between them is commonly defined as:
\[\begin{align*} \vec{A} \cdot \vec{B} = a_1b_1 + a_2b_2 \end{align*}\]In this problem, you will fill up the assembly function q2_3_dot_product
to compute the dot product of two 2D integer vectors with their components specified in registers x10-x13
. Similar to previous problem, you will need to put the result back in x10
when the function return. The mapping of the arguments and registers mapping is:
// For two vector A, B
// A = (a1, a2)
// B = (b1, b2)
a1: reg x10
a2: reg x11
b1: reg x12
b2: reg x13
// Final result
a1b1 + a2b2: reg x10
After completing the problem, you could build and run the lab1 executable. The autograder will grade your function by compared its result with the C equivalent function on random inputs.
Hint: You might find the instructions
add
andmul
to be useful.
In this section, we will use some of the logical/bit-wise instructions of RISC-V ISA and write some code with them.
Often in embedded programming, you will need to extract one byte from a word and perform some operations on it.
In this problem, you will need to extract the MSB (most significant byte, the byte to the leftmost)
and the LSB (least significant byte, the byte at the rightmost position) of a 32-bit word.
There are two functions you will need to implement: q3_1_MSB
amd q3_1_LSB
.
Both functions will be passed in with the word in register x10
, and you will put the result back in x10
.
Example: Suppose the given word is
0x11223344AABBCCDD
, its MSB will be0x11
and its LSB will be0xDD
.
Note: Be careful to remember that the B in LSB and MSB is byte, not bit!
Hint: You will need to use logical shift operation and bit-wise AND operation to extract one byte from a 32-bit word. You might find the instructions
srli
andand
useful. Also, you might also want to consider loading a register with value0xFF
.
In embedded programming, usually you will need to configure the peripherals associated with the microcontroller. For instance, you might need to enable certain pins on the development board or configure the clock frequency of the CPU. A typical way to enable or disable them is by setting or resetting certain bits in control registers, which can hold a 32-bit or 64-bit value, with each bit corresponding to a different function. You can imagine each bit as a tiny switch to control one part of the microcontroller.
Another common name for these individual bits (tiny switches) is flag, which is a value with only one bit set to 1 and the rest to 0, like 0x1000000
or 0b0001 0000 0000 0000
.
Keep in mind that the flag is not the actual bit position that you need to shift to - it is the already-shifted value. Therefore, if you needed to turn on bit 17 in a specific control register, the flag value given to you will be 1 << 17
.
In this problem, you will implement 3 functions to:
x10
corresponding to a flag x11
to 1
(function q3_2_flag_set
)x10
corresponding to a flag x11
to 0
(function q3_2_flag_reset
)x10
corresponding to a flag x11
(function q3_2_flag_toggle
)For all three of them, the value will be passed in x10
and the flag will be passed in x11
. You will again save the modified value to x10
.
Hint: Instruction
or
,xor
,xori
, andand
might come in handy. Also note that there is no special instruction for inversion in RISC-V. The developer made the decision to use xori with 0xFFFFFFFF to get the inverse instead in order to keep with the three-operand format of the instructions.
Note: You could reuse what you have in section 3.1 to extract LSB and MSB
In this problem you will implement the function q3_3_swap_byte
that swap the LSB and MSB of a 32-bit word passed in from x10
and store the swapped word in x10
.
Note: You should only swap the LSB and MSB and leave the rest of the bytes intact.
Hint:
slli
could shift to the left andsrli
will shift to the right, both will pad with0
. Also you might want to review what you have done for 3.2 Turning a flag on/off.
So far what you have done in Assembly was manipulating the register values. However, as you might already know, there are simply not enough registers to hold all the variables within a program. This is where memory steps in.
With memory, we could store the values in our registers to memory and leave space to perform other calculations. When we are done, we could just loaded the values back from memory and continue doing our work. You can think of memory as a gigantic hash table or array with address
as the key or index:
// Pseudo-code for memory store and load
// Storing to memory, or writing
Mem[addr] = variable
// Loading from memory, or reading
variable = Mem[addr]
In this section, we will work with memory instructions to perform some simple stores and loads, but first, we will have a crash course on assembly directives that helps programs know where a variable is in memory and what size does it has.
In this lab, you will only work with the .global
and .asciz
directives. You could safely ignore the others.
.global SYMBOL
: make the symbol SYMBOL
visible to other object files during linking, which could either be a variable or a function..asciz
/.string "SOME_STRING"
: specify a null-terminating C-string
\0
char to the end of the string.In a high-level view, these directives provide some meta info beyond the actual instructions so that both assembler and linker can utilize to generate the executable.
In this problem, you will implement the assembly function q4_1_toupper
that will take in an address of a lowercase letter and save its uppercase form to the same address like the C program below:
// Assume we have this function implemented
void toupper(char *c) {...}
char chr = 'a';
char *string = "hello";
toupper(&chr);
toupper(string);
// After the above function calls, we will have
chr = 'A'
string = "Hello"
The address will be passed in register x10
, you will need to read the value at the address in x10
, modify it, and then store back to the same address.
In addition, the autograder will use your function to modify the string at label lowercase_string
and print it out! You could also change the string to whatever you like and observe the effect (autograder wonβt use this string for grading).
Note: You can safely assume that the character at the address passed in will always be lowercase.
Hint#1: You would want to look at instructions
lb
,li
,sub
andsb
for this task.
Hint#2: Check out the ASCII table (try
man ascii
on Linux or Mac terminal) and find the relation between lowercase and uppercase letter!
Swap:
Β Β Β Β Β Β take part in an exchange of
Β Β Β Β Β Β β from Oxford dictionary
In this section, you will implement the assembly function q4_2_swap
that will take two 32-bit integers addresses and swap the content, similar to the C program below:
void swap(int32_t *a, int32_t *b) {...}
int32_t x = 362;
int32_t y = 270;
swap(&x, &y);
// After the above function call
x = 270
y = 362
The two addresses will be passed in register x10
and x11
respectively.
Sometimes, strings are just too long.
In this problem, you will implement the assembly function q4_3_cutter
that will take two arguments, a C character string starting address and a cut position, passed in from x10
and x11
. The function will be similar to the C program below:
void cutter(char * str, uint32_t pos) {
*(str + pos) = '\0'
}
char string[50] = "Hello World!";
cutter(string, 5);
printf(string);
// After the above function call, printf will print
// 'Hello' instead of 'Hello World!' to the console
// since the bit at position 5 gets replaced with the
// null character '\0'.
Hint: You do not need to load the string.
From Gulliverβs Travels by Jonathan Swift
It is allowed on all hands, that the primitive way of breaking eggs before we eat them, was upon the larger end: but his present Majestyβs grandfather, while he was a boy, going to eat an egg, and breaking it according to the ancient practice, happened to cut one of his fingers. Whereupon the Emperor his father published an edict, commanding all his subjects, upon great penalties, to break the smaller end of their eggs.
Although there is not a physical egg within a computer, we did find a way to start a βwarβ on which end should we break the egg: the Big-Endian or the Little-Endian.
Before we actually explain these two vague terms, letβs consider how does a computer store a 32-bit integer inside memory.
We all know that (or know by now) that modern computer store data at byte (8 bits) granularity inside memory. However, for a 32-bit integer like 0x11223344
, it has 4
bytes, in what order should we store it? There are 4! = 24
ways to do so, and we could just store it randomly. But to make our lives easier, and to make the computerβs life easier, we should probably store them in consecutive order starting at a memory address.
Therefore, we store one byte at address x
, and then store the next byte in order at x + 1
, and continue until we finish all 4
bytes. However, there exist two ways to do this - we store the MSB first, or the LSB first.
address | x |
x + 1 |
x + 2 |
x + 3 |
---|---|---|---|---|
LSB first | 0x44 |
0x33 |
0x22 |
0x11 |
MSB first | 0x11 |
0x22 |
0x33 |
0x44 |
Note: MSB refers to most significant byte and LSB refers to least significant byte, check 3.1 for more explanation on them.
If we store the MSB first, it is Big-Endian; if we store the LSB first, it is Little-Endian.
In this part you will need to implement the function q4_4_cvt_endian
that will take in a 32-bit integer and reverse its endianness. That is, if it is little-endian, we want it to be big-endian, and vice-versa.
Again like the previous problem, the integer will be passed by reference with its address passed in register x10
. You will have to store the reversed endianness integer back the same address.
Hint: You might find yourself copy pasting a lot, which is normal, as we wonβt cover loops until next lab.
The following tidbit is more relevant to computer architecture if youβre curious.
It may seem tedious to do this without loops, but implementing a function like this is actually a common compiler optimization called loop-unrolling that is actually faster compared to a loop.
Loops are typically implemented with instructions called branches. These are instructions that tell the CPU to jump to a different part of the program, either with no conditions, or based on the result of a condition (branch-if-equal, branch-if-less-than, etc.).
Modern CPUs are implemented with a component called branch predictors. These try to predict the next instruction to execute. The CPU may start loading the instructions at the part of the program indicated by the predictor, even while the branch/compare instruction itself is being executed. With a correct prediction, a CPU can execute instructions much faster than if it had to wait on the result of the branch instruction. If the prediction is wrong, the CPU will have to βflush its pipelineβ of instructions and start over. This is called a branch mispredict, and can reduce the efficiency of a program. Loop unrolling reduces the number of branches and therefore reduces the number of branch mispredictions. This is why compilers will often unroll loops for you if the limits of the loop are known at compile-time.