2020-03-16
This online book is currently being written. This is not the final version. feedback form
This book is packed with content that is too much to include in one book. In this book, we will create a program that converts source code written in C language into assembly language, that is, a C compiler. The compiler itself is also developed using C. The immediate goal is to be able to self-host, i.e. compile its own source code with a homebrew compiler.
In this book, I decided to explain various topics in a progressive manner throughout the book, so as not to make the explanation of compilers too difficult. Here's why:
Compilers can be conceptually divided into multiple stages: parsing, intermediate passes, and code generation. A common textbook approach is to explain each topic in chapters, but books with this approach tend to become too narrow and deep in the middle, making it difficult for readers to follow.
Also, with the development method of creating each stage, it is not possible to run the compiler until all stages are completed, so it is difficult to notice if there is a definite mistake in your understanding or code until the whole stage starts working. The downside is that you can't. In the first place, you don't really know what the next stage's input is expected to be until you create it yourself, so you don't really know what to output in the previous stage. Another problem is that it's hard to stay motivated because you can't compile any code at all until it's finished.
In this book, I decided to take a different approach to avoid that trap. Early in the book, you will implement a ``proprietary language'' with a very simple language specification. The language is so simple that you don't need to know much about how to write a compiler to implement it. After that, the reader will continue to add functions to the "proprietary language" through this book, and will eventually develop it into something that is consistent with C.
In such an incremental development method, you create a compiler step by step, making small commits. With this development method, the compiler is always in some sense "complete" at every commit. At one stage it may be able to do nothing more than a calculator level, at another stage it may be a very limited subset of C, and at another stage it may be a language that can almost be called C. The point is that at every stage, we aim for a language with reasonable specifications that match the level of completion at that point. During development, we do not emphasize only some functions to make it look like C language.
We will also explain data structures, algorithms, and computer science knowledge in stages according to the development stage.
Incremental development achieves the goal that, at any point in time while reading this book, the reader has a complete knowledge of how to create a reasonable language at that level. This is much better than the state where only some topics of compiler creation are extremely detailed. By the time you finish reading this book, you will be well-versed in all the topics.
This book also explains how to write large programs from scratch. The skill of creating large programs is a unique skill that is different from learning data structures and algorithms, but I don't think there are many books that explain such things. Also, even if someone explains it to you, you won't really know whether a development method is good or bad unless you actually experience it. This book is designed so that the process of developing your own language into C language will give you practical experience of a good development method.
If the author's plan is successful, by reading this book, readers will not only learn about techniques for creating compilers and the CPU instruction set, but also learn how to break down large programs into small steps and create them little by little. You'll learn testing techniques, version control techniques, and even how to prepare for an ambitious project like writing a compiler.
The intended audience for this book is ordinary C programmers. You don't need to be a super C programmer who knows the C language specification well. It is sufficient that you understand pointers and arrays, and that you can at least take some time to read small C programs written by others.
When writing this book, I tried not only to explain the language specifications and CPU specifications, but also to explain as much as possible why such designs were chosen. We have also interspersed columns about compilers, CPUs, the computer industry, and its history that will interest readers, making it an enjoyable read.
Creating a compiler is a lot of fun. In the beginning, my home-made language could only do ridiculously simple things, but as I continued to develop it, it quickly grew to resemble C language to the point that even I was surprised, and it started working as if by magic. Become. During actual development, I am often surprised that large test code that I didn't think would compile well at the time compiles without errors and works perfectly correctly. Such code cannot be easily understood by oneself even when looking at the compiled assembly. Sometimes I even feel like my compiler has more intelligence than I, the author. A compiler is a program that, even if you know how it works, you still wonder why it works so well. I'm sure you too will fall in love with its charm.
So, without further ado, let's jump into the world of compiler development with the author!
Column: Why C language?
Among the many programming languages available, why did you choose C for this book? Or why not a homegrown language? Regarding this point, there is no reason why it absolutely has to be C, but if you have to choose a language to learn how to create a compiler that outputs native code, C is a reasonable choice that is not very common. I think it's one of them.
Interpreted languages don't allow you to learn much about the lower layers. On the other hand, C is usually compiled into assembly, so by creating a compiler, you can learn about the CPU's instruction set and how programs work, as well as C itself.
C is widely used, so once you have a working compiler, you can play around with compiling third-party source code that you download from the internet. For example, you can build and play mini Unix xv6. If the compiler is mature enough, it should be possible to compile even the Linux kernel. This kind of enjoyment is not possible with minor or homegrown languages.
C++ is a statically typed language that compiles to a native machine language like C and is at least as widely used as C. However, the language specifications for C++ are so large that it is impossible to easily create your own compiler, so it is not a realistic option.
Designing and implementing an original language is good in terms of honing your language design sense, but there are pitfalls as well. Things that are difficult to implement can be avoided by avoiding them in the language specification. This is not the case with languages such as C, where the language specification is given as a standard. I think that this restriction is quite good in terms of learning.
Functions, expressions, commands, etc. are displayed in monospaced fonts, such as in main
the text .foo=3
make
Code that spans multiple lines is displayed in a frame using a monospaced font, as shown below.
If the boxed code is a shell command that is meant to be typed verbatim by the user, a $
line starting with represents a prompt. $
Type the rest of that line ( $
but not the rest) into the shell. $
The other lines represent the output from the command you entered. For example, the block below make
is an example of what happens when the user enters the string and presses Enter. make
The output from the command make: Nothing to be done for `all'.
is:
$ make
make: Nothing to be done for `all'.
This book assumes a 64-bit Linux environment running on a so-called ordinary PC such as Intel or AMD. Please install development tools such as gcc and make in advance according to the distribution you are using. If you are using Ubuntu, you can install the commands used in this book by running the following command.
$ sudo apt update
$ sudo apt install -y gcc make git binutils libc6-dev
Although macOS is fairly compatible with Linux at the assembly source level, it is not fully compatible (specifically, a feature called "static linking" is not supported). Although it's possible to create a C compiler for macOS using the information in this book, if you try it you'll likely run into a number of minor incompatibilities. It is not recommended to simultaneously learn techniques for creating a C compiler and the differences between macOS and Linux. When something doesn't work, it's hard to know which understanding is wrong.
Therefore, this book does not cover macOS. On macOS, please use some kind of virtual environment to prepare a Linux environment. If this is your first time preparing a Linux virtual environment, please refer to Appendix 3 , which summarizes how to create a development environment using Docker .
Windows is not compatible with Linux at the assembly source level. However, with Windows 10, it is possible to run Linux on Windows like one application, and by using this, you can proceed with development on Windows. An application called Windows Subsystem for Linux (WSL) is that Linux-compatible environment. When implementing the contents of this book on Windows, please install WSL and proceed with development within it.
Column: Cross compiler
The machine on which the compiler runs is called the "host," and the machine on which the code output by the compiler runs is called the "target." Although both are 64-bit Linux environments in this book, the host and target do not necessarily have to be the same.
A compiler whose host and target are different is called a cross-compiler. For example, a compiler that runs on Windows that generates an executable file for Raspberry Pi is a cross compiler. Cross-compilers are often used when the target machine is too weak or specialized to run the compiler.
Rui Ueyama ( @rui314 ). He is the original author and current maintainer of the high-speed linker lld, which is used as the standard linker for creating executable files in many OSes and projects, including Android (version Q or later), FreeBSD (12 or later), Nintendo Switch, Chrome, and Firefox . (Therefore, there is a high possibility that the binary created by the tool I wrote is already on your computer.) He is also the author of the compact C compiler 8cc . I mainly write essays about software in notes .
Column: Compiler that compiles the compiler
Self-referential situations, such as a C compiler written in C, are not uncommon. Many language implementations other than C are written using the language itself.
If there is already an implementation of language X, there is no logical contradiction in creating a new X compiler using the language itself. If you decide to self-host, you can simply develop with your existing compiler and switch when you're done with your own. That's exactly what we try to do in this book.
But what if you don't have an existing compiler? In that case, you have no choice but to write in another language. When writing your first compiler for the X language with the intention of self-hosting, you will need to write it using an existing Y language that is different from X, and once the compiler is complete, you will need to rewrite the compiler itself from the Y language to the X language.
Compilers for today's complex programming languages are also other compilers that were used to compile implementations of that language, and so on until, in the early days of computers, someone was able to directly write machine code. You should arrive at the simple assembler you wrote. We don't know whether there was one or more assemblers, which are in some sense the ultimate ancestors of all existing language implementations, but there is no doubt that modern compilers started from a very small number of ancestors. Sho. Executable files other than compilers are also usually generated by compilers, so almost all existing executable files are indirect descendants of the original assembler. This is an interesting story similar to the origin of life.
In this chapter, the goal is to get a rough idea of the components that make up a computer and what kind of code should be output from the C compiler we create. I will not go into details about specific CPU instructions yet. It is important to understand the concept first.
The components that make up a computer can be broadly divided into CPU and memory. Memory is a device that can hold data, and a CPU is a device that performs some processing while reading and writing from that memory.
Conceptually, memory appears to the CPU as a huge array of randomly accessible bytes. When the CPU accesses memory, it specifies the information about which byte of memory it wants to access using a number, and that number is called an "address." For example, "read 8 bytes of data from address 16" means to read 8 bytes of data from the 16th byte of memory that looks like an array of bytes. The same thing is sometimes called "reading 8 bytes of data from address 16."
Both the programs that the CPU executes and the data that those programs read and write reside in memory. The CPU retains the "address of the currently executing instruction" inside the CPU, reads the instruction from that address, does what is written there, and then reads and executes the next instruction. We are doing this. The address of the instruction currently being executed is called the " program counter " (PC) or the " instruction pointer " (IP). The format of the program that the CPU executes is called "machine code. "
The program counter does not necessarily advance linearly to the next instruction. A type of CPU instruction called a branch instruction allows you to set the program counter to any address other than the next instruction. This function enables if statements, loops, etc. Setting the program counter to a location other than the next instruction is called "jumping" or "branching."
In addition to the program counter, the CPU also has a small data storage area. For example, Intel and AMD processors have 16 areas that can hold 64-bit integers. This area is called a " register ". Memory is an external device to the CPU, and reading and writing from it takes some time, but registers exist inside the CPU and can be accessed without delay.
Many machine languages have a format that performs some operation using the values of two registers, and then writes the results back to the registers. Therefore, when a program is executed, the CPU reads data from memory into registers, performs some operations between registers, and writes the results back to memory. Masu.
The instructions of a specific machine language are collectively called the " instruction set architecture " (ISA) or "instruction set." There is not just one type of instruction set, and you can design it as you like for each CPU. However, the same program cannot run without compatibility at the machine language level, so there are not many variations in the instruction set. PCs use an instruction set called x86-64 from Intel and its compatible chip maker AMD . Although x86-64 is one of the major instruction sets, it is not the only one that dominates the market. For example, iPhone and Android use an instruction set called ARM.
Column: x86-64 instruction set name
x86-64 is also sometimes called AMD64 , Intel 64 , x64 , etc. There is a historical reason why the same instruction set has multiple names like this.
The x86 instruction set was created by Intel in 1978, but it was AMD that expanded it to 64 bits. Around 2000, when 64-bit processors were becoming a necessity, Intel was working hard on a completely new instruction set called Itanium, and didn't bother working on the 64-bit version of x86 that would compete with it. Taking advantage of this gap, AMD formulated and released specifications for the 64-bit version of x86. That's x86-64. After that, AMD renamed x86-64 to AMD64, probably due to branding strategy.
After that, Itanium's failure became obvious, and Intel had no choice but to make a 64-bit version of x86, but by that time, a number of actual AMD64 chips had been released, so it was similar to that. It is difficult to formulate an extended instruction set that is not compatible with AMD, so Intel has decided to adopt an instruction set that is compatible with AMD. It is said that there was also pressure from Microsoft to maintain compatibility. At that time, Intel adopted the almost exact same instruction set as AMD64, naming it IA-32e. The fact that it was named IA-32e (Intel Architecture 32 extensions) instead of 64 seems to reflect the unsatisfied experience of an unsuccessful instruction set, with Itanium being the core of a 64-bit CPU. After that, Intel decided to completely abandon Itanium, and IA-32e was renamed to the common name Intel 64. Microsoft calls x86-64 x64, perhaps because they don't like long names.
For the reasons listed above, x86-64 has many different names.
Open source projects often prefer the name x86-64, which does not include the name of a specific company. This book also uses the name x86-64 throughout.
Machine language is read directly by the CPU, so only the convenience of the CPU is taken into consideration, not the ease of use by humans. Writing such machine code with a binary editor is not impossible, but it is a very difficult task. That's when the assembler was invented . Assembly is a language that has almost a one-to-one correspondence with machine language, but it is much easier to read for humans than machine language.
For compilers that output native binaries rather than virtual machines or interpreters, the goal is usually to output assembly. Even compilers that appear to directly output machine language often start an assembler in the background after outputting assembly. The C compiler created in this book also outputs assembly.
Converting assembly code into machine language is sometimes called "compiling," but it is also sometimes called "assembling " to emphasize that the input is assembly.
You may have seen assembly before. If you haven't seen Assembly, now is a good time to take a look. objdump
Let's use a command to disassemble an appropriate executable file and display the machine language contained in that file as assembly. Below ls
is the result of disassembling the commands.
$ objdump -d -M intel /bin/ls
/bin/ls: file format elf64-x86-64
Disassembly of section .init:
0000000000003d58 <_init@@Base>:
3d58: 48 83 ec 08 sub rsp,0x8
3d5c: 48 8b 05 7d b9 21 00 mov rax,QWORD PTR [rip+0x21b97d]
3d63: 48 85 c0 test rax,rax
3d66: 74 02 je 366a <_init@@Base+0x12>
3d68: ff d0 call rax
3d6a: 48 83 c4 08 add rsp,0x8
3d6e: c3 ret
...
In my environment, ls
a command contains about 20,000 machine language instructions, so the disassembled result will be nearly 20,000 lines long. I have only posted the first part here.
Assembly basically consists of one line per machine code. For example, consider the following line:
3d58: 48 83 ec 08 sub rsp,0x8
What does this line mean? 3d58 is the memory address containing the machine code. In other words, ls
when the command is executed, the instruction on this line will be placed at address 0x3d58 in memory, and this instruction will be executed when the program counter is 0x3d58. The next four hexadecimal numbers are the actual machine code. The CPU reads this data and executes it as an instruction. sub rsp,0x8
This is the assembly that corresponds to the machine language instructions. The CPU instruction set will be explained in a separate chapter, but this instruction is to subtract 8 from a register called RSP.
To get an idea of what kind of output the C compiler is producing, let's compare the C code and the corresponding assembly code. As the simplest example, consider the following C program.
Given the file in which this program is written , you can verify that it actually returns 42 by test1.c
compiling it as follows :main
In C, main
the value returned by a function is the exit code for the entire program. Although the program's exit code is not displayed on the screen, it is implicitly set in the shell variable $?
, so you can see the command's exit code by displaying immediately after the command $?
ends . echo
can. You can see that 42 is returned correctly here.
Now, the assembly program corresponding to this C program is as follows.
main
This assembly defines a global label , main
followed by the code for the function. Here, we set the value 42 in a register called RAX and main
then return. There are a total of 16 registers that can store integers, including RAX, but the promise is that when the function returns, the value in RAX will be the return value of the function, so here we will store the value in RAX. I am setting it.
Let's actually assemble and run this assembly program. The assembly file extension .s
is , so try writing the assembly code above test2.s
and running the following command.
$ cc -o test2 test2.s
$ ./test2
$ echo $?
42
42 is now the exit code, just like in C.
Roughly speaking, a C compiler is a program that outputs assembly like test1.c
, when it reads C code like .test2.s
As a slightly more complex example, let's look at how code with function calls is translated into assembly.
A function call is different from a simple jump; after the called function finishes, you must return to the place you were originally executing. The address that was originally executed is called the "return address." If there is only one level of function calls, the return address can be stored in an appropriate register on the CPU, but since function calls can be made as deep as desired, the return address must be stored in memory. The return address is actually stored on a stack in memory.
A stack can be implemented using only one variable that holds the address of the top of the stack. The storage area that holds the top of the stack is called the "stack pointer." To support programming using functions, x86-64 supports a register dedicated to the stack pointer and instructions that use that register. Putting data on the stack is called "push," and taking out data from the stack is called "pop."
Now, let's look at an example of a function call. Consider the following C code.
The assembly corresponding to this C code looks like this:
.intel_syntax noprefix
.globl plus, main
plus:
add rsi, rdi
mov rax, rsi
ret
main:
mov rdi, 3
mov rsi, 4
call plus
ret
The first line is an instruction that specifies the assembly grammar. The second line .globl
, starting with , tells assembly that the two functions `` and `` are visible to the entire program, not file scope plus
. main
You can ignore this for now.
main
Please take a look first . main
In C, plus
kara is called with arguments. The assembler has a convention that the first argument is placed in the RDI register and the second argument is placed in the RSI register, so the main
values are set accordingly in the first two lines of .
call
is an instruction that calls a function. Specifically call
:
call
ret
pushes the address of the next instruction (in this case ) onto the stackcall
jump to the address given as an argument toTherefore, call
once the instruction is executed, the CPU plus
will start executing the function.
plus
Look at the functions. plus
The function has three instructions.
add
is an instruction that performs addition. In this case, the result of adding the RSI and RDI registers is written to the RSI register. Since x86-64 integer arithmetic instructions usually only accept two registers, the result will be saved by overwriting the value of the first argument register.
The return value from the function is to be stored in RAX. Therefore, since we want to store the addition result in RAX, we need to copy the value from RSI to RAX. mov
We're doing it here using commands. mov
is an abbreviation for move, but it actually does not move data, but simply copies it.
plus
At the end of the function, ret
we call and return from the function. Specifically ret
:
In other words ret
, call
it is an instruction that undoes what has been done and resumes execution of the calling function. is defined as the opposite call
command .ret
plus
What you see when you return from main
is ret
the command. The original C code was supposed to return plus
the return value as is . main
Here, plus
the return value is stored in RAX, so by main
returning as is, main
you can use it as the return value.
This chapter provided an overview of how a computer works internally and what a C compiler should do. When you look at assembly or machine language, it looks like a jumbled mass of data that is far removed from C, but many readers may have thought that it actually reflects the structure of C more directly than you might think. .
This book hasn't explained much about specific machine language yet, so I do objdump
n't think you know the meaning of the individual instructions in the assembly code displayed in , but each instruction doesn't do much. I think you can imagine it. At this stage of this chapter, it is enough to get a sense of that.
The main points of this chapter are summarized below in bullet points.
Column: Online compiler
Looking at C code and its compilation results is a good way to learn assembly language, but it can be surprisingly tedious to edit and compile source code over and over again, and then check the assembly output. . There is a very good website that can save you that hassle. That is Compiler Explorer (commonly known as godbolt). When you enter code in the text box on the left half of the screen in Compiler Explorer, the corresponding assembly output is displayed in real time on the right half of the screen. This site is a good place to use when you want to check what kind of assembly C code is converted to.
In this chapter, the first step in writing a C compiler is to support arithmetic operations and other arithmetic operators so that you can compile expressions such as:
30 + (4 - 2) * -5
This may seem like a simple goal, but it is actually quite difficult. Numerical formulas have a structure in which the expression in parentheses takes precedence, or multiplication takes precedence over addition, and unless you understand this in some way, you will not be able to perform calculations correctly. However, the formula given as input is just a flat string of characters, not structured data. In order to correctly evaluate an expression, it is necessary to analyze the sequence of characters and successfully derive the hidden structure.
These parsing problems are quite difficult to solve without any prior knowledge. In fact, these problems were once thought to be difficult problems, and intensive research was conducted and various algorithms were developed, especially from the 1950s to the 1970s. Thanks to that work, parsing is now a fairly simple problem once you know how to do it.
This chapter describes one of the most common parsing algorithms, recursive descent parsing. The C/C++ compilers that you use every day, such as GCC and Clang, also use recursive descent parsing.
The need to read text with some kind of structure often arises when programming, not just with compilers. The techniques you will learn in this chapter can be used directly for such problems. It is no exaggeration to say that the syntax analysis techniques you will learn in this chapter are techniques that will last a lifetime. Read this chapter to understand algorithms and add parsing skills to your programmer's toolbox.
Consider the simplest subset of the C language. What kind of language do you imagine? main
Is it a language that only has functions? Or maybe a language consisting of only one expression? Ultimately, I think it's fair to say that a language consisting of only one integer is the simplest subset imaginable.
In this step, let's implement the simplest language first.
The program you create in this step is a compiler that reads a number from input and outputs an assembly that exits with that number as the program's exit code. So the input is simply 42
a string like , and when you read it you create a compiler that outputs assembly like this:
.intel_syntax noprefix
is an assembler command for selecting the Intel notation used in this book from among the multiple assembly writing methods. Please be sure to include this line at the beginning of the compiler you will create this time. The other lines are as explained in the previous chapter.
At this point, the reader may think, ``This program cannot be called a compiler.'' I honestly think so too. However, this program accepts a language consisting of a single number as input and outputs code corresponding to that number, so by definition it is a fine compiler. Even a simple program like this can quickly become quite complex if you modify it, so let's complete this step first.
This step is actually very important from the overall development procedure. This is because we will use what we create in this step as a skeleton for future development. In this step, in addition to creating the compiler itself, we also create a build file (Makefile), automated tests, and set up a git repository. Let's look at each of these tasks one by one.
The C compiler created in this book is called 9cc. cc is an abbreviation for C compiler. The number 9 doesn't have any particular meaning, but the C compiler I created before was called 8cc, so I decided to name it 9cc because it's the next work. Of course, you can give it any name you like. However, don't think too much about the name in advance and be unable to start writing the compiler. You can change names later, including GitHub repositories, so there is no problem starting with any name.
Column: Intel notation and AT&T notation
In addition to the Intel notation used in this book , an assembler notation called AT&T notation is also widely used, mainly in Unix. By default, gcc and objdump output assembly using AT&T notation.
In AT&T notation, the result register is the second argument. Therefore, in a two-argument command, the arguments are written in reverse order. Write the register name with %
a prefix . Write %rax
numbers with $
a prefix .$42
Also, when referencing memory, write the expression using a unique notation, using []
instead of . ()
Below are some examples for comparison.
mov rbp, rsp // Intel
mov %rsp, %rbp // AT&T
mov rax, 8 // Intel
mov $8, %rax // AT&T
mov [rbp + rcx * 4 - 8], rax // Intel
mov %rax, -8(rbp, rcx, 4) // AT&T
In the compiler I created this time, I decided to use Intel notation for readability. Intel's instruction set manual uses Intel notation, which has the advantage that you can write the manual's description directly into code. The expressive power is the same for both AT&T notation and Intel notation. No matter which notation is used, the generated machine language instruction sequence is the same.
Normally we give input to the compiler as a file, but since we don't want to open and read the file, we'll give the code directly as the first argument to the command. A C program that reads the first argument as a number and embeds it in the assembly of a fixed phrase can be easily written as follows.
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "引数の個数が正しくありません\n");
return 1;
}
printf(".intel_syntax noprefix\n");
printf(".globl main\n");
printf("main:\n");
printf(" mov rax, %d\n", atoi(argv[1]));
printf(" ret\n");
return 0;
}
9cc
Create an empty directory and 9cc.c
create a file in it with the above content. Then, run 9cc as shown below and check the operation.
$ cc -o 9cc 9cc.c
$ ./9cc 123 > tmp.s
The first line 9cc.c
compiles and 9cc
creates an executable file. The second line 123
passes the input to 9cc to generate an assembly and tmp.s
writes it to the file. tmp.s
Let's check the contents.
$ cat tmp.s
.intel_syntax noprefix
.globl main
main:
mov rax, 123
ret
As you can see, it's generated well. You can create an executable file by passing the assembly file created in this way to an assembler.
In Unix, cc
(or gcc
) is a front end for many languages, not just C and C++, and it determines the language based on the extension of a given file and starts the compiler or assembler. I am. Therefore, just like when compiling 9cc, you can assemble it by passing an .s
assembler file with the extension to . cc
The following is an example of assembling and running the generated executable file.
$ cc -o tmp tmp.s
$ ./tmp
$ echo $?
123
In the shell, the exit code of the previous command $?
can be accessed as a variable. In the example above, the number 123 is displayed, which is the same argument given to 9cc. That means it's working fine. Try giving a number other than 123 in the range 0 to 255 (Unix process exit codes are supposed to be 0 to 255) and see if 9cc actually works.
Many readers may have never written tests for hobby programming, but in this book we will write code to test new code each time we extend the compiler. Writing tests may seem tedious at first, but you'll soon start to appreciate them. If you don't write test code, you'll end up having to run the same tests by hand every time to check the operation, but doing it by hand is much more troublesome.
I think a lot of the impression that writing tests is tedious stems from the fact that test frameworks are exaggerated and the philosophy of testing is sometimes dogmatic. For example, testing frameworks like JUnit have a variety of useful functions, but they take time to install and learn how to use. Therefore, this chapter does not introduce such testing frameworks. Instead, I'm going to write a very simple handwritten "test framework" in shell scripts and use it to write my tests.
test.sh
Below is a shell script for testing . The shell function assert
takes two arguments, an input value and an expected output value, and does what it does: actually assemble the 9cc result and compare the actual result to the expected value. In the shell script, assert
after defining the function, I use it to verify that both 0 and 42 compile correctly.
#!/bin/bash
assert() {
expected="$1"
input="$2"
./9cc "$input" > tmp.s
cc -o tmp tmp.s
./tmp
actual="$?"
if [ "$actual" = "$expected" ]; then
echo "$input => $actual"
else
echo "$input => $expected expected, but got $actual"
exit 1
fi
}
assert 0 0
assert 42 42
echo OK
test.sh
Create it with the above content and chmod a+x test.sh
run it to make it executable. Let's actually test.sh
run it. If no errors occur, the following will display and exit test.sh
.OK
$ ./test.sh
0 => 0
42 => 42
OK
If an error occurs, will test.sh
not OK
be displayed. Instead test.sh
, it displays the expected and actual values for the failed test, as shown below.
$ ./test.sh
0 => 0
42 expected, but got 123
When you want to debug a test script, -x
run the script with the bash option. -x
With this option, bash will display a trace of the execution as shown below.
$ bash -x test.sh
+ assert 0 0
+ expected=0
+ input=0
+ cc -o 9cc 9cc.c
+ ./9cc 0
+ cc -o tmp tmp.s
+ ./tmp
+ actual=0
+ '[' 0 '!=' 0 ']'
+ assert 42 42
+ expected=42
+ input=42
+ cc -o 9cc 9cc.c
+ ./9cc 42
+ cc -o tmp tmp.s
+ ./tmp
+ actual=42
+ '[' 42 '!=' 42 ']'
+ echo OK
OK
The "testing framework" we use throughout this book is simply a shell script like the one above. This script may seem too simple compared to a full-fledged testing framework such as JUnit, but the simplicity of this shell script is balanced with the simplicity of 9cc itself, so That's preferable. Automated testing is all about running the code you've written in one go and mechanically comparing the results, so it's important to test it first without thinking too much about it.
Throughout this book, you will build 9cc hundreds or even thousands of times. The work of creating a 9cc executable file and then running the test script is the same every time, so it is convenient to leave it to a tool. Commands are commonly used for these purposes make
.
When run, make Makefile
reads the file named in the current directory and executes the commands written in it. Makefile
consists of a rule ending with a colon and a sequence of commands for that rule. The following Makefile
is to automate the commands you want to run in this step.
CFLAGS=-std=c11 -g -static
9cc: 9cc.c
test: 9cc
./test.sh
clean:
rm -f 9cc *.o *~ tmp*
.PHONY: test clean
Create the above file 9cc.c
in the same directory with the name . Makefile
Then, make
just by running 9cc will be created, make test
and by running , you will be able to run the test. Since make can understand file dependencies, there is no need to run 9cc.c
after changing and make test
before running . make
Only if the executable file 9cc is older than 9cc.c, make will build 9cc before running the tests.
make clean
This is a rule for deleting temporary files. You can create temporary files rm
manually , but it would be troublesome if you accidentally delete a file you don't want to delete, Makefile
so I decided to write something like this as a utility.
Please Makefile
note that when writing , Makefile
the indent must be a tab character. 4 or 8 spaces will result in an error. This is just bad syntax, but make is an old tool developed in the 1970s, and has traditionally been this way.
cc
Be sure -static
to pass the option . This option is explained in the chapter called Dynamic Linking . You don't need to think much about what this option means right now.
This book uses git as the version control system. Throughout this book, we will create a compiler step by step, and for each step, please create a git commit and write a commit message. The commit message can be in Japanese, so please make sure to include a one-line summary of what has actually changed. If you want to write more than one line of detailed explanation, leave one blank line after the first line and write the explanation after that.
Git only performs version control on files that you manually generate. Files generated as a result of running 9cc can be generated again by executing the same command, so there is no need to include them under version control. In fact, including such files makes the changes per commit unnecessarily long, so they should be removed from version control and not included in the repository.
In git .gitignore
, you can write patterns for files to be excluded from version control in a file called .git. 9cc.c
Create a directory with the following content in the same directory as .gitignore
git and set it to ignore temporary files, editor backup files, etc.
*~
*.o
tmp*
a.out
9cc
If this is your first time using Git, tell Git your name and email address. The name and email address you tell git here will be recorded in the commit log. Below is an example of setting the author's name and email address. Readers, please enter your name and email address.
$ git config --global user.name "Rui Ueyama"
$ git config --global user.email "ruiu@cs.stanford.edu"
To make a commit with git, you first git add
need to add the file that has changed. Since this is our first commit, we will first git init
create a git repository and then git add
add all the files we have created so far.
$ git init
Initialized empty Git repository in /home/ruiu/9cc
$ git add 9cc.c test.sh Makefile .gitignore
Then git commit
commit.
$ git commit -m "整数1つをコンパイルするコンパイラを作成"
-m
Optionally specify a commit message. -m
If there are no options, git
launches the editor. git log -p
You can check that the commit was successful by running the following :
$ git log -p
commit 0942e68a98a048503eadfee46add3b8b9c7ae8b1 (HEAD -> master)
Author: Rui Ueyama <ruiu@cs.stanford.edu>
Date: Sat Aug 4 23:12:31 2018 +0000
整数1つをコンパイルするコンパイラを作成
diff --git a/9cc.c b/9cc.c
new file mode 100644
index 0000000..e6e4599
--- /dev/null
+++ b/9cc.c
@@ -0,0 +1,16 @@
+#include <stdio.h>
+#include <stdlib.h>
+
+int main(int argc, char **argv) {
+ if (argc != 2) {
...
Finally, let's upload the git repository we created so far to GitHub . There's no particular reason to upload to GitHub, but there's also no reason not to, and GitHub is useful as a backup for your code. To upload to GitHub, create a new repository (in this example, I created the repository using rui314
user ) and add it as a remote repository with the following command:9cc
$ git remote add origin git@github.com:rui314/9cc.git
Then, git push
running will push the contents of your repository to GitHub. git push
After running GitHub, try opening GitHub in a browser and verify that your source code has been uploaded.
This completes the first step of creating the compiler. The compiler in this step is too simple to be called a compiler, but it is a fine program that includes all the elements necessary for a compiler. From now on, we will continue to enhance the functionality of this compiler, and although it may be hard to believe, we will develop it into a fine C compiler. Enjoy the completion of the first step.
Risks of emulating “good design” compilers
コンパイラの世界にはGCCやLLVMといった定評のあるオープンソースのプログラムが存在します。そういったプログラムは、優れたクオリティのコードを効率的に出力することのできる「よいデザイン」に基づいて作られていると考えられています。自作コンパイラを作るときに、そういったコンパイラの内部構成をお手本にしようというのは 自然な考え方でしょう。
そういった既存の大規模なコンパイラをお手本に自作コンパイラを作ろうとすると、まずLLVMのような汎用的な中間言語をデザインして、その中間言語を扱うフロントエンドとバックエンドを作成することになります。また、コンパイラの最適化をよく勉強している人であれば、SSAといった中間形式やSSAを前提にした最適化パスなども加えようとするでしょう。いろいろなベストプラクティスを組み合わせた「僕の考えた最強のコンパイラ」を構想してみるのは楽しいものです。
しかし、大規模コンパイラと同じような「よいデザイン」の構想を先に完成させて、それから実装に取り掛かるという開発スタイルは、実際にはほとんどうまくいきません。
大規模なコンパイラは、最初から抽象化されたよいデザインに基づいて作られたわけではありません。最初はアドホックなところがたくさんあった状態から、だんだん進化を遂げていまの形になっているだけです。完成形だけをみてそれを真似するというのは
コンピュータの記憶階層と経済性
プログラムを書いていると、ストレージについて、あらゆるところにサイズと速度のトレードオフがあることに気がつくと思います。レジスタは数百バイトしかないストレージですが、ディレイなしでCPUの内部からアクセスできます。DRAMから構成されるメインメモリでは、CPUからのアクセスに100クロック以上かかりますが、数GiBというサイズを確保することができます。レジスタとDRAMの間にはCPUのチップ内に実装されたキャッシュがあり、L1、L2、L3という階層ごとに、小さくて速いメモリから、比較的大きくて遅いメモリまでが用意されています。
このようなストレージの速度と容量のトレードオフはCPUやメインメモリに限りません。SSDはDRAMより大容量で1000倍くらい遅いストレージです。HDDはSSDより大きくてより遅いストレージです。DropboxやGoogle Driveのようなインターネットのストレージは、HDDよりもずっと大きくなることができて、そしてアクセスにはさらに時間がかかります。
なぜストレージには、速いものは容量が小さくて、遅いものは容量が大きいという一般的な法則があるのでしょうか?
一つの理由は、ストレージの種類によっては、容量と速度がトレードオフの関係にあることです。例えばレジスタは増やせば増やすほど良さそうですが、レジスタを増やすと回路規模が増大して、他の機能につかえるシリコンが減ってしまします。また、レジスタの数を増やした分だけ命令セットのレジスタを指定するビットの幅も増やさなければならないので、命令が長くなり、命令キャッシュの利用効率が悪くなります。レジスタを増やすことによる速度向上はある程度以上ではほとんど効果がなくなるので、多ければよいというものでもありません。
SSDやHDDのような外部ストレージについては、実際に、速くて容量が大きいストレージや、遅くて容量が小さいストレージというものは存在します。ただし、速くて大容量のストレージが登場すると、それより劣るテクノロジは市場から駆逐されてしまいますし、同様に遅くて小容量のストレージは作る意味がないので、市場に出回っているテクノロジには、小容量で速いものと、大容量で遅いものしか存在していないのです。コンピュータの博物館にいくと、コアメモリや水銀遅延菅といった、現在主流のメモリ技術と比べると小容量で低速なメモリの実物を見ることができます。
最も速くて最も大容量の不揮発性のストレージがあれば、それが究極の単一のメモリ技術になり得ますが、残念ながらそういったデバイスは今のところ存在しません。すべての評価基準において最高の性能特性を持つメモリ技術が開発されない限り、デコボコの性能特性を持ったストレージシステムをうまく階層化してコンピュータを構成していくのは、技術の選択としては自然な成り行きなのです。
Reference implementation
In this step, we will extend the compiler we created in the previous step so that it accepts expressions that include additions and subtractions, such as and , rather 42
than just values such as .2+11
5+20-4
5+20-4
You could calculate an expression like at compile time and 21
embed the resulting number (in this case) in the assembly, but that would make it look like an interpreter rather than a compiler, so you could do the addition and subtraction. When you need to output the assembly you do. The assembly instructions for addition and add
subtraction sub
are and. add
takes two registers, adds their contents, and writes the result to the register of its first argument. is the same sub
as add
Using these instructions, 5+20-4
you can compile as follows:
In the above assembly, mov
we set RAX to 5, then add 20 to RAX, and then subtract 4. ret
The value of RAX at the time it is executed 5+20-4
should be 21. Let's run it and check it out. tmp.s
Save the above file , assemble it, and try running it.
$ cc -o tmp tmp.s
$ ./tmp
$ echo $?
21
21 was displayed correctly as shown above.
Now, how should I create this assembly file? If we consider this expression involving addition and subtraction as a "language," this language can be defined as follows.
+
that has a number after it, or -
something that has a number after it.If you translate this definition into C code, you will get a program like this:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
if (argc != 2) {
fprintf(stderr, "引数の個数が正しくありません\n");
return 1;
}
char *p = argv[1];
printf(".intel_syntax noprefix\n");
printf(".globl main\n");
printf("main:\n");
printf(" mov rax, %ld\n", strtol(p, &p, 10));
while (*p) {
if (*p == '+') {
p++;
printf(" add rax, %ld\n", strtol(p, &p, 10));
continue;
}
if (*p == '-') {
p++;
printf(" sub rax, %ld\n", strtol(p, &p, 10));
continue;
}
fprintf(stderr, "予期しない文字です: '%c'\n", *p);
return 1;
}
printf(" ret\n");
return 0;
}
The program is a little longer, but ret
the lines in the first half are the same as before. Code for reading terms is added in the middle. This time, the program is not just for reading a single number, so after reading a number, we need to know how far it has read. atoi
Since this does not return the number of characters read, you atoi
will not know where to start reading the next section. Therefore, strtol
we used functions from the C standard library.
strtol
reads the number, then updates the pointer in its second argument to point to the character after the last character read. Therefore, after reading a number, if the next character is +
, -
then p
should point to that character. The above program takes advantage of this fact by while
reading the terms one after another in a loop, and outputting one line of assembly each time it reads one term.
Now, let's run this modified version of the compiler. 9cc.c
After updating the file, make
you can create a new 9cc file by simply running . An example of execution is shown below.
$ make
$ ./9cc '5+20-4'
.intel_syntax noprefix
.globl main
main:
mov rax, 5
add rax, 20
sub rax, 4
ret
It looks like the assembly is being output successfully. To test this new feature, test.sh
let's add a test line like this:
assert 21 "5+20-4"
Once you've done this, commit your changes to git. To do so, run the following command:
$ git add test.sh 9cc.c
$ git commit
git commit
When you run , the editor will start, so write ``Add addition and subtraction'', save it, and exit the editor. git log -p
Try using the command to confirm that the commit is happening as expected. Once you finally git push
run and push the commit to GitHub, this step is complete!
Reference implementation
The compiler created in the previous step has one drawback. If the input contains blank characters, an error will occur at that point. 5 - 3
For example , if you give a string with spaces like the following , a space character will be found where you are trying to read +
or, -
and compilation will fail.
$ ./9cc '5 - 3' > tmp.s
予期しない文字です: ' '
There are several ways to resolve this issue. One obvious method is to skip the whitespace characters before attempting to read or +
. -
There's nothing particularly wrong with this approach, but in this step we'll take a different approach to solving the problem. The method is to split the input into words before reading the expression.
Just like Japanese and English, mathematical formulas and programming languages can be thought of as consisting of sequences of words. 5+20-4
For example , you can think of it as being made up of five words: , 5
, , . This "word" is called a " token ." The white space between tokens only exists to separate the tokens, and is not part of the word. So it would be natural to remove whitespace characters when splitting a string into token sequences. Dividing a string into a string of tokens is called " tokenizing ."+
20
-
4
There are other benefits to separating strings into token sequences. When an expression is divided into tokens, the tokens can be classified and given types. For example , +
``ya '' is a symbol such as -
``as you see'' , and the string ``on the other hand'' means the numerical value 123. By interpreting each token of the input instead of just splitting it into strings when tokenizing, you have less to think about when consuming a string of tokens.+
-
123
In the current grammar for expressions that allow addition and subtraction, there are three types of tokens: , , and numeric +
. -
Furthermore, for compiler implementation reasons, it is better to define one special type to represent the end of a token sequence (just as a string ends with '\0'
). Let's make the tokens a linked list connected by pointers so that we can handle input of arbitrary length.
Although it is a bit long, I will post below an improved version of the compiler that introduces a tokenizer.
#include <ctype.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// トークンの種類
typedef enum {
TK_RESERVED, // 記号
TK_NUM, // 整数トークン
TK_EOF, // 入力の終わりを表すトークン
} TokenKind;
typedef struct Token Token;
// トークン型
struct Token {
TokenKind kind; // トークンの型
Token *next; // 次の入力トークン
int val; // kindがTK_NUMの場合、その数値
char *str; // トークン文字列
};
// 現在着目しているトークン
Token *token;
// エラーを報告するための関数
// printfと同じ引数を取る
void error(char *fmt, ...) {
va_list ap;
va_start(ap, fmt);
vfprintf(stderr, fmt, ap);
fprintf(stderr, "\n");
exit(1);
}
// 次のトークンが期待している記号のときには、トークンを1つ読み進めて
// 真を返す。それ以外の場合には偽を返す。
bool consume(char op) {
if (token->kind != TK_RESERVED || token->str[0] != op)
return false;
token = token->next;
return true;
}
// 次のトークンが期待している記号のときには、トークンを1つ読み進める。
// それ以外の場合にはエラーを報告する。
void expect(char op) {
if (token->kind != TK_RESERVED || token->str[0] != op)
error("'%c'ではありません", op);
token = token->next;
}
// 次のトークンが数値の場合、トークンを1つ読み進めてその数値を返す。
// それ以外の場合にはエラーを報告する。
int expect_number() {
if (token->kind != TK_NUM)
error("数ではありません");
int val = token->val;
token = token->next;
return val;
}
bool at_eof() {
return token->kind == TK_EOF;
}
// 新しいトークンを作成してcurに繋げる
Token *new_token(TokenKind kind, Token *cur, char *str) {
Token *tok = calloc(1, sizeof(Token));
tok->kind = kind;
tok->str = str;
cur->next = tok;
return tok;
}
// 入力文字列pをトークナイズしてそれを返す
Token *tokenize(char *p) {
Token head;
head.next = NULL;
Token *cur = &head;
while (*p) {
// 空白文字をスキップ
if (isspace(*p)) {
p++;
continue;
}
if (*p == '+' || *p == '-') {
cur = new_token(TK_RESERVED, cur, p++);
continue;
}
if (isdigit(*p)) {
cur = new_token(TK_NUM, cur, p);
cur->val = strtol(p, &p, 10);
continue;
}
error("トークナイズできません");
}
new_token(TK_EOF, cur, p);
return head.next;
}
int main(int argc, char **argv) {
if (argc != 2) {
error("引数の個数が正しくありません");
return 1;
}
// トークナイズする
token = tokenize(argv[1]);
// アセンブリの前半部分を出力
printf(".intel_syntax noprefix\n");
printf(".globl main\n");
printf("main:\n");
// 式の最初は数でなければならないので、それをチェックして
// 最初のmov命令を出力
printf(" mov rax, %d\n", expect_number());
// `+ <数>`あるいは`- <数>`というトークンの並びを消費しつつ
// アセンブリを出力
while (!at_eof()) {
if (consume('+')) {
printf(" add rax, %d\n", expect_number());
continue;
}
expect('-');
printf(" sub rax, %d\n", expect_number());
}
printf(" ret\n");
return 0;
}
The code is not very short, about 150 lines, but it doesn't do anything too tricky, so you should be able to read it by reading from the top.
Let me explain some programming techniques used in the code above.
token
I decided to represent the token string read by the parser using a global variable . The parser token
reads the input by walking through a linked list. This style of programming using global variables may not seem like a clean style. However, in practice, it is often easier to read the parser code by treating the input token sequence as a stream like standard input, as we are doing here. Therefore, we have adopted such a style here.token
I divided the code that directly touches the code into functions such consume
as , and prevented expect
other functions from directly touching it.token
tokenize
The function is building a linked list. When constructing a linked list, you can simplify the code head
by creating a dummy element, connecting new elements to it, and head->next
returning at the end. This method head
wastes most of the memory allocated to the element, but the cost of allocating local variables is almost zero, so you don't need to worry about it.calloc
malloc
is a function that allocates memory just malloc
Unlike , calloc
it zeroes out the allocated memory. Here, calloc
I decided to use ``to save the effort of zeroing out the elements.''test.sh
This improved version should now be able to skip whitespace, so let's add the following test to one line .
assert 41 " 12 + 34 - 5 "
The exit code of a Unix process is a number between 0 and 255, so when writing tests, make sure that the result of the entire expression falls within the range 0 to 255.
Adding the test files to your git repository completes this step.
Reference implementation
With the compiler we have created so far, if the input is grammatically incorrect, the only thing we can tell is that there is an error somewhere. Let's try to improve that problem in this step. Specifically, we will be able to display an intuitive error message like the one below.
$ ./9cc "1+3++" > tmp.s
1+3++
^ 数ではありません
$ ./9cc "1 + foo + 5" > tmp.s
1 + foo + 5
^ トークナイズできません
In order to display such an error message, we need to be able to tell which byte of input the error occurs. To do this, user_input
let's save the entire program string in a variable named variable, and define a new error display function that receives a pointer to the middle of the string. The code is shown below.
// 入力プログラム
char *user_input;
// エラー箇所を報告する
void error_at(char *loc, char *fmt, ...) {
va_list ap;
va_start(ap, fmt);
int pos = loc - user_input;
fprintf(stderr, "%s\n", user_input);
fprintf(stderr, "%*s", pos, " "); // pos個の空白を出力
fprintf(stderr, "^ ");
vfprintf(stderr, fmt, ap);
fprintf(stderr, "\n");
exit(1);
}
error_at
The pointer it receives is a pointer to the middle of a string that represents the entire input. By taking the difference between that pointer and the pointer pointing to the beginning of the input, you can tell which byte of input the error is located, so you can mark it prominently ^
.
argv[1]
This step is completed by updating the code user_input
to save the error("数ではありません")
code to .error_at(token->str, "数ではありません")
A practical-level compiler should also write tests for the behavior when there are errors in the input, but as of now error messages are only output to aid debugging, there is no need to write tests at this stage. It doesn't matter if you don't have it.
Reference implementation
Column: Source code formatter
In the same way that Japanese sentences with many orthographical errors such as punctuation marks are difficult to read, source code that has incorrect indentation or inconsistent spacing can cause the source code to become unreadable. It is not clean code at this level. When it comes to inconsequential areas like code formatting, be sure to apply certain mechanical rules to write code that can be read without distraction.
When developing with multiple people, you have to consult and decide what format to use, but in this book, you are developing by yourself, so you can choose your favorite format from among the major formats. does not matter.
Some recently developed languages provide official language formatters to eliminate the need for an arbitrary but non-essential discussion about what format to choose. For example, the Go language has a command called gofmt that formats source code neatly. gofmt does not have an option to select a formatting style, so it can only be formatted in the "Go official format". By intentionally not giving you a choice, Go completely solves the problem of formatting.
There is a formatter called clang-format for C and C++ , but this book does not specifically recommend the use of such tools. Instead of writing oddly formatted code and having to format it later, try writing code that looks consistent from the beginning.
Column: Security bug caused by indentation error
Incorrect indentation of source code has caused serious security issues in iOS and macOS. The code where the bug occurred is shown below.
if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
goto fail;
goto fail;
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
goto fail;
Can our readers know where the bug is? At first glance, this code looks like a normal piece of code, but if you look closely, you will notice that the second goto statement from the bottom is not inside an if statement, and is instead a goto statement that is always executed.
Unfortunately, this code was written in a function that validates the TLS certificate, and as a result, most of the code that validates the certificate is unconditionally skipped in the goto statement, causing iOS/macOS to The certificate was accepted as a valid certificate (allowing spoofing of an HTTPS site). This bug was discovered and fixed in 2014. This bug is called the goto fail bug because the extra goto fail causes the program to fail.
Now, we would like to add multiplication/division and precedence parentheses, namely , to the language, but there is one big technical challenge to doing so *
. This is because there is a rule that multiplication and division must be calculated first in an expression. For example, the expression should be interpreted as , not as . These rules about which operators "stick together" first are called " operator precedence . "/
()
1+2*3
1+(2*3)
(1+2)*3
How should we handle operator precedence? The compiler we've created so far just reads the token sequence from the beginning and outputs the assembly, so if you simply extend it and add and, you'll end up compiling *
it as ./
1+2*3
(1+2)*3
Existing compilers are naturally good at handling operator precedence. The compiler's parsing is very powerful and can interpret any complex code correctly as long as it follows the syntax. The behavior of this compiler may give the impression that it has an intellectual ability that surpasses that of humans, but in reality, computers do not have the ability to read text like humans, so parsing is performed only by some mechanical mechanism. There should be. How exactly does it work?
In this chapter, let's take a break from coding and learn some parsing techniques. This chapter describes parsing techniques in the following order:
In programming language parser implementations, the input is usually a flat sequence of tokens and the output is a tree representing a nested structure. The compiler created in this book also follows that structure.
In the C language, grammatical elements such as if
and while
can be nested. Representing something like this in a tree structure is a natural way to represent it.
Numerical formulas have a structure in which things inside parentheses are calculated first, and multiplication and division are calculated before addition and subtraction. Although this kind of structure may not look like a tree at first glance, it is actually a very simple way to represent the structure of an expression. For example 1*(2+3)
, the expression can be thought of as being represented by the following tree.
1*(2+3)
tree representingIf we adopt the rule of sequentially calculating from the end of the tree, the above tree will represent the formula 1 multiplied by 2+3. In other words, in the above tree, 1*(2+3)
the specific calculation order of is expressed in the tree itself.
Here's another example. The tree below 7-3-3
is a tree that represents.
7-3-3
tree representingIn the above tree, the result of applying the rule that "subtraction must be calculated from the left" is explicitly expressed in the form of a tree. In other words, the above tree (7-3)-3 = 1
represents the expression , 7-(3-3) = 7
not the expression . If it were the latter expression, the tree representing it would be deeper to the right rather than to the left.
Operators that must be calculated from the left are called "left associative" operators, and operators that must be calculated from the right are called "right associative" operators. In C, =
most operators are defined as left associative, except for assignment.
In a tree structure, you can express an expression as long as you want by making the tree deeper. The next tree 1*2+3*4*5
is the tree that represents.
1*2+3*4*5
tree representingA tree like the one above is called a "syntax tree. " In particular, a syntax tree that is expressed as compactly as possible without leaving redundant elements such as parentheses for grouping in the tree is called an ``abstract syntax tree'' (AST ) . All of the above syntax trees can be called abstract syntax trees.
Since the abstract syntax tree is an internal representation of the compiler, it may be defined as appropriate for implementation purposes. However, since arithmetic operators such as addition and multiplication are defined as operations on two sides, a left-hand side and a right-hand side, it would be natural for any compiler to create a binary tree. On the other hand, it would be natural to represent expressions in the body of a function, which can be executed in sequence and have any number of expressions, as a tree with all child elements flat.
The goal in parsing is to construct an abstract syntax tree. The compiler first parses the input token sequence into an abstract syntax tree, which in turn converts into assembly.
Now, let's learn how to write syntax in programming languages. Most of the syntax of programming languages is defined using production rules . Production rules are rules that recursively define grammars.
Let's think about natural language for a moment. In Japanese, grammar has a nested structure. For example, if you replace the noun ``flower'' in the sentence ``flowers are beautiful'' with the noun phrase ``red flower,'' the sentence becomes correct, and if you further expand ``red'' into ``a little red''. However, it is still a correct sentence. You can also put it inside another sentence, like ``I thought the little red flowers were pretty.''
These grammars are defined as, ``A 'sentence' consists of a 'subject' and a 'predicate',' or 'A 'noun phrase' consists of a 'noun' or an 'adjective' followed by a 'noun phrase.'” Let's think of it as something defined as a rule like this. Then, by using a ``sentence'' as a starting point and developing it according to the rules, it is possible to create an infinite number of valid sentences within the defined grammar.
Or, conversely, by thinking about the expansion procedure that matches an existing sentence, you can think about what kind of structure that string has.
The above ideas were originally devised for natural languages, but because they have a high affinity with data handled by computers, production rules are used in various areas of computers, including programming languages.
Column: Chomsky's generative grammar
The idea of generative grammar was invented by a linguist named Noam Chomsky . His ideas have had a huge impact on linguistics and computer science.
According to Chomsky's hypothesis, the reason humans can speak is that humans are born with specialized circuits in their brains for acquiring production rules. Humans are able to speak languages because they have the ability to acquire recursive language rules. Nonhuman animals do not have the ability to acquire language, and he believed that this is because the circuitry for acquiring production rules does not exist in the brains of nonhuman animals. Although Chomsky's claims have not been proven or disproven nearly 60 years after his hypothesis was published, it is still considered to be quite persuasive.
One notation for describing production rules in a compact and easy-to-understand manner is BNF ( Backus–Naur form ) and its extended version, EBNF ( Extended BNF ). In this book, we will explain the syntax of C using EBNF. This section first describes BNF and then describes extensions to EBNF.
In BNF, each production rule is A = α₁α₂⋯
expressed in the form. This means that the symbol can be expanded into A
. is a sequence of zero or more symbols, which can contain both symbols that cannot be expanded further and symbols that can be expanded further (coming on the left side in some production).α₁α₂⋯
α₁α₂⋯
A symbol that cannot be expanded further is called a " terminal symbol", and a symbol that appears on the left side of any production rule and can be expanded is called a " nonterminal symbol". Grammars defined by such production rules are generally called ``context free grammars.' '
A nonterminal symbol may match multiple productions. For example, A = α₁
if A = α₂
there are both rules, it means that it can be expanded either way .A
α₁
α₂
The right-hand side of the production rule can be empty. In such a rule, the symbol on the left side will be expanded to a zero-length symbol string (i.e., to nothing). However, if the right-hand side is omitted, it becomes difficult to understand the meaning, so in such cases, the standard BNF rule is to write ε (epsilon) on the right-hand side as a symbol to indicate that there is nothing . This book also uses this rule.
Strings are enclosed in double quotes and "foo"
written as follows. Strings are always terminal.
The above are the basic BNF rules. In addition to BNF rules, EBNF allows you to write complex rules concisely using the following symbols.
How to write | meaning |
---|---|
A* |
A 0 or more repetitions of |
A? |
A or ε |
A | B |
A orB |
( ... ) |
grouping |
For example A = ("fizz" | "buzz")*
, A
is a string in which "fizz"
or "buzz"
is repeated zero or more times, i.e.
""
"fizz"
"buzz"
"fizzfizz"
"fizzbuzz"
"buzzfizz"
"buzzbuzz"
"fizzfizzfizz"
"fizzfizzbuzz"
can be expanded to either.
Column: BNF and EBNF
Although ordinary BNF that is not Extended does not have concise notations such as , , , , etc., the sentences that can be generated with BNF and EBNF are *
the ?
same |
. ( ... )
This is because EBNF can be converted to BNF by rewriting as follows.
EBNF | Corresponding BNF |
---|---|
A = α* |
A = αA andA = ε |
A = α? |
A = α andA = ε |
A = α | β |
A = α andA = β |
A = α (β₁β₂⋯) γ |
A = α B γ andB = β₁β₂⋯ |
A = αA
For example, when you use A = ε
the production rule `` to'' to generate the sentence ``after, '' the order of expansion is .A
ααα
A → αA → ααA → αααA → ααα
In this way, the notation *
``and ?
'' is just a shortcut, but a short notation is easier to understand and desirable, so if a short notation is available, it is normal to use that notation to write concisely. .
As an example of writing a grammar using EBNF, consider the following production rule.
num
is defined somewhere as a symbol representing a numerical value. In this grammar, there is one expr
first word, followed by zero or more " to , or and ". This rule actually represents the grammar of addition and subtraction expressions.num
+
num
-
num
By starting from expr and expanding it, you can create a string of arbitrary additions and subtractions, for example a string like .1
Please check the expansion results below.10+5
42-30+2
In addition to using arrows to represent each expansion order, these expansion steps can also be represented in a tree structure. The syntax tree for the above expression is shown below.
1
syntax tree of10+5
syntax tree of42-30+2
syntax tree ofBy representing it in a tree structure, it is now easier to understand which nonterminal symbol is expanded into which symbol.
A syntax tree like the one shown above, which includes all the tokens in the input and has a perfect one-to-one match to the grammar, is sometimes called a "concrete syntax tree . " This term is often used to contrast with abstract syntax trees.
Note that in the above concrete syntax tree, the rule that addition and subtraction are calculated from the left is not expressed in the form of a tree. In the grammar explained here, such a rule is not expressed using EBNF, but is written as a proviso in the language specification that ``Addition and subtraction are calculated from the left first.'' Become. The parser takes both EBNF and the disclaimer into account, reads the token sequence representing the expression, and constructs an abstract syntax tree that appropriately represents the evaluation order of the expression.
Therefore, in the above grammar, the concrete syntax tree represented by EBNF and the abstract syntax tree output from the parser only roughly match. It is possible to define a grammar so that the abstract and concrete syntax trees have as much of the same structure as possible, but this would make the grammar redundant and make it difficult to understand how to write a parser. The above grammar is an easy-to-use method of expressing grammar that strikes a balance between the rigor of formal grammar description and the ease of understanding supplemented by natural language.
Production rules are very powerful tools for expressing grammar. Operator precedence can also be expressed in production rules by devising the grammar. Its syntax is shown below.
The previous rule expanded expr
directly to , but now the rule expands to via . This is the production rule for multiplication and division, and when performing addition and subtraction, is used as a single component. In this grammar, the rule that multiplication and division come first is naturally expressed in the syntax tree. Let's look at some concrete examples.num
expr
mul
num
mul
expr
mul
1*2+3
syntax tree of1+2*3
syntax tree of1*2+3*4*5
syntax tree ofIn the above tree structure, multiplication always appears closer to the end of the tree than addition. In reality, there is no rule to return mul
to expr
``,'' so there is no way to create a tree with addition under multiplication.However, with such a simple rule, priorities can be expressed well as a tree structure. feels quite strange. Readers are encouraged to check that the syntax tree is correct by actually comparing the production rules with the syntax tree.
With generative grammars, recursive grammars can also be written normally. Below is a grammar production rule that adds precedence parentheses to the four arithmetic operations.
expr = mul ("+" mul | "-" mul)*
mul = primary ("*" primary | "/" primary)*
primary = num | "(" expr ")"
Comparing the above grammar with the previous grammar, we can see that , or or can now appear num
where it was allowed before . In other words, in this new grammar, expressions enclosed in parentheses will be treated with the same ``stickiness'' as single numbers. Let's look at an example.primary
num
"(" expr ")"
The following tree 1*2
is the syntax tree for.
1*2
syntax tree ofThe following tree 1*(2+3)
is the syntax tree for.
1*(2+3)
syntax tree ofComparing the two trees, we can see that only the expansion result of mul
the right branch of is different. primary
The rule that appears at the end of the expansion result primary
is that it can be expanded to a single number or to any expression enclosed in parentheses, which is clearly reflected in the tree structure. . Isn't it a bit impressive that you can also handle parentheses' precedence with such a simple production rule?
Given the production rules of the C language, by expanding them one after another, it is possible to mechanically generate any C program that is correct from the perspective of the production rules. But what we want to do with 9cc is actually the opposite. I am given a C program as a string from an external source, and I would like to know the expansion procedure that results in an input string when expanded, or the structure of the syntax tree that results in a string that is the same as the input.
In fact, for certain types of production rules, if a rule is given, it is possible to mechanically write code to find a syntax tree that matches the sentence generated from that rule. The ``recursive descent parsing method'' described here is one such technique.
As an example, let's consider the grammar of the four arithmetic operations. I will repost the grammar of the four arithmetic operations.
expr = mul ("+" mul | "-" mul)*
mul = primary ("*" primary | "/" primary)*
primary = num | "(" expr ")"
The basic strategy when writing a parser using recursive descent parsing is to map each of these nonterminal symbols directly to each function. Therefore, the parser will have three functions expr
, mul
, and . primary
Each function parses a sequence of tokens as its name suggests.
Let's think about it in concrete terms. The input passed to the parser is a sequence of tokens. Since we want to create and return an abstract syntax tree from the parser, let's define the types of nodes in the abstract syntax tree. The node types are shown below.
// 抽象構文木のノードの種類
typedef enum {
ND_ADD, // +
ND_SUB, // -
ND_MUL, // *
ND_DIV, // /
ND_NUM, // 整数
} NodeKind;
typedef struct Node Node;
// 抽象構文木のノードの型
struct Node {
NodeKind kind; // ノードの型
Node *lhs; // 左辺
Node *rhs; // 右辺
int val; // kindがND_NUMの場合のみ使う
};
lhs
rhs
means left-hand side and right-hand side, respectively
Let's also define a function to create a new node. There are two types of arithmetic operations in this grammar: binary operators that accept left and right sides, and numbers, so we will prepare two functions for each of these two types.
Node *new_node(NodeKind kind, Node *lhs, Node *rhs) {
Node *node = calloc(1, sizeof(Node));
node->kind = kind;
node->lhs = lhs;
node->rhs = rhs;
return node;
}
Node *new_node_num(int val) {
Node *node = calloc(1, sizeof(Node));
node->kind = ND_NUM;
node->val = val;
return node;
}
Now, let's write a parser using these functions and data types. +
is -
a left associative operator. The function that parses left associative operators is written as a pattern as follows:
Node *expr() {
Node *node = mul();
for (;;) {
if (consume('+'))
node = new_node(ND_ADD, node, mul());
else if (consume('-'))
node = new_node(ND_SUB, node, mul());
else
return node;
}
}
consume
is the function defined in the previous step that reads forward one token in the input and returns true if the next token in the input stream matches the argument.
expr
Please read the function carefully. expr = mul ("+" mul | "-" mul)*
You can see that this production rule is mapped directly to function calls and loops. In the abstract syntax tree returned by the above expr
function, the operators are left associative, meaning that the branches to the left of the returned node are deeper.
expr
mul
Let's also define the functions used by the function . *
Yamo /
is a left associative operator, so it can be written using the same pattern. The function is shown below.
Node *mul() {
Node *node = primary();
for (;;) {
if (consume('*'))
node = new_node(ND_MUL, node, primary());
else if (consume('/'))
node = new_node(ND_DIV, node, primary());
else
return node;
}
}
The function call relationship in the code above mul = primary ("*" primary | "/" primary)*
corresponds directly to the production rule.
Finally, primary
let's define the function. primary
Since it is not a left-associative operator that is primary = "(" expr ")" | num
read primary
by .
Node *primary() {
// 次のトークンが"("なら、"(" expr ")"のはず
if (consume('(')) {
Node *node = expr();
expect(')');
return node;
}
// そうでなければ数値のはず
return new_node_num(expect_number());
}
Now that we have all the functions, can we really parse the token string? Although it may not be obvious at first glance, you can use this group of functions to properly parse token sequences. As an example, 1+2*3
consider the expression .
The first thing to be called expr
is . expr
It begins reading the input assuming that the expression is the whole (which in this case it is). Then, a function call is made like expr
→ mul
→ , the token is read, and a syntax tree representing 1 is returned as the return value.primary
1
expr
Then the expression expr
in becomes true, so the token is consumed and is called again. The remaining inputs at this stage are:consume('+')
+
mul
2*3
mul
primary
is called as before and 2
reads the token, but this time mul
it does not return immediately. mul
The expression in consume('*')
becomes true, so calls mul
again and reads the token. The result is a syntax tree representing kara .primary
3
mul
2*3
At the destination expr
, the syntax tree representing 1 and 2*3
the syntax tree representing 1 are combined 1+2*3
to construct a syntax tree representing , which expr
becomes the return value of . 1+2*3
In other words , it was parsed correctly .
The diagram below shows the function calling relationship and the tokens read by each function. In the diagram below, there is a layer of , which represents a call to read the entire input 1+2*3
. There are two above , but they represent a call to and another that reads .expr
expr
expr
mul
1
2*3
mul
1+2*3
Function calling relationship when parsingA slightly more complex example is shown below. The figure below 1*2+(3+4)
shows the function calling relationship when parsing.
1*2+(3+4)
Function calling relationship when parsingFor programmers who are not familiar with recursion, recursive functions like the one above may seem confusing. Honestly, even for me, who is supposed to be very familiar with recursion, it feels like a kind of magic that this kind of code works. Recursive code feels strange even when you know how it works, and that's probably what it is. Try tracing the code in your head over and over again to make sure it works properly.
The parsing method that maps one production rule to one function as described above is called "recursive descending parsing." In the above parser, only one token is read ahead to decide which function to call or return, but a recursive descent parser that reads only one token in advance is (1) It is called a parser. Also, a grammar that can be written by an LL(1) parser is called an LL(1) grammar.
In the previous chapter, we explained the algorithm for converting a sequence of tokens into an abstract syntax tree. By choosing a grammar that takes operator precedence into account, it is now possible to create an abstract syntax tree whose branches are always at the top of the tree compared to ``ya' *
' . How should I convert it to assembly? This chapter explains how./
+
-
First, let's consider why it cannot be converted to assembly in the same way as addition and subtraction. Compilers that can perform addition and subtraction use RAX as a result register and perform additions and subtractions there. In other words, the compiled program retained only one intermediate calculation result.
However, when multiplication and division are involved, there is no guarantee that there will be only one intermediate calculation result. Consider 2*3+4*5 as an example. In order to perform addition, both sides must be calculated, so 2*3 and 4*5 must be calculated before addition. In other words, in this case, the entire calculation cannot be performed unless two intermediate calculation results can be stored.
A computer called a " stack machine " can easily perform calculations like this . Let's move away from the abstract syntax tree created by the parser and learn about the stack machine.
A stack machine is a computer that has a stack as a data storage area. Therefore, in a stack machine, the two basic operations are "pushing onto the stack" and "popping from the stack." A push puts a new element on top of the stack. Pop removes an element from the top of the stack.
Arithmetic instructions in a stack machine operate on the elements at the top of the stack. For example, a stack machine ADD
instruction pops two elements from the top of the stack, adds them, and pushes the result onto the stack (to avoid confusion with x86-64 instructions, the virtual stack machine instruction (written in all capital letters). In other words, ADD
is an instruction that replaces two elements at the top of the stack with the single element that results from adding them together.
SUB
The , MUL
, DIV
instruction ADD
is an instruction that replaces two elements at the top of the stack with a single element obtained by subtracting, multiplying, or dividing them.
PUSH
The instruction shall push the argument elements onto the top of the stack. Although not used here, you POP
can also consider an instruction that removes an element from the top of the stack and discards it.
Now, let's consider calculating 2*3+4*5 using these instructions. Using the stack machine defined above, you should be able to calculate 2*3+4*5 with the following code.
// 2*3を計算
PUSH 2
PUSH 3
MUL
// 4*5を計算
PUSH 4
PUSH 5
MUL
// 2*3 + 4*5を計算
ADD
Let's take a closer look at this code. Assume that the stack already contains some value. The value is not important here, so we display it with a “…”. The stack is assumed to run from top to bottom in the diagram.
The first two PUSH
push 2 and 3 onto the stack, so MUL
by the time the next one is executed, the state of the stack is as follows:
... |
2 |
3 |
MUL
removes the two values at the top of the stack, 3 and 2, and pushes the product of their multiplication, 6, onto the stack. Therefore, MUL
after execution, the state of the stack will be as follows:
... |
6 |
The next PUSH
one pushes 4 and 5, so MUL
just before the second one runs, the stack should look like this:
... |
6 |
Four |
Five |
If you run here MUL
, it will remove 5 and 4 and replace them with 20, which is the result of multiplying them. So MUL
after running:
... |
6 |
20 |
Notice how the calculation results for 2*3 and 4*5 are neatly placed on the stack. ADD
Running in this state will calculate 20+6 and push the result onto the stack, so the stack should eventually be in the following state:
... |
26 |
If we assume that the stack machine's calculation result is the value remaining at the top of the stack, then 26 is the result of 2*3+4*5, which means that the formula has been calculated correctly.
Stack machines can calculate not only this expression, but any expression that has multiple intermediate results. With the stack machine, any subexpression can be successfully compiled in the above way, as long as it honors the promise that executing it leaves a single element on the stack.
Column: CISC and RISC
x86-64 is an instruction set that has gradually evolved from 8086, which was released in 1978, and is a typical " CISC " style processor. The characteristics of CISC processors are that machine language operations can take memory addresses as well as registers, that the length of machine language instructions is variable, and that complex operations can be carried out in one sequence, which is convenient for assembly programmers. For example, it has many commands that can be performed by command.
`` RISC '' was invented in the 1980s in contrast to CISC . The characteristics of RISC processors are that operations are always performed only between registers, the only memory operations are loads to and stores from registers, the length of machine language instructions is the same for all instructions, and assembly For example, it does not have complex instructions that are useful to programmers, only simple instructions generated by the compiler.
x86-64 is one of the few survivors of CISC, and almost all major processors other than x86-64 are based on RISC. Specifically, ARM, PowerPC, SPARC, MIPS, RISC-V (Risk Five), etc. are all RISC processors.
RISC does not have operations between memory and registers like x86-64. There are also no register aliases. There are also no rules that say a particular integer register is used in a special way by a particular instruction. To modern eyes, where such instruction sets have become mainstream, the x86-64 instruction set looks old-fashioned.
RISC processors' simple design made them easy to speed up, and they took the processor industry by storm. So why did x86-64 survive? There was a huge market need for high-speed x86 processors that could take advantage of existing software assets, and technological innovation from Intel and Intel-compatible chip manufacturers to meet that need. Intel internally converted x86 instructions into some kind of RISC instructions in the CPU's instruction decoder, making x86 an internal RISC processor. This made it possible to apply the same techniques that made RISC faster to x86.
This section describes how to convert an abstract syntax tree into stack machine code. If you can do that, you will be able to parse an expression consisting of four arithmetic operations, assemble an abstract syntax tree, compile it into a stack machine using x86-64 instructions, and execute it. In other words, it will be possible to write a compiler that can perform the four arithmetic operations.
In a stack machine, when you compute a subexpression, one value of the result, whatever it is, remains on the top of the stack. For example, consider the tree below.
A
``ya B
'' is an abstract representation of a subtree, and actually means a node of some type. However, the specific type or shape of the tree is not important when compiling this entire tree. To compile this tree, do the following:
After executing the code in #1, no matter what the specific code is, there should be a single value on the top of the stack representing the result of the left subtree. Similarly, after running the code in 2, there should be a single value on the top of the stack representing the result of the right subtree. Therefore, in order to calculate the value of the entire tree, we just need to replace those two values with their total value.
In this way, when compiling an abstract syntax tree to a stack machine, we think recursively and output assemblies as we descend the tree. Although it may seem a bit difficult for readers who are not familiar with the concept of recursion, it is a standard technique when working with self-similar data structures such as trees.
Let's consider the example below.
The function that performs the code generation receives the root node of the tree.
Following the steps above, the first thing the function does is compile the left subtree. In other words, we will be compiling the number 2. The result of calculating 2 is 2, so the compilation result of that subtree is PUSH 2
.
The code generation function then attempts to compile the right subtree. This will recursively compile the left side of the subtree, and the result PUSH 3
will be output. Next we will compile the right side of the subtree, PUSH 4
which will output:
The code generation function then returns to the recursive call and outputs code that matches the type of the subtree operator. The first output is code that replaces the two elements at the top of the stack with their product. Next, the code that replaces the two elements at the top of the stack with their sum is output. The result will be the assembly below.
PUSH 2
PUSH 3
PUSH 4
MUL
ADD
Using this method, abstract syntax trees can be mechanically reduced to assembly.
So far we have been talking about virtual stack machines. The actual x86-64 is a register machine, not a stack machine. x86-64 operations are usually defined to operate between two registers, not the two values at the top of the stack. So in order to use stack machine techniques on x86-64, you need to emulate the stack machine in some sense with a register machine.
Emulating a stack machine with a register machine is relatively easy. What would be one instruction on a stack machine can be implemented using multiple instructions.
Let's explain a specific method for doing so.
First, prepare one register that points to the top element of the stack. This register is called the stack pointer. If you want to pop the two values at the top of the stack, take the two elements pointed to by the stack pointer and change them by the amount of the element you removed the stack pointer from. Similarly, when pushing, all you have to do is change the value of the stack pointer and write to the memory area it points to.
The x86-64 RSP register is designed to be used as a stack pointer. push
Instructions like `` in x86-64' pop
' implicitly use RSP as a stack pointer, modify its value, and access the memory pointed to by RSP. Therefore, when using the x86-64 instruction set like a stack machine, it is straightforward to use RSP as a stack pointer. Now, 1+2
let's compile the expression using x86-64 as a stack machine. Below is the x86-64 assembly.
// 左辺と右辺をプッシュ
push 1
push 2
// 左辺と右辺をRAXとRDIにポップして足す
pop rdi
pop rax
add rax, rdi
// 足した結果をスタックにプッシュ
push rax
Since x86-64 does not have an instruction to ``add the two elements pointed to by RSP,'' you need to load them into a register, perform the addition, and then push the result back onto the stack. That's what the above add
command does.
Similarly, 2*3+4*5
if you try to implement it on x86-64, it will look like this:
// 2*3を計算して結果をスタックにプッシュ
push 2
push 3
pop rdi
pop rax
mul rax, rdi
push rax
// 4*5を計算して結果をスタックにプッシュ
push 4
push 5
pop rdi
pop rax
mul rax, rdi
push rax
// スタックトップの2つの値を足す
// つまり2*3+4*5を計算する
pop rdi
pop rax
add rax, rdi
push rax
In this way, by using x86-64 stack manipulation instructions, you can run code that is quite similar to a stack machine even on x86-64.
The following gen
function implements this method as a C function.
void gen(Node *node) {
if (node->kind == ND_NUM) {
printf(" push %d\n", node->val);
return;
}
gen(node->lhs);
gen(node->rhs);
printf(" pop rdi\n");
printf(" pop rax\n");
switch (node->kind) {
case ND_ADD:
printf(" add rax, rdi\n");
break;
case ND_SUB:
printf(" sub rax, rdi\n");
break;
case ND_MUL:
printf(" imul rax, rdi\n");
break;
case ND_DIV:
printf(" cqo\n");
printf(" idiv rdi\n");
break;
}
printf(" push rax\n");
}
Although this is not a particularly important point when it comes to parsing or code generation, the idiv
code above uses instructions with a tricky specification, so let's explain them.
idiv
is an instruction that performs signed division. If x86-64 idiv
had a straightforward specification, I idiv rax, rdi
would have written the above code as originally written, but such a division instruction that takes two registers does not exist on x86-64. Instead, idiv
implicitly takes RDX and RAX, treats them together as a 128-bit integer, divides it by the 64-bit value of the argument register, puts the quotient in RAX, and puts the remainder in RDX. It is designed to be set to . cqo
By using the instruction, the 64-bit value in RAX can be expanded to 128 bits and set in RDX and RAX, so in the code above, is called before calling idiv
.cqo
Well, this concludes the explanation of stack machines. By now, you should be able to perform complex parsing and translate the resulting abstract syntax tree into machine code. Let's go back to writing a compiler to put that knowledge to use!
掛け算や割り算のサイズ
一般にn桁の数を2つ掛けると、結果を表すためには2n桁必要になります。例えば3桁の10進数を掛けたときの最大の値を考えてみると、999×999=998001というように6桁の数になります。乗算をハードウェアで普通に実装すると、実際に引数のレジスタの2倍の桁数の結果が生成されます。x86-64ではその計算結果を無駄にすることなく、上位桁をRDXに、下位桁をRAXに保存することにしています。
割り算を普通に実装すると、割られる数は割る数の2倍の桁数が必要です。上位桁を0で埋めて計算を始めても構いませんが、x86-64ではRDXを使って上位桁の値を指定できるようになっています。また、割り算を行うと必然的に余りも同時に計算できてしまうのですが、それはRDXに保存されるという仕様になっています。
多くのRISCプロセッサは上記のx86-64のような機能は実装していません。RISCでは、掛け算では下位桁の結果だけがレジスタに保存され、割り算では上位桁は暗黙のうちに0で初期化される、といった仕様になっていることが多いようです。
そのような命令セットとx86-64を比較してみると、せっかく計算した値を無駄にしないx86-64の仕様の方が優れているように思えますが、実際には特にそういうわけでもありません。例えば掛け算においてアセンブリレベルでは常に2倍幅の結果が計算されているとはいっても、Cやそのほかの言語では結果の上位桁にアクセスする方法が特に規定されていないので、使いようがないというのが実際のところです。
また、多くのRISCプロセッサでは掛け算の上位桁と下位桁を計算する命令を別々に持っていて、2つの命令を使って2倍幅の結果を計算することが可能になっています。それらの命令は、連続して実行されるとき、ハードウェアが特別にそのパターンを認識して、内部的に1つの命令として実行するという最適化が行われていることがよくあります。したがって現代のプロセッサにおいては、RISCで大きな桁の数字を扱いたい時も、x86-64に比べて特に不利ということはありません。
x86-64の命令セットが現在の形になっている理由は歴史的事情によるところが多いのですが、掛け算や割り算もその一つの例です。
Column: Optimizing compiler
The x86-64 assembly I used to illustrate this chapter may seem quite inefficient. push
For example, if you write an instruction that puts a number on the stack and writes pop
that value directly to a register, it should only take one instruction. mov
Some of you may find yourself wanting to optimize these assemblies by removing redundancy. But don't give in to that temptation. At the very beginning of code generation, it is desirable to output redundant code in favor of ease of implementation for the compiler.
Optimization passes can be added to 9cc later if necessary. It is not difficult to rescan the generated assembly and replace the instructions that appear in a particular pattern with other instructions. For example, you can create rules such as ``Replace with the immediately following '' or ``If consecutive but immediate values are added to the same register, replace the immediate values with one push
that adds the sum of the immediate values.'' can be applied mechanically to replace redundant code with more efficient code without changing its meaning.pop
mov
add
add
Mixing code generation and optimization complicates the compiler. If the code is difficult from the beginning, it is rather difficult to add optimization passes later. As Donald Knuth said, "Premature optimization is the root of all evil." In the compilers you write, please only consider ease of implementation. Don't worry about obvious verbosity in the output; you can remove it later.
In this chapter, we will modify the compiler created in the previous chapters and extend it so that it can handle expressions of four arithmetic operations that include precedence parentheses. All the necessary parts are included, so there is only a small amount of new code to write. main
Try modifying the compiler functions to use the newly created parser and code generator. The code should look like the one below.
int main(int argc, char **argv) {
if (argc != 2) {
error("引数の個数が正しくありません");
return 1;
}
// トークナイズしてパースする
user_input = argv[1];
token = tokenize(user_input);
Node *node = expr();
// アセンブリの前半部分を出力
printf(".intel_syntax noprefix\n");
printf(".globl main\n");
printf("main:\n");
// 抽象構文木を下りながらコード生成
gen(node);
// スタックトップに式全体の値が残っているはずなので
// それをRAXにロードして関数からの返り値とする
printf(" pop rax\n");
printf(" ret\n");
return 0;
}
At this point, expressions consisting of addition, subtraction, multiplication, and priority parentheses should compile correctly. Let's add some tests.
assert 47 '5+6*7'
assert 15 '5*(9-6)'
assert 4 '(3+5)/2'
For convenience of explanation, up to this point, the flow of the story has been as if , were implemented all at once, but in reality, it would be better to avoid implementing them all at once *
. Since there was originally a function that could perform addition and subtraction, first try introducing an abstract syntax tree and a code generator using it without breaking that function. Since we are not adding new functionality at that time, no new tests are needed. After that, please implement , , and including tests./
()
*
/
()
Reference implementation
Column: Memory management in 9cc
If you have read this far in this book, you may be wondering how memory management works in this compiler. The code we've seen so far uses calloc (a variant of malloc), but it doesn't call free. In other words, the allocated memory will not be freed. Isn't this a bit of an oversight?
In reality, this design of ``setting the memory management policy to not perform memory management'' was intentionally chosen by the author after considering various trade-offs.
The advantage of this design is that by not freeing memory, you can write code as if it were a language with a garbage collector. This not only eliminates the need to write memory management code, but also eliminates the mysterious bugs associated with manual memory management.
On the other hand, there are virtually no problems caused by not freeing, considering that it is run on a computer like a normal PC. A compiler is a short-lived program that simply reads a single C file and outputs assembly. All allocated memory is automatically released by the OS when the program ends. Therefore, the only issue is how much memory to allocate in total, but according to my actual measurements, even when compiling a fairly large C file, the memory usage is only about 100MiB. Therefore, not freeing is a realistically effective strategy. For example, the D language compiler DMD adopts a policy of only mallocing and not freeing based on the same idea. 1
A subtraction -
operator can 5-3
not only be written between two terms, as in , but also before a single term, as in . -3
Similarly, +
operators +3
can be written by omitting the left-hand side. An operator like this that takes only one term is called a "unary operator. " On the other hand, an operator that takes two terms is called a binary operator.
+
In addition to and , there are other unary operators in C , such as -
retrieving a pointer &
and dereferencing a pointer , but in this step we will only implement and .*
+
-
Unary +
and unary have the same symbol as -
binomial +
and -
, but have different definitions. Binary -
is defined as an operation that subtracts the right side from the left side, but -
since unary has no left side in the first place, the binary -
definition does not make sense as it is. In C, a unary -
is defined as an operation that reverses the sign of the right-hand side. A unary +
operator is an operator that returns the right-hand side as is. This is an operator that doesn't really need to be used, but it -
exists when there is a unary.
+
It -
is appropriate to think that there are multiple operators with the same name, with similar but different definitions, called unary and binary operators. Whether it is unary or binary depends on the context. The new grammar containing the unary +
/ -
looks like this:
expr = mul ("+" mul | "-" mul)*
mul = unary ("*" unary | "/" unary)*
unary = ("+" | "-")? primary
primary = num | "(" expr ")"
The new grammar above unary
adds a new nonterminal symbol, and mul
now primary
uses unary
, instead of . X?
is an EBNF syntax for an element that is optional, i.e. X
occurs zero or once. unary = ("+" | "-")? primary
According to the rule, unary
the nonterminal symbol , +
which may -
or may not have one , primary
indicates that it is followed by .
-3
Check that expressions such as and match the new syntax -(3+5)
. The syntax tree is shown -3*+5
below .-3*+5
Let's change the parser to follow this new grammar. As usual, parser changes should be completed by mapping the grammar directly to function calls. unary
The function to parse is shown below.
Node *unary() {
if (consume('+'))
return primary();
if (consume('-'))
return new_node(ND_SUB, new_node_num(0), primary());
return primary();
}
Here, I decided to replace , and , +x
at the parsing stage. Therefore, no code generator changes are required for this step.x
-x
0-x
This step is completed by writing a few tests and checking them in along with the code that adds the unary +
/ . -
When writing tests, try to keep the test results in the range 0-255. -10+20
An expression like this -
uses a single term but the overall value is a positive number, so please use this kind of test.
Reference implementation
Column: Unary plus and the pros and cons of grammar
Unary +
operators were not present in the original C compiler and were officially added to the language when C was standardized by ANSI (American National Standards Institute) in 1989. -
Since there is a single term , +
it is certainly better to have a single term because it has higher symmetry, but in reality, a single term +
has no particular use.
On the other hand, +
there are side effects of adding unary terms to the grammar. Let's say someone new to C accidentally writes +=
an operator like this: Without i =+ 3
the unary , this is just an invalid expression, but because of the unary, it is interpreted the same as writing , and the compiler silently accepts it as a legal assignment expression that assigns 3 to . Masu.+
+
i = +3
i
ANSI's C language standardization committee understood the above issues and +
made the decision to add unary terms to the language, but what do you think? If you were on the C standards committee at the time, would you support it? Do you object?
In this section, we implement , <
, <=
, >
, >=
, ==
. !=
Although these comparison operators appear to have special meaning, +
they -
are actually ordinary binary operators that take two integers and return one, just like and. +
Just as returns the result of adding both sides, for example, ==
returns 1 if both sides are the same, and 0 if they are different.
The symbol tokens we've been dealing with so far have all been one character in length, and we've assumed that in our code, but in order to ==
handle comparison operators such as , we need to generalize the code. Let's store len
a member in a structure so that we can store the length of a string in a token . Token
The new structure type is shown below.
struct Token {
TokenKind kind; // トークンの型
Token *next; // 次の入力トークン
int val; // kindがTK_NUMの場合、その数値
char *str; // トークン文字列
int len; // トークンの長さ
};
This change requires changes to functions such as and to improve them so that they take strings instead of characters consume
. expect
Here's an example with some changes:
bool consume(char *op) {
if (token->kind != TK_RESERVED ||
strlen(op) != token->len ||
memcmp(token->str, op, token->len))
return false;
token = token->next;
return true;
}
When tokenizing a symbol that consists of multiple characters, the longer tokens must be tokenized first. For example, if the rest of the string >
starts with , if you check the possibility that it starts with without first checking the possibility that strncmp(p, ">=", 2)
it is , as in It will be nized.>=
>
>=
>
=
To add support for comparison operators to the parser, consider what the grammar would look like with the addition of comparison operators. If you write the operators that have appeared so far in order of priority from lowest to highest, it will look like this:
==
!=
<
<=
>
>=
+
-
*
/
+
unary-
()
Precedence can be expressed in a generative grammar, where operators with different precedence map to different nonterminals. expr
If we mul
consider the grammar in the same way as , the new grammar with the addition of comparison operators will be as follows.
expr = equality
equality = relational ("==" relational | "!=" relational)*
relational = add ("<" add | "<=" add | ">" add | ">=" add)*
add = mul ("+" mul | "-" mul)*
mul = unary ("*" unary | "/" unary)*
unary = ("+" | "-")? primary
primary = num | "(" expr ")"
equality
The ==
dove represents !=
, relational
, , . These nonterminals can be directly mapped to functions using patterns that parse left-associative operators.<
<=
>
>=
Note that in the above grammar, and are separated equality
to indicate that the entire expression is . I could have written the right -hand side of `` directly on the right-hand side of ``, but I think the above syntax is probably easier to read.expr
equality
expr
equality
Column: Simple and verbose code vs. advanced and concise code
In recursive descending parsing, you write code that corresponds almost exactly to the production rules, so functions that parse similar rules will look the same. relational
The , equality
, add
and functions we have written so far mul
should also have similar appearance.
It's probably natural to think about how the common patterns of these functions can be better abstracted using metaprogramming techniques such as C macros, C++ templates, higher-order functions, and code generation. In fact, it is possible to do such a thing. However, this book does not intentionally do that. The reason is as follows.
Simple code is easy to understand, even if it is somewhat verbose. If you end up making similar changes to similar functions later on, it's actually not that big of a deal. Highly abstracted code, on the other hand, can be difficult to understand because you first need to understand the abstraction mechanism and then how to use it. For example, if I had started this book by writing a function that uses metaprogramming to generate functions for recursive descent parsing, this book would have been much more difficult.
You don't always have to aim to write elegant and concise code. When you aim for something like that, you tend to make the code harder to the point where it can't be made any harder.
The person writing the code becomes an expert on that code, so they tend to think that concise and lean code seen from an expert's perspective is good code, but most code readers share the same feeling as the author. There is no need to be that proficient in it in the first place, so you need to question your own sense as a code writer to some extent. One important technique for creating programs that are easy to understand and maintain is to intentionally write simple code that could be written better when necessary.
On x86-64, comparisons are done using the cmp instruction. The code that pops two integers from the stack, performs a comparison, and sets RAX to 1 if they are identical, and 0 otherwise, looks like this:
The code is a bit voluminous for a short assembly, so let's walk through the code step-by-step.
The first two lines pop the value off the stack. The third line compares the popped values. Where do the comparison results go? On x86-64, the result of a compare instruction is set in a special " flags register ." The flag register is a register that is updated every time an integer operation or comparison operation instruction is executed, and contains bits that indicate whether the result is 0, bits that indicate whether an overflow has occurred, bits that indicate whether the result is less than 0, etc. have.
The flag register is not a normal integer register, so if you want to set the comparison result in RAX, you need to copy the specific bits of the flag register to RAX. Commands do that sete
. The instruction sets 1 in the specified register (AL in this case) if the values in the two registers examined by the previous instruction are the same sete
. cmp
Otherwise set to 0.
AL is a new register name that has not appeared so far in this book, but AL is actually just another register that points to the lower 8 bits of RAX. Therefore sete
, when you set a value in AL, RAX will also be updated automatically. However, when updating RAX via AL, the upper 56 bits remain at their original value, so if you want to set the entire RAX to 0 or 1, the upper 56 bits must be cleared to zero. Commands do that movzb
. sete
It would be nice if the instruction could write directly to RAX, but since the sete
specification is such that only 8-bit registers can be taken as arguments, the comparison instruction uses two instructions like this to set a value to RAX. Become.
sete
You can implement other comparison operators by using other instructions instead of . <
So setl
, try to <=
use setle
then !=
.setne
>
and >=
does not need to be supported by the code generator. Please replace both sides in the parser and read it <
as .<=
Reference implementation
Column: Flag registers and hardware
This x86-64 specification, in which the results of value comparisons are implicitly stored in special registers different from ordinary integer registers, may seem confusing at first. In fact, some RISC processors do not want to have flag registers and have an instruction set that sets the results of value comparisons in ordinary registers. For example, RISC-V is such an instruction set.
However, from the perspective of a hardware implementer, it is very easy to create a flag register if the implementation is simple. In other words, when performing an integer operation, you can branch the wire for the result and connect it to another logic, and then check whether the result is zero (all lines are 0) or whether the result is negative (the top All you have to do is check whether the bit line is 1 or not, and set the result to each bit in the flag register. That's exactly how a CPU with a flag register is implemented, and every time you perform an integer operation, the flag register is updated as well.
In such a mechanism, the flags register is updated cmp
not only by , add
but also by , etc. sub
In fact, it is cmp
actually a special instruction that updates only the flag register . If you look at the flag register after that, you can see the magnitude relationship between RAX and RDI, but since that would result in RAX being updated, we have prepared a method that does not write to the integer register .sub
sub rax, rdi
sub
cmp
In the case of software, it always takes extra time to ``calculate something at the same time,'' but in hardware, branching lines and using extra transistors itself does not incur a time penalty. So the cost of updating the flag register every time is non-existent in a naive hardware implementation.
Up until this stage, development had proceeded with a file structure consisting of only one C file and one test shell script. There's nothing wrong with this structure, but since the source code is getting longer and longer, I decided to split it into multiple C files to make it easier to read. In this step, we will split one file, 9cc.c, into the following five files.
9cc.h
: header filemain.c
: main
functionparse.c
: parsercodegen.c
: code generatormain
Since the function is small, I could have put it in another C file, but since it doesn't belong to either of these semantically parse.c
, codegen.c
I decided to separate it into a separate file.
This chapter explains the concept of separate compilation and its significance, and then explains the specific steps.
Separate compilation means writing a program by dividing it into multiple source files and compiling them separately. With split compilation, the compiler reads fragments of the program and outputs the corresponding fragments, rather than the entire program. A file that contains program fragments that cannot be executed on its own is called an " object file " (extension: ). .o
In split compilation, the object files are finally joined together to create a single file. A program that combines object files into one executable file is called a linker.
Let's understand why we need separate compilation. Actually, technically there is no necessity to split the source. If you give the compiler all the source code at once, it is theoretically possible for the compiler to output a complete executable without the aid of a linker.
However, such an approach requires that the compiler really know all the code that the program uses. For example, printf
standard library functions such as are usually functions written in C by the standard library author, but in order to avoid the linking step, it is necessary to provide the source code of such functions as input to the compiler every time. I'm coming. Compiling the same function over and over again is often just a waste of time. Therefore, standard libraries are usually distributed in the form of precompiled object files, so you don't have to recompile them each time. In other words, even a program consisting of one source code is actually using separate compilation as long as it uses the standard library.
Without separate compilation, changing just one line would require recompiling the entire code. Compiling code with a length of tens of thousands of lines can take several tens of seconds. A large project may have over 10 million lines of source code, so if you compile it as a single unit, it will take more than a day. Memory is also required in units of 100 GiB. Such a build procedure is unrealistic.
Another problem is that simply writing all functions and variables in one file makes it difficult for humans to manage.
Separate compilation is necessary for the reasons mentioned above.
Column: Linker history
The linker's ability to string together multiple fragmented machine language routines into a single program has been needed since the dawn of computers. In 1947, John Mauchly ( project leader for the first digital computer, ENIAC) wrote a program that relocated subprograms read from tape and combined them into a single program .
Even in the earliest computers, it was desirable to write a general-purpose subroutine only once and use it from various programs, but in that case a linker was needed to combine program fragments into an executable program. is. Since 1947 was a time when assemblers were not yet in use and code was written directly in machine language, programmers actually wanted to create a linker before an assembler.
With split compilation, the compiler sees only one part of the program's code, but the compiler cannot compile even the smallest piece of the program. For example, consider the following code.
In the above code, if you know the type of struct Foo, you can output the assembly corresponding to this code, but otherwise you will not be able to compile this function.
When compiling separately, each C file must contain enough information to compile each individual C file. However, if you write all the code written in separate files, it will no longer be a separate compilation, so you will need to be selective about the information to some extent.
As an example, consider what information needs to be included in order to output code that calls a function in another C file. The compiler requires the following information:
call
uses an instruction to jump to the beginning of another function. Depending on the argument type, it may also convert an integer to a floating point number. You should also display an error message if the type or number of arguments is incorrect. Therefore, we need the number of arguments for a function and the types of each argument.call
The address to jump to is not known when compiling separately, but the assembler outputs an call
instruction to jump to address 0 and writes a message in the object file saying ``The Xth byte of the object file is You can leave the information "Correct by address". The linker looks at this information, determines the layout of the executable file, and then performs binary patching on the program fragments to modify the jump destination address (this operation is called ``relocating''). Therefore, although the name of the function is required for separate compilation, the address of the function is not required.Putting the above requirements together, { ... }
we have enough information to call the function as long as we omit the function body. A function without the body of the function is called a "declaration" of the function. The declaration only tells the compiler the type and name; it does not contain the code for the function. For example, below strncmp
is the declaration of:
By looking at the above line, the compiler strncmp
can learn about the existence of and its type. A declaration that includes function code is called a "definition. "
extern
Add a keyword that represents the declaration to the function declaration ,
However, in the case of functions, declarations and definitions can be distinguished by omitting the function body, so it is not necessary extern
.
Note that you only need to know the type of the argument, so the name can be omitted in the declaration, but it is common to write the name in the declaration to make it easier for humans to understand.
Consider struct types as another example. If there are two or more C files that use the same structure, you must write a declaration for the same structure in each C file. If a structure is used only in one C file, other C files do not need to know about its existence.
In C, the declarations needed when compiling other C files are written together in a header file (with the extension ). If you write a declaration in , and write it in another C file that requires it , the line will be replaced with the contents of the file..h
foo.h
#include "foo.h"
#include
foo.h
typedef
etc. are also used to tell the compiler type information. If these are used in multiple C files, they must be written in the header file.
The compiler does not output any assembly when it reads the declaration. A declaration is the information necessary to use a function or variable contained in another file, and does not itself define the function or variable.
Based on the story of split compilation up to this point, you can understand what is actually being done when people say, ``When using , I will write it down as a spell.' printf
' #include <stdio.h>
The C standard library is passed implicitly to the linker so that the linker printf
can link object files containing function calls to create an executable file. On the other hand, the compiler printf
doesn't know anything about it by default. is not a built-in function, and there is no specification that standard library header files are automatically loaded, so the compiler does not know anything about it printf
immediately after startup. printf
From this state, by including the header file that comes with the C standard library, printf
the compiler can learn about the existence of and its type, and printf
can compile the function call.
Column: One-pass compiler and forward declaration
In C, declarations are sometimes required even when all functions are written in one file. The C language specification allows the compiler to compile each function one by one from the beginning without reading the entire file. Therefore, any function must be able to compile using only the information written up to the point where the function appears in the file. Therefore, if you want to use a function defined at the end of a file, you need to write a declaration for that function in advance. Such a declaration is called a " forward declaration."
You can avoid writing most forward declarations by adjusting the order in which functions are written in files, but forward declarations are essential if you want to write functions that are mutually recursive.
The C specification that allowed compilation without reading the entire file made sense when main memory was very small, but it is now an outdated specification. . If the compiler were a little smarter, it would be possible to avoid writing declarations for definitions written in the same file. However, this behavior is part of the language specification and should be kept in mind.
When the object files are finally assembled and passed to the linker, they must contain just enough information to construct the program as a whole.
If a program foo
contains only function declarations and no definitions, the individual C files, foo
including the code that calls them, can be compiled normally. However, when the linker finally tries to create a complete program, there is no foo
way to fix it because the address that should be fixed foo
is missing, resulting in an error.
Errors that occur when linking are called link errors .
Linking errors will also occur if multiple object files contain the same function or variable. As a linker, I don't really know which one to choose if there are duplicates. Duplicate errors like this often occur when you write the wrong definition in a header file. This is because header files are included in multiple C files, so if a header file has a definition, it will be in the same state as having duplicate definitions written in multiple C files. To resolve this kind of error, write only the declaration in the header file, and move the substance to one C file.
Column: Duplicate definitions and link errors
When there are duplicate definitions, the linker may choose one and ignore the remaining definitions. With such a linker, duplicate definitions will not result in an error.
Even in actual object files, it is possible to choose whether or not to allow duplication for each definition, and inline functions and C++ template expansion results are included in object files in a way that allows duplication. Object file formats and linker operations are surprisingly complex, and there are many exceptions, but these operations are just exceptions. By default, duplicate definitions usually result in an error.
Our compiler doesn't yet have global variables, so we don't have assembly examples for them yet, but global variables are almost the same as functions at the assembly level. Therefore, like functions, global variables also have a distinction between definitions and declarations. If the body of a variable is duplicated in multiple C files, it usually results in a linking error.
Global variables are allocated in non-executable memory areas by default, so jumping there will cause your program to crash with a segmentation fault, but other than that there is essentially no difference between data and code. At runtime, you can read the function as data like a global variable, or you can execute the data as code by changing the memory attributes to allow execution and jumping to the data.
Let's verify with some actual code that both functions and global variables are essentially just data that resides in memory. In the code below, main
the identifier is defined as a global variable. main
The content is x86-64 machine language.
foo.c
Let's save the above C code in a file, compile objdump
it, and check the contents using . objdump
By default, the contents of global variables are only displayed in hexadecimal, but you -D
can pass an option to force the data to be disassembled as code.
$ cc -c foo.c
$ objdump -D -M intel foo.o
Disassembly of section .data:
0000000000000000 <main>:
0: 48 c7 c0 2a 00 00 00 mov rax,0x2a
7: c3 ret
The default behavior of data being mapped to non-executable areas -Wl,--omagic
can be changed at compile time by passing the option . Let's generate an executable file using this option.
$ cc -static -Wl,--omagic -o foo foo.o
Functions and variables are just labels in assembly and belong to the same namespace, so the linker doesn't care which are functions and which are data when combining multiple object files. So main
even if is defined as data at the C level, main
the link will succeed just as if is a function.
Let's run the generated file.
$ ./foo
$ echo $?
42
As shown above, the value 42 is correctly returned. main
The contents of the global variable were executed as code.
In C syntax, if it is a global variable, extern
it becomes a declaration if you add . Below foo
is the declaration of a global variable of type int.
foo
When writing a program that includes , the above line should be written in the header file. foo
Then, define it in one of the C files . Below foo
is the definition of.
Note that in C, global variables for which no initialization expression is given are initialized to 0, so such variables are semantically equivalent to being initialized to 0 , {0, 0, ...}
etc. "\0\0\0\0..."
Same as .
int foo = 3
When writing an initialization expression like this, write the initialization expression only in the definition. A declaration only tells the compiler the type of a variable, so no specific initialization expression is required. Since the compiler does not specifically output assembly when it sees the declaration of a global variable, it does not need to know how its contents are initialized on the spot.
If the initialization expression is omitted, the declaration and definition of a global variable extern
are simply the presence or absence of the variable, so they look similar, but the declaration and definition are different. Please understand this clearly here.
Column: Intel CPU F00F bug
Intel Pentium processors before 1997 F0 0F C7 C8
had a serious bug that caused the CPU to hang completely when executing the 4-byte instruction.
Officially, there is no assembly instruction that corresponds to this 4-byte instruction, but if you write it down as assembly, it lock cmpxchg8b eax
will become an instruction like this. 0F C7 C8
This cmpxchg8b eax
is an instruction that atomically exchanges an 8-byte value between a register and memory (even in a multi-core system, the intermediate state cannot be observed by other cores). F0
This lock
is additional information called a prefix, which has the effect of making the instruction immediately following it atomic. However, cmpxchg8b
since it is originally atomic, lock cmpxchg8b eax
is a redundant and illegal way to write an instruction. Therefore, such an assembly instruction is not supposed to exist grammatically, the F0 0F C7 C8
byte sequence would never appear in a normal program, and Intel was unable to detect this bug before mass production of processors. .
Using the hack of writing the main function as data, the code that reproduces the F00F bug can be written in C in one line:
This function is harmless on modern x86, but on a Pentium back in 1997, anyone could easily hang the entire system with this one-line program.
The F00F bug is not a big problem if the PC is completely occupied by an individual, but if the CPU is shared like what is now called a cloud, this bug is fatal. However, at first it was thought that the F00F bug was unfixable and the only option was to recover and replace the CPU, but later on, a tricky method was developed to circumvent the bug at the exception handler level of the OS kernel, and fortunately for Intel. Fortunately, we were able to avoid product replacement.
Try splitting the file using the configuration shown at the beginning of this chapter. 9cc.h
That is a header file. Depending on the structure of the program, one file may be prepared for each file, but there is no particular harm in having extra declarations, so there is no need to manage dependencies in such detail here .c
. .h
there is no. 9cc.h
Prepare one file and #include "9cc.h"
include it in all C files.
Now that we've changed the program to multiple files, Makefile
let's also update it. The example below Makefile
compiles and links all .c files located in the current directory to create an executable file called 9cc. It is assumed that there is only one file, 9cc.h, as a project header file, and that all .c files include that header file.
CFLAGS=-std=c11 -g -static
SRCS=$(wildcard *.c)
OBJS=$(SRCS:.c=.o)
9cc: $(OBJS)
$(CC) -o 9cc $(OBJS) $(LDFLAGS)
$(OBJS): 9cc.h
test: 9cc
./test.sh
clean:
rm -f 9cc *.o *~ tmp*
.PHONY: test clean
Makefile
Note that the indentation of must be a tab character.
make is a highly functional tool, and you don't necessarily need to be proficient Makefile
at using it, but being able to read the above will be useful in many situations. Therefore, in this section Makefile
we will explain the above.
Makefile
In , zero or more command lines separated by colons and indented by tabs constitute a rule. The name before the colon is called the "target." Zero or more file names after the colon are called dependent files.
make foo
When you run , make
it foo
tries to create a file called . If the specified target file already exists, make
reruns the target rule only if the target file is older than the dependent file. This enables behavior such as regenerating binaries only when the source code changes.
.PHONY
is a special name used to represent a dummy target. make test
After all make clean
, test
I clean
don't do it to create a file, but I don't usually know make
that, so if test
there clean
is a file with a name by accident, make test
nothing make clean
will be done. I'll put it away. .PHONY
By specifying such a dummy target, it does not mean that you really want to create a file with that name, but rather that the rule's command should be executed regardless of whether the specified target file exists or not . make
can be conveyed to .
CFLAGS
and SRCS
are OBJS
variables.
CFLAGS
is a variable recognized by make's built-in rules, and describes the command line options to be passed to the C compiler. Here we are passing the following flags:
-std=c11
: Tell that the source code is written in C11, the latest C standard.-g
: Output debug information-static
: Static linkSRCS
The one used on the right side of wildcard
is a function provided by make, which is expanded to the file name that matches the function argument. $(wildcard *.c)
is currently main.c parse.c codegen.c
being expanded.
OBJS
The right-hand side uses variable substitution rules to generate a value that replaces .c in SRC with .o. SRCS
Because it main.c parse.c codegen.c
is, OBJS
it main.o parse.o codegen.o
becomes.
With these things in mind, make 9cc
let's trace what happens when we run . Since make attempts to generate the target specified as an argument, 9cc
creating the file is the final goal of the command (if there is no argument, the first rule is chosen, so in this case 9cc is not specified). (optional). make does this by walking through dependencies and attempting to build missing or outdated files.
9cc
A dependent file is a file .c
that corresponds to a file in the current directory .o
. .o
If a file remains from a previous run of make .c
and has a newer timestamp than the corresponding file, make won't bother running the same command again. Runs the compiler to generate the file only .o
if the file does not exist or is .c
newer ..o
$(OBJS): 9cc.h
The rule says that all .o
files depend on . 9cc.h
Therefore, 9cc.h
if you change it, all .o
files will be recompiled.
Column: Various meanings of the static keyword
C static
keywords are mainly used for two purposes:
static
so that their values are saved even after exiting the functionstatic
Add to a global variable or function to make the scope of that variable or function file scope.Although these two uses have nothing in common, they use the same keywords, which is one of the points of confusion when learning C. Ideally, they should have used different keywords such as usage 1 persistent
and usage 2 . private
More ideally, for usage 2 , it might have been better to make private
it the default and attach it to variables and functions in the global scope .public
The reason C reuses keywords is for compatibility with code assets written in the past. private
If you add a new keyword to the language, existing programs that use that keyword as the name of a variable or function will no longer compile. C didn't like that, so instead of adding more keywords, he decided to reuse existing keywords in different contexts.
static
If we had decided at some point in the 1970s to add new keywords instead of reusing them, we probably could have saved ourselves a huge amount of code changes, but I wonder what I would have done . This is quite a difficult problem.
In this chapter, we will implement functions and local variables. We also implement a simple control structure. After completing this chapter, you will be able to compile code like this:
// mからnまでを足す
sum(m, n) {
acc = 0;
for (i = m; i <= n; i = i + 1)
acc = acc + i;
return acc;
}
main() {
return sum(1, 10); // 55を返す
}
Although there is still a gap between the above code and C, I think it can be said that it has come quite close to C.
Up to the previous chapter, we were able to create a compiler for a language that can perform four arithmetic operations. In this section, we'll add functionality to the language to allow the use of variables. Specifically, the goal is to be able to compile multiple statements containing variables as follows.
a = 3;
b = 5 * 6 - 8;
a + b / 2;
Let's use the result of the last expression as the calculation result for the entire program. I think you could say that this language has a much more ``real language'' feel to it than a language that only uses four arithmetic operations.
In this chapter, we will first explain how to implement variables, and then we will implement variables incrementally.
Variables in C exist in memory. You can say that a variable is a memory address with a name. By naming memory addresses, you can express things like ``accessing a variable'' instead of ``accessing address 0x6080 in memory.' a
'
However, a function's local variables must exist separately for each function call. Considering only the convenience of implementation, it seems easy to fix the address, for example, ``place the f
local variable of the function at address 0x6080'', but in that case it will work fine when called recursively. not. In C, local variables are placed on the stack so that each function call has a separate local variable.a
f
Let's consider the contents of the stack using a concrete example. Suppose you have a function with a
a local variable and some other function calls it. The function call instruction puts the return address on the stack, so the top of the stack when is called will contain that return address. In addition to that, it is assumed that the stack originally contains some value. The specific value is not important here, so we will represent it with "...". If you make a diagram, it will look like this:b
f
f
call
f
... | |
return address | ← RSP |
Here, we will use the notation "← RSP" to indicate that the current RSP register value points to this address. a
and b
are each 8 bytes in size.
The stack grows downwards. In order to secure space from this state, it is necessary to reduce RSP by two variables, or 16 bytes in a
total . b
When you do that, it looks like this:
... | |
return address | |
a | |
b | ← RSP |
If you create a layout like the one above, you can access both the RSP+8 value a
and the RSP value . b
The memory area that is secured for each function call is called a "function frame" or "activation record."
The number of bytes to change in RSP and the order in which variables are placed in the area secured in this way are not visible to other functions, so they are determined appropriately based on the compiler implementation. It's okay.
Basically, local variables are implemented as simple things like this.
However, this method has one drawback, so the actual implementation would use one more register. Recall that in our compiler (and in other compilers as well) RSP may change while the function is running. 9cc pushes/pops the calculation results in the middle of the formula onto the stack using RSP, so the RSP value changes frequently. Therefore, it a
cannotb
A common way to solve this is to have a register separate from RSP that always points to the start of the current function frame. Such a register is called a ``base register,'' and the value contained therein is called a ``base pointer.'' By convention, x86-64 uses the RBP register as the base register.
The base pointer must not change during function execution (that's why we have a base pointer). You can't just call another function from a function and have a different value when it returns, so you need to save the original base pointer for each function call and write it back before returning. .
The figure below shows the state of the stack when calling a function using a base pointer. Suppose that a function with x
local variables calls . While running, the stack looks like this:y
g
f
g
... | |
return address of g | |
RBP at the time of calling g | ← RBP |
x | |
y | ← RSP |
f
Calling from here will result in the following state.
... | |
return address of g | |
RBP at the time of calling g | |
x | |
y | |
return address of f | |
RBP at the time of calling f | ← RBP |
a | |
b | ← RSP |
In this way, you can always access the address a
RBP-8 and RBP-16. b
If we consider specifically the assembly that creates this stack state, the compiler should output the following assembly at the beginning of each function.
The standard instructions that the compiler outputs at the beginning of a function are called a "prologue. " Note that 16 actually needs to be a value that matches the number and size of variables for each function.
Let's confirm that when we execute the above code with RSP pointing to the return address, the expected function frame is created. The stack status for each instruction is shown below.
f
call
The stack immediately after calling with
... | |
return address of g | |
RBP at the time of calling g | ← RBP |
x | |
y | |
return address of f | ← RSP |
push rbp
The stack after running
... | |
return address of g | |
RBP at the time of calling g | ← RBP |
x | |
y | |
return address of f | |
RBP at the time of calling f | ← RSP |
mov rbp, rsp
The stack after running
... | |
return address of g | |
RBP at the time of calling g | |
x | |
y | |
return address of f | |
RBP at the time of calling f | ← RSP, RBP |
sub rsp, 16
The stack after running
... | |
return address of g | |
RBP at the time of calling g | |
x | |
y | |
return address of f | |
RBP at the time of calling f | ← RBP |
a | |
b | ← RSP |
When returning from a function, write back the original value to RBP, make RSP point to the return address, and call an instruction (the instruction is an instruction that pops an address from the stack and jumps to it ret
) ret
. . The code to do this can be written simply as follows:
The standard instructions that the compiler outputs at the end of a function are called an "epilogue. "
The state of the stack when running the epilogue is shown below. The stack area below the address pointed to by RSP can no longer be considered invalid data, so it is omitted in the diagram.
mov rsp, rbp
Stack before running
... | |
return address of g | |
RBP at the time of calling g | |
x | |
y | |
return address of f | |
RBP at the time of calling f | ← RBP |
a | |
b | ← RSP |
mov rsp, rbp
The stack after running
... | |
return address of g | |
RBP at the time of calling g | |
x | |
y | |
return address of f | |
RBP at the time of calling f | ← RSP, RBP |
pop rbp
The stack after running
... | |
return address of g | |
RBP at the time of calling g | ← RBP |
x | |
y | |
return address of f | ← RSP |
ret
The stack after running
... | |
return address of g | |
RBP at the time of calling g | ← RBP |
x | |
y | ← RSP |
Thus, by executing the epilogue, g
the state of the calling function's stack is restored. call
The instruction call
pushes the address of the instruction following itself onto the stack. The epilogue ret
pops that address and jumps there, causing the function to resume execution call
at the next instruction . g
This behavior is completely consistent with the behavior of functions as we know them.
This is how function calls and function local variables are implemented.
Column: Direction of stack growth
The x86-64 stack grows from larger to smaller addresses as explained above. It feels more natural to have the stack grow in the opposite direction, towards the top, but why are the stacks designed to grow downward?
In fact, there is no technical necessity for the stack to grow downwards. In actual CPUs and ABIs, the stack starts at an upper address and grows downwards, but there are some architectures in which the stack grows in the opposite direction, although it is extremely minor. For example, in 8051 microcontrollers, PA-RISC's ABI 3 , Multics 4 , etc., the stack grows toward higher addresses.
However, the design of the stack growing downwards is not particularly unnatural.
Immediately after the power is turned on, when the CPU starts executing a program from a blank state, the address at which execution starts is usually determined by the CPU specifications. A common design is for the CPU to start execution at a low address, such as address 0. In this case, the program code is usually placed all together at a lower address. By placing the two as far apart as possible so that the stack does not grow and overlap with the program's code, the stack can be placed at higher addresses and designed to grow toward the center of the address space. Become. This way the stack will grow downwards.
Of course, you can think of a different design than the CPU above, and then extending the stack up would be a more natural placement. This is honestly an either-or issue, and the reality is simply that the general industry perception is that the machine stack grows down.
Now that you know how to implement variables, let's start implementing them. However, supporting an arbitrary number of variables quickly becomes too difficult, so we decided to limit the variables in this step to a single lowercase letter, so variable is RBP-8, variable is RBP-16, and variable is RBP a
- b
24 c
. , and so on, all variables are assumed to always exist. Since there are 26 characters in the alphabet, if we decide to push down RSP by 26 x 8, or 208 bytes, when calling the function, we will be able to reserve space for all single-character variables.
Let's implement it now. First, let's modify the tokenizer so that it can tokenize single-character variables in addition to the previous grammar elements. To do this, we need to add a new token type. Since the variable name str
can be read from the member, Token
there is no need to add a new member to the type. As a result, the token type is:
Modify the tokenizer TK_IDENT
to create a token of type if it is a lowercase letter of the alphabet. if
You should be able to add statements like the following to your tokenizer.
With recursive descent parsing, if you know the syntax, you can mechanically map it to a function call. Therefore, in order to think about changes that should be made to the parser, it is necessary to consider what the new syntax will look like when variable names (identifiers) are added.
ident
Let's call it an identifier . This num
is a terminal symbol, just like . Variables can be used anywhere a number can be used, so if you write num
``where was'' num | ident
, you get a syntax that allows variables to be used in the same places as numbers.
In addition to that, we need to add an assignment expression to the grammar. Since variables cannot be assigned, a=1
we want to create a grammar that allows expressions like . Here, a=b=1
let's use a syntax that is compatible with C and can be written as follows.
Furthermore, I would like to be able to write multiple statements separated by semicolons, so the resulting new grammar would be as follows.
program = stmt*
stmt = expr ";"
expr = assign
assign = equality ("=" assign)?
equality = relational ("==" relational | "!=" relational)*
relational = add ("<" add | "<=" add | ">" add | ">=" add)*
add = mul ("+" mul | "-" mul)*
mul = unary ("*" unary | "/" unary)*
unary = ("+" | "-")? primary
primary = num | ident | "(" expr ")"
First of 42;
all a=b=2; a+b;
please check that a program like Hayaya conforms to this grammar. After that, modify the parser you have created so far so that it can parse the above grammar. At this stage, a+1=5
expressions like can also be parsed, which is correct behavior. Eliminating such semantically invalid expressions is done in the next pass. There's nothing particularly tricky about modifying the parser, and you should be able to do it just by mapping the grammar elements to function calls as before.
Since we are multiplying multiple expressions by separating them with semicolons, we need to save the multiple nodes as a result of the parsing somewhere. For now, please prepare the following global array and store the parsed result nodes there in order. If you fill the last node with NULL, you will be able to see where it ends. Some of the new code to be added is shown below.
Node *code[100];
Node *assign() {
Node *node = equality();
if (consume("="))
node = new_node(ND_ASSIGN, node, assign());
return node;
}
Node *expr() {
return assign();
}
Node *stmt() {
Node *node = expr();
expect(";");
return node;
}
void program() {
int i = 0;
while (!at_eof())
code[i++] = stmt();
code[i] = NULL;
}
The abstract syntax tree needs to be able to express a new ``node representing a local variable''. To do this, let's add a new type for the local variable and a new member for the node. For example, it should look like this: This data structure ND_LVAR
causes the parser to create and return type nodes for identifier tokens.
typedef enum {
ND_ADD, // +
ND_SUB, // -
ND_MUL, // *
ND_DIV, // /
ND_ASSIGN, // =
ND_LVAR, // ローカル変数
ND_NUM, // 整数
} NodeKind;
typedef struct Node Node;
// 抽象構文木のノード
struct Node {
NodeKind kind; // ノードの型
Node *lhs; // 左辺
Node *rhs; // 右辺
int val; // kindがND_NUMの場合のみ使う
int offset; // kindがND_LVARの場合のみ使う
};
offset
is a member that represents the offset from the local variable's base pointer. Currently, variables a
are RBP-8, b
RBP-16..., etc., and local variables are in fixed positions determined by their names, so the offset can be determined at the parsing stage. Below ND_LVAR
is the code that reads the identifier and returns the type node.
Node *primary() {
...
Token *tok = consume_ident();
if (tok) {
Node *node = calloc(1, sizeof(Node));
node->kind = ND_LVAR;
node->offset = (tok->str[0] - 'a' + 1) * 8;
return node;
}
...
Column: ASCII code
In the ASCII code , characters are assigned to numbers from 0 to 127. A table of character assignments in ASCII code is shown below.
0 | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL |
8 | B.S. | HT | N.L. | VT | NP | CR | S.O. | S.I. |
16 | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB |
twenty four | CAN | E.M. | SUB | Esc | F.S. | G.S. | R.S. | US |
32 | sp | ! | " | # | $ | % | & | ' |
40 | ( | ) | * | + | , | - | . | / |
48 | 0 | 1 | 2 | 3 | Four | Five | 6 | 7 |
56 | 8 | 9 | : | ; | < | = | > | ? |
64 | @ | A | B | C | D | E | F | G |
72 | H | I | J | K | L | M | N | O |
80 | P | Q | R | S | T | U | V | W |
88 | X | Y | Z | [ | \ | ] | ^ | _ |
96 | ` | a | b | c | d | e | f | g |
104 | h | i | j | k | l | m | n | o |
112 | p | q | r | s | t | u | v | w |
120 | x | y | z | { | | | } | ~ | DEL |
Those in 0-31 are control characters. Nowadays, with the exception of NUL characters and newline characters, there are almost no opportunities to use such control characters, and most control characters are just a waste of prime real estate in the character code. However, when the ASCII code was created in 1963, these control characters were actually commonly used. During the development of the ASCII standard, there was even a proposal to include more control characters in place of the lowercase letters of the alphabet5 .
Numbers are assigned to 48-58, uppercase letters are assigned to 65-90, and lowercase letters are assigned to 97-122. Notice that these characters are assigned consecutive codes. In other words, 0123456789 and abcdefg... are consecutive on the character code. It may seem natural to place characters with a defined order in consecutive positions like this, but in EBCDIC, which was a major character code at the time, the alphabet was arranged consecutively on the code due to the influence of punch cards . did not.
In C, a character is just a small value of the integer type, and the meaning is the same as writing the code corresponding to the character as a number. In other words, assuming ASCII, for example, 'a'
97 '0'
is equivalent to 48. In the above code, a
there was an expression that subtracted from the letter as a number, but by doing so, you can calculate how many characters a given letter is away from a. This technique is possible because the alphabets are arranged consecutively in the ASCII code.
Unlike other binary operators, assignment expressions require special handling of the value on the left side, so let's explain that here.
Not just any expression is allowed on the left side of an assignment expression. For example, 1=2
1 cannot be changed to 2. a=2
Assignments like are allowed, but (a+1)=2
statements like are illegal. Pointers and structures do not yet exist in 9cc, but if they did exist, assignments to the *p=2
destination pointed to by a pointer such as , or a.b=2
assignments to a member of a structure such as It must be accepted as legitimate. How can we distinguish between valid and invalid expressions?
There is a simple answer to that question. In C, only expressions that specify memory addresses can appear on the left side of assignment expressions.
Variables exist in memory and have addresses, so variables can be written on the left side of assignments. Similarly, *p
a pointer reference such p
as can also be written on the left side, since the value of is an address. a.b
Accessing a member of a structure like this can also be written on the left side because it points to a memory address that is advanced by the member's offset a
from the start position of the structure in memory.b
On the other hand, a+1
the result of an expression such as is not a variable, so it cannot be used as an expression to specify a memory address. These temporary values may actually exist only in registers and not in memory, and even if they do exist in memory, they cannot be accessed at a fixed offset from a known variable. Normally you can't. For this reason, &(a+1)
even if you write, for example, a+1
it is not allowed to get the address of the result and will result in a compilation error. These expressions cannot be written on the left side of an assignment statement.
A value that can be written on the left side is called a left value, and a value that cannot be written is called a right value. Lvalues and rvalues are also called lvalues and rvalues , respectively. In our language today, only variables are lvalues; all other values are rvalues.
When generating code for variables, you can start from the lvalue. If a variable appears as the left-hand side of an assignment, the address of the variable is calculated as the value of the left-hand side, and the evaluation result of the right-hand side is stored at that address. This allows you to implement assignment expressions. If the variable appears in any other context, convert the lvalue to an rvalue by calculating the variable's address in the same way and then loading the value from that address. This allows you to get the value of the variable.
The code generation up to this point has only accessed memory at the top of the stack, but local variables require access to any location on the stack. This section explains how to access memory.
The CPU can load and store values from any address in memory, not just the top of the stack.
When loading a value from memory, mov dst, [src]
use the syntax: This instruction means ``Regarding the value of the src register as an address, load the value from there and save it to dst.'' For example mov rdi, [rax]
, this means loading a value from the address in RAX and setting it in RDI.
When storing, mov [dst], src
use the syntax: This instruction means ``assume the value of the dst register as an address, and store the value of the src register there.'' For example mov [rdi], rax
, we would store the value of RAX to the address contained in RDI.
push
and pop
are instructions that access memory by implicitly treating RSP as an address, so these can actually be rewritten using multiple instructions using normal memory access instructions. So for pop rax
example
mov rax, [rsp]
add rsp, 8
It is the same as the two commands, push rax
and
sub rsp, 8
mov [rsp], rax
This is the same as the two commands.
Using what we've learned so far, let's modify the code generator to handle expressions that include variables. This change adds a function that evaluates an expression as an lvalue. The function in the code below gen_lval
does that. gen_lval
calculates the address of a variable when the given node points to it and pushes it onto the stack. Displays an error otherwise. This (a+1)=2
will eliminate expressions like .
When using a variable as an rvalue, first evaluate it as an lvalue, then treat the result of the calculation at the top of the stack as an address, and load the value from that address. The code is shown below.
void gen_lval(Node *node) {
if (node->kind != ND_LVAR)
error("代入の左辺値が変数ではありません");
printf(" mov rax, rbp\n");
printf(" sub rax, %d\n", node->offset);
printf(" push rax\n");
}
void gen(Node *node) {
switch (node->kind) {
case ND_NUM:
printf(" push %d\n", node->val);
return;
case ND_LVAR:
gen_lval(node);
printf(" pop rax\n");
printf(" mov rax, [rax]\n");
printf(" push rax\n");
return;
case ND_ASSIGN:
gen_lval(node->lhs);
gen(node->rhs);
printf(" pop rdi\n");
printf(" pop rax\n");
printf(" mov [rax], rdi\n");
printf(" push rdi\n");
return;
}
gen(node->lhs);
gen(node->rhs);
printf(" pop rdi\n");
printf(" pop rax\n");
switch (node->kind) {
case '+':
printf(" add rax, rdi\n");
break;
case '-':
printf(" sub rax, rdi\n");
break;
case '*':
printf(" imul rax, rdi\n");
break;
case '/':
printf(" cqo\n");
printf(" idiv rdi\n");
}
printf(" push rax\n");
}
Now that we have all the parts, main
let's change the function and actually run the compiler.
int main(int argc, char **argv) {
if (argc != 2) {
error("引数の個数が正しくありません");
return 1;
}
// トークナイズしてパースする
// 結果はcodeに保存される
user_input = argv[1];
tokenize();
program();
// アセンブリの前半部分を出力
printf(".intel_syntax noprefix\n");
printf(".globl main\n");
printf("main:\n");
// プロローグ
// 変数26個分の領域を確保する
printf(" push rbp\n");
printf(" mov rbp, rsp\n");
printf(" sub rsp, 208\n");
// 先頭の式から順にコード生成
for (int i = 0; code[i]; i++) {
gen(code[i]);
// 式の評価結果としてスタックに一つの値が残っている
// はずなので、スタックが溢れないようにポップしておく
printf(" pop rax\n");
}
// エピローグ
// 最後の式の結果がRAXに残っているのでそれが返り値になる
printf(" mov rsp, rbp\n");
printf(" pop rbp\n");
printf(" ret\n");
return 0;
}
In the previous chapter, we fixed the variable name to one letter and treated it as if 26 local variables from a to z always existed. This section provides support for identifiers with names longer than one character, allowing code such as the following to compile:
Variables can be used without being defined. Therefore, the parser must determine whether each identifier has been seen before, and automatically allocate a variable in the stack area if it is new.
First, change your tokenizer TK_IDENT
to read multi-character identifiers as type tokens.
We will represent the variables as a linked list. LVar
Let's represent one variable with a structure called , and locals
hold the first element with a pointer called . Expressed in code, it looks like this:
typedef struct LVar LVar;
// ローカル変数の型
struct LVar {
LVar *next; // 次の変数かNULL
char *name; // 変数の名前
int len; // 名前の長さ
int offset; // RBPからのオフセット
};
// ローカル変数
LVar *locals;
In the parser, TK_IDENT
when a type token occurs, it checks to see if the identifier has ever occurred before. locals
You can tell whether the variable already exists by looking at the variable name. If the variable has appeared previously, offset
use that variable as is. For a new variable, LVar
create a new one, set a new offset, and use that offset.
Below is a function that searches for variables by name.
// 変数を名前で検索する。見つからなかった場合はNULLを返す。
LVar *find_lvar(Token *tok) {
for (LVar *var = locals; var; var = var->next)
if (var->len == tok->len && !memcmp(tok->str, var->name, var->len))
return var;
return NULL;
}
In your parser, you should just add code like the following:
Token *tok = consume_ident();
if (tok) {
Node *node = calloc(1, sizeof(Node));
node->kind = ND_LVAR;
LVar *lvar = find_lvar(tok);
if (lvar) {
node->offset = lvar->offset;
} else {
lvar = calloc(1, sizeof(LVar));
lvar->next = locals;
lvar->name = tok->str;
lvar->len = tok->len;
lvar->offset = locals->offset + 8;
node->offset = lvar->offset;
locals = lvar;
}
return node;
}
Column: Frequency of appearance of machine language instructions
If you look at the assembly output by 9cc, you will notice that there are many data movement instructions such as mov
and , and there are relatively few instructions that perform "real calculations" such as and . One reason for this is that 9cc does not perform optimization and outputs useless data movement instructions, but in fact, even optimizing compilers output data movement instructions the most. Below is a graph of the results of disassembling all the executable files included in my environment and counting the number of instructions.push
add
mul
/bin
As you can see, mov
instructions alone account for 30% of all instructions. A computer is a data processing machine, and the most frequently used part of data processing is the movement of data. If you consider that "moving data to the appropriate location" is one of the essences of data processing, this large number of mov
instructions seems to be a natural result, but some readers may have been surprised. I think there are many.
In this chapter return
, we'll add statements so that we can compile code like this:
return
We will decide that it is okay to write the statement in the middle of the program. As in normal C, program execution is aborted return
at the function returns. For example, the program below return
returns the value of the first value, which is 5.
In order to implement this feature, return
let's first consider what will happen to the grammar after adding . Previously, a statement was just an expression, but the new grammar return <式>;
allows statements to be just expressions. So the new grammar would be:
To implement this, the tokenizer, parser, and code generator all need to be tweaked a bit.
First, let's enable the tokenizer return
to recognize the token and TK_RETURN
represent it with a token of type . return
Since there are only a limited number of tokens while
( int
called keywords) that have special grammatical meanings, such as and, it is easier to give each token a different type like this.
It seems like all you have to do to determine whether the next token is return
or not is to check whether the rest of the tokenizer's input string return
starts with , but that would result returnx
in tokens like and being incorrectly tokenized as return
and . x
Become. So here return
we need to make sure that in addition to starting the input, the next character is not a character that forms a token.
Below is a function that determines whether a given character is a token character, i.e. an alphanumeric character or an underscore.
int is_alnum(char c) {
return ('a' <= c && c <= 'z') ||
('A' <= c && c <= 'Z') ||
('0' <= c && c <= '9') ||
(c == '_');
}
Using this function, you can tokenize
add the following code return
to TK_RETURN
tokenize as .
if (strncmp(p, "return", 6) == 0 && !is_alnum(p[6])) {
tokens[i].ty = TK_RETURN;
tokens[i].str = p;
i++;
p += 6;
continue;
}
Next, let's modify the parser so that it can parse token sequences containing TK_RETURN. To do this, first add return
the type of node that represents the sentence . ND_RETURN
Next, modify the function that reads the statement return
so that it can be parsed. As usual, you can parse the grammar by mapping it directly to a function call. The new stmt
functions are shown below.
Node *stmt() {
Node *node;
if (consume(TK_RETURN)) {
node = calloc(1, sizeof(Node));
node->kind = ND_RETURN;
node->lhs = expr();
} else {
node = expr();
}
if (!consume(';'))
error_at(tokens[pos].str, "';'ではないトークンです");
return node;
}
ND_RETURN
Since the type node is only generated here, I decided to set the value on the malloc
spot .
Finally, modify the code generator ND_RETURN
so that it outputs the appropriate assembly code for the type's nodes. A portion of the new gen
function is shown below.
void gen(Node *node) {
if (node->kind == ND_RETURN) {
gen(node->lhs);
printf(" pop rax\n");
printf(" mov rsp, rbp\n");
printf(" pop rbp\n");
printf(" ret\n");
return;
}
...
gen(node->lhs)
The code for the expression that is the return value of the function call in the above code return
is output. That code should leave one value at the top of the stack. gen(node->lhs)
The assembly that follows pops the value from the stack, sets it in RAX, and returns from the function.
The functions implemented up to the previous chapter always ret
output one instruction at the end of the function. return
Implementing the statements in the manner described in this chapter results in the output of an return
extra instruction per statement . ret
It is possible to group these instructions together, but in order to simplify the implementation here, we ret
decided to output multiple instructions. There's no point worrying about these details at this point, so it's important to prioritize implementation simplicity. Being able to write difficult code is a useful skill, but sometimes an even more useful skill is not making the code too difficult in the first place.
Column: Grammar hierarchy
"Regular expressions" are often used to determine whether input matches some kind of rule, but more complex grammars cannot be expressed using regular expressions. For example, it is impossible in principle to write a regular expression that determines whether parentheses are balanced within a string.
Context-free grammars (grammars that can be expressed in BNF) are more powerful than regular expressions; for example, they can only represent strings with balanced parentheses (when written in BNF) S → SS | "(" S ")" | ε
. However, like regular expressions, context-free grammars have their limitations, and context-free grammars cannot express the complex rules that appear in ordinary programming languages. For example, the rule "variables must be declared before they are used" is part of the C grammar, but such a rule cannot be expressed using a context-free grammar.
If you write a C compiler, you can say that the input that the compiler accepts is a valid C program, and the input that the compiler does not accept is an invalid C program, unless there is a bug in the compiler. In other words, if you have the power of an ordinary computer, you can determine whether the problem matches the C grammar, and the compiler as a whole can be said to be a more powerful grammar determiner than context-free grammars. Masu. In this way, a grammar that can always be judged with YES/NO whether it matches the grammar is called Decidable.
You can also consider grammars that are not Decidable. For example, the question "When a computer program is given as input and executed, does the program eventually exit
execute a function and exit, or continue running indefinitely?" It has been proven that it is generally impossible to determine YES/NO without doing this (assuming you are running on a virtual computer with infinite memory). In other words, to the question of whether the program will stop, you can answer YES if the program stops, but you cannot answer NO because if the program does not stop, it will just continue running indefinitely. In this way, a category of grammars in which it is possible that the decision machine not only returns YES/NO, but also that the decision machine does not finish its execution is called Turing-recognizable.
In other words, there is a hierarchy of grammars: regular expression < context-free grammar < Decidable < Turing-recognizable. These grammatical hierarchies are widely studied as part of computer science. The famous unsolved problem P≟NP is also a problem related to the hierarchy of grammars.
So far we have been building the compiler incrementally. In a sense, this development process can be said to follow the history of C.
If you look at the current version of C, you'll find parts that don't make much sense or are unnecessarily complex, but these things can't be understood without looking at history. Many of the puzzling aspects of current C begin to make sense when you read the early C code and look at the early form of C and the subsequent development of the language and compilers.
C was developed in 1972 as a language for Unix. The source code from 1972 or 1973, a very early period in the history of C, remains on tape, and the files read from it are published on the Internet. Let's take a look at the code of the C compiler at the time . Below printf
is a function that takes a message in format and prints it as a compilation error message.
error(s, p1, p2) {
extern printf, line, fout, flush, putchar, nerror;
int f;
nerror++;
flush();
f = fout;
fout = 1;
printf("%d: ", line);
printf(s, p1, p2);
putchar('\n');
fout = f;
}
It looks like a somewhat strange, C-like, non-C language. At that time, C was such a language. The first thing you notice when you read this code is that, just like in the early stages of the compiler we created, there are no types for function return values or arguments. Here, s is supposed to be a pointer to a string, p1
or p2
an integer, but on machines at the time, everything was the same size, so the variable is untyped like this.
The second line error
contains the declarations of the global variables and functions referenced by. At the time, C compilers had no header files or C preprocessors, so programmers had to tell the compiler of the existence of variables and functions in this way.
Like our current compiler, functions are only checked to see if the name exists, not to see if the types or number of arguments match. After putting the expected number of arguments on the stack, they could just jump to the function body and the function call would be successful, so they probably thought that was fine.
fout
is a global variable that holds the number of the file descriptor to output to. Back then, fprintf
it didn't exist yet, and in order to write a string to standard error instead of standard output, you had to switch the output destination via a global variable.
error
printf
is called twice in . The second printf passes two values in addition to the format string. So, what do you do when displaying an error message that takes only one value?
In fact, this error
function works correctly even if you simply force it to have fewer arguments. Recall that function argument checking did not exist at this point. s
Arguments like , p1
, p2
and so on simply point to the first, second, and third words from the stack pointer; the p2
compiler doesn't care if you're passing a value that actually corresponds to . accesses as many extra arguments as there printf
are in the first argument string , so if the message contains only one , it will not be accessed at all. Therefore, there is no problem even if the number of arguments does not match.%d
%s
%d
p2
In this way, early C compilers have many similarities with the current 9cc.
Let's look at another code example. The code below is a function that copies the passed string into a statically allocated area and returns a pointer to the beginning of that area. So this strdup
is a function that uses static space.
copy(s)
char s[]; {
extern tsp;
char tsp[], otsp[];
otsp = tsp;
while(*tsp++ = *s++);
return(otsp);
}
At that time, int *p
the syntax for declarations of the form had not been devised. Instead, pointer types int p[]
are declared as . There is something like a variable definition between the function argument list and the function body, but this s
is to declare it as a pointer type.
There's something else worth mentioning about this early C compiler.
&&
There ||
is no such operator yet. At this time, it was a context-dependent behavior that became a logical operator only in conditional expressions such as ``Yaga'' .&
|
if
+=
The operator =+
was written like this. There was a problem with this syntax that if you intended to assign -1 to and wrote it i
without a space , it would be interpreted as and decremented, resulting in an unintended behavior.i=-1
i =- 1
i
In addition to the above, C in the early 1970s lacked various features. However, this C compiler was written in C, as you can see from the source code above. C was already self-hosted before there were even structures.
By looking at old source code, you can also deduce why some of C's confusing syntax ended up in its current form. Parsing variable definitions is easy if the syntax is such that the variable name always comes after the heel .extern
It's easy to parse a pointer if it just comes right after the variable name . However, it seems obvious that if this syntax were to evolve along the lines seen in this early compiler, it would end up in its current, unnecessarily complex form.auto
int
char
[]
Well, what Unix and C co-inventor Dennis Ritchie was doing around 1973 was truly incremental development. He was writing a compiler using C while developing C itself. The current version of C is not some kind of finished product that reached a special point as the language continued to add features, but simply completed as a language at a certain point when Dennis Ritchie decided that the language had enough features. That's just what happened.
Even with our compiler, we did not pursue a complete version from the beginning. The completed form of C does not have any special meaning, so there is probably no point in specifically pursuing it. Continuing to develop a language with a reasonable set of features at any point in time, and eventually converting it to C, is the time-honored development method used by the original C compiler. Let's proceed with development with confidence!
Column: Rob Pike's 5 Rules of Programming
9cc is influenced by Rob Pike's programming philosophy. Rob Pike is a former colleague of Dennis Ritchie, creator of C, creator of the Go language, and co-founder of Unicode's UTF-8 with Ken Thompson, creator of Unix.
I quote Rob Pike's 5 Rules of Programming.
Further chapters are currently being written. I thought I had written the chapters up to this point carefully, but honestly I don't think the chapters from here have reached the level of publication yet. However, if you have read this far, you probably won't be able to complete the necessary information yourself, and some people may want guidance on how to proceed. I'll make it public for what it's worth.
In this section, we will add control structures such as if
, if ... else
, while
and to the language. for
Although these control structures may seem complex at first glance, they are relatively easy to implement when compiled straight into assembly.
Since there is no counterpart to C control structures in assembly, C control structures are represented in assembly by branch instructions and labels. In a sense, this goto
is the same as rewriting the control structure using . Just as humans can manually goto
rewrite control structures into statements, control structures can be easily implemented simply by generating code according to a pattern.
There are various other control constructs, such as do ... while
, goto
, continue
, break
etc., but you do not need to implement them yet.
if
The new grammar with the addition of , while
, for
and is shown below.
program = stmt*
stmt = expr ";"
| "if" "(" expr ")" stmt ("else" stmt)?
| "while" "(" expr ")" stmt
| "for" "(" expr? ";" expr? ";" expr? ")" stmt
| ...
...
expr? ";"
When reading , you can read ahead one token, assume that the next token does not exist, and ;
read otherwise.expr
expr
if (A) B
compiles into an assembly like this:
Aをコンパイルしたコード // スタックトップに結果が入っているはず
pop rax
cmp rax, 0
je .LendXXX
Bをコンパイルしたコード
.LendXXX:
In other words if (A) B
,
It will be expanded in the same way. XXX
Please make sure that all labels are unique by using serial numbers.
if (A) B else C
compiles into an assembly like this:
Aをコンパイルしたコード // スタックトップに結果が入っているはず
pop rax
cmp rax, 0
je .LelseXXX
Bをコンパイルしたコード
jmp .LendXXX
.LelseXXX
Cをコンパイルしたコード
.LendXXX
That is if (A) B else C
, it is expanded as follows.
if
When reading a statement, it looks ahead one token and else
checks if it exists, else
and if it does , it compiles as not if ... else
.else
if
while (A) B
compiles like this:
.LbeginXXX:
Aをコンパイルしたコード
pop rax
cmp rax, 0
je .LendXXX
Bをコンパイルしたコード
jmp .LbeginXXX
.LendXXX:
In other words while (A) B
, it will be expanded like the following code:
for (A; B; C) D
compiles like this:
Aをコンパイルしたコード
.LbeginXXX:
Bをコンパイルしたコード
pop rax
cmp rax, 0
je .LendXXX
Dをコンパイルしたコード
Cをコンパイルしたコード
jmp .LbeginXXX
.LendXXX:
for (A; B; C) D
The corresponding C code is shown below.
Note that .L
labels starting with are specially recognized by the assembler and automatically become file scoped. File-scoped labels can be referenced from within the same file, but not from other files. Therefore, if if
you start with the for
labels generated by the compiler for , you don't have to worry about them colliding with labels in other files..L
Compile a small loop with cc and use the assembly as a reference.
Column: Detecting run-time errors by the compiler
When writing programs in C, it is common to write data past the end of an array, or to corrupt unrelated data structures due to pointer bugs. These bugs can also become security holes, so the idea is to proactively detect bugs during runtime with the help of a compiler.
For example, if you pass the option to GCC -fstack-protector
, the compiled function will output a random pointer-sized integer called a "canary" to the function frame in the prologue, and will indicate in the epilogue that the value of the canary has not changed . I'll check it out. In this way, if the contents of the stack are unknowingly overwritten due to an array buffer overflow, the value of the canary will almost certainly have changed, so the error can be detected when the function returns. If an error is detected, the program usually terminates immediately.
LLVM has something called TSan (ThreadSanitizer) that can output code that detects at runtime when multiple threads are accessing shared data structures without properly securing locks. Additionally, LLVM's UBSan (UndefinedBehaviorSanitizer) can output code that detects at runtime whether you have inadvertently stepped on undefined behavior in C. For example, signed integer overflow is undefined behavior in C, so UBSan will report an error if a signed integer overflow occurs.
Since TSan etc. will slow down the program several times, it is unreasonable to add it to the compile options of frequently used programs, but functions such as stack canary, which has a relatively low runtime cost, can be used depending on the environment. It may be on by default.
Dynamic error detection with the help of such compilers has been actively researched in recent years, and has contributed significantly to writing reasonably secure programs using languages such as C and C++, which are not memory safe. Masu.
This step supports " blocks{ ... }
" between which you can write multiple statements . A block is officially called a compound statement, but because it's such a long word, it's often just called a block.
Blocks have the effect of combining multiple statements into a single statement. The implementation in the above step if
only while
allowed one statement to be executed when the conditional expression was true, but by implementing the block in this step, you can do it there just like in {}
C. You will be able to write multiple sentences that are grouped together.
The function body is also actually a block. Grammarly requires that the function body must be a block. Function definitions are { ... }
actually syntactically the same when written after if
.while
{ ... }
The grammar with blocks added is shown below.
In this grammar, if stmt
it "{"
starts with , "}"
zero or more stmt
can appear until . stmt* "}"
To parse the statement, call iterate "}"
until it occurs , and return the result as a vector.while
stmt
To implement a block, ND_BLOCK
add the type of node that represents the block. The structure representing the node Node
must be populated with a vector containing the expressions contained in the block. The code generator ND_BLOCK
should generate code for the statements contained in a node in order, given the node's type. Note that each statement leaves one value on the stack, so don't forget to pop it each time.
In this step, our goal is to foo()
be able to recognize function calls with no arguments like , and compile them to .call foo
The new syntax with function calls is shown below.
ident
By looking ahead one token after reading , ident
you can tell whether it is a variable name or a function name.
For the test, int foo() { printf("OK\n"); }
prepare a C file with the content like this, cc -c
compile it into an object file, and link it with the output of your compiler. By doing so, you should be able to link properly as a whole, and you should be able to confirm that the functions you want to call are called correctly.
Once that works, the next step foo(3, 4)
is to be able to write function calls like: There is no need to check the number or types of arguments. By simply evaluating the arguments in order, the arguments to be passed to the function will be created on the stack, so we can copy them to registers in the order specified by the x86-64 ABI and call the function. More than 6 arguments need not be supported.
For testing, you int foo(int x, int y) { printf("%d\n", x + y); }
should be able to check the operation by preparing a function like above and linking it.
The x86-64 function call ABI is easy (as long as you follow the method above), but there is one caveat. RSP must be a multiple of 16 before making a function call. push
Since pop
RSP is changed in units of 8 bytes, call
RSP is not necessarily a multiple of 16 when issuing an instruction. If this promise is not kept, functions that assume RSP to be a multiple of 16 will suffer from a mysterious phenomenon where they fail only half of the time. Make sure to adjust RSP before calling the function so that it is a multiple of 16.
Once this is done, we can now define the function. However, C function definitions are difficult to parse, so I don't implement them all at once. Currently, our language only has the int type, so we will implement the syntax int foo(int x, int y) { ... }
omitting the type name instead of the syntax .foo(x, y) { ... }
The called side needs to be able to access the argument by x
name y
, but it is currently not possible to access the value passed in a register by name. What to do is to compile as x
if y
a local variable exists, and in the prologue of the function write the register value to the area on the stack for that local variable. After that, you should be able to handle arguments and local variables without any distinction.
Until now, main() { ... }
the behavior was the same as implicitly enclosing the entire code, but that will be abolished and all code will be written inside some function. Then, when parsing the top level, when you read the token first, it should always be the function name, followed by the argument list, and then the function body. Easy to read.
After completing this step, you will be able to calculate and display the Fibonacci sequence using recursion, which should make it a lot more interesting.
The C language specification defines specifications at the source code level. For example, language specifications specify how functions can be defined, and which files should be included to declare which functions. On the other hand, the language specifications do not specify what kind of machine language the source code written to comply with the standard will be converted into. This makes sense, since the C language standard was not determined with any particular instruction set in mind.
Therefore, at first glance it may seem that there is no need to clearly determine machine language level specifications, but in reality, specifications are determined to some extent for each platform. This specification is called ABI ( Application Binary Interface ).
The way functions have been called so far in this book has meant that the arguments are placed in registers in a particular order. Also, the return value was promised to be set to RAX. These rules for how to call functions are called "function calling conventions. " Function calling conventions are part of the ABI.
In addition to how to pass arguments and return values, the C language ABI also includes the following:
int
long
size of molds such asABI is just a convention at the software level, so it is possible to think of something different from what is explained in this book, but code that is not ABI compatible cannot call and use each other, so Basically, CPU vendors and OS vendors define the platform standard ABI. There are two widely used x86-64 systems: System V ABI, which is used in Unix and macOS, and Microsoft ABI, which is used in Windows. Note that these two calling conventions are not separated by necessity; they are simply created by different people.
So far in this book, we have used our own compiler to call functions compiled with another compiler. This was possible because our C compiler and another compiler had the same ABI.
At this point, let's understand how integers, especially negative integers, are represented on computers. This chapter explains how to represent unsigned numbers and how to represent signed numbers using two's complement representation .
In this book, binary bit patterns will be expressed as 0b0001_1010, with a 0b prefix and each four digits separated by an underscore to make it easier to read. The 0b prefix is actually a compiler-specific extension that can be used as is by many C compilers (although it usually cannot include underscores).
The representation of an unsigned integer is the same as a normal binary number. The decimal number is, in order from the lowest digit, 1 digit, 10 digit, 100 digit, 1000 digit, ... (i.e. 10 0 digit, 10 1 digit, 10 2 digit , 10 3 digit ) ,...), a binary number is represented from the lowest digit to 1's digit, 2's digit, 4's digit, 8's digit,... (i.e. 2 0's digit, 2's digit , 1 digit, 2 2 digit, 2 3 digit,...).
For example, the value of the unsigned integer represented by the bit pattern 0b1110 can be found by looking at the position of the bit that is 1. In this case, the second, third, and fourth digits, that is, the 2 digit, 4 digit, and 8 digit, are 1, so 0b1110 represents 2 + 4 + 8 = 14. Some example diagrams are shown below.
When you add 1 to an unsigned integer, the values cycle as shown in the following graph. This is an example of a 4-bit integer.
When the result of an operation has too many digits, resulting in a different result than if there were an infinite number of bits, this is called " overflow ." For example, in an 8-bit integer, 1+3 will not overflow, but 200+100 and 20-30 will overflow, resulting in 44 and 246 respectively. Mathematically speaking, it is the same as the remainder when divided by 2 8 = 256.
Column: Interesting bugs caused by overflow
Numerical overflow can sometimes cause unexpected bugs. Here we will introduce bugs that were present in the first version of the game "Civilization".
Civilization is a strategy simulation game that pits civilizations against each other, where you choose a character like Genghis Khan or Queen Elizabeth to conquer the world or win the space race.
A bug in the original Civilization was that Gandhi, a non-violent principle, would suddenly launch a nuclear attack. The reason was the logic that when a civilization adopts democracy, its aggressiveness decreases by 2. In the original Civilization, Gandhi's aggression is the lowest among all players, 1, but as the game progresses and Indian civilization adopts democracy, his aggression is reduced by -2 and becomes 255 due to overflow, and Gandhi becomes less aggressive in the game. All of a sudden, I was an extremely aggressive player. By that time, the game has usually progressed to the point where each civilization has nuclear weapons in terms of science and technology, so as a result, Gandhi suddenly starts a nuclear war on a certain turn. It was. This "Nuclear Gansey" was rather interesting and became a standard feature in subsequent Civilization series, but it was an unintentional bug in the first generation.
Signed integers use two's complement representation, which treats the most significant bit specially . An n-digit integer in two's complement representation represents the same number as an unsigned number except for the nth digit, except that the most significant n-digit represents -2 n-1 instead of 2 n - 1 . The rule is to represent.
Specifically, if we consider a 4-digit binary number, each digit and the number it represents are as shown in the table below.
Four | 3 | 2 | 1 | |
If unsigned | 8 | Four | 2 | 1 |
With sign | -8 | Four | 2 | 1 |
As with unsigned values, the signed value represented by a bit pattern can be determined by looking at the position of the 1 bit. For example, if we consider 0b1110 as a 4-digit signed integer, the 2nd, 3rd, and 4th digits, that is, the 2nd, 4th, and -8th digits, are 1, so 0b1110 is 2 + This means that 4 + (-8) = -2. Some example diagrams are shown below.
Under this rule, unless the most significant bit is turned on, a signed integer represents the same number as if it were interpreted as an unsigned integer. For a 4-bit integer, 0-7 have the same bit pattern whether signed or unsigned. On the other hand, if the 4th bit is on, the bit pattern represents any number from -8 to -1 (0b1000 to 0b1111). If the most significant bit is on, it becomes a negative number, so the most significant bit is sometimes called the "sign bit. "
When you add 1 to a signed integer, the values cycle as shown in the following graph. This is an example of a 4-bit integer.
Understanding the above rules will help explain a variety of seemingly strange behaviors of signed integers commonly encountered in programming.
Readers have probably experienced that when you add 1 to a signed integer, the number turns from a large number to an extremely small number when it overflows. If you consider 2's complement representation, you can understand exactly what is happening. For example, for an 8-bit signed integer, the largest number is 0b0111_1111 or 127. Adding 1 to this gives 0b1000_0000, which is -128 in two's complement representation. This is the largest negative number in absolute value.
For example, if a unary -
test main
returns -3, the exit code for the entire program would be 253. This means that main
while RAX is set to -3, that is, 0b1111_1111_1111_1111_1111_1111_1111_1101, the receiving side considers only the lower 8 bits of RAX to be a meaningful value. and treats it as an unsigned integer, so the result is as if 0b1111_1101, or 253, were the return value.
In this way, what kind of number a certain bit pattern represents depends on the assumptions of the reader. For example, the characters on a paper book are actually just ink stains, but they only have meaning because there are people who read them thinking they are sentences.Just as the words on a paper book have meaning because there are people who read them thinking they are sentences, things in a computer's memory can also be turned on and off. It is just a string of bits and has no inherent meaning in itself. In order to exchange a certain number, the method of interpretation must match between the side that sets the value and the side that reads it.
Note that in two's complement representation, the number of negative numbers that can be represented is one more than the number of positive numbers that can be represented. For example, with an 8-bit integer, -128 is representable, but +128 is just outside the representable range. This imbalance in the positive and negative ranges cannot be helped due to the mechanism. There are 2 n patterns that can be represented by n bits , that is, there are always even numbers, but if you allocate one bit pattern for 0, an odd number of patterns will remain, so either positive or negative numbers will be more. That's what happens.
In computers, the operation of widening the bit width of a number often occurs. For example, if you want to read an 8-bit number from memory and set it in a 64-bit register, you need to stretch the 8-bit value to 64 bits.
When you're working with unsigned integers, extending the value is easy, just fill the high-order bits with 0s. For example, expanding the 4-bit value 0b1110 = 14 to 8 bits becomes 0b0000_1110 = 14.
On the other hand, when dealing with signed integers, filling the upper bits with 0 will change the number. For example, if you intend to extend the 4-bit value 0b1110 = -2 to 8 bits and change it to 0b0000_1110, it will become 14. This is not even a negative number since the sign bit is not set in the first place.
When extending a signed integer, if the sign bit is 1, the new high-order bits must be filled with all 1s, and if the sign bit is 0, the new high-order bits must be filled with all 0s. This operation is called " sign extension ." For example, if we sign-extend the 4-bit value 0b1110 = -2 to 8 bits, we get 0b1111_1110 = -2, which shows that the bit width has been successfully expanded.
You can think of an unsigned integer as having an infinite number of 0s on the left side of the number, and extracting them when expanding.
Similarly, you can think of a signed integer as having an infinite number of values equal to the sign bit on the left side of the number, and extracting them when expanding.
In this way, when trying to fit a number into a wider bit width, you need to be aware of whether the value you are currently dealing with is unsigned or signed.
Column: Representation of negative numbers that do not require sign extension
Two's complement representation is a commonly used way to represent signed integers in computers, but it's not the only way to map positive and negative integers to patterns of bits. For example, if we consider a negative binary number , the lowest digit represents (-2) 0 , (-2) 1 , (-2) 2 , etc. Below is a comparison table for the number that each digit represents in the case of 4 bits.
Four | 3 | 2 | 1 | |
unsigned | 8 | Four | 2 | 1 |
two's complement | -8 | Four | 2 | 1 |
minus binary number | -8 | Four | -2 | 1 |
A 4-bit negative binary number can represent a total of 16 integers from -10 to 5, as shown below.
Five | 0b0101 |
Four | 0b0100 |
3 | 0b0111 |
2 | 0b0110 |
1 | 0b0001 |
0 | 0b0000 |
-1 | 0b0011 |
-2 | 0b0010 |
-3 | 0b1101 |
-Four | 0b1100 |
-Five | 0b1111 |
-6 | 0b1110 |
-7 | 0b1001 |
-8 | 0b1000 |
-9 | 0b1011 |
-Ten | 0b1010 |
Negative binary numbers have the disadvantage that processing such as carrying is difficult, and 0 does not appear near the center of the representable range, but on the other hand, they have an interesting feature that they do not require a sign bit. Therefore, when expanding a negative binary number of one digit to a larger number of digits, the high-order bits can always be filled with 0.
In this way, there are various ways to represent integers on a computer, not just 2's complement representation. Of these, two's complement representation is the easiest to handle in terms of hardware, and is used in almost all existing computers.
The details of 2's complement representation are not necessarily necessary knowledge to create a compiler, but remembering some techniques related to 2's complement representation can be useful in various situations. Here we will explain a simple method to reverse the sign of a number.
In two's complement representation, if you invert all bits and add 1, the sign of the number will be reversed. For example, the steps to find the bit pattern from 3 to -3 for an 8-bit signed integer are as follows.
By remembering the above method, you can easily find the bit pattern of negative numbers.
Also, by performing the same operation on a bit pattern with a sign bit set to make it a positive number, you can easily find the numerical value represented by that bit pattern. For example, it is troublesome to find out what 0b1111_1101 represents by simple addition, but if you invert the bits and add 1, you get 0b0000_0011, which means that it represents the opposite sign of 3, which is -3. I understand.
The reason why the above trick works is relatively simple. Up to this point, we have not defined the operation of two's complement representation mathematically, so the explanation is a little vague, but the idea is as follows.
Inverting all bits is the same as subtracting from a bit pattern of -1, that is, all bits are 1. For example, the bit pattern 0b0011_0011 can be inverted as follows:
1111 1111
- 0011 0011
= 1100 1100
In other words, reversing the bit pattern representing the number n is equivalent to calculating -1 - n. If we add 1 to that, we are calculating (-1 - n) + 1 = -n, which means we were able to find -n for n.
Column: Literal number base
The C standard allows numbers to be written in octal, decimal, or hexadecimal. Normally, when you write a number like 123, it becomes a decimal number, when you write it with a leading 0x like 0x8040, it becomes a hexadecimal number, and when you write it with a leading 0 like 0737, it becomes an octal number.
Many readers may think that they have never used the function to write numbers in octal in C, but in this syntax, even a simple 0 is written in octal, so any C programmer can actually write numbers in octal. I write very often. This is a bit of trivia, but if you think about it, there are reasons that may or may not be deep.
To begin with, 0 is a somewhat special notation for numbers. Normally, a number like 1 would not be written as 01 or 001 just because the 10's digit or the 100's digit is 0, but if you apply that rule to 0 as is, you will end up with an empty string. It would be a practical problem to write nothing when you want to write 0, so in that case we have a special rule to write 0, but if you do that, it becomes a somewhat special category in the C grammar. That's why.
In the previous chapters, we have begun to develop a language that can perform meaningful calculations, but our language is not even Hello world
able to display it yet. It's time to add some strings so that the program can output meaningful messages.
C string literals char
are closely related to types, global variables, and arrays. Consider the following function as an example.
The above code will compile the same way as the code below. However msg
, it is a unique identifier that does not overlap with other identifiers.
Our compiler still lacks some features to support string literals. printf
In this chapter, we will implement the following functions in order so that we can support string literals, display messages, etc.
&
and unary*
We will also add the functions necessary to test the above functions in this chapter.
In this step, the first step in implementing a pointer is to implement a unary that returns an address &
and a unary that references an address .*
These operators are originally operators that return or take pointer type values, but since our compiler does not yet have any types other than integers, we substitute integer types for pointer types. I'll decide. That is, returns the address of &x
the variable as just an integer. x
Also, *x
is x
an operation that regards the value of as an address and reads the value from that address.
Implementing such an operator allows code like the following to work:
Also, by taking advantage of the fact that local variables are allocated contiguously in memory, it is also possible to forcefully access variables on the stack indirectly via a pointer. The code below assumes that the variable is y
8 bytes above the variable on the stack.x
In such an implementation that does not distinguish between pointer types and integer types, the *4
expression, for example, will be an expression that reads the value from address 4, but let's just assume that's fine.
Implementation is relatively easy. The grammar with unary &
and unary *
additions is shown below. Modify the parser to follow this grammar and read unary &
and unary as nodes of type and *
respectively .ND_ADDR
ND_DEREF
There are very few changes to the code generator. The changes are listed below.
case ND_ADDR:
gen_lval(node->lhs);
return;
case ND_DEREF:
gen(node->lhs);
printf(" pop rax\n");
printf(" mov rax, [rax]\n");
printf(" push rax\n");
return;
Until now, all variables and function return values were implicitly set to int. Therefore, int x;
we did not go out of our way to define variables with their type names, and assumed that every new identifier was a new variable name. You can no longer make such assumptions. So, let's modify that point first. Please implement the following functions.
int x;
Please define variables in this way. int x = 3;
There is no need to support initialization expressions such as int x, y;
You don't need anything like that either . Implement only the simplest things possible.foo(x, y)
was written in the form, but int foo(int x, int y)
I will modify it so that it is in the form. Currently, the top level should only contain function definitions, so the parser will int
read first, then the function name, which should be the next, and then int <引数の名前>
the next column. There's no need to support more complex syntax, and there's no need to do anything just to "prepare for future expansion." Write enough code to simply read "int <function name>(<argument list consisting of repeating int <variable name>>)".In this step, we will now int
allow zero or more following , which was previously only allowed for type names, as type names int
. *
In other words, it allows definitions such as and to be parsed int *x
.int ***x
Of course, types such as "pointer to int" must be handled by the compiler. For example, x
if a variable is a pointer to an int, the compiler *x
must know that the expression is of type int. Since types can be as complex as ``pointer to pointer to pointer to int'', this cannot be expressed only by a fixed-size type.
So what we're going to do is use a pointer. Until now, the only information associated with variables through maps was the offset from the base pointer (RBP) on the stack. Change this so you can have a variable type. Roughly speaking, the variable type should be a structure like the following.
Here, ty
it can have one of two values: an int type or a "pointer to" type. ptr_to
isty
a meaningful member only when is of type ``pointer to'', in which case it contains a pointer to the Type object pointed to by ``. For example, for a "pointer to int", the data structure representing that type is internally as follows.
If it is a "pointer to a pointer to an int", it will look like this:
In this way, any number of difficult types can be expressed within the compiler.
*p=3
How do I compile expressions where the left-hand side of the assignment expression is not a simple variable name, such as ? The basic concept of these expressions is the same as when the left side is a simple variable. In this case, you can compile as an lvalue p
so that the address of is generated .*p
*p=3
When compiling a syntax tree representing , we recursively descend the tree to generate code, and the first thing that is called is the code *p
generator that compiles as an lvalue.
The code generator will branch depending on the type of the given syntax tree. With a simple variable, as mentioned above, we would output the code that outputs the address of that variable, but here we are given a dereference operator, so we need to do something different. If a dereference operator is given, compile the syntax tree within it as an "rvalue". Then it should compile into code that calculates some address (otherwise you can't dereference that result). Then you can just leave that address on the stack.
Once you've completed this step, you should be able to compile statements like the following:
In this step, we will be able to write expressions like and for p
values of pointer type. Although this looks like simple addition of integers, it is actually quite a different operation. does not mean adding 1 to the address held by , but rather makes it a pointer to the next element of , so we have to add the width of the data type pointed to by the pointer. . For example , if points to an int, our ABI would add 4 to the number of bytes in the address. If one is a pointer to a pointer to an int, then will add 8.p+1
p-5
p+1
p
p
p
p
p+1
p
p+1
Therefore, when adding or subtracting pointers, you need a way to know the size of the type, but currently it is 4 for int and 8 for pointer, so please write code like that.
Since we don't have a way to allocate contiguous memory yet at this stage (our compiler doesn't have arrays yet), writing tests is a bit difficult. You can simply enlist the help of an external compiler, do malloc there, and write tests using the helper functions in your compiler's output. For example, you could test it like this.
Column: size of int or long
A data model such as the x86-64 System V ABI in which ints are 32 bits and longs and pointers are 64 bits is called LP64. This means that longs and pointers are 64 bits. Even with the same ABI on x86-64, Windows uses LLP64, a data model in which ints and longs are 32 bits, and long longs and pointers are 64 bits.
LP64 and LLP64 are not ABI compatible because they have different long sizes. For example, if you create a structure that contains long members, write the entire structure to a file as is, and then read it by directly casting the data in the file to that structure, you can write the file in Unix and Windows. They cannot be passed to each other for reading.
The C specification states that int is "plain" int object has the natural size suggested by the architecture of the execution environmen. When you say that, it feels like you have to make int into 64 bits on a 64-bit machine, but what is natural is a subjective question, and even 64-bit machines can usually handle 32-bit operations naturally. , it's not necessarily wrong to make int 32 bits even on a 64 bit machine. If you think about it realistically, making int 64 bits will cause the following problems:
For the reasons mentioned above, int is 32 bits on most existing 64-bit machines. However, ILP64, which has a 64-bit int, also exists. For example, the old Cray supercomputer was ILP64.
sizeof
Although it looks like a function, it is grammatically a unary operator. In C, most operators are symbols, but grammatically there is no particular reason why an operator must be a symbol; in fact, sizeof
it is an exception.
sizeof
Let's review the operation of operators for a moment. sizeof
is an operator that returns the number of bytes in memory of the type of the argument expression. For example, in our ABI, sizeof(x)
returns 4 if x
is and 8 if is a pointer. Any expression can be written as the argument of , for example , if the overall type of the expression is int, it will return 4, and if it is a pointer, it will return 8.int
x
sizeof
sizeof(x+3)
x+3
Our compiler doesn't have arrays yet, but sizeof(x)
if x
is an array , x
returns the total size in bytes. For example x
, int x[10]
if is defined as , sizeof(x)
returns 40. x
If is int x[5][10]
defined as , then sizeof(x)
is 200, sizeof(x[0])
is 40, sizeof(x[0][0])
and is 4.
sizeof
Operator arguments are only written to know the type; they are not the actual expressions to be executed. For example, sizeof(x[3])
if you write the expression , x[3]
no access actually occurs. x[3]
The overall type of the expression is known at compile time, so the sizeof(x[3])
expression will be replaced by the size of that type at compile time. Therefore, the concrete expression given to x[3]
etc. sizeof
no longer exists at runtime.
sizeof
The operation is shown below.
int x;
int *y;
sizeof(x); // 4
sizeof(y); // 8
sizeof(x + 3); // 4
sizeof(y + 3); // 8
sizeof(*y); // 4
// sizeofに渡す式は何でもよい
sizeof(1); // 4
// sizeofの結果は現在int型なのでsizeof(int)と同じ
sizeof(sizeof(1)); // 4
Now, sizeof
let's implement this operator. sizeof
To implement an operator, you will need to modify both the tokenizer and parser.
First, modify the tokenizer to recognize sizeof
the keyword as a token of type .TK_SIZEOF
Next, we modify the parser to replace sizeof
it with a constant of the type. The grammar with added operators is shown below. In the following grammar, is a unary operator and is defined as having the same precedence as unary plus and unary minus. This is the same syntax as C.int
sizeof
sizeof
In this grammar, writings such as and are sizeof(x)
also sizeof x
allowed, and the same is true in actual C.
In the parser, sizeof
when an operator appears, it parses the expression that is its argument as usual, and replaces it with the number 4 if the type associated with the resulting syntax tree is 4, or 8 if it is a pointer int
. please. Since the parser replaces it with a constant, there is no need to make any changes to the code generation tree.
In this step we will implement the array. Up until this stage, we had only handled data that was large enough to fit in a register, but this is the first time that we are dealing with data that is larger than that.
However, C's syntax is restrictive when it comes to arrays. You cannot pass an array as a function argument or return an array as a function return value. If you write code with that intention, the array itself will not be passed by value, but a pointer pointing to that array will be automatically created and passed. Directly assigning and copying an array to an array is also not supported (must use memcpy).
Therefore, there is no need to pass data that does not fit into registers between functions and variables. It is sufficient to have the ability to allocate a memory area larger than one word on the stack.
Please make it possible to read variable definitions such as the following.
The type of a above is an array, the array has length 10, and the element type is int. Like pointer types, array types can be as complex as you want, so just as in step 7, use ptr_to to point to the type of array element. The struct representing the type should look like this:
Here array_size
, it is a field that is meaningful only when it is an array type, and it is a variable that holds the number of elements in the array.
Once you've done this, you should be able to easily allocate space for arrays on the stack. To find the size of an array in bytes, simply multiply the size of the array elements in bytes by the number of elements in the array. Up until now, all variables should have been allocated as one word in the stack area, but please change this and ensure that the array has the required size.
Arrays and pointers are often used in combination, so C's syntax allows it to work without making much of a distinction between pointers and arrays, but this backfires and makes it difficult to understand the relationship between arrays and pointers. It seems that it is becoming difficult for programmers to understand. Therefore, here we will explain the relationship between arrays and pointers.
First, in C, arrays and pointers are completely different types.
A pointer (on x86-64) is an 8-byte value type. Just as operators such as + and - are defined for ints, + and - are also defined (in slightly different forms) for pointers. In addition, pointers *
have unary operators defined that allow you to refer to what the pointer points to. *
It can be said that there is nothing special about pointers except for unary ones . In other words, a pointer is a normal type like an int.
On the other hand, an array is a type that can be any number of bytes. Unlike pointers, there are few operators defined for arrays. The only operators defined sizeof
are the operator that returns the size of the array and the & operator that returns a pointer to the first element of the array. There are no other operators that can be applied to arrays.
So why a[3]
can an expression like this compile? In C, a[3]
is*(a+3)
defined as being equivalent to . Isn't the + operator not defined for arrays?
This is where the syntax of implicitly converting an array to a pointer comes into play. sizeof
An &
array is implicitly converted to a pointer to the first element of the array, except when used as a unary operand. Therefore, *(a+3)
is an expression that dereferences the pointer to the first element of array a plus 3, which has the same meaning as accessing the third element of the array.
[]
In C, there are no operators for array access . C's []
is just a shorthand notation for accessing array elements via pointers.
Similarly, if you pass an array as a function argument, it becomes a pointer to the first element of the array, or you can write it as if you were directly assigning the array to the pointer, but that is also possible as shown above. It depends on the reason.
Therefore, the compiler must convert arrays to pointers when implementing most operators. This shouldn't be too difficult to implement. Unless you are implementing a unary with , you should be able to parse the operand of an operator and assume that if it is an array of type T, it will be a pointer to sizeof
T. &
For array-type values, the code generator should generate code that pushes the address of the value onto the stack.
Once this is complete, you should be able to run code like the following.
Column: Linguistic lawyer
A person who has a good understanding of formal language specifications is sometimes called a `` language lawyer,'' referring to language specifications as laws. In the programmer's slang dictionary, The Jargon Files, language lawyer is described as: 6 .
The word language lawyer can also be used as a verb: language lawyering.
Skilled language lawyers are often respected by other programmers. When I worked on the C++ compiler team at Google, there was a person on the team who could be considered the ultimate language lawyer, and if there was something I didn't understand about C++, I would often decide to ask him ( There are many things in the C++ specifications that even people who create C++ compilers don't understand.) In fact, he was the person who implemented the main parts of Clang, a major C++ compiler, and was the lead author of the C++ specification, making him one of the most knowledgeable people in the world about C++. I remember thinking that the size of the C++ language specification and the complexity of the details were quite large.
This book intentionally avoids delving into the details of the C language specification until the compiler is more mature. There's a reason for that. When implementing a programming language that has specifications, it is necessary to become a language lawyer to some extent, but paying too much attention to details from the beginning is not a desirable development method. Just as when you draw a picture, you first complete a rough sketch of the whole thing, rather than just drawing one part in detail, when you are implementing a programming language, try to balance it out so that you don't become too much of a language lawyer at first. We need to maintain and develop it.
x[y]
In C *(x+y)
, is defined as being equivalent to . Therefore, implementing subscripts is relatively simple. Simply read it as x[y]
in your parser . For *(x+y)
example , is.a[3]
*(a+3)
In this grammar, 3[a]
it *(3+a)
expands to , so a[3]
if it works , 3[a]
it should also work, but in C, 3[a]
expressions like are actually legal. try it.
I would like to be able to write literal strings in programs soon. In C, a literal string is an array of char. This is fine because we have already implemented arrays, but the difference is that literal strings are not values that exist on the stack. String literals reside in a fixed location in memory, not on the stack. Therefore, to implement a string literal, we will first add a global variable.
Until now, only function definitions were allowed at the top level. Let's change that syntax so that we can write global variables at the top level.
Variable definitions are somewhat tricky to parse because they look similar to function definitions. For example, compare the following four definitions.
The top two foos are variable definitions, and the bottom two are function definitions, but you can't tell them apart until you get to the identifier that becomes the function name or variable name and read the next token. not. Therefore, you need to first call the ``read the first half of the type name'' function, read the identifier that should come after that, and then try reading ahead one token. If the token read ahead is "(", it means that a function definition was being read; otherwise, it means that a variable definition was being read.
Put the names of the parsed global variables in a map so you can look them up by name. Only if a variable name cannot be resolved as a local variable will an attempt be made to resolve it as a global variable. This allows for a natural way to have a local variable hide a global variable with the same name.
The parser converts local variable references and global variable references to separate nodes in the abstract syntax tree. Since the names can be resolved at the parsing stage, the types can also be separated at that stage.
Until now, all variables were supposed to be on the stack, so variables were read and written relative to the RBP (base pointer). A global variable is not a value on the stack, but a fixed location in memory, so compile to access that address directly. Please refer to the actual gcc output .
Once you implement it, you'll be surprised to find that local variables and global variables are quite different. The reason why it can be written without visual distinction is because the C language abstracts well. Local variables and global variables are actually implemented quite differently internally.
An array was a type that could be larger than one word, but a character was a type that could be smaller than one word. To get to this step, you probably needed to write a function that takes an object representing a type and returns the size of that type in bytes. First, add a character type, then modify the function so that it returns 1 for the character type.
There is no need to implement literal characters (characters enclosed in single quotes) in this step. Avoid the urge to implement everything all at once and keep the changes as small as possible.
So in this step, a character is really just a small integer type. movsx ecx, BYTE PTR [rax]
By doing this, you can read one byte from the address pointed to by RAX and put it in ECX. If sign extension is not required, use the movzx ecx, BYTE PTR [rax]
following instruction. movzx
When writing mov [rax], cl
, an 8-bit register is used as the source register.
Please refer to the actual compiler output .
Once you have implemented this step, you should be able to run code like this:
Column: Difference between 8-bit registers and 32-bit registers
Why do we need to use movsx or movzx when reading 1-byte values? When reading a 4-byte value, it was sufficient to simply load it into the lower 32-bit alias register such as EAX with a normal mov, so when reading a char, it seems sufficient to just load it into AL with a normal mov. But that doesn't work properly. The answer to this mystery lies in the x86-64 specifications.
On x86-64, reading the lower 32 bits of the alias's register resets the upper 32 bits to 0. However, when reading into the lower 8-bit alias register, the upper 56 bits remain at their previous values. This is an inconsistent specification, but x86-64 is an instruction set with a long history, so these inconsistencies exist in many places.
x86-64 evolved from a 16-bit processor called 8086 to 32-bit and then 64-bit, so first there was AL, then EAX, and then RAX. In other words, when loaded to AL, there was originally a specification that the upper 24 bits of EAX were not reset (remained as they were), and when expanded to 64 bits, when loaded to EAX, the upper 32 bits of RAX They should have established a specification to reset it. There's a good reason why we decided to create a specification that compromises consistency like this.
Modern processors look at instruction dependencies and increasingly execute unrelated instructions (instructions that do not use the result of the previous instruction) in parallel. If we were to create an instruction set that does not reset the upper 32 bits, but the upper 32 bits simply exist as garbage, but we ignore them until the end and continue to use only the lower 32 bits. However, this creates a false dependency between the instruction that generated the ignored upper 32 bits and subsequent instructions that use the same register. By sign-extending and resetting the upper 32 bits, the previous value will be completely overwritten, so you can break the dependency. For this reason, when converting x86 to 64 bits, we decided on a specification that would be easier to speed up, even though we were aware that consistency would be compromised.
This step parses the string enclosed in double quotes so that it can be compiled. Now that we have the necessary parts: arrays, global variables, and character types, I think it will be relatively easy to implement.
First, modify the tokenizer so that when you find a double quote, read up to the next double quote to create a string token. There is no need to implement backslash escaping etc. in this step. It's important to go step by step, so even if it seems easy to implement, try not to.
Assembly code representing string literal data cannot be output while generating machine code to be executed on the CPU. In the output assembly, global data and code must be written without mixing. In other words, when outputting code, we want to first output all string literals that appear in the code, but it is a pain to go through the syntax tree to do this. An easy way to do this would be to have a vector containing all the string literals you've ever seen, and simply add to it each time the parser sees a string.
Please refer to the actual compiler output .
By now printf
you should be able to output strings. This is a good opportunity to use your own programming language to write something a little more sophisticated than something obvious like test code. For example, wouldn't it be possible to write a solver for the 8-queen problem in your own language? It took humanity many decades to develop a programming language that allows us to easily write code at this level. It is a great advancement for humanity, and for you, that it can be implemented in a matter of weeks.
(When calling a function that takes variable-length arguments, the number of floating-point arguments is stored in AL. Our compiler does not yet have floating-point numbers. Therefore, when calling a function, (Always set AL to 0 before doing so.)
Up until now, I have been passing the C code directly to the argument string, but since the input is gradually getting longer, I think it's time to modify it so that it takes the file name as a command line argument like a normal C compiler. Let's. A function that opens a given file, reads its contents, and '\0'
returns a string terminated with can be written concisely as follows:
#include <errno.h>
#include <stdio.h>
#include <string.h>
// 指定されたファイルの内容を返す
char *read_file(char *path) {
// ファイルを開く
FILE *fp = fopen(path, "r");
if (!fp)
error("cannot open %s: %s", path, strerror(errno));
// ファイルの長さを調べる
if (fseek(fp, 0, SEEK_END) == -1)
error("%s: fseek: %s", path, strerror(errno));
size_t size = ftell(fp);
if (fseek(fp, 0, SEEK_SET) == -1)
error("%s: fseek: %s", path, strerror(errno));
// ファイル内容を読み込む
char *buf = calloc(1, size + 2);
fread(buf, size, 1, fp);
// ファイルが必ず"\n\0"で終わっているようにする
if (size == 0 || buf[size - 1] != '\n')
buf[size++] = '\n';
buf[size] = '\0';
fclose(fp);
return buf;
}
Due to compiler implementation, data that ends with a newline character is easier to handle than data that ends with a newline character or EOF, so if the last byte of the file is \n
not \n
I decided to.
Strictly speaking, this function will not work well if you are given a special file that cannot be accessed randomly. For example, if you specify a device file representing standard input /dev/stdin
or a named pipe as a file name, /dev/stdin: fseek: Illegal seek
you should see an error message displayed. However, in practice this function should be fine. Modify your code to use this function to read the contents of the file and treat it as input.
Since input files usually contain multiple lines, you should also enhance the function to display error messages. When an error occurs, if you decide to display the input file name, the line number of the line with the error, and the contents of that line, the error message will be as follows.
foo.c:10: x = y + + 5;
^ 式ではありません
A function that displays such an error message would look like this:
// 入力ファイル名
char *filename;
// エラーの起きた場所を報告するための関数
// 下のようなフォーマットでエラーメッセージを表示する
//
// foo.c:10: x = y + + 5;
// ^ 式ではありません
void error_at(char *loc, char *msg) {
// locが含まれている行の開始地点と終了地点を取得
char *line = loc;
while (user_input < line && line[-1] != '\n')
line--;
char *end = loc;
while (*end != '\n')
end++;
// 見つかった行が全体の何行目なのかを調べる
int line_num = 1;
for (char *p = user_input; p < line; p++)
if (*p == '\n')
line_num++;
// 見つかった行を、ファイル名と行番号と一緒に表示
int indent = fprintf(stderr, "%s:%d: ", filename, line_num);
fprintf(stderr, "%.*s\n", (int)(end - line), line);
// エラー箇所を"^"で指し示して、エラーメッセージを表示
int pos = loc - line + indent;
fprintf(stderr, "%*s", pos, ""); // pos個の空白を出力
fprintf(stderr, "^ %s\n", msg);
exit(1);
}
Although this error message output routine is quite simple, it can be said that it outputs errors in a fairly professional-looking format.
Column: Error recovery
If the input code is grammatically incorrect, many compilers will try to skip over the error and continue parsing the rest of the code. The goal is to find as many errors as possible and not just one. The ability to recover from a parser error and continue parsing is called "error recovery."
Error recovery was a very important feature in old compilers. In the 1960s and 1970s, programmers used large computers at computer centers for time-sharing, requiring them to bring in code to be compiled and wait, sometimes overnight, before they could get the compiled results. did. In such an environment, one of the important tasks of the compiler was to point out as many possible errors as possible. In older compiler textbooks, error recovery is one of the main topics in parsing.
Nowadays, development with compilers is more interactive, so error recovery is not as important a topic. The compiler we develop only prints the first error message. In modern times, this is sufficient in many cases.
Our compilers have gradually evolved, and it has become possible to write full-fledged code. In this case, what you want is a comment. This chapter implements comments.
There are two types of comments in C. A single comment is called a line comment, and //
the length from to the end of the line is a comment. The other type is called a block comment, /*
where is the start symbol and */
is the end symbol. All characters in block comments */
are skipped except for the two-character sequence .
Grammarly, a comment is treated the same as a single space character. Therefore, it is natural for tokenizers to skip comments in the same way as whitespace. The code to skip comments is shown below.
void tokenize() {
char *p = user_input;
while (*p) {
// 空白文字をスキップ
if (isspace(*p)) {
p++;
continue;
}
// 行コメントをスキップ
if (strncmp(p, "//", 2) == 0) {
p += 2;
while (*p != '\n')
p++;
continue;
}
// ブロックコメントをスキップ
if (strncmp(p, "/*", 2) == 0) {
char *q = strstr(p + 2, "*/");
if (!q)
error_at(p, "コメントが閉じられていません");
p = q + 2;
continue;
}
...
Here strstr
we used a function included in the C standard library to find the end of a block comment. strstr
searches for a string in a string and returns a pointer to the beginning of the passed string if it is found, or NULL if it is not found.
Column: Line comment
Only block comments existed in the original C, and line comments were officially added to the specification in 1999, almost 30 years after C was developed. Initially, this was supposed to be a non-breaking change, but in reality, in some delicate cases, code that was originally working could end up with a different meaning.
Specifically, the code below will read as if only block comments were supported a/b
, and as if line comments were supported .a
Column: Block comments and nesting
Block comments cannot be nested. /*
has no special meaning in comments, so comment out the existing block comment.
/* /* ... */ */
, the */
first would end the comment, */
and the second would cause a syntax error.
If you want to comment out all lines that may contain block comments, use the C preprocessor.
#if 0
...
#endif
#if 0
There is a way to do it like this .
In this step we will rewrite the test make test
to make it faster. By the time you get to this step, you'll probably already have over 100 tests written in your shell script. Shell script tests launch multiple processes for each test. In other words, for each test, I start my own compiler, assembler, linker, and test itself.
Starting a process is not that fast even for small programs. Therefore, if you do this hundreds of times, it will take a considerable amount of time in total. Your test scripts are probably taking a few seconds to run.
The reason I wrote the tests in shell scripts in the first place was because I couldn't do proper tests otherwise. When the language was at the calculator level, there was if
no such thing as ``and ==
'', so it was not possible to verify whether the calculation results were correct within the language. But now it can be verified. It is now possible to compare the results to see if they are correct, and if they are incorrect, display an error message (as a string) and exit.
Therefore, in this step, rewrite the tests that were written in shell scripts into C files.
Now, with these steps, our compiler now supports all the major programming elements: functions, global variables, and local variables. Also, by learning about split compilation and linking, you now understand how to compile a program in pieces and then combine them into a single file at the end.
This chapter describes how the OS executes executable files. By reading this chapter, you will be able to understand what kind of data is contained in the executable file and what happens before the main function is called.
This chapter also explains variable initialization expressions, how code like the one below is compiled, and adds support for initialization expressions to our compiler.
In order to support initialization expressions, it may be surprising, but knowledge of how the program is running before it reaches main is essential.
This chapter describes a simple executable format that contains all code and data in one executable file. Such an executable file is called a "statically linked" executable file. In contrast to static linking, an executable format called ``dynamic linking'' is also widely used, in which fragments of one program are divided into multiple files, and when they are executed, they are combined in memory and executed. I will explain this in a separate chapter. First, let's understand the basic model, static links.
An executable file consists of a file header and one or more regions called segments . An executable file usually has at least two segments, each containing separate executable code and data. A segment that contains executable code is called a `` text segment ,'' and a segment that contains other data is called a `` data segment .'' The actual executable file also contains other segments, but they are not needed to understand the mechanism, so they are omitted here.
Regarding terminology, please note that although text is the same word as text in a text file, it has a different meaning. Traditionally, in lower layers, data representing machine language is called "text." Also, machine language is just a string of bytes, and text is also a type of data, but when we talk about "text and data," "data" usually refers to "data other than text." . In this chapter, when we say data, we mean data other than text.
The object file that is input to the linker contains text and data separately. The linker concatenates text read from multiple object files and places it into one text segment, and similarly concatenates data read from multiple object files and places it into one data segment.
The file header of an executable file contains the memory address where each segment should be placed during execution. When you run an executable file, a program called the OS's " program loader " or simply " loader " copies text and data from the executable file into memory according to that information.
The following diagram shows an executable file and how it is loaded into memory by the loader.
In the executable file shown in this figure, we assumed that the file header contained information to load the text segment at 0x41000 and the data segment at 0x50000.
The file header also contains information about the address at which execution should start. For example, if information is written that execution should start from 0x41040, the loader loads the executable file into memory as shown in the above figure, sets the stack pointer to 0x7fff_ffff_ffff_ffff, and then jumps to 0x41040. This will start execution of the user program.
The content of the text segment is obviously machine code, but what's in the data segment? The answer is that the data segment contains global variables and literal strings.
Local variables are not directly contained in either the text or data segment. Local variables are created dynamically by the program in the stack area, so they do not exist immediately after the executable file is loaded into memory.
In the C execution model, a program can start executing the main function by simply loading the executable file into memory almost unchanged. Therefore, global variables must be set to appropriate initial values simply by copying them from the executable file's data segment to memory.
Due to this restriction, in C, for example, the following initialization expression that uses a function call cannot be used for global variables.
If you have a global variable that requires dynamic initialization like the one above, someone will need to execute the above expression before executing the main function. However, C does not have an initialization mechanism that is activated before main, so such initialization is not possible.
In other words, the value of a global variable must be completed at link time and be stored as a byte sequence in the executable file. Such values are limited to the following expressions:
It is obvious that constant expressions such as literal numbers or strings can be set as fixed values in text segments.
The addresses of global variables and functions are not normally determined at compile time, but are usually determined when the linker completes the executable file. int *x = &y;
Therefore, it is legal to define the value of a pointer type global variable by initializing it with the address of another global variable . The linker determines the layout of the program's segments by itself, and of course knows the addresses where functions and global variables are loaded, so it can fill in the contents at link time x
.
Also, adding a constant to the address of a label is supported by the linker functionality, so a int *x = &y + 3;
definition like is also legal.
Expressions other than the above patterns cannot be written in the initialization expression. For example, you cannot use the value (rather than the address) of a global variable in an initialization expression. ptrdiff_t x = &y - &z;
An expression that calculates the difference between the addresses of two global variables, such as , such an expression cannot be written in an initialization expression. Only the limited patterns listed above are allowed for global variable initialization expressions.
An example of an expression that can be written as a global variable initialization expression in C is as follows.
The assembly corresponding to each expression looks like this:
a:
.long 3
b:
.byte 0x66 // 'f'
.byte 0x6f // 'o'
.byte 0x6f // 'o'
.byte 0x62 // 'b'
.byte 0x61 // 'a'
.byte 0x72 // 'r'
.byte 0 // '\0'
c:
.quad a
d:
.quad b + 3
You can also write it as follows .byte
using .ascii
the notation " Continuous "..ascii "foobar\0"
Column: Dynamic initialization of global variables
In C, the contents of global variables must be statically determined, but in C++, global variables can be initialized using arbitrary expressions. In other words, in C++, global variable initialization expressions are executed before the main function is called. It works by the following mechanism.
.init_array
prints their function pointers in a special section called.init_array
The linker concatenates sections from multiple input files .init_array
and outputs them in a segment (so .init_array
the segment contains an array of function pointers)main
before transferring control to.init_array
In this way, the dynamic initialization of global variables is possible through the cooperation of the compiler, linker, and program loader.
C could support dynamic initialization of global variables by using the same mechanisms as C++, but such functionality has been intentionally left out of the C language specification.
The design choices of the C language specification are more restrictive for those who write programs, but they are more restrictive for those who run programs in environments with poor loaders or no loaders (such as code executed directly from ROM when the computer starts). However, it means that the language specifications can be fully satisfied. Therefore, this is a question of division, not one that is better than the other.
At first glance, an initialization expression appears to be just an assignment expression, but in reality, initialization expressions and assignment expressions are quite different syntactically, and there are some special ways of writing that are only allowed for initialization expressions. Let's make sure we understand this special way of writing.
First, an initialization expression can initialize an array. For example, the following expression initializes , x[0]
, x[1]
and x[2]
0, 1, and 2, respectively .x
If an initialization expression is given, the length of the array can be omitted because the length of the array can be determined by looking at the number of elements on the right side. For example, the expression above and the expression below have the same meaning.
If the array length is explicitly given and only part of the initialization expression is given, the remaining elements must be initialized to zero. Therefore, the following two expressions have the same meaning.
In addition, char
as a special syntax for initialization expressions for arrays, the following way of writing that uses literal strings as initialization expressions is allowed.
The above expression has the same meaning as the following expression.
Initialization expressions for global variables must be calculated at compile time. The result of a computation is either a simple sequence of bytes or a pointer to a function or global variable. For pointers, you can have one integer that represents the offset of that pointer.
A global variable that is not given any initialization expression must be initialized to have all bits set to 0. This is determined by the C syntax.
If the initialization expression does not result in the above calculation, treat it as a compilation error.
Local variable initialization expressions look the same as global variable initialization expressions, but their meaning is very different. A local variable initialization expression is an expression that is executed on the spot. Therefore, its contents need not be determined at compile time.
Basically, int x = 5;
a statement like int x; x = 5;
will compile the same way as if you had written it in two separate statements like .
int x[] = {1, 2, foo()};
A statement like compiles the same as:
The contents of local variables for which no initialization expression is given are undefined. Therefore such variables do not need to be initialized.
Column: word size
On x86-64, the term "word" means both 16-bit and 64-bit data. This is a complicated situation, but there are historical reasons why it is so.
The term "word" originally refers to the size of the largest integer or address that can be naturally handled on a computer. This is where 64 bits is called 1 word in x86-64, a 64-bit processor.
On the other hand, the word 16 bits comes from the term 8086, a 16-bit processor. When Intel engineers expanded the 8086 to 32 bits to create the 386 processor, they decided to call the 32 bits a double word or dword to avoid changing the size of the word. Did. Similarly, x86-64 extended 386 to 64 bits, and decided to call 64 bits a quad word or qword. This consideration of compatibility led to two different meanings of the word.
So far in this book, we have only used a feature called static linking. Static linking is a straightforward execution model, so by focusing on that model, I was able to explain things like assembly code and memory images of executable files in an easy-to-understand manner. Static linking is not widely used when creating files. In reality, a feature called dynamic linking is widely used instead of static linking.
This chapter explains static linking and dynamic linking.
By default, compilers and linkers attempt to output executable files that perform dynamic linking. Readers may have seen the following error when you forget to add an option to (if you haven't, try removing the option cc
from and then running Please try it).-static
-static
Makefile
make
$ cc -o tmp tmp.s
/usr/bin/ld: /tmp/ccaRuuub.o: relocation R_X86_64_32S against `.data' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
The linker attempts to dynamically link by default, but in order to dynamically link, the compiler must output assembly code that can do so. 9cc currently does not output such code, so if -static
you forget to add it, an error like the one above will be displayed. After reading this chapter, you should be able to understand what the above errors mean and what you need to do to resolve them.
A statically linked executable is a self-contained executable that does not require any other files to run. For printf
example, the function ``is not a function written by the user, but is libc
included in the standard library, but when you create an executable file with static linking, the printf
code of `` libc
is copied to the executable file. libc
It is not needed when running statically linked programs . Because libc
the necessary code and data inside has already been copied to the executable file.
Let's take a look at how the simple program below hello.c
becomes a statically linked executable.
hello.c
To compile and link this file hello
into an executable file named , enter the following command:
$ cc -c hello.c
$ cc -o hello hello.o
In the above command, the first line hello.c
compiles and hello.o
creates an object file, and the second line links it into an executable file. It is possible to write these two commands together cc -o hello hello.c
, but when you start the compiler in that way, the same thing as the above two commands will be done internally.
hello.c
However stdio.h
, as we have seen so far in this book, the header file does not contain the code itself for the function body. Therefore, hello.o
when creating the file, the compiler knows the existence of the function stdio.h
declared in and its type, but has no knowledge of the actual code in . Therefore, it is impossible for a file to contain such code . In reality, contains only the definition of . It is the role of the linker to complete the executable file by combining and the object file containing it.printf
printf
printf
hello.o
hello.o
main
hello.o
printf
cc
When the linker is started via in the second line , hello.o
not only the files passed on the command line are passed to the linker, but /usr/lib/x86_64-linux-gnu/libc.a
also are passed to the linker. printf
The function libc.a
is contained in this function. .a
It is an archive file .tar
similar .zip
to . Let's take a peek inside.
$ ar t /usr/lib/x86_64-linux-gnu/libc.a
...
printf_size.o
fprintf.o
printf.o
snprintf.o
sprintf.o
...
The archive file contains
The execution model for statically linked executables is simple. At runtime, only the executable file exists in memory, so each segment of the executable file can be loaded to any location in memory. The load will no longer fail when attempting to load to the default address determined at link time. This is because there is nothing in memory before the executable is loaded. Therefore, with static linking, the addresses of all global variables and functions can be determined at link time.
Static links have the following advantages:
Static links have the following disadvantages:
Dynamically linked executables, on the other hand, require other .so
(on Unix) or (on Windows) files to run. Contains code for functions such as and global variables such as . Files such as and are called dynamic libraries, or simply libraries, or DSOs (dynamic shared objects)..dll
.so
.dll
printf
errno
.so
.dll
C's type syntax is notoriously unnecessarily complex. C developer Dennis Ritchie co-authored the book `` The C Programming Language '' (commonly known as ``K&R'') and wrote, ``The syntax of declarations in the C language, especially those involving pointers to functions, is often criticized. 7 . _
In this way, the C type syntax is a bad design that even the author implicitly admits, but even so, this syntax is not that difficult once you understand the rules.
This chapter explains how to read the C type syntax. By following this step-by-step approach, by the end of this chapter you should be able to decipher complex types such as void (*x)(int)
.void (*signal(int, void (*)(int)))(int)
The types that can be expressed in C are relatively simple. In order to separate the complexity of a type's syntax from the complexity of the type itself, let's put syntax aside for a moment and just think about types.
Complex types such as pointers and arrays can be represented by diagrams that connect simple types with arrows. For example, the following diagram shows a type that represents a "pointer to int".
In Japanese, it is pronounced "pointer of int pointer" from the end point of the arrow to the start point. In English, the opposite would be pronounced: a pointer to a pointer to an int, following the direction of the arrow.
Let's say the variable x
has the type shown in the diagram above. The simplest answer to the question " x
What is the type of ?" is "It's a pointer." This is because the type that the first arrow points to is a pointer type. Note that x
is first of all a pointer, not a type such as . int
The answer to the question "What type is that pointer pointing to?" is "It's a pointer." This is because the pointer that follows one arrow is also a pointer type. Finally, the answer to the question "What type is that pointer pointing to?" is "int."
The following figure shows an "array of pointers to int". The length of the array is 20. In the actual compiler, the length of an array is also expressed as a member of the type representing the array, as shown in the figure below.
If a variable x
has the type shown in the diagram above, x
then is an array of length 20, the elements of that array are pointers, and the pointer points to int
.
Function types can also be represented graphically. The following figure shows the type of function that takes two arguments, an int and a pointer to an int, and returns a pointer to void.
Finally, let's look at a more complex example. The following figure shows a type called pointer to function that takes an int as an argument and returns a pointer to a function that returns an int. It's complicated when you put it into words, but when you put it in a diagram, you can see that it's just long and the structure is simple.
If a variable x
has the type shown in the diagram above, x
then is of type pointer, the pointer points to a function, the function's arguments are of type , the int
return type is of pointer type, and the pointer points to What it points to is a function, and int
what is the return type of that function.
Internally, the compiler represents types using the same method as in the diagram above. In other words, complex types involving pointers, arrays, and functions are represented inside the compiler as a data structure consisting of structures of simple types connected by pointers in the same order as shown in the diagram above. Therefore, it is no exaggeration to say that this diagram is the true form of the kata.
Although it is easier to understand the meaning of a type when it is represented in a diagram as shown above, it is tedious to draw a diagram every time to understand the type. In this section, we will consider a notation that can be written more compactly without compromising the clarity of the diagram.
Unless a function type is included, all boxes will be arranged in a daisy chain in the diagram without any branching. Therefore, if the type has only pointers or arrays, you should be able to express the diagram in text by writing the name of the type in the diagram from left to right.
Let's think about specific notation. The box representing the pointer *
will be represented by the symbol. [n]
Also, make it a rule to write the name of the type in boxes that represent arrays of length n and boxes that represent built-in types such as int. Then, the figure below * * int
can be represented by the string .
Pointer, pointer, and int appear in order from the starting point of the arrow, so the * * int
notation is as follows. * * int
Given the notation conversely, we can also draw the above diagram . In other words, this text representation is a notation that allows you to compactly write down the same information as a diagram in text.
The figure below [20] * int
can be expressed by the string .
For functions, func(引数の型, ...) 返り値の型
let's write it as " ". For example, the type shown in the figure below func(int, * int) * void
is written as . Readers are encouraged to check that this notation matches the diagram.
Finally, the type that the diagram below represents * func(int) * func() int
is .
The notation described so far is probably the most straightforward and simplest textual representation of a type. In fact, the type syntax in the Go programming language is exactly the same as the notation described here. Go is a language in which the people who created C are participating in its development, and the type syntax in Go has been subtly improved by taking advantage of the lessons learned from C8 .
In this section, we will learn how to decipher C types by combining the type notation we learned above with the C type notation.
A C type can be decomposed into four parts starting from the beginning:
For int x
example, the base type int
has no asterisks to represent pointers, and the identifier x
has no parentheses for functions or arrays. unsigned int *x()
The base type is the unsigned int
asterisk, which represents a pointer *
, the identifier is an identifier x
, and the parentheses represent a ()
function. void **(*x)()
The base type is void
the asterisk representing a pointer **
, the nested type is *x
the parentheses representing a function ()
, and so on.
If what follows the pointer is just an identifier, the type is relatively easy to decipher.
When there are no parentheses for functions or arrays, the type notation and its meaning are as follows: The meaning will be written using the same notation as Go explained above.
C type notation | meaning |
---|---|
int x |
int |
int *x |
* int |
int **x |
* * int |
When there are function parentheses, the base type and the asterisk representing the pointer represent the return type. An example is shown below.
C type notation | meaning |
---|---|
int x() |
func() int |
int *x() |
func() * int |
int **x(int) |
func(int) * * int |
You can see that the type of the function when parsed without parentheses, that is int x
, int *x
the int **x
type of , func(...)
just follows after .
Similarly, when you see array parentheses, it means "array of." An example is shown below.
C type notation | meaning |
---|---|
int x[5] |
[5] int |
int *x[5] |
[5] * int |
int **x[4][5] |
[4] [5] * * int |
For example int *x[5]
, for the type , x
the type is an array of length 5, the elements of that array are pointers, and the pointer points to int
. As with functions, you [...]
can see that the type of an array without parentheses follows.
When a pointer is followed by parentheses rather than an identifier, the parentheses represent a nested type. If you have nested types, you can parse the types inside and outside the parentheses separately and combine them later to get the overall type.
As an example, int (*x)()
let's parse the declaration:
int (*x)()
y
If you interpret the type by considering the entire first parenthesis as a single identifier (we'll use a random name here ), int y()
it will look like this, so the type outside the parentheses func() int
will be . *x
On the other hand, if we consider the type in parentheses , it * ___
represents the type. The type in parentheses int
does not have the base type written in it, so it is not a complete type, but I decided to use the notation to represent the missing base part here ___
.
int (*x)()
By parsing the inside and outside of the parentheses separately, we were able to obtain two types func() int
. The overall mold is created by fitting the outer mold into * ___
the missing parts of the inner mold . ___
In this case, * func() int
is the type as a whole. In other words int (*x)()
, in the declaration , x
is a pointer, the type pointed to by that pointer is a function, and the return value of that function int
is .
Let's take another example. void (*x[20])(int)
If we consider the first parentheses as an identifier, void y(int)
we get the same pattern as , so anything outside the parentheses func(int) void
represents the type . *x[20]
The symbol in parentheses [20] * ___
represents the symbol. Combining these two [20] * func(int) void
will give you. That is, x
is an array type of length 20, the type of its elements is pointers, the pointer points to a function that takes int
, and the type the function returns is void
.
You can read any complex type using the method described here. As an extreme example, signal
let's look at the types of Unix functions. signal
Functions are famous for their type, which you can't tell what they are at first glance. signal
The function declaration is shown below .
Even complex types like this can be deciphered by breaking them down and reading them. First, let's consider what's inside and outside the first bracket. If we consider the type by considering the first parentheses as an identifier, we void y(int)
get the same pattern as , so the outside of the parentheses func(int) void
represents the type.
The type in parentheses *signal(int, void (*)(int))
is . This has no base type, an asterisk to represent a pointer *
, an identifier signal
, a set of parentheses to represent a function, and two arguments. Therefore, the rough type func(引数1, 引数2) * ___
is as follows. If you think about the type of the argument, it looks like this:
int
is. Its type int
is simply .void (*)(int)
is. This consists of a base type void
, a nested type enclosed in parentheses *
, and a pair of parentheses representing a function, so if the first parenthesis is considered an identifier, the type is func(int) void
. *
If you think about what's in the parentheses , there's an asterisk representing a pointer, so its type です。結局のところ、第二引数の型はそれらの型が組み合わさった
is `* ___ * func(int) void`.If you put the above argument type func
in the argument part of , func(int, * func(int) void) * ___
it becomes the type. Finally, when you combine the type inside the parentheses with the type outside the parentheses, func(int, * func(int) void) * func(int) void
you get the type that is the final result.
That is, signal
is a function that takes two arguments:
int
int
void
pointer to a function that takes and returnssignal
The return type of is a pointer, which points to the function that int
takes and returns.void
Column: Intent of C type syntax
C's type syntax was designed with the idea that types would be easier to understand if they were written the same way they would be used. In this design policy, int *x[20]
the declaration, for example , means that *x[20]
when you write the expression, the type of is determined int
so that the type becomes . What is the type of the type that matches ? It can be said that it is the same as solving the problem.x
int foo = *x[20]
x
int *(*x[20])()
If we think of this declaration as a problem of determining the type of int foo = *(*x[20])()
to avoid a type error , first of all, must be an array, and the elements of that array should be safe to dereference with . Since we are calling a function with a pointer, the pointer's destination is a function, and since we are dereferencing its return value, we can think of it as a function that returns a pointer.x
x
*
*
C's type syntax may seem nonsensical, but it makes sense as a design principle. However, I must say that in the end it turned out to be a bad design.
C type notation | meaning |
---|---|
int x |
int |
int *x |
* int |
int x[] |
[] int |
int x() |
func() int |
int **x |
* * int |
int (*x)[] |
* [] int |
int (*x)() |
* func() int |
int *x[] |
[] * int |
int x[][] |
[] [] int |
int *x() |
func() * int |
int ***x |
* * * int |
int (**x)[] |
* * [] int |
int (**x)() |
* * func() int |
int *(*x)[] |
* [] * int |
int (*x)[][] |
* [] [] int |
int *(*x)() |
* func() * int |
int **x[] |
[] * * Int |
int (*x[])[] |
[] * [] int |
int (*x[])() |
[] * func() int |
int *x[][] |
[] [] * int |
int x[][][] |
[] [] [] int |
int **x() |
func() * * int |
int (*x())[] |
func() * [] int |
int (*x())() |
func() * func() int |
The main text of this book was written in Markdown format. I used Pandoc to convert Markdown to HTML , Graphviz to create the syntax tree diagram , and draw.io to create the other diagrams .
This chapter summarizes the x86-64 instruction set features used by the compiler created in this book. In this chapter, we have used the following abbreviations for brevity.
src
, dst
: two arbitrary registers of the same sizer8
, r16
, r32
, r64
: 8-bit, 16-bit, 32-bit, and 64-bit registers respectivelyimm
: immediate valuereg1:reg2
: Notation used to represent a large number that cannot be stored in one register, such as 128 bits, using two registers as upper and lower bits, respectively reg1
.reg2
A table containing a list of 64-bit integer registers and their alias names.
64 | 32 | 16 | 8 |
---|---|---|---|
RAX | EAX | AX | AL |
RDI | EDI | D.I. | DIL |
RSI | ESI | S.I. | SIL |
RDX | EDX | DX | DL |
RCX | ECX | CX | C.L. |
RBP | EBP | B.P. | BPL |
RSP | ESP | SP | SPL |
RBX | EBX | BX | BL |
R8 | R8D | R8W | R8B |
R9 | R9D | R9W | R9B |
R10 | R10D | R10W | R10B |
R11 | R11D | R11W | R11B |
R12 | R12D | R12W | R12B |
R13 | R13D | R13W | R13B |
R14 | R14D | R14W | R14B |
R15 | R15D | R15W | R15B |
The usage in ABI is as follows. Registers that do not need to be restored to their original values when returning from a function are marked with a ✔.
register | How to use | |
---|---|---|
RAX | Return value/Number of arguments | ✔ |
RDI | 1st argument | ✔ |
RSI | 2nd argument | ✔ |
RDX | 3rd argument | ✔ |
RCX | 4th argument | ✔ |
RBP | base pointer | |
RSP | stack pointer | |
RBX | (nothing special) | |
R8 | 5th argument | ✔ |
R9 | 6th argument | ✔ |
R10 | (nothing special) | ✔ |
R11 | (nothing special) | ✔ |
R12 | (nothing special) | |
R13 | (nothing special) | |
R14 | (nothing special) | |
R15 | (nothing special) |
call
When calling a function, the instruction must be called with RSP a multiple of 16 (aligned to 16) . Function calls that do not meet this condition are not ABI compliant and may cause some functions to crash.
mov dst, [r64] | Load the value into dst from the address pointed to by r64 |
mov [r64], src | Store the value of src to the address pointed to by r64 |
push r64/imm | Reduce RSP by 8 and store r64/imm to RSP |
pop r64 | Load from RSP to r64 and increase RSP by 8 |
call label | Push RIP onto stack and jump to label |
call r64 | Push RIP onto stack and jump to r64 address |
ret | Pop the stack and jump to that address |
leave | mov rsp, rbp pop rbp equivalent to after |
cmp reg1, reg2/imm je label |
If reg1 == reg2/imm, jump to label |
cmp reg1, reg2/imm jne label |
If reg1 != reg2/imm, jump to label |
cmp reg1, reg2/imm jl label |
If reg1 < reg2, jump to label (signed comparison) |
cmp reg1, reg2/imm jle label |
If reg1 <= reg2, jump to label (signed comparison) |
cmp reg1, reg2/imm sete al movzb eax, al |
RAX = (reg1 == reg2) ? 1 : 0 |
cmp reg1, reg2/imm setne al movzb eax, al |
RAX = (reg1 != reg2) ? 1 : 0 |
cmp reg1, reg2/imm setl al movzb eax, al |
RAX = (reg1 > reg2) ? 1 : 0 (signed comparison) |
cmp reg1, reg2/imm setle al movzb eax, al |
RAX = (reg1 >= reg2) ? 1 : 0 (signed comparison) |
add dst, src/imm | dst = dst + src/imm |
sub dst, src/imm | dst = dst - src/imm |
mul src | RDX:RAX = RAX * src |
imul dst, src | dst = dst * src |
div r32 | EAX = EDX:EAX / r32 EDX = EDX:EAX % r32 |
div r64 | RAX = RDX:RAX / r64 RDX = RDX:RAX % r64 |
idiv r32/r64 | signed version of div |
cqo | Sign extend RAX to 128 bits and store in RDX:RAX |
and dst, src | dst = src & dst |
or dst, src | dst = src | dst |
xor dst, src | dst = src^dst |
neg dst | dst = -dst |
not dst | dst = ~dst |
shl dst, imm/CL | Shift dst to the left by the value in the imm or CL register (when specifying the shift amount with a register, only CL can be used) |
shr dst, imm/CL | Logically shift dst to the right by the value of the imm or CL register.The upper bits that have been shifted in are cleared to zero. |
sar dst, imm/CL | Arithmetically shift dst to the right by the value in the imm or CL register . The upper bits shifted in will be the same as the sign bit of the original dst. |
lea dst, [src] | Calculate the address of [src], but do not access memory and store the result of the address calculation itself in dst. |
movsb dst, r8 | Sign extend r8 and store to dst |
movzb dst, r8 | Store r8 to dst without sign extension |
movsw dst, r16 | Sign extend r16 and store to dst |
movzw dst, r16 | Store r16 to dst without sign extension |
Git was originally developed for Linux kernel version control. The Linux kernel is a huge project with thousands of developers, so Git has a rich set of features to meet the complex workflows involved. These functions are useful, but in individual development where you are the only committer, there is no need to fully utilize Git's functions. There's a lot to learn when it comes to mastering Git, but this book covers just a few things you'll need to remember. For those who are new to development using Git, we have prepared a cheat sheet below.
git add ファイル名
Add the newly created file to the repository
git commit -A
Commit all changes made to the working tree at once (an editor will open so enter a commit message)
git reset --hard
Undo all changes made to the working tree since the last commit
git log -p
View past commits
git push
Push the repository upstream, such as GitHub
For those who are new to version control systems, let's explain a little about Git and version control system concepts.
Git is a database management tool that stores file change history. This database is called a repository. When you clone a repository from GitHub, etc., the repository is downloaded, and then a default, up-to-date directory tree is extracted from the repository, all under the current directory.
The directory tree extracted from the repository is called the "working tree." Although you use an editor to edit and compile source files in the working tree, the working tree itself is not part of the repository. The working tree is like a file extracted from a zip file, and no matter how many changes you make to it, the original repository remains intact.
Changes you make to your work tree are written back to the repository in units called "commits" at convenient points. This updates the database, and you can then make other changes. When using Git, development progresses by repeatedly changing files and committing them.
The commit message can be written in Japanese, so please write it properly. For example, a one-line message such as "Add * and / as operators, but do not handle operator precedence" is fine.
Divide commit units into as small a size as possible. For example, if you are changing code and want to do a little refactoring, commit the refactoring as a separate commit. It is undesirable to combine two or more separate features into one commit.
You don't need to use Git's advanced features. For example, you shouldn't need to use branches.
Be sure to commit code that adds functionality together with code that tests that functionality. Also, before committing, run tests to make sure that existing features are not broken and new features are working properly before committing. In other words, the goal is to be able to compile and test a repository no matter what point you check out. However, if you accidentally commit something that doesn't pass the test, you don't have to modify the git commit log to fix it. Simply put the fix in the next commit.
When reading the Git documentation, there are many tricky functions, but if you build a model of how Git stores data in principle, it will be easier to understand the functions. Become. Therefore, here we will explain the internal structure of Git.
Git is a type of file system implemented as a user program. The structure of a Git database is very similar to a file system. However, the difference is that while normal file systems use file names to access files, Git uses the hash value of the file as the name.
This system where the name is determined based on the contents of the file is called a content-addressable system. In a content-addressable file system, files with the same name have the same content. Also, files with different contents cannot have the same name (same hash value). This is ensured by using a cryptographically secure hash function. These file systems have the property that there is no need to name each file separately, and once the name is determined, the contents of the file are also uniquely determined.
A commit is also a file within Git. In addition to the commit message, that file contains the hash values of the files belonging to that commit and the hash value of the previous commit.
In order to retrieve a file from the Git file system, you must know the hash value of the file you want.
It may seem like a chicken-and-egg problem that you can't retrieve a commit file without knowing its hash value, but in reality, in addition to the file system, the repository also has the hash value of the commit file. and a list of names for it, so you can use it to find commits. For example, a repository contains da39a3ee5e...
information such as the hash value of a commit named "master" (which is the history expanded into the working tree by default). Using this information, Git can expand files belonging to master into the working tree.
Internally, "committing" means adding files that have changed to Git's internal file system, and adding the hash values of those files and the hash value of the previous commit. A commit file is added in the same way, and finally the inventory is updated with the hash value of that commit file.
For example, in order to make master's last commit disappear (although it is better not to do this), look at the commit file pointed to by master, get the hash value of the previous commit, and use that hash value. All you have to do is overwrite master. A "branch" simply means that there are two or more commits that have a certain commit as the previous commit, and those two commits are listed in the inventory.
These content-addressable version control systems also have security benefits. The name of a commit (commit file hash value) includes the hash values of all files belonging to that commit and the hash value of the commit file before it. The previous commit file contains the hash of the previous commit file, so in the end the hash values of all the commits leading up to the latest commit are included in the calculation of the hash value of the latest commit. There will be. Therefore, it is theoretically impossible to secretly modify the commit content or history without changing the hash value. That's an interesting trait.
When learning about Git's features, always keep this content-addressable file system in mind. I'm sure it will make things easier to understand.
This appendix explains how to use Docker to build and run Linux applications on macOS.
Docker is software that provides a Linux virtual environment. Whereas a full-featured VM emulates the PC hardware and runs a regular OS on it, Docker provides (only) Linux system calls directly within the virtual environment. There is a difference. This difference gives Docker the advantage of faster startup and overall lighter weight compared to full-featured virtualization. Docker is often used when deploying applications to computers in the cloud.
When developing Linux applications on macOS using Docker, you can consider the following two system configurations.
In the former configuration, the configuration is simple because it is the same as preparing a Linux machine separately from the Mac for development, but the setup is a little complicated because you need to prepare a development environment on Linux, including an editor. is. On the other hand, in the latter configuration, you do not "live" inside Docker, so you don't have to worry about setting up the environment, but it is a bit troublesome to use both inside and outside of Docker.
Although you can choose either of the above two configurations, in this book we will choose the latter configuration to avoid describing the setup steps for a Linux environment. Therefore, only the commands you want to run in the Linux environment will be explicitly executed in Docker.
To set up a Linux development environment using Docker, first download and install Docker Desktop for Mac . compilerbook
You can then run the following command to create a Docker image named:
$ docker build -t compilerbook https://www.sigbus.info/compilerbook/Dockerfile
A "Docker image" or "image" is a collection of all the files and settings required for a Linux environment. What actually launches a Docker image is called a "Docker container" or simply "container" (same as the relationship between an executable file and a process).
To create a container and run a command inside it, docker run
supply the image name and command as arguments to the command. The following is an example of running ls /
the command compilerbook
inside a container.
$ docker run --rm compilerbook ls /
bin
root
dev
etc
...
It is possible to keep the container running in the background, but since our usage only requires interactive use, we provided an option that causes the container to exit as soon as the command finishes --rm
. Therefore, each time you enter the above command, a container will be created and destroyed.
make
In order to run and compile source files inside a container, you need to make the source files you are editing outside Docker visible to the container .
docker run
You can make the external environment path visible inside Docker -v <source>:<dest>
by giving an option of the form . You can also optionally specify the current directory when executing the command . By using these options, you can run the program with the directory containing the source file set as the current directory.<source>
<dest>
-w
make
9cc
Let's assume that the source files are located in a subdirectory of your home directory . To run that directory from within the container make test
, run the following command:
$ docker run --rm -v $HOME/9cc:/9cc -w /9cc compilerbook make test
Please run the build and test commands as described above.
If you want to start a shell inside the container and use it interactively, run it with the following options -it
:docker run
$ docker run --rm -it -v $HOME/9cc:/9cc compilerbook
The Docker container you created above comes with a set of development tools installed, but you may want to install additional applications. Here's how.
Docker containers are temporary objects. Even if you create a container and install an application from a shell inside it, those changes will not be written back to the original image. This property of Docker ensures that your application starts from the same fresh state every time, but this property is a liability when you want to make changes to the image.
docker commit
If you have made changes to the container and want to write them back to the image, you must run the command explicitly .
As an example, curl
let's say you want to install a command. In that case, first create a container as follows.
$ docker run -it compilerbook
--rm
Note that we are not passing any options. After that, install it apt
using the shell inside the container as follows , and exit from the container with the command.curl
exit
$ sudo apt update
$ sudo apt install -y curl
$ exit
docker run
Since I didn't specify any options when running --rm
, the container remains in a suspended state even after exiting from the container shell. docker container ls -a
You can display suspended containers using the following command:
$ docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a377e570d1da compilerbook "/bin/bash" 7 seconds ago Exited (0) 5 seconds ago pedantic_noyce
compilerbook
a377e570d1da
You should see that there is a container with the ID that ran the image . docker commit
You can write a container back to an image using the command:
$ docker commit a377e570d1da compilerbook
You can make changes to the image using the steps above.
Even if suspended containers and old images remain, there is no particular problem as they only consume some disk space, but if you are concerned about them, you can delete them by running docker system prune
.
http://www.drdobbs.com/cpp/increasing-compiler-speed-by-over-75/240158941
↩︎DMD does memory allocation in a bit of a sneaky way. Since compilers are short-lived programs, and speed is of the essence, DMD just mallocs away, and never frees.
Linkers and Loaders, ISBN 978-1558604964, John R. Levine (1999) Chapter 1.2
↩︎Programmers were using libraries of subprograms even before they used assemblers. By 1947, John Mauchly, who led the ENIAC project, wrote about loading programs along with subprograms selected from a catalog of programs stored on tapes, and of the need to relocate the subprograms' code to reflect the addresses at which they were loaded. Perhaps surprisingly, these two basic linker functions, relocation and library search, appear to predate even assemblers, as Mauchly expected both the program and subprograms to be written in machine language. The relocating loader allowed the authors and users of the subprograms to write each subprogram as though it would start at location zero, and to defer the actual address binding until the subprograms were linked with a particular main program.
https://parisc.wiki.kernel.org/images-parisc/b/b2/Rad_11_0_32.pdf The 32-bit PA-RISC run-time architecture document, v. 1.0 for HP-UX 11.0, Chapter 2.2.3
↩︎When a process is initiated by the operating system, a virtual address range is allocated for that process to be used for the call stack, and the stack pointer (GR 30) is initialized to point to the low end of this range. As procedures are called, the stack pointer is incremented to allow the called procedure frame to exist at the address below the stack pointer. When procedures are exited, the stack pointer is decremented by the same amount.
https://www.acsac.org/2002/papers/classic-multics.pdf
↩︎Thirty Years Later: Lessons from the Multics Security Evaluation, "Third, stacks on the Multics processors grew in the positive direction, rather than the negative direction. This meant that if you actually accomplished a buffer overflow, you would be overwriting unused stack frames, rather than your own return pointer, making exploitation much more difficult.
https://en.wikipedia.org/wiki/ASCII#cite_note-Mackenzie_1980-1
↩︎There was some debate at the time whether there should be more control characters rather than the lowercase alphabet.
http://catb.org/~esr/jargon/html/L/language-lawyer.html
↩︎language lawyer: n. A person, usually an experienced or senior software engineer, who is intimately familiar with many or most of the numerous restrictions and features (both useful and esoteric) applicable to one or more computer programming languages. A language lawyer is distinguished by the ability to show you the five sentences scattered through a 200-plus-page manual that together imply the answer to your question “if only you had thought to look there”. Compare wizard, legal, legalese.
C Programming Language, 2nd Edition (ISBN 978-0131103627), 5.12, p.122
↩︎C is sometimes castigated for the syntax of its declarations, particularly ones that involve pointers to functions.
Go's Declaration Syntax by Rob Pike ↩︎