Introduction to C compiler creation for those who want to know about low layers

Rui Ueyama <ruiu@cs.stanford.edu>

2020-03-16


Introduction

This online book is currently being written. This is not the final version. feedback form

This book is packed with content that is too much to include in one book. In this book, we will create a program that converts source code written in C language into assembly language, that is, a C compiler. The compiler itself is also developed using C. The immediate goal is to be able to self-host, i.e. compile its own source code with a homebrew compiler.

In this book, I decided to explain various topics in a progressive manner throughout the book, so as not to make the explanation of compilers too difficult. Here's why:

Compilers can be conceptually divided into multiple stages: parsing, intermediate passes, and code generation. A common textbook approach is to explain each topic in chapters, but books with this approach tend to become too narrow and deep in the middle, making it difficult for readers to follow.

Also, with the development method of creating each stage, it is not possible to run the compiler until all stages are completed, so it is difficult to notice if there is a definite mistake in your understanding or code until the whole stage starts working. The downside is that you can't. In the first place, you don't really know what the next stage's input is expected to be until you create it yourself, so you don't really know what to output in the previous stage. Another problem is that it's hard to stay motivated because you can't compile any code at all until it's finished.

In this book, I decided to take a different approach to avoid that trap. Early in the book, you will implement a ``proprietary language'' with a very simple language specification. The language is so simple that you don't need to know much about how to write a compiler to implement it. After that, the reader will continue to add functions to the "proprietary language" through this book, and will eventually develop it into something that is consistent with C.

In such an incremental development method, you create a compiler step by step, making small commits. With this development method, the compiler is always in some sense "complete" at every commit. At one stage it may be able to do nothing more than a calculator level, at another stage it may be a very limited subset of C, and at another stage it may be a language that can almost be called C. The point is that at every stage, we aim for a language with reasonable specifications that match the level of completion at that point. During development, we do not emphasize only some functions to make it look like C language.

We will also explain data structures, algorithms, and computer science knowledge in stages according to the development stage.

Incremental development achieves the goal that, at any point in time while reading this book, the reader has a complete knowledge of how to create a reasonable language at that level. This is much better than the state where only some topics of compiler creation are extremely detailed. By the time you finish reading this book, you will be well-versed in all the topics.

This book also explains how to write large programs from scratch. The skill of creating large programs is a unique skill that is different from learning data structures and algorithms, but I don't think there are many books that explain such things. Also, even if someone explains it to you, you won't really know whether a development method is good or bad unless you actually experience it. This book is designed so that the process of developing your own language into C language will give you practical experience of a good development method.

If the author's plan is successful, by reading this book, readers will not only learn about techniques for creating compilers and the CPU instruction set, but also learn how to break down large programs into small steps and create them little by little. You'll learn testing techniques, version control techniques, and even how to prepare for an ambitious project like writing a compiler.

The intended audience for this book is ordinary C programmers. You don't need to be a super C programmer who knows the C language specification well. It is sufficient that you understand pointers and arrays, and that you can at least take some time to read small C programs written by others.

When writing this book, I tried not only to explain the language specifications and CPU specifications, but also to explain as much as possible why such designs were chosen. We have also interspersed columns about compilers, CPUs, the computer industry, and its history that will interest readers, making it an enjoyable read.

Creating a compiler is a lot of fun. In the beginning, my home-made language could only do ridiculously simple things, but as I continued to develop it, it quickly grew to resemble C language to the point that even I was surprised, and it started working as if by magic. Become. During actual development, I am often surprised that large test code that I didn't think would compile well at the time compiles without errors and works perfectly correctly. Such code cannot be easily understood by oneself even when looking at the compiled assembly. Sometimes I even feel like my compiler has more intelligence than I, the author. A compiler is a program that, even if you know how it works, you still wonder why it works so well. I'm sure you too will fall in love with its charm.

So, without further ado, let's jump into the world of compiler development with the author!

Column: Why C language?

Among the many programming languages ​​available, why did you choose C for this book? Or why not a homegrown language? Regarding this point, there is no reason why it absolutely has to be C, but if you have to choose a language to learn how to create a compiler that outputs native code, C is a reasonable choice that is not very common. I think it's one of them.

Interpreted languages ​​don't allow you to learn much about the lower layers. On the other hand, C is usually compiled into assembly, so by creating a compiler, you can learn about the CPU's instruction set and how programs work, as well as C itself.

C is widely used, so once you have a working compiler, you can play around with compiling third-party source code that you download from the internet. For example, you can build and play mini Unix xv6. If the compiler is mature enough, it should be possible to compile even the Linux kernel. This kind of enjoyment is not possible with minor or homegrown languages.

C++ is a statically typed language that compiles to a native machine language like C and is at least as widely used as C. However, the language specifications for C++ are so large that it is impossible to easily create your own compiler, so it is not a realistic option.

Designing and implementing an original language is good in terms of honing your language design sense, but there are pitfalls as well. Things that are difficult to implement can be avoided by avoiding them in the language specification. This is not the case with languages ​​such as C, where the language specification is given as a standard. I think that this restriction is quite good in terms of learning.

Conventions used in this book

Functions, expressions, commands, etc. are displayed in monospaced fonts, such as in mainthe text .foo=3make

Code that spans multiple lines is displayed in a frame using a monospaced font, as shown below.

int main() {
  printf("Hello world!\n");
  return 0;
}

If the boxed code is a shell command that is meant to be typed verbatim by the user, a $line starting with represents a prompt. $Type the rest of that line ( $but not the rest) into the shell. $The other lines represent the output from the command you entered. For example, the block below makeis an example of what happens when the user enters the string and presses Enter. makeThe output from the command make: Nothing to be done for `all'.is:

$ make
make: Nothing to be done for `all'.

Development environment assumed in this book

This book assumes a 64-bit Linux environment running on a so-called ordinary PC such as Intel or AMD. Please install development tools such as gcc and make in advance according to the distribution you are using. If you are using Ubuntu, you can install the commands used in this book by running the following command.

$ sudo apt update
$ sudo apt install -y gcc make git binutils libc6-dev

Although macOS is fairly compatible with Linux at the assembly source level, it is not fully compatible (specifically, a feature called "static linking" is not supported). Although it's possible to create a C compiler for macOS using the information in this book, if you try it you'll likely run into a number of minor incompatibilities. It is not recommended to simultaneously learn techniques for creating a C compiler and the differences between macOS and Linux. When something doesn't work, it's hard to know which understanding is wrong.

Therefore, this book does not cover macOS. On macOS, please use some kind of virtual environment to prepare a Linux environment. If this is your first time preparing a Linux virtual environment, please refer to Appendix 3 , which summarizes how to create a development environment using Docker .

Windows is not compatible with Linux at the assembly source level. However, with Windows 10, it is possible to run Linux on Windows like one application, and by using this, you can proceed with development on Windows. An application called Windows Subsystem for Linux (WSL) is that Linux-compatible environment. When implementing the contents of this book on Windows, please install WSL and proceed with development within it.

Column: Cross compiler

The machine on which the compiler runs is called the "host," and the machine on which the code output by the compiler runs is called the "target." Although both are 64-bit Linux environments in this book, the host and target do not necessarily have to be the same.

A compiler whose host and target are different is called a cross-compiler. For example, a compiler that runs on Windows that generates an executable file for Raspberry Pi is a cross compiler. Cross-compilers are often used when the target machine is too weak or specialized to run the compiler.

About the author

Rui Ueyama ( @rui314 ). He is the original author and current maintainer of the high-speed linker lld, which is used as the standard linker for creating executable files in many OSes and projects, including Android (version Q or later), FreeBSD (12 or later), Nintendo Switch, Chrome, and Firefox . (Therefore, there is a high possibility that the binary created by the tool I wrote is already on your computer.) He is also the author of the compact C compiler 8cc . I mainly write essays about software in notes .

Column: Compiler that compiles the compiler

Self-referential situations, such as a C compiler written in C, are not uncommon. Many language implementations other than C are written using the language itself.

If there is already an implementation of language X, there is no logical contradiction in creating a new X compiler using the language itself. If you decide to self-host, you can simply develop with your existing compiler and switch when you're done with your own. That's exactly what we try to do in this book.

But what if you don't have an existing compiler? In that case, you have no choice but to write in another language. When writing your first compiler for the X language with the intention of self-hosting, you will need to write it using an existing Y language that is different from X, and once the compiler is complete, you will need to rewrite the compiler itself from the Y language to the X language.

Compilers for today's complex programming languages ​​are also other compilers that were used to compile implementations of that language, and so on until, in the early days of computers, someone was able to directly write machine code. You should arrive at the simple assembler you wrote. We don't know whether there was one or more assemblers, which are in some sense the ultimate ancestors of all existing language implementations, but there is no doubt that modern compilers started from a very small number of ancestors. Sho. Executable files other than compilers are also usually generated by compilers, so almost all existing executable files are indirect descendants of the original assembler. This is an interesting story similar to the origin of life.


Machine language and assembler

In this chapter, the goal is to get a rough idea of ​​the components that make up a computer and what kind of code should be output from the C compiler we create. I will not go into details about specific CPU instructions yet. It is important to understand the concept first.

CPU and memory

The components that make up a computer can be broadly divided into CPU and memory. Memory is a device that can hold data, and a CPU is a device that performs some processing while reading and writing from that memory.

Conceptually, memory appears to the CPU as a huge array of randomly accessible bytes. When the CPU accesses memory, it specifies the information about which byte of memory it wants to access using a number, and that number is called an "address." For example, "read 8 bytes of data from address 16" means to read 8 bytes of data from the 16th byte of memory that looks like an array of bytes. The same thing is sometimes called "reading 8 bytes of data from address 16."

Both the programs that the CPU executes and the data that those programs read and write reside in memory. The CPU retains the "address of the currently executing instruction" inside the CPU, reads the instruction from that address, does what is written there, and then reads and executes the next instruction. We are doing this. The address of the instruction currently being executed is called the " program counter " (PC) or the " instruction pointer " (IP). The format of the program that the CPU executes is called "machine code. "

The program counter does not necessarily advance linearly to the next instruction. A type of CPU instruction called a branch instruction allows you to set the program counter to any address other than the next instruction. This function enables if statements, loops, etc. Setting the program counter to a location other than the next instruction is called "jumping" or "branching."

In addition to the program counter, the CPU also has a small data storage area. For example, Intel and AMD processors have 16 areas that can hold 64-bit integers. This area is called a " register ". Memory is an external device to the CPU, and reading and writing from it takes some time, but registers exist inside the CPU and can be accessed without delay.

Many machine languages ​​have a format that performs some operation using the values ​​of two registers, and then writes the results back to the registers. Therefore, when a program is executed, the CPU reads data from memory into registers, performs some operations between registers, and writes the results back to memory. Masu.

The instructions of a specific machine language are collectively called the " instruction set architecture " (ISA) or "instruction set." There is not just one type of instruction set, and you can design it as you like for each CPU. However, the same program cannot run without compatibility at the machine language level, so there are not many variations in the instruction set. PCs use an instruction set called x86-64 from Intel and its compatible chip maker AMD . Although x86-64 is one of the major instruction sets, it is not the only one that dominates the market. For example, iPhone and Android use an instruction set called ARM.

Column: x86-64 instruction set name

x86-64 is also sometimes called AMD64 , Intel 64 , x64 , etc. There is a historical reason why the same instruction set has multiple names like this.

The x86 instruction set was created by Intel in 1978, but it was AMD that expanded it to 64 bits. Around 2000, when 64-bit processors were becoming a necessity, Intel was working hard on a completely new instruction set called Itanium, and didn't bother working on the 64-bit version of x86 that would compete with it. Taking advantage of this gap, AMD formulated and released specifications for the 64-bit version of x86. That's x86-64. After that, AMD renamed x86-64 to AMD64, probably due to branding strategy.

After that, Itanium's failure became obvious, and Intel had no choice but to make a 64-bit version of x86, but by that time, a number of actual AMD64 chips had been released, so it was similar to that. It is difficult to formulate an extended instruction set that is not compatible with AMD, so Intel has decided to adopt an instruction set that is compatible with AMD. It is said that there was also pressure from Microsoft to maintain compatibility. At that time, Intel adopted the almost exact same instruction set as AMD64, naming it IA-32e. The fact that it was named IA-32e (Intel Architecture 32 extensions) instead of 64 seems to reflect the unsatisfied experience of an unsuccessful instruction set, with Itanium being the core of a 64-bit CPU. After that, Intel decided to completely abandon Itanium, and IA-32e was renamed to the common name Intel 64. Microsoft calls x86-64 x64, perhaps because they don't like long names.

For the reasons listed above, x86-64 has many different names.

Open source projects often prefer the name x86-64, which does not include the name of a specific company. This book also uses the name x86-64 throughout.

What is an assembler?

Machine language is read directly by the CPU, so only the convenience of the CPU is taken into consideration, not the ease of use by humans. Writing such machine code with a binary editor is not impossible, but it is a very difficult task. That's when the assembler was invented . Assembly is a language that has almost a one-to-one correspondence with machine language, but it is much easier to read for humans than machine language.

For compilers that output native binaries rather than virtual machines or interpreters, the goal is usually to output assembly. Even compilers that appear to directly output machine language often start an assembler in the background after outputting assembly. The C compiler created in this book also outputs assembly.

Converting assembly code into machine language is sometimes called "compiling," but it is also sometimes called "assembling " to emphasize that the input is assembly.

You may have seen assembly before. If you haven't seen Assembly, now is a good time to take a look. objdumpLet's use a command to disassemble an appropriate executable file and display the machine language contained in that file as assembly. Below lsis the result of disassembling the commands.

$ objdump -d -M intel /bin/ls
/bin/ls:     file format elf64-x86-64

Disassembly of section .init:

0000000000003d58 <_init@@Base>:
  3d58:  48 83 ec 08           sub    rsp,0x8
  3d5c:  48 8b 05 7d b9 21 00  mov    rax,QWORD PTR [rip+0x21b97d]
  3d63:  48 85 c0              test   rax,rax
  3d66:  74 02                 je     366a <_init@@Base+0x12>
  3d68:  ff d0                 call   rax
  3d6a:  48 83 c4 08           add    rsp,0x8
  3d6e:  c3                    ret
...

In my environment, lsa command contains about 20,000 machine language instructions, so the disassembled result will be nearly 20,000 lines long. I have only posted the first part here.

Assembly basically consists of one line per machine code. For example, consider the following line:

  3d58:  48 83 ec 08           sub    rsp,0x8

What does this line mean? 3d58 is the memory address containing the machine code. In other words, lswhen the command is executed, the instruction on this line will be placed at address 0x3d58 in memory, and this instruction will be executed when the program counter is 0x3d58. The next four hexadecimal numbers are the actual machine code. The CPU reads this data and executes it as an instruction. sub rsp,0x8This is the assembly that corresponds to the machine language instructions. The CPU instruction set will be explained in a separate chapter, but this instruction is to subtract 8 from a register called RSP.

C and its assembler counterpart

simple example

To get an idea of ​​what kind of output the C compiler is producing, let's compare the C code and the corresponding assembly code. As the simplest example, consider the following C program.

int main() {
  return 42;
}

Given the file in which this program is written , you can verify that it actually returns 42 by test1.ccompiling it as follows :main

$ cc -o test1 test1.c
$ ./test1
$ echo $?
42

In C, mainthe value returned by a function is the exit code for the entire program. Although the program's exit code is not displayed on the screen, it is implicitly set in the shell variable $?, so you can see the command's exit code by displaying immediately after the command $?ends . echocan. You can see that 42 is returned correctly here.

Now, the assembly program corresponding to this C program is as follows.

.intel_syntax noprefix
.globl main
main:
        mov rax, 42
        ret

mainThis assembly defines a global label , mainfollowed by the code for the function. Here, we set the value 42 in a register called RAX and mainthen return. There are a total of 16 registers that can store integers, including RAX, but the promise is that when the function returns, the value in RAX will be the return value of the function, so here we will store the value in RAX. I am setting it.

Let's actually assemble and run this assembly program. The assembly file extension .sis , so try writing the assembly code above test2.sand running the following command.

$ cc -o test2 test2.s
$ ./test2
$ echo $?
42

42 is now the exit code, just like in C.

Roughly speaking, a C compiler is a program that outputs assembly like test1.c, when it reads C code like .test2.s

Example with function call

As a slightly more complex example, let's look at how code with function calls is translated into assembly.

A function call is different from a simple jump; after the called function finishes, you must return to the place you were originally executing. The address that was originally executed is called the "return address." If there is only one level of function calls, the return address can be stored in an appropriate register on the CPU, but since function calls can be made as deep as desired, the return address must be stored in memory. The return address is actually stored on a stack in memory.

A stack can be implemented using only one variable that holds the address of the top of the stack. The storage area that holds the top of the stack is called the "stack pointer." To support programming using functions, x86-64 supports a register dedicated to the stack pointer and instructions that use that register. Putting data on the stack is called "push," and taking out data from the stack is called "pop."

Now, let's look at an example of a function call. Consider the following C code.

int plus(int x, int y) {
  return x + y;
}

int main() {
  return plus(3, 4);
}

The assembly corresponding to this C code looks like this:

.intel_syntax noprefix
.globl plus, main

plus:
        add rsi, rdi
        mov rax, rsi
        ret

main:
        mov rdi, 3
        mov rsi, 4
        call plus
        ret

The first line is an instruction that specifies the assembly grammar. The second line .globl, starting with , tells assembly that the two functions `` and `` are visible to the entire program, not file scope plus. mainYou can ignore this for now.

mainPlease take a look first . mainIn C, pluskara is called with arguments. The assembler has a convention that the first argument is placed in the RDI register and the second argument is placed in the RSI register, so the mainvalues ​​are set accordingly in the first two lines of .

callis an instruction that calls a function. Specifically call:

Therefore, callonce the instruction is executed, the CPU pluswill start executing the function.

plusLook at the functions. plusThe function has three instructions.

addis an instruction that performs addition. In this case, the result of adding the RSI and RDI registers is written to the RSI register. Since x86-64 integer arithmetic instructions usually only accept two registers, the result will be saved by overwriting the value of the first argument register.

The return value from the function is to be stored in RAX. Therefore, since we want to store the addition result in RAX, we need to copy the value from RSI to RAX. movWe're doing it here using commands. movis an abbreviation for move, but it actually does not move data, but simply copies it.

plusAt the end of the function, retwe call and return from the function. Specifically ret:

In other words ret, callit is an instruction that undoes what has been done and resumes execution of the calling function. is defined as the opposite callcommand .ret

plusWhat you see when you return from mainis retthe command. The original C code was supposed to return plusthe return value as is . mainHere, plusthe return value is stored in RAX, so by mainreturning as is, mainyou can use it as the return value.

Summary of this chapter

This chapter provided an overview of how a computer works internally and what a C compiler should do. When you look at assembly or machine language, it looks like a jumbled mass of data that is far removed from C, but many readers may have thought that it actually reflects the structure of C more directly than you might think. .

This book hasn't explained much about specific machine language yet, so I do objdumpn't think you know the meaning of the individual instructions in the assembly code displayed in , but each instruction doesn't do much. I think you can imagine it. At this stage of this chapter, it is enough to get a sense of that.

The main points of this chapter are summarized below in bullet points.

Column: Online compiler

Looking at C code and its compilation results is a good way to learn assembly language, but it can be surprisingly tedious to edit and compile source code over and over again, and then check the assembly output. . There is a very good website that can save you that hassle. That is Compiler Explorer (commonly known as godbolt). When you enter code in the text box on the left half of the screen in Compiler Explorer, the corresponding assembly output is displayed in real time on the right half of the screen. This site is a good place to use when you want to check what kind of assembly C code is converted to.


Creating a calculator-level language

In this chapter, the first step in writing a C compiler is to support arithmetic operations and other arithmetic operators so that you can compile expressions such as:

30 + (4 - 2) * -5

This may seem like a simple goal, but it is actually quite difficult. Numerical formulas have a structure in which the expression in parentheses takes precedence, or multiplication takes precedence over addition, and unless you understand this in some way, you will not be able to perform calculations correctly. However, the formula given as input is just a flat string of characters, not structured data. In order to correctly evaluate an expression, it is necessary to analyze the sequence of characters and successfully derive the hidden structure.

These parsing problems are quite difficult to solve without any prior knowledge. In fact, these problems were once thought to be difficult problems, and intensive research was conducted and various algorithms were developed, especially from the 1950s to the 1970s. Thanks to that work, parsing is now a fairly simple problem once you know how to do it.

This chapter describes one of the most common parsing algorithms, recursive descent parsing. The C/C++ compilers that you use every day, such as GCC and Clang, also use recursive descent parsing.

The need to read text with some kind of structure often arises when programming, not just with compilers. The techniques you will learn in this chapter can be used directly for such problems. It is no exaggeration to say that the syntax analysis techniques you will learn in this chapter are techniques that will last a lifetime. Read this chapter to understand algorithms and add parsing skills to your programmer's toolbox.

Step 1: Create a language that compiles one integer

Consider the simplest subset of the C language. What kind of language do you imagine? mainIs it a language that only has functions? Or maybe a language consisting of only one expression? Ultimately, I think it's fair to say that a language consisting of only one integer is the simplest subset imaginable.

In this step, let's implement the simplest language first.

The program you create in this step is a compiler that reads a number from input and outputs an assembly that exits with that number as the program's exit code. So the input is simply 42a string like , and when you read it you create a compiler that outputs assembly like this:

.intel_syntax noprefix
.globl main

main:
        mov rax, 42
        ret

.intel_syntax noprefixis an assembler command for selecting the Intel notation used in this book from among the multiple assembly writing methods. Please be sure to include this line at the beginning of the compiler you will create this time. The other lines are as explained in the previous chapter.

At this point, the reader may think, ``This program cannot be called a compiler.'' I honestly think so too. However, this program accepts a language consisting of a single number as input and outputs code corresponding to that number, so by definition it is a fine compiler. Even a simple program like this can quickly become quite complex if you modify it, so let's complete this step first.

This step is actually very important from the overall development procedure. This is because we will use what we create in this step as a skeleton for future development. In this step, in addition to creating the compiler itself, we also create a build file (Makefile), automated tests, and set up a git repository. Let's look at each of these tasks one by one.

The C compiler created in this book is called 9cc. cc is an abbreviation for C compiler. The number 9 doesn't have any particular meaning, but the C compiler I created before was called 8cc, so I decided to name it 9cc because it's the next work. Of course, you can give it any name you like. However, don't think too much about the name in advance and be unable to start writing the compiler. You can change names later, including GitHub repositories, so there is no problem starting with any name.

Column: Intel notation and AT&T notation

In addition to the Intel notation used in this book , an assembler notation called AT&T notation is also widely used, mainly in Unix. By default, gcc and objdump output assembly using AT&T notation.

In AT&T notation, the result register is the second argument. Therefore, in a two-argument command, the arguments are written in reverse order. Write the register name with %a prefix . Write %raxnumbers with $a prefix .$42

Also, when referencing memory, write the expression using a unique notation, using []instead of . ()Below are some examples for comparison.

mov rbp, rsp   // Intel
mov %rsp, %rbp // AT&T

mov rax, 8     // Intel
mov $8, %rax   // AT&T

mov [rbp + rcx * 4 - 8], rax // Intel
mov %rax, -8(rbp, rcx, 4)    // AT&T

In the compiler I created this time, I decided to use Intel notation for readability. Intel's instruction set manual uses Intel notation, which has the advantage that you can write the manual's description directly into code. The expressive power is the same for both AT&T notation and Intel notation. No matter which notation is used, the generated machine language instruction sequence is the same.

Creating the compiler body

Normally we give input to the compiler as a file, but since we don't want to open and read the file, we'll give the code directly as the first argument to the command. A C program that reads the first argument as a number and embeds it in the assembly of a fixed phrase can be easily written as follows.

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
  if (argc != 2) {
    fprintf(stderr, "引数の個数が正しくありません\n");
    return 1;
  }

  printf(".intel_syntax noprefix\n");
  printf(".globl main\n");
  printf("main:\n");
  printf("  mov rax, %d\n", atoi(argv[1]));
  printf("  ret\n");
  return 0;
}

9ccCreate an empty directory and 9cc.ccreate a file in it with the above content. Then, run 9cc as shown below and check the operation.

$ cc -o 9cc 9cc.c
$ ./9cc 123 > tmp.s

The first line 9cc.ccompiles and 9cccreates an executable file. The second line 123passes the input to 9cc to generate an assembly and tmp.swrites it to the file. tmp.sLet's check the contents.

$ cat tmp.s
.intel_syntax noprefix
.globl main
main:
  mov rax, 123
  ret

As you can see, it's generated well. You can create an executable file by passing the assembly file created in this way to an assembler.

In Unix, cc(or gcc) is a front end for many languages, not just C and C++, and it determines the language based on the extension of a given file and starts the compiler or assembler. I am. Therefore, just like when compiling 9cc, you can assemble it by passing an .sassembler file with the extension to . ccThe following is an example of assembling and running the generated executable file.

$ cc -o tmp tmp.s
$ ./tmp
$ echo $?
123

In the shell, the exit code of the previous command $?can be accessed as a variable. In the example above, the number 123 is displayed, which is the same argument given to 9cc. That means it's working fine. Try giving a number other than 123 in the range 0 to 255 (Unix process exit codes are supposed to be 0 to 255) and see if 9cc actually works.

Creating automated tests

Many readers may have never written tests for hobby programming, but in this book we will write code to test new code each time we extend the compiler. Writing tests may seem tedious at first, but you'll soon start to appreciate them. If you don't write test code, you'll end up having to run the same tests by hand every time to check the operation, but doing it by hand is much more troublesome.

I think a lot of the impression that writing tests is tedious stems from the fact that test frameworks are exaggerated and the philosophy of testing is sometimes dogmatic. For example, testing frameworks like JUnit have a variety of useful functions, but they take time to install and learn how to use. Therefore, this chapter does not introduce such testing frameworks. Instead, I'm going to write a very simple handwritten "test framework" in shell scripts and use it to write my tests.

test.shBelow is a shell script for testing . The shell function asserttakes two arguments, an input value and an expected output value, and does what it does: actually assemble the 9cc result and compare the actual result to the expected value. In the shell script, assertafter defining the function, I use it to verify that both 0 and 42 compile correctly.

#!/bin/bash
assert() {
  expected="$1"
  input="$2"

  ./9cc "$input" > tmp.s
  cc -o tmp tmp.s
  ./tmp
  actual="$?"

  if [ "$actual" = "$expected" ]; then
    echo "$input => $actual"
  else
    echo "$input => $expected expected, but got $actual"
    exit 1
  fi
}

assert 0 0
assert 42 42

echo OK

test.shCreate it with the above content and chmod a+x test.shrun it to make it executable. Let's actually test.shrun it. If no errors occur, the following will display and exit test.sh.OK

$ ./test.sh
0 => 0
42 => 42
OK

If an error occurs, will test.shnot OKbe displayed. Instead test.sh, it displays the expected and actual values ​​for the failed test, as shown below.

$ ./test.sh
0 => 0
42 expected, but got 123

When you want to debug a test script, -xrun the script with the bash option. -xWith this option, bash will display a trace of the execution as shown below.

$ bash -x test.sh
+ assert 0 0
+ expected=0
+ input=0
+ cc -o 9cc 9cc.c
+ ./9cc 0
+ cc -o tmp tmp.s
+ ./tmp
+ actual=0
+ '[' 0 '!=' 0 ']'
+ assert 42 42
+ expected=42
+ input=42
+ cc -o 9cc 9cc.c
+ ./9cc 42
+ cc -o tmp tmp.s
+ ./tmp
+ actual=42
+ '[' 42 '!=' 42 ']'
+ echo OK
OK

The "testing framework" we use throughout this book is simply a shell script like the one above. This script may seem too simple compared to a full-fledged testing framework such as JUnit, but the simplicity of this shell script is balanced with the simplicity of 9cc itself, so That's preferable. Automated testing is all about running the code you've written in one go and mechanically comparing the results, so it's important to test it first without thinking too much about it.

Build with make

Throughout this book, you will build 9cc hundreds or even thousands of times. The work of creating a 9cc executable file and then running the test script is the same every time, so it is convenient to leave it to a tool. Commands are commonly used for these purposes make.

When run, make Makefilereads the file named in the current directory and executes the commands written in it. Makefileconsists of a rule ending with a colon and a sequence of commands for that rule. The following Makefileis to automate the commands you want to run in this step.

CFLAGS=-std=c11 -g -static

9cc: 9cc.c

test: 9cc
        ./test.sh

clean:
        rm -f 9cc *.o *~ tmp*

.PHONY: test clean

Create the above file 9cc.cin the same directory with the name . MakefileThen, makejust by running 9cc will be created, make testand by running , you will be able to run the test. Since make can understand file dependencies, there is no need to run 9cc.cafter changing and make testbefore running . makeOnly if the executable file 9cc is older than 9cc.c, make will build 9cc before running the tests.

make cleanThis is a rule for deleting temporary files. You can create temporary files rmmanually , but it would be troublesome if you accidentally delete a file you don't want to delete, Makefileso I decided to write something like this as a utility.

Please Makefilenote that when writing , Makefilethe indent must be a tab character. 4 or 8 spaces will result in an error. This is just bad syntax, but make is an old tool developed in the 1970s, and has traditionally been this way.

ccBe sure -staticto pass the option . This option is explained in the chapter called Dynamic Linking . You don't need to think much about what this option means right now.

Version control with git

This book uses git as the version control system. Throughout this book, we will create a compiler step by step, and for each step, please create a git commit and write a commit message. The commit message can be in Japanese, so please make sure to include a one-line summary of what has actually changed. If you want to write more than one line of detailed explanation, leave one blank line after the first line and write the explanation after that.

Git only performs version control on files that you manually generate. Files generated as a result of running 9cc can be generated again by executing the same command, so there is no need to include them under version control. In fact, including such files makes the changes per commit unnecessarily long, so they should be removed from version control and not included in the repository.

In git .gitignore, you can write patterns for files to be excluded from version control in a file called .git. 9cc.cCreate a directory with the following content in the same directory as .gitignoregit and set it to ignore temporary files, editor backup files, etc.

*~
*.o
tmp*
a.out
9cc

If this is your first time using Git, tell Git your name and email address. The name and email address you tell git here will be recorded in the commit log. Below is an example of setting the author's name and email address. Readers, please enter your name and email address.

$ git config --global user.name "Rui Ueyama"
$ git config --global user.email "ruiu@cs.stanford.edu"

To make a commit with git, you first git addneed to add the file that has changed. Since this is our first commit, we will first git initcreate a git repository and then git addadd all the files we have created so far.

$ git init
Initialized empty Git repository in /home/ruiu/9cc
$ git add 9cc.c test.sh Makefile .gitignore

Then git commitcommit.

$ git commit -m "整数1つをコンパイルするコンパイラを作成"

-mOptionally specify a commit message. -mIf there are no options, gitlaunches the editor. git log -pYou can check that the commit was successful by running the following :

$ git log -p
commit 0942e68a98a048503eadfee46add3b8b9c7ae8b1 (HEAD -> master)
Author: Rui Ueyama <ruiu@cs.stanford.edu>
Date:   Sat Aug 4 23:12:31 2018 +0000

    整数1つをコンパイルするコンパイラを作成

diff --git a/9cc.c b/9cc.c
new file mode 100644
index 0000000..e6e4599
--- /dev/null
+++ b/9cc.c
@@ -0,0 +1,16 @@
+#include <stdio.h>
+#include <stdlib.h>
+
+int main(int argc, char **argv) {
+  if (argc != 2) {
...

Finally, let's upload the git repository we created so far to GitHub . There's no particular reason to upload to GitHub, but there's also no reason not to, and GitHub is useful as a backup for your code. To upload to GitHub, create a new repository (in this example, I created the repository using rui314user ) and add it as a remote repository with the following command:9cc

$ git remote add origin git@github.com:rui314/9cc.git

Then, git pushrunning will push the contents of your repository to GitHub. git pushAfter running GitHub, try opening GitHub in a browser and verify that your source code has been uploaded.

This completes the first step of creating the compiler. The compiler in this step is too simple to be called a compiler, but it is a fine program that includes all the elements necessary for a compiler. From now on, we will continue to enhance the functionality of this compiler, and although it may be hard to believe, we will develop it into a fine C compiler. Enjoy the completion of the first step.

Reference implementation

Step 2: Create a compiler that can add and subtract

In this step, we will extend the compiler we created in the previous step so that it accepts expressions that include additions and subtractions, such as and , rather 42than just values ​​such as .2+115+20-4

5+20-4You could calculate an expression like at compile time and 21embed the resulting number (in this case) in the assembly, but that would make it look like an interpreter rather than a compiler, so you could do the addition and subtraction. When you need to output the assembly you do. The assembly instructions for addition and addsubtraction subare and. addtakes two registers, adds their contents, and writes the result to the register of its first argument. is the same subas addUsing these instructions, 5+20-4you can compile as follows:

.intel_syntax noprefix
.globl main

main:
        mov rax, 5
        add rax, 20
        sub rax, 4
        ret

In the above assembly, movwe set RAX to 5, then add 20 to RAX, and then subtract 4. retThe value of RAX at the time it is executed 5+20-4should be 21. Let's run it and check it out. tmp.sSave the above file , assemble it, and try running it.

$ cc -o tmp tmp.s
$ ./tmp
$ echo $?
21

21 was displayed correctly as shown above.

Now, how should I create this assembly file? If we consider this expression involving addition and subtraction as a "language," this language can be defined as follows.

If you translate this definition into C code, you will get a program like this:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
  if (argc != 2) {
    fprintf(stderr, "引数の個数が正しくありません\n");
    return 1;
  }

  char *p = argv[1];

  printf(".intel_syntax noprefix\n");
  printf(".globl main\n");
  printf("main:\n");
  printf("  mov rax, %ld\n", strtol(p, &p, 10));

  while (*p) {
    if (*p == '+') {
      p++;
      printf("  add rax, %ld\n", strtol(p, &p, 10));
      continue;
    }

    if (*p == '-') {
      p++;
      printf("  sub rax, %ld\n", strtol(p, &p, 10));
      continue;
    }

    fprintf(stderr, "予期しない文字です: '%c'\n", *p);
    return 1;
  }

  printf("  ret\n");
  return 0;
}

The program is a little longer, but retthe lines in the first half are the same as before. Code for reading terms is added in the middle. This time, the program is not just for reading a single number, so after reading a number, we need to know how far it has read. atoiSince this does not return the number of characters read, you atoiwill not know where to start reading the next section. Therefore, strtolwe used functions from the C standard library.

strtolreads the number, then updates the pointer in its second argument to point to the character after the last character read. Therefore, after reading a number, if the next character is +, -then pshould point to that character. The above program takes advantage of this fact by whilereading the terms one after another in a loop, and outputting one line of assembly each time it reads one term.

Now, let's run this modified version of the compiler. 9cc.cAfter updating the file, makeyou can create a new 9cc file by simply running . An example of execution is shown below.

$ make
$ ./9cc '5+20-4'
.intel_syntax noprefix
.globl main
main:
  mov rax, 5
  add rax, 20
  sub rax, 4
  ret

It looks like the assembly is being output successfully. To test this new feature, test.shlet's add a test line like this:

assert 21 "5+20-4"

Once you've done this, commit your changes to git. To do so, run the following command:

$ git add test.sh 9cc.c
$ git commit

git commitWhen you run , the editor will start, so write ``Add addition and subtraction'', save it, and exit the editor. git log -pTry using the command to confirm that the commit is happening as expected. Once you finally git pushrun and push the commit to GitHub, this step is complete!

Reference implementation

Step 3: Introduce tokenizer

The compiler created in the previous step has one drawback. If the input contains blank characters, an error will occur at that point. 5 - 3For example , if you give a string with spaces like the following , a space character will be found where you are trying to read +or, -and compilation will fail.

$ ./9cc '5 - 3' > tmp.s
予期しない文字です: ' '

There are several ways to resolve this issue. One obvious method is to skip the whitespace characters before attempting to read or +. -There's nothing particularly wrong with this approach, but in this step we'll take a different approach to solving the problem. The method is to split the input into words before reading the expression.

Just like Japanese and English, mathematical formulas and programming languages ​​can be thought of as consisting of sequences of words. 5+20-4For example , you can think of it as being made up of five words: , 5, , . This "word" is called a " token ." The white space between tokens only exists to separate the tokens, and is not part of the word. So it would be natural to remove whitespace characters when splitting a string into token sequences. Dividing a string into a string of tokens is called " tokenizing ."+20-4

There are other benefits to separating strings into token sequences. When an expression is divided into tokens, the tokens can be classified and given types. For example , +``ya '' is a symbol such as -``as you see'' , and the string ``on the other hand'' means the numerical value 123. By interpreting each token of the input instead of just splitting it into strings when tokenizing, you have less to think about when consuming a string of tokens.+-123

In the current grammar for expressions that allow addition and subtraction, there are three types of tokens: , , and numeric +. -Furthermore, for compiler implementation reasons, it is better to define one special type to represent the end of a token sequence (just as a string ends with '\0'). Let's make the tokens a linked list connected by pointers so that we can handle input of arbitrary length.

Although it is a bit long, I will post below an improved version of the compiler that introduces a tokenizer.

#include <ctype.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// トークンの種類
typedef enum {
  TK_RESERVED, // 記号
  TK_NUM,      // 整数トークン
  TK_EOF,      // 入力の終わりを表すトークン
} TokenKind;

typedef struct Token Token;

// トークン型
struct Token {
  TokenKind kind; // トークンの型
  Token *next;    // 次の入力トークン
  int val;        // kindがTK_NUMの場合、その数値
  char *str;      // トークン文字列
};

// 現在着目しているトークン
Token *token;

// エラーを報告するための関数
// printfと同じ引数を取る
void error(char *fmt, ...) {
  va_list ap;
  va_start(ap, fmt);
  vfprintf(stderr, fmt, ap);
  fprintf(stderr, "\n");
  exit(1);
}

// 次のトークンが期待している記号のときには、トークンを1つ読み進めて
// 真を返す。それ以外の場合には偽を返す。
bool consume(char op) {
  if (token->kind != TK_RESERVED || token->str[0] != op)
    return false;
  token = token->next;
  return true;
}

// 次のトークンが期待している記号のときには、トークンを1つ読み進める。
// それ以外の場合にはエラーを報告する。
void expect(char op) {
  if (token->kind != TK_RESERVED || token->str[0] != op)
    error("'%c'ではありません", op);
  token = token->next;
}

// 次のトークンが数値の場合、トークンを1つ読み進めてその数値を返す。
// それ以外の場合にはエラーを報告する。
int expect_number() {
  if (token->kind != TK_NUM)
    error("数ではありません");
  int val = token->val;
  token = token->next;
  return val;
}

bool at_eof() {
  return token->kind == TK_EOF;
}

// 新しいトークンを作成してcurに繋げる
Token *new_token(TokenKind kind, Token *cur, char *str) {
  Token *tok = calloc(1, sizeof(Token));
  tok->kind = kind;
  tok->str = str;
  cur->next = tok;
  return tok;
}

// 入力文字列pをトークナイズしてそれを返す
Token *tokenize(char *p) {
  Token head;
  head.next = NULL;
  Token *cur = &head;

  while (*p) {
    // 空白文字をスキップ
    if (isspace(*p)) {
      p++;
      continue;
    }

    if (*p == '+' || *p == '-') {
      cur = new_token(TK_RESERVED, cur, p++);
      continue;
    }

    if (isdigit(*p)) {
      cur = new_token(TK_NUM, cur, p);
      cur->val = strtol(p, &p, 10);
      continue;
    }

    error("トークナイズできません");
  }

  new_token(TK_EOF, cur, p);
  return head.next;
}

int main(int argc, char **argv) {
  if (argc != 2) {
    error("引数の個数が正しくありません");
    return 1;
  }

  // トークナイズする
  token = tokenize(argv[1]);

  // アセンブリの前半部分を出力
  printf(".intel_syntax noprefix\n");
  printf(".globl main\n");
  printf("main:\n");

  // 式の最初は数でなければならないので、それをチェックして
  // 最初のmov命令を出力
  printf("  mov rax, %d\n", expect_number());

  // `+ <数>`あるいは`- <数>`というトークンの並びを消費しつつ
  // アセンブリを出力
  while (!at_eof()) {
    if (consume('+')) {
      printf("  add rax, %d\n", expect_number());
      continue;
    }

    expect('-');
    printf("  sub rax, %d\n", expect_number());
  }

  printf("  ret\n");
  return 0;
}

The code is not very short, about 150 lines, but it doesn't do anything too tricky, so you should be able to read it by reading from the top.

Let me explain some programming techniques used in the code above.

test.shThis improved version should now be able to skip whitespace, so let's add the following test to one line .

assert 41 " 12 + 34 - 5 "

The exit code of a Unix process is a number between 0 and 255, so when writing tests, make sure that the result of the entire expression falls within the range 0 to 255.

Adding the test files to your git repository completes this step.

Reference implementation

Step 4: Improve error messages

With the compiler we have created so far, if the input is grammatically incorrect, the only thing we can tell is that there is an error somewhere. Let's try to improve that problem in this step. Specifically, we will be able to display an intuitive error message like the one below.

$ ./9cc "1+3++" > tmp.s
1+3++
    ^ 数ではありません

$ ./9cc "1 + foo + 5" > tmp.s
1 + foo + 5
    ^ トークナイズできません

In order to display such an error message, we need to be able to tell which byte of input the error occurs. To do this, user_inputlet's save the entire program string in a variable named variable, and define a new error display function that receives a pointer to the middle of the string. The code is shown below.

// 入力プログラム
char *user_input;

// エラー箇所を報告する
void error_at(char *loc, char *fmt, ...) {
  va_list ap;
  va_start(ap, fmt);

  int pos = loc - user_input;
  fprintf(stderr, "%s\n", user_input);
  fprintf(stderr, "%*s", pos, " "); // pos個の空白を出力
  fprintf(stderr, "^ ");
  vfprintf(stderr, fmt, ap);
  fprintf(stderr, "\n");
  exit(1);
}

error_atThe pointer it receives is a pointer to the middle of a string that represents the entire input. By taking the difference between that pointer and the pointer pointing to the beginning of the input, you can tell which byte of input the error is located, so you can mark it prominently ^.

argv[1]This step is completed by updating the code user_inputto save the error("数ではありません")code to .error_at(token->str, "数ではありません")

A practical-level compiler should also write tests for the behavior when there are errors in the input, but as of now error messages are only output to aid debugging, there is no need to write tests at this stage. It doesn't matter if you don't have it.

Reference implementation

Column: Source code formatter

In the same way that Japanese sentences with many orthographical errors such as punctuation marks are difficult to read, source code that has incorrect indentation or inconsistent spacing can cause the source code to become unreadable. It is not clean code at this level. When it comes to inconsequential areas like code formatting, be sure to apply certain mechanical rules to write code that can be read without distraction.

When developing with multiple people, you have to consult and decide what format to use, but in this book, you are developing by yourself, so you can choose your favorite format from among the major formats. does not matter.

Some recently developed languages ​​provide official language formatters to eliminate the need for an arbitrary but non-essential discussion about what format to choose. For example, the Go language has a command called gofmt that formats source code neatly. gofmt does not have an option to select a formatting style, so it can only be formatted in the "Go official format". By intentionally not giving you a choice, Go completely solves the problem of formatting.

There is a formatter called clang-format for C and C++ , but this book does not specifically recommend the use of such tools. Instead of writing oddly formatted code and having to format it later, try writing code that looks consistent from the beginning.

Column: Security bug caused by indentation error

Incorrect indentation of source code has caused serious security issues in iOS and macOS. The code where the bug occurred is shown below.

if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0)
    goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0)
    goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)
    goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
    goto fail;
    goto fail;
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
    goto fail;

Can our readers know where the bug is? At first glance, this code looks like a normal piece of code, but if you look closely, you will notice that the second goto statement from the bottom is not inside an if statement, and is instead a goto statement that is always executed.

Unfortunately, this code was written in a function that validates the TLS certificate, and as a result, most of the code that validates the certificate is unconditionally skipped in the goto statement, causing iOS/macOS to The certificate was accepted as a valid certificate (allowing spoofing of an HTTPS site). This bug was discovered and fixed in 2014. This bug is called the goto fail bug because the extra goto fail causes the program to fail.

Grammar description method and recursive descent parsing

Now, we would like to add multiplication/division and precedence parentheses, namely , to the language, but there is one big technical challenge to doing so *. This is because there is a rule that multiplication and division must be calculated first in an expression. For example, the expression should be interpreted as , not as . These rules about which operators "stick together" first are called " operator precedence . "/()1+2*31+(2*3)(1+2)*3

How should we handle operator precedence? The compiler we've created so far just reads the token sequence from the beginning and outputs the assembly, so if you simply extend it and add and, you'll end up compiling *it as ./1+2*3(1+2)*3

Existing compilers are naturally good at handling operator precedence. The compiler's parsing is very powerful and can interpret any complex code correctly as long as it follows the syntax. The behavior of this compiler may give the impression that it has an intellectual ability that surpasses that of humans, but in reality, computers do not have the ability to read text like humans, so parsing is performed only by some mechanical mechanism. There should be. How exactly does it work?

In this chapter, let's take a break from coding and learn some parsing techniques. This chapter describes parsing techniques in the following order:

  1. First understand the final goal by knowing the data structure of the parser's output.
  2. Learn the rules that define grammar rules
  3. Learn techniques for writing parsers based on rules that define grammar rules

Representation of grammatical structure using tree structure

In programming language parser implementations, the input is usually a flat sequence of tokens and the output is a tree representing a nested structure. The compiler created in this book also follows that structure.

In the C language, grammatical elements such as ifand whilecan be nested. Representing something like this in a tree structure is a natural way to represent it.

Numerical formulas have a structure in which things inside parentheses are calculated first, and multiplication and division are calculated before addition and subtraction. Although this kind of structure may not look like a tree at first glance, it is actually a very simple way to represent the structure of an expression. For example 1*(2+3), the expression can be thought of as being represented by the following tree.

1*(2+3)tree representing

If we adopt the rule of sequentially calculating from the end of the tree, the above tree will represent the formula 1 multiplied by 2+3. In other words, in the above tree, 1*(2+3)the specific calculation order of is expressed in the tree itself.

Here's another example. The tree below 7-3-3is a tree that represents.

7-3-3tree representing

In the above tree, the result of applying the rule that "subtraction must be calculated from the left" is explicitly expressed in the form of a tree. In other words, the above tree (7-3)-3 = 1represents the expression , 7-(3-3) = 7not the expression . If it were the latter expression, the tree representing it would be deeper to the right rather than to the left.

Operators that must be calculated from the left are called "left associative" operators, and operators that must be calculated from the right are called "right associative" operators. In C, =most operators are defined as left associative, except for assignment.

In a tree structure, you can express an expression as long as you want by making the tree deeper. The next tree 1*2+3*4*5is the tree that represents.

1*2+3*4*5tree representing

A tree like the one above is called a "syntax tree. " In particular, a syntax tree that is expressed as compactly as possible without leaving redundant elements such as parentheses for grouping in the tree is called an ``abstract syntax tree'' (AST ) . All of the above syntax trees can be called abstract syntax trees.

Since the abstract syntax tree is an internal representation of the compiler, it may be defined as appropriate for implementation purposes. However, since arithmetic operators such as addition and multiplication are defined as operations on two sides, a left-hand side and a right-hand side, it would be natural for any compiler to create a binary tree. On the other hand, it would be natural to represent expressions in the body of a function, which can be executed in sequence and have any number of expressions, as a tree with all child elements flat.

The goal in parsing is to construct an abstract syntax tree. The compiler first parses the input token sequence into an abstract syntax tree, which in turn converts into assembly.

Defining grammar using production rules

Now, let's learn how to write syntax in programming languages. Most of the syntax of programming languages ​​is defined using production rules . Production rules are rules that recursively define grammars.

Let's think about natural language for a moment. In Japanese, grammar has a nested structure. For example, if you replace the noun ``flower'' in the sentence ``flowers are beautiful'' with the noun phrase ``red flower,'' the sentence becomes correct, and if you further expand ``red'' into ``a little red''. However, it is still a correct sentence. You can also put it inside another sentence, like ``I thought the little red flowers were pretty.''

These grammars are defined as, ``A 'sentence' consists of a 'subject' and a 'predicate',' or 'A 'noun phrase' consists of a 'noun' or an 'adjective' followed by a 'noun phrase.'” Let's think of it as something defined as a rule like this. Then, by using a ``sentence'' as a starting point and developing it according to the rules, it is possible to create an infinite number of valid sentences within the defined grammar.

Or, conversely, by thinking about the expansion procedure that matches an existing sentence, you can think about what kind of structure that string has.

The above ideas were originally devised for natural languages, but because they have a high affinity with data handled by computers, production rules are used in various areas of computers, including programming languages.

Column: Chomsky's generative grammar

The idea of ​​generative grammar was invented by a linguist named Noam Chomsky . His ideas have had a huge impact on linguistics and computer science.

According to Chomsky's hypothesis, the reason humans can speak is that humans are born with specialized circuits in their brains for acquiring production rules. Humans are able to speak languages ​​because they have the ability to acquire recursive language rules. Nonhuman animals do not have the ability to acquire language, and he believed that this is because the circuitry for acquiring production rules does not exist in the brains of nonhuman animals. Although Chomsky's claims have not been proven or disproven nearly 60 years after his hypothesis was published, it is still considered to be quite persuasive.

Description of production rules using BNF

One notation for describing production rules in a compact and easy-to-understand manner is BNF ( Backus–Naur form ) and its extended version, EBNF ( Extended BNF ). In this book, we will explain the syntax of C using EBNF. This section first describes BNF and then describes extensions to EBNF.

In BNF, each production rule is A = α₁α₂⋯expressed in the form. This means that the symbol can be expanded into A. is a sequence of zero or more symbols, which can contain both symbols that cannot be expanded further and symbols that can be expanded further (coming on the left side in some production).α₁α₂⋯α₁α₂⋯

A symbol that cannot be expanded further is called a " terminal symbol", and a symbol that appears on the left side of any production rule and can be expanded is called a " nonterminal symbol". Grammars defined by such production rules are generally called ``context free grammars.' '

A nonterminal symbol may match multiple productions. For example, A = α₁if A = α₂there are both rules, it means that it can be expanded either way .Aα₁α₂

The right-hand side of the production rule can be empty. In such a rule, the symbol on the left side will be expanded to a zero-length symbol string (i.e., to nothing). However, if the right-hand side is omitted, it becomes difficult to understand the meaning, so in such cases, the standard BNF rule is to write ε (epsilon) on the right-hand side as a symbol to indicate that there is nothing . This book also uses this rule.

Strings are enclosed in double quotes and "foo"written as follows. Strings are always terminal.

The above are the basic BNF rules. In addition to BNF rules, EBNF allows you to write complex rules concisely using the following symbols.

How to write meaning
A* A0 or more repetitions of
A? Aor ε
A | B AorB
( ... ) grouping

For example A = ("fizz" | "buzz")*, Ais a string in which "fizz"or "buzz"is repeated zero or more times, i.e.

can be expanded to either.

Column: BNF and EBNF

Although ordinary BNF that is not Extended does not have concise notations such as , , , , etc., the sentences that can be generated with BNF and EBNF are *the ?same |. ( ... )This is because EBNF can be converted to BNF by rewriting as follows.

EBNF Corresponding BNF
A = α* A = αAandA = ε
A = α? A = αandA = ε
A = α | β A = αandA = β
A = α (β₁β₂⋯) γ A = α B γandB = β₁β₂⋯

A = αAFor example, when you use A = εthe production rule `` to'' to generate the sentence ``after, '' the order of expansion is .AαααA → αA → ααA → αααA → ααα

In this way, the notation *``and ?'' is just a shortcut, but a short notation is easier to understand and desirable, so if a short notation is available, it is normal to use that notation to write concisely. .

simple production rules

As an example of writing a grammar using EBNF, consider the following production rule.

expr = num ("+" num | "-" num)*

numis defined somewhere as a symbol representing a numerical value. In this grammar, there is one exprfirst word, followed by zero or more " to , or and ". This rule actually represents the grammar of addition and subtraction expressions.num+num-num

By starting from expr and expanding it, you can create a string of arbitrary additions and subtractions, for example a string like .1 Please check the expansion results below.10+542-30+2

expr → num → "1"
expr → num "+" num
"10" "+" "5"
expr → num "-" num "+" num
"42" "-" "30" "+" "2"

In addition to using arrows to represent each expansion order, these expansion steps can also be represented in a tree structure. The syntax tree for the above expression is shown below.

1syntax tree of
10+5syntax tree of
42-30+2syntax tree of

By representing it in a tree structure, it is now easier to understand which nonterminal symbol is expanded into which symbol.

A syntax tree like the one shown above, which includes all the tokens in the input and has a perfect one-to-one match to the grammar, is sometimes called a "concrete syntax tree . " This term is often used to contrast with abstract syntax trees.

Note that in the above concrete syntax tree, the rule that addition and subtraction are calculated from the left is not expressed in the form of a tree. In the grammar explained here, such a rule is not expressed using EBNF, but is written as a proviso in the language specification that ``Addition and subtraction are calculated from the left first.'' Become. The parser takes both EBNF and the disclaimer into account, reads the token sequence representing the expression, and constructs an abstract syntax tree that appropriately represents the evaluation order of the expression.

Therefore, in the above grammar, the concrete syntax tree represented by EBNF and the abstract syntax tree output from the parser only roughly match. It is possible to define a grammar so that the abstract and concrete syntax trees have as much of the same structure as possible, but this would make the grammar redundant and make it difficult to understand how to write a parser. The above grammar is an easy-to-use method of expressing grammar that strikes a balance between the rigor of formal grammar description and the ease of understanding supplemented by natural language.

Expressing operator precedence using production rules

Production rules are very powerful tools for expressing grammar. Operator precedence can also be expressed in production rules by devising the grammar. Its syntax is shown below.

expr = mul ("+" mul | "-" mul)*
mul  = num ("*" num | "/" num)*

The previous rule expanded exprdirectly to , but now the rule expands to via . This is the production rule for multiplication and division, and when performing addition and subtraction, is used as a single component. In this grammar, the rule that multiplication and division come first is naturally expressed in the syntax tree. Let's look at some concrete examples.numexprmulnummulexprmul

1*2+3syntax tree of
1+2*3syntax tree of
1*2+3*4*5syntax tree of

In the above tree structure, multiplication always appears closer to the end of the tree than addition. In reality, there is no rule to return multo expr``,'' so there is no way to create a tree with addition under multiplication.However, with such a simple rule, priorities can be expressed well as a tree structure. feels quite strange. Readers are encouraged to check that the syntax tree is correct by actually comparing the production rules with the syntax tree.

Productions involving recursion

With generative grammars, recursive grammars can also be written normally. Below is a grammar production rule that adds precedence parentheses to the four arithmetic operations.

expr    = mul ("+" mul | "-" mul)*
mul     = primary ("*" primary | "/" primary)*
primary = num | "(" expr ")"

Comparing the above grammar with the previous grammar, we can see that , or or can now appear numwhere it was allowed before . In other words, in this new grammar, expressions enclosed in parentheses will be treated with the same ``stickiness'' as single numbers. Let's look at an example.primarynum"(" expr ")"

The following tree 1*2is the syntax tree for.

1*2syntax tree of

The following tree 1*(2+3)is the syntax tree for.

1*(2+3)syntax tree of

Comparing the two trees, we can see that only the expansion result of multhe right branch of is different. primaryThe rule that appears at the end of the expansion result primaryis that it can be expanded to a single number or to any expression enclosed in parentheses, which is clearly reflected in the tree structure. . Isn't it a bit impressive that you can also handle parentheses' precedence with such a simple production rule?

recursive descending parsing

Given the production rules of the C language, by expanding them one after another, it is possible to mechanically generate any C program that is correct from the perspective of the production rules. But what we want to do with 9cc is actually the opposite. I am given a C program as a string from an external source, and I would like to know the expansion procedure that results in an input string when expanded, or the structure of the syntax tree that results in a string that is the same as the input.

In fact, for certain types of production rules, if a rule is given, it is possible to mechanically write code to find a syntax tree that matches the sentence generated from that rule. The ``recursive descent parsing method'' described here is one such technique.

As an example, let's consider the grammar of the four arithmetic operations. I will repost the grammar of the four arithmetic operations.

expr    = mul ("+" mul | "-" mul)*
mul     = primary ("*" primary | "/" primary)*
primary = num | "(" expr ")"

The basic strategy when writing a parser using recursive descent parsing is to map each of these nonterminal symbols directly to each function. Therefore, the parser will have three functions expr, mul, and . primaryEach function parses a sequence of tokens as its name suggests.

Let's think about it in concrete terms. The input passed to the parser is a sequence of tokens. Since we want to create and return an abstract syntax tree from the parser, let's define the types of nodes in the abstract syntax tree. The node types are shown below.

// 抽象構文木のノードの種類
typedef enum {
  ND_ADD, // +
  ND_SUB, // -
  ND_MUL, // *
  ND_DIV, // /
  ND_NUM, // 整数
} NodeKind;

typedef struct Node Node;

// 抽象構文木のノードの型
struct Node {
  NodeKind kind; // ノードの型
  Node *lhs;     // 左辺
  Node *rhs;     // 右辺
  int val;       // kindがND_NUMの場合のみ使う
};

lhsrhsmeans left-hand side and right-hand side, respectively

Let's also define a function to create a new node. There are two types of arithmetic operations in this grammar: binary operators that accept left and right sides, and numbers, so we will prepare two functions for each of these two types.

Node *new_node(NodeKind kind, Node *lhs, Node *rhs) {
  Node *node = calloc(1, sizeof(Node));
  node->kind = kind;
  node->lhs = lhs;
  node->rhs = rhs;
  return node;
}

Node *new_node_num(int val) {
  Node *node = calloc(1, sizeof(Node));
  node->kind = ND_NUM;
  node->val = val;
  return node;
}

Now, let's write a parser using these functions and data types. +is -a left associative operator. The function that parses left associative operators is written as a pattern as follows:

Node *expr() {
  Node *node = mul();

  for (;;) {
    if (consume('+'))
      node = new_node(ND_ADD, node, mul());
    else if (consume('-'))
      node = new_node(ND_SUB, node, mul());
    else
      return node;
  }
}

consumeis the function defined in the previous step that reads forward one token in the input and returns true if the next token in the input stream matches the argument.

exprPlease read the function carefully. expr = mul ("+" mul | "-" mul)*You can see that this production rule is mapped directly to function calls and loops. In the abstract syntax tree returned by the above exprfunction, the operators are left associative, meaning that the branches to the left of the returned node are deeper.

exprmulLet's also define the functions used by the function . *Yamo /is a left associative operator, so it can be written using the same pattern. The function is shown below.

Node *mul() {
  Node *node = primary();

  for (;;) {
    if (consume('*'))
      node = new_node(ND_MUL, node, primary());
    else if (consume('/'))
      node = new_node(ND_DIV, node, primary());
    else
      return node;
  }
}

The function call relationship in the code above mul = primary ("*" primary | "/" primary)*corresponds directly to the production rule.

Finally, primarylet's define the function. primarySince it is not a left-associative operator that is primary = "(" expr ")" | numread primaryby .

Node *primary() {
  // 次のトークンが"("なら、"(" expr ")"のはず
  if (consume('(')) {
    Node *node = expr();
    expect(')');
    return node;
  }

  // そうでなければ数値のはず
  return new_node_num(expect_number());
}

Now that we have all the functions, can we really parse the token string? Although it may not be obvious at first glance, you can use this group of functions to properly parse token sequences. As an example, 1+2*3consider the expression .

The first thing to be called expris . exprIt begins reading the input assuming that the expression is the whole (which in this case it is). Then, a function call is made like exprmul→ , the token is read, and a syntax tree representing 1 is returned as the return value.primary1expr

Then the expression exprin becomes true, so the token is consumed and is called again. The remaining inputs at this stage are:consume('+')+mul2*3

mulprimaryis called as before and 2reads the token, but this time mulit does not return immediately. mulThe expression in consume('*')becomes true, so calls mulagain and reads the token. The result is a syntax tree representing kara .primary3mul2*3

At the destination expr, the syntax tree representing 1 and 2*3the syntax tree representing 1 are combined 1+2*3to construct a syntax tree representing , which exprbecomes the return value of . 1+2*3In other words , it was parsed correctly .

The diagram below shows the function calling relationship and the tokens read by each function. In the diagram below, there is a layer of , which represents a call to read the entire input 1+2*3. There are two above , but they represent a call to and another that reads .exprexprexprmul12*3mul

1+2*3Function calling relationship when parsing

A slightly more complex example is shown below. The figure below 1*2+(3+4)shows the function calling relationship when parsing.

1*2+(3+4)Function calling relationship when parsing

For programmers who are not familiar with recursion, recursive functions like the one above may seem confusing. Honestly, even for me, who is supposed to be very familiar with recursion, it feels like a kind of magic that this kind of code works. Recursive code feels strange even when you know how it works, and that's probably what it is. Try tracing the code in your head over and over again to make sure it works properly.

The parsing method that maps one production rule to one function as described above is called "recursive descending parsing." In the above parser, only one token is read ahead to decide which function to call or return, but a recursive descent parser that reads only one token in advance is (1) It is called a parser. Also, a grammar that can be written by an LL(1) parser is called an LL(1) grammar.

stack machine

In the previous chapter, we explained the algorithm for converting a sequence of tokens into an abstract syntax tree. By choosing a grammar that takes operator precedence into account, it is now possible to create an abstract syntax tree whose branches are always at the top of the tree compared to ``ya' *' . How should I convert it to assembly? This chapter explains how./+-

First, let's consider why it cannot be converted to assembly in the same way as addition and subtraction. Compilers that can perform addition and subtraction use RAX as a result register and perform additions and subtractions there. In other words, the compiled program retained only one intermediate calculation result.

However, when multiplication and division are involved, there is no guarantee that there will be only one intermediate calculation result. Consider 2*3+4*5 as an example. In order to perform addition, both sides must be calculated, so 2*3 and 4*5 must be calculated before addition. In other words, in this case, the entire calculation cannot be performed unless two intermediate calculation results can be stored.

A computer called a " stack machine " can easily perform calculations like this . Let's move away from the abstract syntax tree created by the parser and learn about the stack machine.

Stack machine concept

A stack machine is a computer that has a stack as a data storage area. Therefore, in a stack machine, the two basic operations are "pushing onto the stack" and "popping from the stack." A push puts a new element on top of the stack. Pop removes an element from the top of the stack.

Arithmetic instructions in a stack machine operate on the elements at the top of the stack. For example, a stack machine ADDinstruction pops two elements from the top of the stack, adds them, and pushes the result onto the stack (to avoid confusion with x86-64 instructions, the virtual stack machine instruction (written in all capital letters). In other words, ADDis an instruction that replaces two elements at the top of the stack with the single element that results from adding them together.

SUBThe , MUL, DIVinstruction ADDis an instruction that replaces two elements at the top of the stack with a single element obtained by subtracting, multiplying, or dividing them.

PUSHThe instruction shall push the argument elements onto the top of the stack. Although not used here, you POPcan also consider an instruction that removes an element from the top of the stack and discards it.

Now, let's consider calculating 2*3+4*5 using these instructions. Using the stack machine defined above, you should be able to calculate 2*3+4*5 with the following code.

// 2*3を計算
PUSH 2
PUSH 3
MUL

// 4*5を計算
PUSH 4
PUSH 5
MUL

// 2*3 + 4*5を計算
ADD

Let's take a closer look at this code. Assume that the stack already contains some value. The value is not important here, so we display it with a “…”. The stack is assumed to run from top to bottom in the diagram.

The first two PUSHpush 2 and 3 onto the stack, so MULby the time the next one is executed, the state of the stack is as follows:

...
2
3

MULremoves the two values ​​at the top of the stack, 3 and 2, and pushes the product of their multiplication, 6, onto the stack. Therefore, MULafter execution, the state of the stack will be as follows:

...
6

The next PUSHone pushes 4 and 5, so MULjust before the second one runs, the stack should look like this:

...
6
Four
Five

If you run here MUL, it will remove 5 and 4 and replace them with 20, which is the result of multiplying them. So MULafter running:

...
6
20

Notice how the calculation results for 2*3 and 4*5 are neatly placed on the stack. ADDRunning in this state will calculate 20+6 and push the result onto the stack, so the stack should eventually be in the following state:

...
26

If we assume that the stack machine's calculation result is the value remaining at the top of the stack, then 26 is the result of 2*3+4*5, which means that the formula has been calculated correctly.

Stack machines can calculate not only this expression, but any expression that has multiple intermediate results. With the stack machine, any subexpression can be successfully compiled in the above way, as long as it honors the promise that executing it leaves a single element on the stack.

Column: CISC and RISC

x86-64 is an instruction set that has gradually evolved from 8086, which was released in 1978, and is a typical " CISC " style processor. The characteristics of CISC processors are that machine language operations can take memory addresses as well as registers, that the length of machine language instructions is variable, and that complex operations can be carried out in one sequence, which is convenient for assembly programmers. For example, it has many commands that can be performed by command.

`` RISC '' was invented in the 1980s in contrast to CISC . The characteristics of RISC processors are that operations are always performed only between registers, the only memory operations are loads to and stores from registers, the length of machine language instructions is the same for all instructions, and assembly For example, it does not have complex instructions that are useful to programmers, only simple instructions generated by the compiler.

x86-64 is one of the few survivors of CISC, and almost all major processors other than x86-64 are based on RISC. Specifically, ARM, PowerPC, SPARC, MIPS, RISC-V (Risk Five), etc. are all RISC processors.

RISC does not have operations between memory and registers like x86-64. There are also no register aliases. There are also no rules that say a particular integer register is used in a special way by a particular instruction. To modern eyes, where such instruction sets have become mainstream, the x86-64 instruction set looks old-fashioned.

RISC processors' simple design made them easy to speed up, and they took the processor industry by storm. So why did x86-64 survive? There was a huge market need for high-speed x86 processors that could take advantage of existing software assets, and technological innovation from Intel and Intel-compatible chip manufacturers to meet that need. Intel internally converted x86 instructions into some kind of RISC instructions in the CPU's instruction decoder, making x86 an internal RISC processor. This made it possible to apply the same techniques that made RISC faster to x86.

Compile to stack machine

This section describes how to convert an abstract syntax tree into stack machine code. If you can do that, you will be able to parse an expression consisting of four arithmetic operations, assemble an abstract syntax tree, compile it into a stack machine using x86-64 instructions, and execute it. In other words, it will be possible to write a compiler that can perform the four arithmetic operations.

In a stack machine, when you compute a subexpression, one value of the result, whatever it is, remains on the top of the stack. For example, consider the tree below.

Abstract syntax tree representing addition

A``ya B'' is an abstract representation of a subtree, and actually means a node of some type. However, the specific type or shape of the tree is not important when compiling this entire tree. To compile this tree, do the following:

  1. Compile the left subtree
  2. Compile the right subtree
  3. Outputs code that replaces two values ​​on the stack with the result of adding them together

After executing the code in #1, no matter what the specific code is, there should be a single value on the top of the stack representing the result of the left subtree. Similarly, after running the code in 2, there should be a single value on the top of the stack representing the result of the right subtree. Therefore, in order to calculate the value of the entire tree, we just need to replace those two values ​​with their total value.

In this way, when compiling an abstract syntax tree to a stack machine, we think recursively and output assemblies as we descend the tree. Although it may seem a bit difficult for readers who are not familiar with the concept of recursion, it is a standard technique when working with self-similar data structures such as trees.

Let's consider the example below.

Abstract syntax tree representing addition and multiplication

The function that performs the code generation receives the root node of the tree.

Following the steps above, the first thing the function does is compile the left subtree. In other words, we will be compiling the number 2. The result of calculating 2 is 2, so the compilation result of that subtree is PUSH 2.

The code generation function then attempts to compile the right subtree. This will recursively compile the left side of the subtree, and the result PUSH 3will be output. Next we will compile the right side of the subtree, PUSH 4which will output:

The code generation function then returns to the recursive call and outputs code that matches the type of the subtree operator. The first output is code that replaces the two elements at the top of the stack with their product. Next, the code that replaces the two elements at the top of the stack with their sum is output. The result will be the assembly below.

PUSH 2
PUSH 3
PUSH 4
MUL
ADD

Using this method, abstract syntax trees can be mechanically reduced to assembly.

How to create a stack machine on x86-64

So far we have been talking about virtual stack machines. The actual x86-64 is a register machine, not a stack machine. x86-64 operations are usually defined to operate between two registers, not the two values ​​at the top of the stack. So in order to use stack machine techniques on x86-64, you need to emulate the stack machine in some sense with a register machine.

Emulating a stack machine with a register machine is relatively easy. What would be one instruction on a stack machine can be implemented using multiple instructions.

Let's explain a specific method for doing so.

First, prepare one register that points to the top element of the stack. This register is called the stack pointer. If you want to pop the two values ​​at the top of the stack, take the two elements pointed to by the stack pointer and change them by the amount of the element you removed the stack pointer from. Similarly, when pushing, all you have to do is change the value of the stack pointer and write to the memory area it points to.

The x86-64 RSP register is designed to be used as a stack pointer. pushInstructions like `` in x86-64' pop' implicitly use RSP as a stack pointer, modify its value, and access the memory pointed to by RSP. Therefore, when using the x86-64 instruction set like a stack machine, it is straightforward to use RSP as a stack pointer. Now, 1+2let's compile the expression using x86-64 as a stack machine. Below is the x86-64 assembly.

// 左辺と右辺をプッシュ
push 1
push 2

// 左辺と右辺をRAXとRDIにポップして足す
pop rdi
pop rax
add rax, rdi

// 足した結果をスタックにプッシュ
push rax

Since x86-64 does not have an instruction to ``add the two elements pointed to by RSP,'' you need to load them into a register, perform the addition, and then push the result back onto the stack. That's what the above addcommand does.

Similarly, 2*3+4*5if you try to implement it on x86-64, it will look like this:

// 2*3を計算して結果をスタックにプッシュ
push 2
push 3

pop rdi
pop rax
mul rax, rdi
push rax

// 4*5を計算して結果をスタックにプッシュ
push 4
push 5

pop rdi
pop rax
mul rax, rdi
push rax

// スタックトップの2つの値を足す
// つまり2*3+4*5を計算する
pop rdi
pop rax
add rax, rdi
push rax

In this way, by using x86-64 stack manipulation instructions, you can run code that is quite similar to a stack machine even on x86-64.

The following genfunction implements this method as a C function.

void gen(Node *node) {
  if (node->kind == ND_NUM) {
    printf("  push %d\n", node->val);
    return;
  }

  gen(node->lhs);
  gen(node->rhs);

  printf("  pop rdi\n");
  printf("  pop rax\n");

  switch (node->kind) {
  case ND_ADD:
    printf("  add rax, rdi\n");
    break;
  case ND_SUB:
    printf("  sub rax, rdi\n");
    break;
  case ND_MUL:
    printf("  imul rax, rdi\n");
    break;
  case ND_DIV:
    printf("  cqo\n");
    printf("  idiv rdi\n");
    break;
  }

  printf("  push rax\n");
}

Although this is not a particularly important point when it comes to parsing or code generation, the idivcode above uses instructions with a tricky specification, so let's explain them.

idivis an instruction that performs signed division. If x86-64 idivhad a straightforward specification, I idiv rax, rdiwould have written the above code as originally written, but such a division instruction that takes two registers does not exist on x86-64. Instead, idivimplicitly takes RDX and RAX, treats them together as a 128-bit integer, divides it by the 64-bit value of the argument register, puts the quotient in RAX, and puts the remainder in RDX. It is designed to be set to . cqoBy using the instruction, the 64-bit value in RAX can be expanded to 128 bits and set in RDX and RAX, so in the code above, is called before calling idiv.cqo

Well, this concludes the explanation of stack machines. By now, you should be able to perform complex parsing and translate the resulting abstract syntax tree into machine code. Let's go back to writing a compiler to put that knowledge to use!

Column: Optimizing compiler

The x86-64 assembly I used to illustrate this chapter may seem quite inefficient. pushFor example, if you write an instruction that puts a number on the stack and writes popthat value directly to a register, it should only take one instruction. movSome of you may find yourself wanting to optimize these assemblies by removing redundancy. But don't give in to that temptation. At the very beginning of code generation, it is desirable to output redundant code in favor of ease of implementation for the compiler.

Optimization passes can be added to 9cc later if necessary. It is not difficult to rescan the generated assembly and replace the instructions that appear in a particular pattern with other instructions. For example, you can create rules such as ``Replace with the immediately following '' or ``If consecutive but immediate values ​​are added to the same register, replace the immediate values ​​with one pushthat adds the sum of the immediate values.'' can be applied mechanically to replace redundant code with more efficient code without changing its meaning.popmovaddadd

Mixing code generation and optimization complicates the compiler. If the code is difficult from the beginning, it is rather difficult to add optimization passes later. As Donald Knuth said, "Premature optimization is the root of all evil." In the compilers you write, please only consider ease of implementation. Don't worry about obvious verbosity in the output; you can remove it later.

Step 5: Create a language that can perform four arithmetic operations

In this chapter, we will modify the compiler created in the previous chapters and extend it so that it can handle expressions of four arithmetic operations that include precedence parentheses. All the necessary parts are included, so there is only a small amount of new code to write. mainTry modifying the compiler functions to use the newly created parser and code generator. The code should look like the one below.

int main(int argc, char **argv) {
  if (argc != 2) {
    error("引数の個数が正しくありません");
    return 1;
  }

  // トークナイズしてパースする
  user_input = argv[1];
  token = tokenize(user_input);
  Node *node = expr();

  // アセンブリの前半部分を出力
  printf(".intel_syntax noprefix\n");
  printf(".globl main\n");
  printf("main:\n");

  // 抽象構文木を下りながらコード生成
  gen(node);

  // スタックトップに式全体の値が残っているはずなので
  // それをRAXにロードして関数からの返り値とする
  printf("  pop rax\n");
  printf("  ret\n");
  return 0;
}

At this point, expressions consisting of addition, subtraction, multiplication, and priority parentheses should compile correctly. Let's add some tests.

assert 47 '5+6*7'
assert 15 '5*(9-6)'
assert 4 '(3+5)/2'

For convenience of explanation, up to this point, the flow of the story has been as if , were implemented all at once, but in reality, it would be better to avoid implementing them all at once *. Since there was originally a function that could perform addition and subtraction, first try introducing an abstract syntax tree and a code generator using it without breaking that function. Since we are not adding new functionality at that time, no new tests are needed. After that, please implement , , and including tests./()*/()

Reference implementation

Column: Memory management in 9cc

If you have read this far in this book, you may be wondering how memory management works in this compiler. The code we've seen so far uses calloc (a variant of malloc), but it doesn't call free. In other words, the allocated memory will not be freed. Isn't this a bit of an oversight?

In reality, this design of ``setting the memory management policy to not perform memory management'' was intentionally chosen by the author after considering various trade-offs.

The advantage of this design is that by not freeing memory, you can write code as if it were a language with a garbage collector. This not only eliminates the need to write memory management code, but also eliminates the mysterious bugs associated with manual memory management.

On the other hand, there are virtually no problems caused by not freeing, considering that it is run on a computer like a normal PC. A compiler is a short-lived program that simply reads a single C file and outputs assembly. All allocated memory is automatically released by the OS when the program ends. Therefore, the only issue is how much memory to allocate in total, but according to my actual measurements, even when compiling a fairly large C file, the memory usage is only about 100MiB. Therefore, not freeing is a realistically effective strategy. For example, the D language compiler DMD adopts a policy of only mallocing and not freeing based on the same idea. 1

Step 6: Unary plus and unary minus

A subtraction -operator can 5-3not only be written between two terms, as in , but also before a single term, as in . -3Similarly, +operators +3can be written by omitting the left-hand side. An operator like this that takes only one term is called a "unary operator. " On the other hand, an operator that takes two terms is called a binary operator.

+In addition to and , there are other unary operators in C , such as -retrieving a pointer &and dereferencing a pointer , but in this step we will only implement and .*+-

Unary +and unary have the same symbol as -binomial +and -, but have different definitions. Binary -is defined as an operation that subtracts the right side from the left side, but -since unary has no left side in the first place, the binary -definition does not make sense as it is. In C, a unary -is defined as an operation that reverses the sign of the right-hand side. A unary +operator is an operator that returns the right-hand side as is. This is an operator that doesn't really need to be used, but it -exists when there is a unary.

+It -is appropriate to think that there are multiple operators with the same name, with similar but different definitions, called unary and binary operators. Whether it is unary or binary depends on the context. The new grammar containing the unary +/ -looks like this:

expr    = mul ("+" mul | "-" mul)*
mul     = unary ("*" unary | "/" unary)*
unary   = ("+" | "-")? primary
primary = num | "(" expr ")"

The new grammar above unaryadds a new nonterminal symbol, and mulnow primaryuses unary, instead of . X?is an EBNF syntax for an element that is optional, i.e. Xoccurs zero or once. unary = ("+" | "-")? primaryAccording to the rule, unarythe nonterminal symbol , +which may -or may not have one , primaryindicates that it is followed by .

-3Check that expressions such as and match the new syntax -(3+5). The syntax tree is shown -3*+5below .-3*+5

-3*+5 syntax tree

Let's change the parser to follow this new grammar. As usual, parser changes should be completed by mapping the grammar directly to function calls. unaryThe function to parse is shown below.

Node *unary() {
  if (consume('+'))
    return primary();
  if (consume('-'))
    return new_node(ND_SUB, new_node_num(0), primary());
  return primary();
}

Here, I decided to replace , and , +xat the parsing stage. Therefore, no code generator changes are required for this step.x-x0-x

This step is completed by writing a few tests and checking them in along with the code that adds the unary +/ . -When writing tests, try to keep the test results in the range 0-255. -10+20An expression like this -uses a single term but the overall value is a positive number, so please use this kind of test.

Reference implementation

Column: Unary plus and the pros and cons of grammar

Unary +operators were not present in the original C compiler and were officially added to the language when C was standardized by ANSI (American National Standards Institute) in 1989. -Since there is a single term , +it is certainly better to have a single term because it has higher symmetry, but in reality, a single term +has no particular use.

On the other hand, +there are side effects of adding unary terms to the grammar. Let's say someone new to C accidentally writes +=an operator like this: Without i =+ 3the unary , this is just an invalid expression, but because of the unary, it is interpreted the same as writing , and the compiler silently accepts it as a legal assignment expression that assigns 3 to . Masu.++i = +3i

ANSI's C language standardization committee understood the above issues and +made the decision to add unary terms to the language, but what do you think? If you were on the C standards committee at the time, would you support it? Do you object?

Step 7: Comparison operators

In this section, we implement , <, <=, >, >=, ==. !=Although these comparison operators appear to have special meaning, +they -are actually ordinary binary operators that take two integers and return one, just like and. +Just as returns the result of adding both sides, for example, ==returns 1 if both sides are the same, and 0 if they are different.

Tokenizer changes

The symbol tokens we've been dealing with so far have all been one character in length, and we've assumed that in our code, but in order to ==handle comparison operators such as , we need to generalize the code. Let's store lena member in a structure so that we can store the length of a string in a token . TokenThe new structure type is shown below.

struct Token {
  TokenKind kind; // トークンの型
  Token *next;    // 次の入力トークン
  int val;        // kindがTK_NUMの場合、その数値
  char *str;      // トークン文字列
  int len;        // トークンの長さ
};

This change requires changes to functions such as and to improve them so that they take strings instead of characters consume. expectHere's an example with some changes:

bool consume(char *op) {
  if (token->kind != TK_RESERVED ||
      strlen(op) != token->len ||
      memcmp(token->str, op, token->len))
    return false;
  token = token->next;
  return true;
}

When tokenizing a symbol that consists of multiple characters, the longer tokens must be tokenized first. For example, if the rest of the string >starts with , if you check the possibility that it starts with without first checking the possibility that strncmp(p, ">=", 2)it is , as in It will be nized.>=>>=>=

new grammar

To add support for comparison operators to the parser, consider what the grammar would look like with the addition of comparison operators. If you write the operators that have appeared so far in order of priority from lowest to highest, it will look like this:

  1. == !=
  2. < <= > >=
  3. + -
  4. * /
  5. unary +unary-
  6. ()

Precedence can be expressed in a generative grammar, where operators with different precedence map to different nonterminals. exprIf we mulconsider the grammar in the same way as , the new grammar with the addition of comparison operators will be as follows.

expr       = equality
equality   = relational ("==" relational | "!=" relational)*
relational = add ("<" add | "<=" add | ">" add | ">=" add)*
add        = mul ("+" mul | "-" mul)*
mul        = unary ("*" unary | "/" unary)*
unary      = ("+" | "-")? primary
primary    = num | "(" expr ")"

equalityThe ==dove represents !=, relational, , . These nonterminals can be directly mapped to functions using patterns that parse left-associative operators.<<=>>=

Note that in the above grammar, and are separated equalityto indicate that the entire expression is . I could have written the right -hand side of `` directly on the right-hand side of ``, but I think the above syntax is probably easier to read.exprequalityexprequality

Column: Simple and verbose code vs. advanced and concise code

In recursive descending parsing, you write code that corresponds almost exactly to the production rules, so functions that parse similar rules will look the same. relationalThe , equality, addand functions we have written so far mulshould also have similar appearance.

It's probably natural to think about how the common patterns of these functions can be better abstracted using metaprogramming techniques such as C macros, C++ templates, higher-order functions, and code generation. In fact, it is possible to do such a thing. However, this book does not intentionally do that. The reason is as follows.

Simple code is easy to understand, even if it is somewhat verbose. If you end up making similar changes to similar functions later on, it's actually not that big of a deal. Highly abstracted code, on the other hand, can be difficult to understand because you first need to understand the abstraction mechanism and then how to use it. For example, if I had started this book by writing a function that uses metaprogramming to generate functions for recursive descent parsing, this book would have been much more difficult.

You don't always have to aim to write elegant and concise code. When you aim for something like that, you tend to make the code harder to the point where it can't be made any harder.

The person writing the code becomes an expert on that code, so they tend to think that concise and lean code seen from an expert's perspective is good code, but most code readers share the same feeling as the author. There is no need to be that proficient in it in the first place, so you need to question your own sense as a code writer to some extent. One important technique for creating programs that are easy to understand and maintain is to intentionally write simple code that could be written better when necessary.

Assembly code generation

On x86-64, comparisons are done using the cmp instruction. The code that pops two integers from the stack, performs a comparison, and sets RAX to 1 if they are identical, and 0 otherwise, looks like this:

pop rdi
pop rax
cmp rax, rdi
sete al
movzb rax, al

The code is a bit voluminous for a short assembly, so let's walk through the code step-by-step.

The first two lines pop the value off the stack. The third line compares the popped values. Where do the comparison results go? On x86-64, the result of a compare instruction is set in a special " flags register ." The flag register is a register that is updated every time an integer operation or comparison operation instruction is executed, and contains bits that indicate whether the result is 0, bits that indicate whether an overflow has occurred, bits that indicate whether the result is less than 0, etc. have.

The flag register is not a normal integer register, so if you want to set the comparison result in RAX, you need to copy the specific bits of the flag register to RAX. Commands do that sete. The instruction sets 1 in the specified register (AL in this case) if the values ​​in the two registers examined by the previous instruction are the same sete. cmpOtherwise set to 0.

AL is a new register name that has not appeared so far in this book, but AL is actually just another register that points to the lower 8 bits of RAX. Therefore sete, when you set a value in AL, RAX will also be updated automatically. However, when updating RAX via AL, the upper 56 bits remain at their original value, so if you want to set the entire RAX to 0 or 1, the upper 56 bits must be cleared to zero. Commands do that movzb. seteIt would be nice if the instruction could write directly to RAX, but since the setespecification is such that only 8-bit registers can be taken as arguments, the comparison instruction uses two instructions like this to set a value to RAX. Become.

seteYou can implement other comparison operators by using other instructions instead of . <So setl, try to <=use setlethen !=.setne

>and >=does not need to be supported by the code generator. Please replace both sides in the parser and read it <as .<=

Reference implementation

Column: Flag registers and hardware

This x86-64 specification, in which the results of value comparisons are implicitly stored in special registers different from ordinary integer registers, may seem confusing at first. In fact, some RISC processors do not want to have flag registers and have an instruction set that sets the results of value comparisons in ordinary registers. For example, RISC-V is such an instruction set.

However, from the perspective of a hardware implementer, it is very easy to create a flag register if the implementation is simple. In other words, when performing an integer operation, you can branch the wire for the result and connect it to another logic, and then check whether the result is zero (all lines are 0) or whether the result is negative (the top All you have to do is check whether the bit line is 1 or not, and set the result to each bit in the flag register. That's exactly how a CPU with a flag register is implemented, and every time you perform an integer operation, the flag register is updated as well.

In such a mechanism, the flags register is updated cmpnot only by , addbut also by , etc. subIn fact, it is cmpactually a special instruction that updates only the flag register . If you look at the flag register after that, you can see the magnitude relationship between RAX and RDI, but since that would result in RAX being updated, we have prepared a method that does not write to the integer register .subsub rax, rdisubcmp

In the case of software, it always takes extra time to ``calculate something at the same time,'' but in hardware, branching lines and using extra transistors itself does not incur a time penalty. So the cost of updating the flag register every time is non-existent in a naive hardware implementation.


Separate compilation and linking

Up until this stage, development had proceeded with a file structure consisting of only one C file and one test shell script. There's nothing wrong with this structure, but since the source code is getting longer and longer, I decided to split it into multiple C files to make it easier to read. In this step, we will split one file, 9cc.c, into the following five files.

mainSince the function is small, I could have put it in another C file, but since it doesn't belong to either of these semantically parse.c, codegen.cI decided to separate it into a separate file.

This chapter explains the concept of separate compilation and its significance, and then explains the specific steps.

What is separate compilation?

Separate compilation and its necessity

Separate compilation means writing a program by dividing it into multiple source files and compiling them separately. With split compilation, the compiler reads fragments of the program and outputs the corresponding fragments, rather than the entire program. A file that contains program fragments that cannot be executed on its own is called an " object file " (extension: ). .oIn split compilation, the object files are finally joined together to create a single file. A program that combines object files into one executable file is called a linker.

Let's understand why we need separate compilation. Actually, technically there is no necessity to split the source. If you give the compiler all the source code at once, it is theoretically possible for the compiler to output a complete executable without the aid of a linker.

However, such an approach requires that the compiler really know all the code that the program uses. For example, printfstandard library functions such as are usually functions written in C by the standard library author, but in order to avoid the linking step, it is necessary to provide the source code of such functions as input to the compiler every time. I'm coming. Compiling the same function over and over again is often just a waste of time. Therefore, standard libraries are usually distributed in the form of precompiled object files, so you don't have to recompile them each time. In other words, even a program consisting of one source code is actually using separate compilation as long as it uses the standard library.

Without separate compilation, changing just one line would require recompiling the entire code. Compiling code with a length of tens of thousands of lines can take several tens of seconds. A large project may have over 10 million lines of source code, so if you compile it as a single unit, it will take more than a day. Memory is also required in units of 100 GiB. Such a build procedure is unrealistic.

Another problem is that simply writing all functions and variables in one file makes it difficult for humans to manage.

Separate compilation is necessary for the reasons mentioned above.

Column: Linker history

The linker's ability to string together multiple fragmented machine language routines into a single program has been needed since the dawn of computers. In 1947, John Mauchly ( project leader for the first digital computer, ENIAC) wrote a program that relocated subprograms read from tape and combined them into a single program .

Even in the earliest computers, it was desirable to write a general-purpose subroutine only once and use it from various programs, but in that case a linker was needed to combine program fragments into an executable program. is. Since 1947 was a time when assemblers were not yet in use and code was written directly in machine language, programmers actually wanted to create a linker before an assembler.

The necessity of header files and their contents

With split compilation, the compiler sees only one part of the program's code, but the compiler cannot compile even the smallest piece of the program. For example, consider the following code.

void print_bar(struct Foo *obj) {
  printf("%d\n", obj->bar);
}

In the above code, if you know the type of struct Foo, you can output the assembly corresponding to this code, but otherwise you will not be able to compile this function.

When compiling separately, each C file must contain enough information to compile each individual C file. However, if you write all the code written in separate files, it will no longer be a separate compilation, so you will need to be selective about the information to some extent.

As an example, consider what information needs to be included in order to output code that calls a function in another C file. The compiler requires the following information:

Putting the above requirements together, { ... }we have enough information to call the function as long as we omit the function body. A function without the body of the function is called a "declaration" of the function. The declaration only tells the compiler the type and name; it does not contain the code for the function. For example, below strncmpis the declaration of:

int strncmp(const char *s1, const char *s2, size_t n);

By looking at the above line, the compiler strncmpcan learn about the existence of and its type. A declaration that includes function code is called a "definition. "

externAdd a keyword that represents the declaration to the function declaration ,

extern int strncmp(const char *s1, const char *s2, size_t n);

However, in the case of functions, declarations and definitions can be distinguished by omitting the function body, so it is not necessary extern.

Note that you only need to know the type of the argument, so the name can be omitted in the declaration, but it is common to write the name in the declaration to make it easier for humans to understand.

Consider struct types as another example. If there are two or more C files that use the same structure, you must write a declaration for the same structure in each C file. If a structure is used only in one C file, other C files do not need to know about its existence.

In C, the declarations needed when compiling other C files are written together in a header file (with the extension ). If you write a declaration in , and write it in another C file that requires it , the line will be replaced with the contents of the file..hfoo.h#include "foo.h"#includefoo.h

typedefetc. are also used to tell the compiler type information. If these are used in multiple C files, they must be written in the header file.

The compiler does not output any assembly when it reads the declaration. A declaration is the information necessary to use a function or variable contained in another file, and does not itself define the function or variable.

Based on the story of split compilation up to this point, you can understand what is actually being done when people say, ``When using , I will write it down as a spell.' printf' #include <stdio.h>The C standard library is passed implicitly to the linker so that the linker printfcan link object files containing function calls to create an executable file. On the other hand, the compiler printfdoesn't know anything about it by default. is not a built-in function, and there is no specification that standard library header files are automatically loaded, so the compiler does not know anything about it printfimmediately after startup. printfFrom this state, by including the header file that comes with the C standard library, printfthe compiler can learn about the existence of and its type, and printfcan compile the function call.

Column: One-pass compiler and forward declaration

In C, declarations are sometimes required even when all functions are written in one file. The C language specification allows the compiler to compile each function one by one from the beginning without reading the entire file. Therefore, any function must be able to compile using only the information written up to the point where the function appears in the file. Therefore, if you want to use a function defined at the end of a file, you need to write a declaration for that function in advance. Such a declaration is called a " forward declaration."

You can avoid writing most forward declarations by adjusting the order in which functions are written in files, but forward declarations are essential if you want to write functions that are mutually recursive.

The C specification that allowed compilation without reading the entire file made sense when main memory was very small, but it is now an outdated specification. . If the compiler were a little smarter, it would be possible to avoid writing declarations for definitions written in the same file. However, this behavior is part of the language specification and should be kept in mind.

link error

When the object files are finally assembled and passed to the linker, they must contain just enough information to construct the program as a whole.

If a program foocontains only function declarations and no definitions, the individual C files, fooincluding the code that calls them, can be compiled normally. However, when the linker finally tries to create a complete program, there is no fooway to fix it because the address that should be fixed foois missing, resulting in an error.

Errors that occur when linking are called link errors .

Linking errors will also occur if multiple object files contain the same function or variable. As a linker, I don't really know which one to choose if there are duplicates. Duplicate errors like this often occur when you write the wrong definition in a header file. This is because header files are included in multiple C files, so if a header file has a definition, it will be in the same state as having duplicate definitions written in multiple C files. To resolve this kind of error, write only the declaration in the header file, and move the substance to one C file.

Column: Duplicate definitions and link errors

When there are duplicate definitions, the linker may choose one and ignore the remaining definitions. With such a linker, duplicate definitions will not result in an error.

Even in actual object files, it is possible to choose whether or not to allow duplication for each definition, and inline functions and C++ template expansion results are included in object files in a way that allows duplication. Object file formats and linker operations are surprisingly complex, and there are many exceptions, but these operations are just exceptions. By default, duplicate definitions usually result in an error.

Declaration and definition of global variables

Our compiler doesn't yet have global variables, so we don't have assembly examples for them yet, but global variables are almost the same as functions at the assembly level. Therefore, like functions, global variables also have a distinction between definitions and declarations. If the body of a variable is duplicated in multiple C files, it usually results in a linking error.

Global variables are allocated in non-executable memory areas by default, so jumping there will cause your program to crash with a segmentation fault, but other than that there is essentially no difference between data and code. At runtime, you can read the function as data like a global variable, or you can execute the data as code by changing the memory attributes to allow execution and jumping to the data.

Let's verify with some actual code that both functions and global variables are essentially just data that resides in memory. In the code below, mainthe identifier is defined as a global variable. mainThe content is x86-64 machine language.

char main[] = "\x48\xc7\xc0\x2a\x00\x00\x00\xc3";

foo.cLet's save the above C code in a file, compile objdumpit, and check the contents using . objdumpBy default, the contents of global variables are only displayed in hexadecimal, but you -Dcan pass an option to force the data to be disassembled as code.

$ cc -c foo.c
$ objdump -D -M intel foo.o
Disassembly of section .data:

0000000000000000 <main>:
   0:   48 c7 c0 2a 00 00 00    mov    rax,0x2a
   7:   c3                      ret

The default behavior of data being mapped to non-executable areas -Wl,--omagiccan be changed at compile time by passing the option . Let's generate an executable file using this option.

$ cc -static -Wl,--omagic -o foo foo.o

Functions and variables are just labels in assembly and belong to the same namespace, so the linker doesn't care which are functions and which are data when combining multiple object files. So maineven if is defined as data at the C level, mainthe link will succeed just as if is a function.

Let's run the generated file.

$ ./foo
$ echo $?
42

As shown above, the value 42 is correctly returned. mainThe contents of the global variable were executed as code.

In C syntax, if it is a global variable, externit becomes a declaration if you add . Below foois the declaration of a global variable of type int.

extern int foo;

fooWhen writing a program that includes , the above line should be written in the header file. fooThen, define it in one of the C files . Below foois the definition of.

int foo;

Note that in C, global variables for which no initialization expression is given are initialized to 0, so such variables are semantically equivalent to being initialized to 0 , {0, 0, ...}etc. "\0\0\0\0..."Same as .

int foo = 3When writing an initialization expression like this, write the initialization expression only in the definition. A declaration only tells the compiler the type of a variable, so no specific initialization expression is required. Since the compiler does not specifically output assembly when it sees the declaration of a global variable, it does not need to know how its contents are initialized on the spot.

If the initialization expression is omitted, the declaration and definition of a global variable externare simply the presence or absence of the variable, so they look similar, but the declaration and definition are different. Please understand this clearly here.

Column: Intel CPU F00F bug

Intel Pentium processors before 1997 F0 0F C7 C8had a serious bug that caused the CPU to hang completely when executing the 4-byte instruction.

Officially, there is no assembly instruction that corresponds to this 4-byte instruction, but if you write it down as assembly, it lock cmpxchg8b eaxwill become an instruction like this. 0F C7 C8This cmpxchg8b eaxis an instruction that atomically exchanges an 8-byte value between a register and memory (even in a multi-core system, the intermediate state cannot be observed by other cores). F0This lockis additional information called a prefix, which has the effect of making the instruction immediately following it atomic. However, cmpxchg8bsince it is originally atomic, lock cmpxchg8b eaxis a redundant and illegal way to write an instruction. Therefore, such an assembly instruction is not supposed to exist grammatically, the F0 0F C7 C8byte sequence would never appear in a normal program, and Intel was unable to detect this bug before mass production of processors. .

Using the hack of writing the main function as data, the code that reproduces the F00F bug can be written in C in one line:

char main[] = "\xf0\x0f\xc7\xc8";

This function is harmless on modern x86, but on a Pentium back in 1997, anyone could easily hang the entire system with this one-line program.

The F00F bug is not a big problem if the PC is completely occupied by an individual, but if the CPU is shared like what is now called a cloud, this bug is fatal. However, at first it was thought that the F00F bug was unfixable and the only option was to recover and replace the CPU, but later on, a tricky method was developed to circumvent the bug at the exception handler level of the OS kernel, and fortunately for Intel. Fortunately, we were able to avoid product replacement.

C standard library and archive files

Step 8: File splitting and Makefile changes

Split files

Try splitting the file using the configuration shown at the beginning of this chapter. 9cc.hThat is a header file. Depending on the structure of the program, one file may be prepared for each file, but there is no particular harm in having extra declarations, so there is no need to manage dependencies in such detail here .c. .hthere is no. 9cc.hPrepare one file and #include "9cc.h"include it in all C files.

Makefile changes

Now that we've changed the program to multiple files, Makefilelet's also update it. The example below Makefilecompiles and links all .c files located in the current directory to create an executable file called 9cc. It is assumed that there is only one file, 9cc.h, as a project header file, and that all .c files include that header file.

CFLAGS=-std=c11 -g -static
SRCS=$(wildcard *.c)
OBJS=$(SRCS:.c=.o)

9cc: $(OBJS)
        $(CC) -o 9cc $(OBJS) $(LDFLAGS)

$(OBJS): 9cc.h

test: 9cc
        ./test.sh

clean:
        rm -f 9cc *.o *~ tmp*

.PHONY: test clean

MakefileNote that the indentation of must be a tab character.

make is a highly functional tool, and you don't necessarily need to be proficient Makefileat using it, but being able to read the above will be useful in many situations. Therefore, in this section Makefilewe will explain the above.

MakefileIn , zero or more command lines separated by colons and indented by tabs constitute a rule. The name before the colon is called the "target." Zero or more file names after the colon are called dependent files.

make fooWhen you run , makeit footries to create a file called . If the specified target file already exists, makereruns the target rule only if the target file is older than the dependent file. This enables behavior such as regenerating binaries only when the source code changes.

.PHONYis a special name used to represent a dummy target. make testAfter all make clean, testI cleandon't do it to create a file, but I don't usually know makethat, so if testthere cleanis a file with a name by accident, make testnothing make cleanwill be done. I'll put it away. .PHONYBy specifying such a dummy target, it does not mean that you really want to create a file with that name, but rather that the rule's command should be executed regardless of whether the specified target file exists or not . makecan be conveyed to .

CFLAGSand SRCSare OBJSvariables.

CFLAGSis a variable recognized by make's built-in rules, and describes the command line options to be passed to the C compiler. Here we are passing the following flags:

SRCSThe one used on the right side of wildcardis a function provided by make, which is expanded to the file name that matches the function argument. $(wildcard *.c)is currently main.c parse.c codegen.cbeing expanded.

OBJSThe right-hand side uses variable substitution rules to generate a value that replaces .c in SRC with .o. SRCSBecause it main.c parse.c codegen.cis, OBJSit main.o parse.o codegen.obecomes.

With these things in mind, make 9cclet's trace what happens when we run . Since make attempts to generate the target specified as an argument, 9cccreating the file is the final goal of the command (if there is no argument, the first rule is chosen, so in this case 9cc is not specified). (optional). make does this by walking through dependencies and attempting to build missing or outdated files.

9ccA dependent file is a file .cthat corresponds to a file in the current directory .o. .oIf a file remains from a previous run of make .cand has a newer timestamp than the corresponding file, make won't bother running the same command again. Runs the compiler to generate the file only .oif the file does not exist or is .cnewer ..o

$(OBJS): 9cc.hThe rule says that all .ofiles depend on . 9cc.hTherefore, 9cc.hif you change it, all .ofiles will be recompiled.

Column: Various meanings of the static keyword

C statickeywords are mainly used for two purposes:

  1. Add to local variables staticso that their values ​​are saved even after exiting the function
  2. staticAdd to a global variable or function to make the scope of that variable or function file scope.

Although these two uses have nothing in common, they use the same keywords, which is one of the points of confusion when learning C. Ideally, they should have used different keywords such as usage 1 persistentand usage 2 . privateMore ideally, for usage 2 , it might have been better to make privateit the default and attach it to variables and functions in the global scope .public

The reason C reuses keywords is for compatibility with code assets written in the past. privateIf you add a new keyword to the language, existing programs that use that keyword as the name of a variable or function will no longer compile. C didn't like that, so instead of adding more keywords, he decided to reuse existing keywords in different contexts.

staticIf we had decided at some point in the 1970s to add new keywords instead of reusing them, we probably could have saved ourselves a huge amount of code changes, but I wonder what I would have done . This is quite a difficult problem.


Functions and local variables

In this chapter, we will implement functions and local variables. We also implement a simple control structure. After completing this chapter, you will be able to compile code like this:

// mからnまでを足す
sum(m, n) {
  acc = 0;
  for (i = m; i <= n; i = i + 1)
    acc = acc + i;
  return acc;
}

main() {
  return sum(1, 10); // 55を返す
}

Although there is still a gap between the above code and C, I think it can be said that it has come quite close to C.

Step 9: One character local variable

Up to the previous chapter, we were able to create a compiler for a language that can perform four arithmetic operations. In this section, we'll add functionality to the language to allow the use of variables. Specifically, the goal is to be able to compile multiple statements containing variables as follows.

a = 3;
b = 5 * 6 - 8;
a + b / 2;

Let's use the result of the last expression as the calculation result for the entire program. I think you could say that this language has a much more ``real language'' feel to it than a language that only uses four arithmetic operations.

In this chapter, we will first explain how to implement variables, and then we will implement variables incrementally.

Variable area on the stack

Variables in C exist in memory. You can say that a variable is a memory address with a name. By naming memory addresses, you can express things like ``accessing a variable'' instead of ``accessing address 0x6080 in memory.' a'

However, a function's local variables must exist separately for each function call. Considering only the convenience of implementation, it seems easy to fix the address, for example, ``place the flocal variable of the function at address 0x6080'', but in that case it will work fine when called recursively. not. In C, local variables are placed on the stack so that each function call has a separate local variable.af

Let's consider the contents of the stack using a concrete example. Suppose you have a function with aa local variable and some other function calls it. The function call instruction puts the return address on the stack, so the top of the stack when is called will contain that return address. In addition to that, it is assumed that the stack originally contains some value. The specific value is not important here, so we will represent it with "...". If you make a diagram, it will look like this:bffcallf

...
return address ← RSP

Here, we will use the notation "← RSP" to indicate that the current RSP register value points to this address. aand bare each 8 bytes in size.

The stack grows downwards. In order to secure space from this state, it is necessary to reduce RSP by two variables, or 16 bytes in atotal . bWhen you do that, it looks like this:

...
return address
a
b ← RSP

If you create a layout like the one above, you can access both the RSP+8 value aand the RSP value . bThe memory area that is secured for each function call is called a "function frame" or "activation record."

The number of bytes to change in RSP and the order in which variables are placed in the area secured in this way are not visible to other functions, so they are determined appropriately based on the compiler implementation. It's okay.

Basically, local variables are implemented as simple things like this.

However, this method has one drawback, so the actual implementation would use one more register. Recall that in our compiler (and in other compilers as well) RSP may change while the function is running. 9cc pushes/pops the calculation results in the middle of the formula onto the stack using RSP, so the RSP value changes frequently. Therefore, it acannotb

A common way to solve this is to have a register separate from RSP that always points to the start of the current function frame. Such a register is called a ``base register,'' and the value contained therein is called a ``base pointer.'' By convention, x86-64 uses the RBP register as the base register.

The base pointer must not change during function execution (that's why we have a base pointer). You can't just call another function from a function and have a different value when it returns, so you need to save the original base pointer for each function call and write it back before returning. .

The figure below shows the state of the stack when calling a function using a base pointer. Suppose that a function with xlocal variables calls . While running, the stack looks like this:ygfg

...
return address of g
RBP at the time of calling g ← RBP
x
y ← RSP

fCalling from here will result in the following state.

...
return address of g
RBP at the time of calling g
x
y
return address of f
RBP at the time of calling f ← RBP
a
b ← RSP

In this way, you can always access the address aRBP-8 and RBP-16. bIf we consider specifically the assembly that creates this stack state, the compiler should output the following assembly at the beginning of each function.

push rbp
mov rbp, rsp
sub rsp, 16

The standard instructions that the compiler outputs at the beginning of a function are called a "prologue. " Note that 16 actually needs to be a value that matches the number and size of variables for each function.

Let's confirm that when we execute the above code with RSP pointing to the return address, the expected function frame is created. The stack status for each instruction is shown below.

  1. fcallThe stack immediately after calling with

    ...
    return address of g
    RBP at the time of calling g ← RBP
    x
    y
    return address of f ← RSP
  2. push rbpThe stack after running

    ...
    return address of g
    RBP at the time of calling g ← RBP
    x
    y
    return address of f
    RBP at the time of calling f ← RSP
  3. mov rbp, rspThe stack after running

    ...
    return address of g
    RBP at the time of calling g
    x
    y
    return address of f
    RBP at the time of calling f ← RSP, RBP
  4. sub rsp, 16The stack after running

    ...
    return address of g
    RBP at the time of calling g
    x
    y
    return address of f
    RBP at the time of calling f ← RBP
    a
    b ← RSP

When returning from a function, write back the original value to RBP, make RSP point to the return address, and call an instruction (the instruction is an instruction that pops an address from the stack and jumps to it ret) ret. . The code to do this can be written simply as follows:

mov rsp, rbp
pop rbp
ret

The standard instructions that the compiler outputs at the end of a function are called an "epilogue. "

The state of the stack when running the epilogue is shown below. The stack area below the address pointed to by RSP can no longer be considered invalid data, so it is omitted in the diagram.

  1. mov rsp, rbpStack before running

    ...
    return address of g
    RBP at the time of calling g
    x
    y
    return address of f
    RBP at the time of calling f ← RBP
    a
    b ← RSP
  2. mov rsp, rbpThe stack after running

    ...
    return address of g
    RBP at the time of calling g
    x
    y
    return address of f
    RBP at the time of calling f ← RSP, RBP
  3. pop rbpThe stack after running

    ...
    return address of g
    RBP at the time of calling g ← RBP
    x
    y
    return address of f ← RSP
  4. retThe stack after running

    ...
    return address of g
    RBP at the time of calling g ← RBP
    x
    y ← RSP

Thus, by executing the epilogue, gthe state of the calling function's stack is restored. callThe instruction callpushes the address of the instruction following itself onto the stack. The epilogue retpops that address and jumps there, causing the function to resume execution callat the next instruction . gThis behavior is completely consistent with the behavior of functions as we know them.

This is how function calls and function local variables are implemented.

Column: Direction of stack growth

The x86-64 stack grows from larger to smaller addresses as explained above. It feels more natural to have the stack grow in the opposite direction, towards the top, but why are the stacks designed to grow downward?

In fact, there is no technical necessity for the stack to grow downwards. In actual CPUs and ABIs, the stack starts at an upper address and grows downwards, but there are some architectures in which the stack grows in the opposite direction, although it is extremely minor. For example, in 8051 microcontrollers, PA-RISC's ABI 3 , Multics 4 , etc., the stack grows toward higher addresses.

However, the design of the stack growing downwards is not particularly unnatural.

Immediately after the power is turned on, when the CPU starts executing a program from a blank state, the address at which execution starts is usually determined by the CPU specifications. A common design is for the CPU to start execution at a low address, such as address 0. In this case, the program code is usually placed all together at a lower address. By placing the two as far apart as possible so that the stack does not grow and overlap with the program's code, the stack can be placed at higher addresses and designed to grow toward the center of the address space. Become. This way the stack will grow downwards.

Of course, you can think of a different design than the CPU above, and then extending the stack up would be a more natural placement. This is honestly an either-or issue, and the reality is simply that the general industry perception is that the machine stack grows down.

Tokenizer changes

Now that you know how to implement variables, let's start implementing them. However, supporting an arbitrary number of variables quickly becomes too difficult, so we decided to limit the variables in this step to a single lowercase letter, so variable is RBP-8, variable is RBP-16, and variable is RBP a- b24 c. , and so on, all variables are assumed to always exist. Since there are 26 characters in the alphabet, if we decide to push down RSP by 26 x 8, or 208 bytes, when calling the function, we will be able to reserve space for all single-character variables.

Let's implement it now. First, let's modify the tokenizer so that it can tokenize single-character variables in addition to the previous grammar elements. To do this, we need to add a new token type. Since the variable name strcan be read from the member, Tokenthere is no need to add a new member to the type. As a result, the token type is:

enum {
  TK_RESERVED, // 記号
  TK_IDENT,    // 識別子
  TK_NUM,      // 整数トークン
  TK_EOF,      // 入力の終わりを表すトークン
} TokenKind;

Modify the tokenizer TK_IDENTto create a token of type if it is a lowercase letter of the alphabet. ifYou should be able to add statements like the following to your tokenizer.

if ('a' <= *p && *p <= 'z') {
  cur = new_token(TK_IDENT, cur, p++);
  cur->len = 1;
  continue;
}

Parser changes

With recursive descent parsing, if you know the syntax, you can mechanically map it to a function call. Therefore, in order to think about changes that should be made to the parser, it is necessary to consider what the new syntax will look like when variable names (identifiers) are added.

identLet's call it an identifier . This numis a terminal symbol, just like . Variables can be used anywhere a number can be used, so if you write num``where was'' num | ident, you get a syntax that allows variables to be used in the same places as numbers.

In addition to that, we need to add an assignment expression to the grammar. Since variables cannot be assigned, a=1we want to create a grammar that allows expressions like . Here, a=b=1let's use a syntax that is compatible with C and can be written as follows.

Furthermore, I would like to be able to write multiple statements separated by semicolons, so the resulting new grammar would be as follows.

program    = stmt*
stmt       = expr ";"
expr       = assign
assign     = equality ("=" assign)?
equality   = relational ("==" relational | "!=" relational)*
relational = add ("<" add | "<=" add | ">" add | ">=" add)*
add        = mul ("+" mul | "-" mul)*
mul        = unary ("*" unary | "/" unary)*
unary      = ("+" | "-")? primary
primary    = num | ident | "(" expr ")"

First of 42;all a=b=2; a+b;please check that a program like Hayaya conforms to this grammar. After that, modify the parser you have created so far so that it can parse the above grammar. At this stage, a+1=5expressions like can also be parsed, which is correct behavior. Eliminating such semantically invalid expressions is done in the next pass. There's nothing particularly tricky about modifying the parser, and you should be able to do it just by mapping the grammar elements to function calls as before.

Since we are multiplying multiple expressions by separating them with semicolons, we need to save the multiple nodes as a result of the parsing somewhere. For now, please prepare the following global array and store the parsed result nodes there in order. If you fill the last node with NULL, you will be able to see where it ends. Some of the new code to be added is shown below.

Node *code[100];

Node *assign() {
  Node *node = equality();
  if (consume("="))
    node = new_node(ND_ASSIGN, node, assign());
  return node;
}

Node *expr() {
  return assign();
}

Node *stmt() {
  Node *node = expr();
  expect(";");
  return node;
}

void program() {
  int i = 0;
  while (!at_eof())
    code[i++] = stmt();
  code[i] = NULL;
}

The abstract syntax tree needs to be able to express a new ``node representing a local variable''. To do this, let's add a new type for the local variable and a new member for the node. For example, it should look like this: This data structure ND_LVARcauses the parser to create and return type nodes for identifier tokens.

typedef enum {
  ND_ADD,    // +
  ND_SUB,    // -
  ND_MUL,    // *
  ND_DIV,    // /
  ND_ASSIGN, // =
  ND_LVAR,   // ローカル変数
  ND_NUM,    // 整数
} NodeKind;

typedef struct Node Node;

// 抽象構文木のノード
struct Node {
  NodeKind kind; // ノードの型
  Node *lhs;     // 左辺
  Node *rhs;     // 右辺
  int val;       // kindがND_NUMの場合のみ使う
  int offset;    // kindがND_LVARの場合のみ使う
};

offsetis a member that represents the offset from the local variable's base pointer. Currently, variables aare RBP-8, bRBP-16..., etc., and local variables are in fixed positions determined by their names, so the offset can be determined at the parsing stage. Below ND_LVARis the code that reads the identifier and returns the type node.

Node *primary() {
  ...

  Token *tok = consume_ident();
  if (tok) {
    Node *node = calloc(1, sizeof(Node));
    node->kind = ND_LVAR;
    node->offset = (tok->str[0] - 'a' + 1) * 8;
    return node;
  }

  ...

Column: ASCII code

In the ASCII code , characters are assigned to numbers from 0 to 127. A table of character assignments in ASCII code is shown below.

0 NUL SOH STX ETX EOT ENQ ACK BEL
8 B.S. HT N.L. VT NP CR S.O. S.I.
16 DLE DC1 DC2 DC3 DC4 NAK SYN ETB
twenty four CAN E.M. SUB Esc F.S. G.S. R.S. US
32 sp ! " # $ % & '
40 ( ) * + , - . /
48 0 1 2 3 Four Five 6 7
56 8 9 : ; < = > ?
64 @ A B C D E F G
72 H I J K L M N O
80 P Q R S T U V W
88 X Y Z [ \ ] ^ _
96 ` a b c d e f g
104 h i j k l m n o
112 p q r s t u v w
120 x y z { | } ~ DEL

Those in 0-31 are control characters. Nowadays, with the exception of NUL characters and newline characters, there are almost no opportunities to use such control characters, and most control characters are just a waste of prime real estate in the character code. However, when the ASCII code was created in 1963, these control characters were actually commonly used. During the development of the ASCII standard, there was even a proposal to include more control characters in place of the lowercase letters of the alphabet5 .

Numbers are assigned to 48-58, uppercase letters are assigned to 65-90, and lowercase letters are assigned to 97-122. Notice that these characters are assigned consecutive codes. In other words, 0123456789 and abcdefg... are consecutive on the character code. It may seem natural to place characters with a defined order in consecutive positions like this, but in EBCDIC, which was a major character code at the time, the alphabet was arranged consecutively on the code due to the influence of punch cards . did not.

In C, a character is just a small value of the integer type, and the meaning is the same as writing the code corresponding to the character as a number. In other words, assuming ASCII, for example, 'a'97 '0'is equivalent to 48. In the above code, athere was an expression that subtracted from the letter as a number, but by doing so, you can calculate how many characters a given letter is away from a. This technique is possible because the alphabets are arranged consecutively in the ASCII code.

lvalue and rvalue

Unlike other binary operators, assignment expressions require special handling of the value on the left side, so let's explain that here.

Not just any expression is allowed on the left side of an assignment expression. For example, 1=21 cannot be changed to 2. a=2Assignments like are allowed, but (a+1)=2statements like are illegal. Pointers and structures do not yet exist in 9cc, but if they did exist, assignments to the *p=2destination pointed to by a pointer such as , or a.b=2assignments to a member of a structure such as It must be accepted as legitimate. How can we distinguish between valid and invalid expressions?

There is a simple answer to that question. In C, only expressions that specify memory addresses can appear on the left side of assignment expressions.

Variables exist in memory and have addresses, so variables can be written on the left side of assignments. Similarly, *pa pointer reference such pas can also be written on the left side, since the value of is an address. a.bAccessing a member of a structure like this can also be written on the left side because it points to a memory address that is advanced by the member's offset afrom the start position of the structure in memory.b

On the other hand, a+1the result of an expression such as is not a variable, so it cannot be used as an expression to specify a memory address. These temporary values ​​may actually exist only in registers and not in memory, and even if they do exist in memory, they cannot be accessed at a fixed offset from a known variable. Normally you can't. For this reason, &(a+1)even if you write, for example, a+1it is not allowed to get the address of the result and will result in a compilation error. These expressions cannot be written on the left side of an assignment statement.

A value that can be written on the left side is called a left value, and a value that cannot be written is called a right value. Lvalues ​​and rvalues ​​are also called lvalues ​​and rvalues , respectively. In our language today, only variables are lvalues; all other values ​​are rvalues.

When generating code for variables, you can start from the lvalue. If a variable appears as the left-hand side of an assignment, the address of the variable is calculated as the value of the left-hand side, and the evaluation result of the right-hand side is stored at that address. This allows you to implement assignment expressions. If the variable appears in any other context, convert the lvalue to an rvalue by calculating the variable's address in the same way and then loading the value from that address. This allows you to get the value of the variable.

How to load a value from any address

The code generation up to this point has only accessed memory at the top of the stack, but local variables require access to any location on the stack. This section explains how to access memory.

The CPU can load and store values ​​from any address in memory, not just the top of the stack.

When loading a value from memory, mov dst, [src]use the syntax: This instruction means ``Regarding the value of the src register as an address, load the value from there and save it to dst.'' For example mov rdi, [rax], this means loading a value from the address in RAX and setting it in RDI.

When storing, mov [dst], srcuse the syntax: This instruction means ``assume the value of the dst register as an address, and store the value of the src register there.'' For example mov [rdi], rax, we would store the value of RAX to the address contained in RDI.

pushand popare instructions that access memory by implicitly treating RSP as an address, so these can actually be rewritten using multiple instructions using normal memory access instructions. So for pop raxexample

mov rax, [rsp]
add rsp, 8

It is the same as the two commands, push raxand

sub rsp, 8
mov [rsp], rax

This is the same as the two commands.

Code generator changes

Using what we've learned so far, let's modify the code generator to handle expressions that include variables. This change adds a function that evaluates an expression as an lvalue. The function in the code below gen_lvaldoes that. gen_lvalcalculates the address of a variable when the given node points to it and pushes it onto the stack. Displays an error otherwise. This (a+1)=2will eliminate expressions like .

When using a variable as an rvalue, first evaluate it as an lvalue, then treat the result of the calculation at the top of the stack as an address, and load the value from that address. The code is shown below.

void gen_lval(Node *node) {
  if (node->kind != ND_LVAR)
    error("代入の左辺値が変数ではありません");

  printf("  mov rax, rbp\n");
  printf("  sub rax, %d\n", node->offset);
  printf("  push rax\n");
}

void gen(Node *node) {
  switch (node->kind) {
  case ND_NUM:
    printf("  push %d\n", node->val);
    return;
  case ND_LVAR:
    gen_lval(node);
    printf("  pop rax\n");
    printf("  mov rax, [rax]\n");
    printf("  push rax\n");
    return;
  case ND_ASSIGN:
    gen_lval(node->lhs);
    gen(node->rhs);

    printf("  pop rdi\n");
    printf("  pop rax\n");
    printf("  mov [rax], rdi\n");
    printf("  push rdi\n");
    return;
  }

  gen(node->lhs);
  gen(node->rhs);

  printf("  pop rdi\n");
  printf("  pop rax\n");

  switch (node->kind) {
  case '+':
    printf("  add rax, rdi\n");
    break;
  case '-':
    printf("  sub rax, rdi\n");
    break;
  case '*':
    printf("  imul rax, rdi\n");
    break;
  case '/':
    printf("  cqo\n");
    printf("  idiv rdi\n");
  }

  printf("  push rax\n");
}

Main function changes

Now that we have all the parts, mainlet's change the function and actually run the compiler.

int main(int argc, char **argv) {
  if (argc != 2) {
    error("引数の個数が正しくありません");
    return 1;
  }

  // トークナイズしてパースする
  // 結果はcodeに保存される
  user_input = argv[1];
  tokenize();
  program();

  // アセンブリの前半部分を出力
  printf(".intel_syntax noprefix\n");
  printf(".globl main\n");
  printf("main:\n");

  // プロローグ
  // 変数26個分の領域を確保する
  printf("  push rbp\n");
  printf("  mov rbp, rsp\n");
  printf("  sub rsp, 208\n");

  // 先頭の式から順にコード生成
  for (int i = 0; code[i]; i++) {
    gen(code[i]);

    // 式の評価結果としてスタックに一つの値が残っている
    // はずなので、スタックが溢れないようにポップしておく
    printf("  pop rax\n");
  }

  // エピローグ
  // 最後の式の結果がRAXに残っているのでそれが返り値になる
  printf("  mov rsp, rbp\n");
  printf("  pop rbp\n");
  printf("  ret\n");
  return 0;
}

Step 10: Multi-character local variables

In the previous chapter, we fixed the variable name to one letter and treated it as if 26 local variables from a to z always existed. This section provides support for identifiers with names longer than one character, allowing code such as the following to compile:

foo = 1;
bar = 2 + 3;
return foo + bar; // 6を返す

Variables can be used without being defined. Therefore, the parser must determine whether each identifier has been seen before, and automatically allocate a variable in the stack area if it is new.

First, change your tokenizer TK_IDENTto read multi-character identifiers as type tokens.

We will represent the variables as a linked list. LVarLet's represent one variable with a structure called , and localshold the first element with a pointer called . Expressed in code, it looks like this:

typedef struct LVar LVar;

// ローカル変数の型
struct LVar {
  LVar *next; // 次の変数かNULL
  char *name; // 変数の名前
  int len;    // 名前の長さ
  int offset; // RBPからのオフセット
};

// ローカル変数
LVar *locals;

In the parser, TK_IDENTwhen a type token occurs, it checks to see if the identifier has ever occurred before. localsYou can tell whether the variable already exists by looking at the variable name. If the variable has appeared previously, offsetuse that variable as is. For a new variable, LVarcreate a new one, set a new offset, and use that offset.

Below is a function that searches for variables by name.

// 変数を名前で検索する。見つからなかった場合はNULLを返す。
LVar *find_lvar(Token *tok) {
  for (LVar *var = locals; var; var = var->next)
    if (var->len == tok->len && !memcmp(tok->str, var->name, var->len))
      return var;
  return NULL;
}

In your parser, you should just add code like the following:

Token *tok = consume_ident();
if (tok) {
  Node *node = calloc(1, sizeof(Node));
  node->kind = ND_LVAR;

  LVar *lvar = find_lvar(tok);
  if (lvar) {
    node->offset = lvar->offset;
  } else {
    lvar = calloc(1, sizeof(LVar));
    lvar->next = locals;
    lvar->name = tok->str;
    lvar->len = tok->len;
    lvar->offset = locals->offset + 8;
    node->offset = lvar->offset;
    locals = lvar;
  }
  return node;
}

Column: Frequency of appearance of machine language instructions

If you look at the assembly output by 9cc, you will notice that there are many data movement instructions such as movand , and there are relatively few instructions that perform "real calculations" such as and . One reason for this is that 9cc does not perform optimization and outputs useless data movement instructions, but in fact, even optimizing compilers output data movement instructions the most. Below is a graph of the results of disassembling all the executable files included in my environment and counting the number of instructions.pushaddmul/bin

Instruction frequency

As you can see, movinstructions alone account for 30% of all instructions. A computer is a data processing machine, and the most frequently used part of data processing is the movement of data. If you consider that "moving data to the appropriate location" is one of the essences of data processing, this large number of movinstructions seems to be a natural result, but some readers may have been surprised. I think there are many.

Step 11: return statement

In this chapter return, we'll add statements so that we can compile code like this:

a = 3;
b = 5 * 6 - 8;
return a + b / 2;

returnWe will decide that it is okay to write the statement in the middle of the program. As in normal C, program execution is aborted returnat the function returns. For example, the program below returnreturns the value of the first value, which is 5.

return 5;
return 8;

In order to implement this feature, returnlet's first consider what will happen to the grammar after adding . Previously, a statement was just an expression, but the new grammar return <式>;allows statements to be just expressions. So the new grammar would be:

program = stmt*
stmt    = expr ";"
        | "return" expr ";"
...

To implement this, the tokenizer, parser, and code generator all need to be tweaked a bit.

First, let's enable the tokenizer returnto recognize the token and TK_RETURNrepresent it with a token of type . returnSince there are only a limited number of tokens while( intcalled keywords) that have special grammatical meanings, such as and, it is easier to give each token a different type like this.

It seems like all you have to do to determine whether the next token is returnor not is to check whether the rest of the tokenizer's input string returnstarts with , but that would result returnxin tokens like and being incorrectly tokenized as returnand . xBecome. So here returnwe need to make sure that in addition to starting the input, the next character is not a character that forms a token.

Below is a function that determines whether a given character is a token character, i.e. an alphanumeric character or an underscore.

int is_alnum(char c) {
  return ('a' <= c && c <= 'z') ||
         ('A' <= c && c <= 'Z') ||
         ('0' <= c && c <= '9') ||
         (c == '_');
}

Using this function, you can tokenizeadd the following code returnto TK_RETURNtokenize as .

if (strncmp(p, "return", 6) == 0 && !is_alnum(p[6])) {
  tokens[i].ty = TK_RETURN;
  tokens[i].str = p;
  i++;
  p += 6;
  continue;
}

Next, let's modify the parser so that it can parse token sequences containing TK_RETURN. To do this, first add returnthe type of node that represents the sentence . ND_RETURNNext, modify the function that reads the statement returnso that it can be parsed. As usual, you can parse the grammar by mapping it directly to a function call. The new stmtfunctions are shown below.

Node *stmt() {
  Node *node;

  if (consume(TK_RETURN)) {
    node = calloc(1, sizeof(Node));
    node->kind = ND_RETURN;
    node->lhs = expr();
  } else {
    node = expr();
  }

  if (!consume(';'))
    error_at(tokens[pos].str, "';'ではないトークンです");
  return node;
}

ND_RETURNSince the type node is only generated here, I decided to set the value on the mallocspot .

Finally, modify the code generator ND_RETURNso that it outputs the appropriate assembly code for the type's nodes. A portion of the new genfunction is shown below.

void gen(Node *node) {
  if (node->kind == ND_RETURN) {
    gen(node->lhs);
    printf("  pop rax\n");
    printf("  mov rsp, rbp\n");
    printf("  pop rbp\n");
    printf("  ret\n");
    return;
  }
  ...

gen(node->lhs)The code for the expression that is the return value of the function call in the above code returnis output. That code should leave one value at the top of the stack. gen(node->lhs)The assembly that follows pops the value from the stack, sets it in RAX, and returns from the function.

The functions implemented up to the previous chapter always retoutput one instruction at the end of the function. returnImplementing the statements in the manner described in this chapter results in the output of an returnextra instruction per statement . retIt is possible to group these instructions together, but in order to simplify the implementation here, we retdecided to output multiple instructions. There's no point worrying about these details at this point, so it's important to prioritize implementation simplicity. Being able to write difficult code is a useful skill, but sometimes an even more useful skill is not making the code too difficult in the first place.

Column: Grammar hierarchy

"Regular expressions" are often used to determine whether input matches some kind of rule, but more complex grammars cannot be expressed using regular expressions. For example, it is impossible in principle to write a regular expression that determines whether parentheses are balanced within a string.

Context-free grammars (grammars that can be expressed in BNF) are more powerful than regular expressions; for example, they can only represent strings with balanced parentheses (when written in BNF) S → SS | "(" S ")" | ε. However, like regular expressions, context-free grammars have their limitations, and context-free grammars cannot express the complex rules that appear in ordinary programming languages. For example, the rule "variables must be declared before they are used" is part of the C grammar, but such a rule cannot be expressed using a context-free grammar.

If you write a C compiler, you can say that the input that the compiler accepts is a valid C program, and the input that the compiler does not accept is an invalid C program, unless there is a bug in the compiler. In other words, if you have the power of an ordinary computer, you can determine whether the problem matches the C grammar, and the compiler as a whole can be said to be a more powerful grammar determiner than context-free grammars. Masu. In this way, a grammar that can always be judged with YES/NO whether it matches the grammar is called Decidable.

You can also consider grammars that are not Decidable. For example, the question "When a computer program is given as input and executed, does the program eventually exitexecute a function and exit, or continue running indefinitely?" It has been proven that it is generally impossible to determine YES/NO without doing this (assuming you are running on a virtual computer with infinite memory). In other words, to the question of whether the program will stop, you can answer YES if the program stops, but you cannot answer NO because if the program does not stop, it will just continue running indefinitely. In this way, a category of grammars in which it is possible that the decision machine not only returns YES/NO, but also that the decision machine does not finish its execution is called Turing-recognizable.

In other words, there is a hierarchy of grammars: regular expression < context-free grammar < Decidable < Turing-recognizable. These grammatical hierarchies are widely studied as part of computer science. The famous unsolved problem P≟NP is also a problem related to the hierarchy of grammars.

1973 C compiler

So far we have been building the compiler incrementally. In a sense, this development process can be said to follow the history of C.

If you look at the current version of C, you'll find parts that don't make much sense or are unnecessarily complex, but these things can't be understood without looking at history. Many of the puzzling aspects of current C begin to make sense when you read the early C code and look at the early form of C and the subsequent development of the language and compilers.

C was developed in 1972 as a language for Unix. The source code from 1972 or 1973, a very early period in the history of C, remains on tape, and the files read from it are published on the Internet. Let's take a look at the code of the C compiler at the time . Below printfis a function that takes a message in format and prints it as a compilation error message.

error(s, p1, p2) {
  extern printf, line, fout, flush, putchar, nerror;
  int f;

  nerror++;
  flush();
  f = fout;
  fout = 1;
  printf("%d: ", line);
  printf(s, p1, p2);
  putchar('\n');
  fout = f;
}

It looks like a somewhat strange, C-like, non-C language. At that time, C was such a language. The first thing you notice when you read this code is that, just like in the early stages of the compiler we created, there are no types for function return values ​​or arguments. Here, s is supposed to be a pointer to a string, p1or p2an integer, but on machines at the time, everything was the same size, so the variable is untyped like this.

The second line errorcontains the declarations of the global variables and functions referenced by. At the time, C compilers had no header files or C preprocessors, so programmers had to tell the compiler of the existence of variables and functions in this way.

Like our current compiler, functions are only checked to see if the name exists, not to see if the types or number of arguments match. After putting the expected number of arguments on the stack, they could just jump to the function body and the function call would be successful, so they probably thought that was fine.

foutis a global variable that holds the number of the file descriptor to output to. Back then, fprintfit didn't exist yet, and in order to write a string to standard error instead of standard output, you had to switch the output destination via a global variable.

errorprintfis called twice in . The second printf passes two values ​​in addition to the format string. So, what do you do when displaying an error message that takes only one value?

In fact, this errorfunction works correctly even if you simply force it to have fewer arguments. Recall that function argument checking did not exist at this point. sArguments like , p1, p2and so on simply point to the first, second, and third words from the stack pointer; the p2compiler doesn't care if you're passing a value that actually corresponds to . accesses as many extra arguments as there printfare in the first argument string , so if the message contains only one , it will not be accessed at all. Therefore, there is no problem even if the number of arguments does not match.%d%s%dp2

In this way, early C compilers have many similarities with the current 9cc.

Let's look at another code example. The code below is a function that copies the passed string into a statically allocated area and returns a pointer to the beginning of that area. So this strdupis a function that uses static space.

copy(s)
char s[]; {
  extern tsp;
  char tsp[], otsp[];

  otsp = tsp;
  while(*tsp++ = *s++);
  return(otsp);
}

At that time, int *pthe syntax for declarations of the form had not been devised. Instead, pointer types int p[]are declared as . There is something like a variable definition between the function argument list and the function body, but this sis to declare it as a pointer type.

There's something else worth mentioning about this early C compiler.

In addition to the above, C in the early 1970s lacked various features. However, this C compiler was written in C, as you can see from the source code above. C was already self-hosted before there were even structures.

By looking at old source code, you can also deduce why some of C's confusing syntax ended up in its current form. Parsing variable definitions is easy if the syntax is such that the variable name always comes after the heel .extern It's easy to parse a pointer if it just comes right after the variable name . However, it seems obvious that if this syntax were to evolve along the lines seen in this early compiler, it would end up in its current, unnecessarily complex form.autointchar[]

Well, what Unix and C co-inventor Dennis Ritchie was doing around 1973 was truly incremental development. He was writing a compiler using C while developing C itself. The current version of C is not some kind of finished product that reached a special point as the language continued to add features, but simply completed as a language at a certain point when Dennis Ritchie decided that the language had enough features. That's just what happened.

Even with our compiler, we did not pursue a complete version from the beginning. The completed form of C does not have any special meaning, so there is probably no point in specifically pursuing it. Continuing to develop a language with a reasonable set of features at any point in time, and eventually converting it to C, is the time-honored development method used by the original C compiler. Let's proceed with development with confidence!

Column: Rob Pike's 5 Rules of Programming

9cc is influenced by Rob Pike's programming philosophy. Rob Pike is a former colleague of Dennis Ritchie, creator of C, creator of the Go language, and co-founder of Unicode's UTF-8 with Ken Thompson, creator of Unix.

I quote Rob Pike's 5 Rules of Programming.

  1. It is impossible to predict which parts of a program will take up time. Bottlenecks appear in surprising places, so don't try to guess where they are and add performance hacks until you know where they are.
  2. Measure it. Don't try to optimize before you measure. Also, even if you measure it, don't try to optimize anything other than extremely slow parts of your code.
  3. Fancy algorithms are slow when n is small, and n is usually small. Elaborate algorithms have large constant parts. Don't get complicated unless you know that n is usually large. (Even if n is large, apply rule 2 first.)
  4. Elaborate algorithms are more prone to bugs and harder to implement than simple ones. Should use simple algorithms and data structures.
  5. Data is what matters. If you choose the right data structure and represent your data well, your algorithm will almost always be trivial. Data structures, not algorithms, are at the heart of programming.

Step 12: Add control constructs

Further chapters are currently being written. I thought I had written the chapters up to this point carefully, but honestly I don't think the chapters from here have reached the level of publication yet. However, if you have read this far, you probably won't be able to complete the necessary information yourself, and some people may want guidance on how to proceed. I'll make it public for what it's worth.

In this section, we will add control structures such as if, if ... else, whileand to the language. forAlthough these control structures may seem complex at first glance, they are relatively easy to implement when compiled straight into assembly.

Since there is no counterpart to C control structures in assembly, C control structures are represented in assembly by branch instructions and labels. In a sense, this gotois the same as rewriting the control structure using . Just as humans can manually gotorewrite control structures into statements, control structures can be easily implemented simply by generating code according to a pattern.

There are various other control constructs, such as do ... while, goto, continue, breaketc., but you do not need to implement them yet.

ifThe new grammar with the addition of , while, forand is shown below.

program = stmt*
stmt    = expr ";"
        | "if" "(" expr ")" stmt ("else" stmt)?
        | "while" "(" expr ")" stmt
        | "for" "(" expr? ";" expr? ";" expr? ")" stmt
        | ...
...

expr? ";"When reading , you can read ahead one token, assume that the next token does not exist, and ;read otherwise.exprexpr

if (A) Bcompiles into an assembly like this:

  Aをコンパイルしたコード // スタックトップに結果が入っているはず
  pop rax
  cmp rax, 0
  je  .LendXXX
  Bをコンパイルしたコード
.LendXXX:

In other words if (A) B,

  if (A == 0)
    goto end;
  B;
end:

It will be expanded in the same way. XXXPlease make sure that all labels are unique by using serial numbers.

if (A) B else Ccompiles into an assembly like this:

  Aをコンパイルしたコード // スタックトップに結果が入っているはず
  pop rax
  cmp rax, 0
  je  .LelseXXX
  Bをコンパイルしたコード
  jmp .LendXXX
.LelseXXX
  Cをコンパイルしたコード
.LendXXX

That is if (A) B else C, it is expanded as follows.

  if (A == 0)
    goto els;
  B;
  goto end;
els:
  C;
end:

ifWhen reading a statement, it looks ahead one token and elsechecks if it exists, elseand if it does , it compiles as not if ... else.elseif

while (A) Bcompiles like this:

.LbeginXXX:
  Aをコンパイルしたコード
  pop rax
  cmp rax, 0
  je  .LendXXX
  Bをコンパイルしたコード
  jmp .LbeginXXX
.LendXXX:

In other words while (A) B, it will be expanded like the following code:

begin:
  if (A == 0)
    goto end;
  B;
  goto begin;
end:

for (A; B; C) Dcompiles like this:

  Aをコンパイルしたコード
.LbeginXXX:
  Bをコンパイルしたコード
  pop rax
  cmp rax, 0
  je  .LendXXX
  Dをコンパイルしたコード
  Cをコンパイルしたコード
  jmp .LbeginXXX
.LendXXX:

for (A; B; C) DThe corresponding C code is shown below.

  A;
begin:
  if (B == 0)
    goto end;
  D;
  C;
  goto begin;
end:

Note that .Llabels starting with are specially recognized by the assembler and automatically become file scoped. File-scoped labels can be referenced from within the same file, but not from other files. Therefore, if ifyou start with the forlabels generated by the compiler for , you don't have to worry about them colliding with labels in other files..L

Compile a small loop with cc and use the assembly as a reference.

Column: Detecting run-time errors by the compiler

When writing programs in C, it is common to write data past the end of an array, or to corrupt unrelated data structures due to pointer bugs. These bugs can also become security holes, so the idea is to proactively detect bugs during runtime with the help of a compiler.

For example, if you pass the option to GCC -fstack-protector, the compiled function will output a random pointer-sized integer called a "canary" to the function frame in the prologue, and will indicate in the epilogue that the value of the canary has not changed . I'll check it out. In this way, if the contents of the stack are unknowingly overwritten due to an array buffer overflow, the value of the canary will almost certainly have changed, so the error can be detected when the function returns. If an error is detected, the program usually terminates immediately.

LLVM has something called TSan (ThreadSanitizer) that can output code that detects at runtime when multiple threads are accessing shared data structures without properly securing locks. Additionally, LLVM's UBSan (UndefinedBehaviorSanitizer) can output code that detects at runtime whether you have inadvertently stepped on undefined behavior in C. For example, signed integer overflow is undefined behavior in C, so UBSan will report an error if a signed integer overflow occurs.

Since TSan etc. will slow down the program several times, it is unreasonable to add it to the compile options of frequently used programs, but functions such as stack canary, which has a relatively low runtime cost, can be used depending on the environment. It may be on by default.

Dynamic error detection with the help of such compilers has been actively researched in recent years, and has contributed significantly to writing reasonably secure programs using languages ​​such as C and C++, which are not memory safe. Masu.

Step 13: Block

This step supports " blocks{ ... } " between which you can write multiple statements . A block is officially called a compound statement, but because it's such a long word, it's often just called a block.

Blocks have the effect of combining multiple statements into a single statement. The implementation in the above step ifonly whileallowed one statement to be executed when the conditional expression was true, but by implementing the block in this step, you can do it there just like in {}C. You will be able to write multiple sentences that are grouped together.

The function body is also actually a block. Grammarly requires that the function body must be a block. Function definitions are { ... }actually syntactically the same when written after if.while{ ... }

The grammar with blocks added is shown below.

program = stmt*
stmt    = expr ";"
        | "{" stmt* "}"
        | ...
...

In this grammar, if stmtit "{"starts with , "}"zero or more stmtcan appear until . stmt* "}"To parse the statement, call iterate "}"until it occurs , and return the result as a vector.whilestmt

To implement a block, ND_BLOCKadd the type of node that represents the block. The structure representing the node Nodemust be populated with a vector containing the expressions contained in the block. The code generator ND_BLOCKshould generate code for the statements contained in a node in order, given the node's type. Note that each statement leaves one value on the stack, so don't forget to pop it each time.

Step 14: Respond to function calls

In this step, our goal is to foo()be able to recognize function calls with no arguments like , and compile them to .call foo

The new syntax with function calls is shown below.

...
primary = num
        | ident ("(" ")")?
        | "(" expr ")"

identBy looking ahead one token after reading , identyou can tell whether it is a variable name or a function name.

For the test, int foo() { printf("OK\n"); }prepare a C file with the content like this, cc -ccompile it into an object file, and link it with the output of your compiler. By doing so, you should be able to link properly as a whole, and you should be able to confirm that the functions you want to call are called correctly.

Once that works, the next step foo(3, 4)is to be able to write function calls like: There is no need to check the number or types of arguments. By simply evaluating the arguments in order, the arguments to be passed to the function will be created on the stack, so we can copy them to registers in the order specified by the x86-64 ABI and call the function. More than 6 arguments need not be supported.

For testing, you int foo(int x, int y) { printf("%d\n", x + y); }should be able to check the operation by preparing a function like above and linking it.

The x86-64 function call ABI is easy (as long as you follow the method above), but there is one caveat. RSP must be a multiple of 16 before making a function call. pushSince popRSP is changed in units of 8 bytes, callRSP is not necessarily a multiple of 16 when issuing an instruction. If this promise is not kept, functions that assume RSP to be a multiple of 16 will suffer from a mysterious phenomenon where they fail only half of the time. Make sure to adjust RSP before calling the function so that it is a multiple of 16.

Step 15: Address the function definition

Once this is done, we can now define the function. However, C function definitions are difficult to parse, so I don't implement them all at once. Currently, our language only has the int type, so we will implement the syntax int foo(int x, int y) { ... }omitting the type name instead of the syntax .foo(x, y) { ... }

The called side needs to be able to access the argument by xname y, but it is currently not possible to access the value passed in a register by name. What to do is to compile as xif ya local variable exists, and in the prologue of the function write the register value to the area on the stack for that local variable. After that, you should be able to handle arguments and local variables without any distinction.

Until now, main() { ... }the behavior was the same as implicitly enclosing the entire code, but that will be abolished and all code will be written inside some function. Then, when parsing the top level, when you read the token first, it should always be the function name, followed by the argument list, and then the function body. Easy to read.

After completing this step, you will be able to calculate and display the Fibonacci sequence using recursion, which should make it a lot more interesting.

Binary level interface

The C language specification defines specifications at the source code level. For example, language specifications specify how functions can be defined, and which files should be included to declare which functions. On the other hand, the language specifications do not specify what kind of machine language the source code written to comply with the standard will be converted into. This makes sense, since the C language standard was not determined with any particular instruction set in mind.

Therefore, at first glance it may seem that there is no need to clearly determine machine language level specifications, but in reality, specifications are determined to some extent for each platform. This specification is called ABI ( Application Binary Interface ).

The way functions have been called so far in this book has meant that the arguments are placed in registers in a particular order. Also, the return value was promised to be set to RAX. These rules for how to call functions are called "function calling conventions. " Function calling conventions are part of the ABI.

In addition to how to pass arguments and return values, the C language ABI also includes the following:

ABI is just a convention at the software level, so it is possible to think of something different from what is explained in this book, but code that is not ABI compatible cannot call and use each other, so Basically, CPU vendors and OS vendors define the platform standard ABI. There are two widely used x86-64 systems: System V ABI, which is used in Unix and macOS, and Microsoft ABI, which is used in Windows. Note that these two calling conventions are not separated by necessity; they are simply created by different people.

So far in this book, we have used our own compiler to call functions compiled with another compiler. This was possible because our C compiler and another compiler had the same ABI.


Representation of integers in computers

At this point, let's understand how integers, especially negative integers, are represented on computers. This chapter explains how to represent unsigned numbers and how to represent signed numbers using two's complement representation .

In this book, binary bit patterns will be expressed as 0b0001_1010, with a 0b prefix and each four digits separated by an underscore to make it easier to read. The 0b prefix is ​​actually a compiler-specific extension that can be used as is by many C compilers (although it usually cannot include underscores).

unsigned integer

The representation of an unsigned integer is the same as a normal binary number. The decimal number is, in order from the lowest digit, 1 digit, 10 digit, 100 digit, 1000 digit, ... (i.e. 10 0 digit, 10 1 digit, 10 2 digit , 10 3 digit ) ,...), a binary number is represented from the lowest digit to 1's digit, 2's digit, 4's digit, 8's digit,... (i.e. 2 0's digit, 2's digit , 1 digit, 2 2 digit, 2 3 digit,...).

For example, the value of the unsigned integer represented by the bit pattern 0b1110 can be found by looking at the position of the bit that is 1. In this case, the second, third, and fourth digits, that is, the 2 digit, 4 digit, and 8 digit, are 1, so 0b1110 represents 2 + 4 + 8 = 14. Some example diagrams are shown below.

When you add 1 to an unsigned integer, the values ​​cycle as shown in the following graph. This is an example of a 4-bit integer.

When the result of an operation has too many digits, resulting in a different result than if there were an infinite number of bits, this is called " overflow ." For example, in an 8-bit integer, 1+3 will not overflow, but 200+100 and 20-30 will overflow, resulting in 44 and 246 respectively. Mathematically speaking, it is the same as the remainder when divided by 2 8 = 256.

Column: Interesting bugs caused by overflow

Numerical overflow can sometimes cause unexpected bugs. Here we will introduce bugs that were present in the first version of the game "Civilization".

Civilization is a strategy simulation game that pits civilizations against each other, where you choose a character like Genghis Khan or Queen Elizabeth to conquer the world or win the space race.

A bug in the original Civilization was that Gandhi, a non-violent principle, would suddenly launch a nuclear attack. The reason was the logic that when a civilization adopts democracy, its aggressiveness decreases by 2. In the original Civilization, Gandhi's aggression is the lowest among all players, 1, but as the game progresses and Indian civilization adopts democracy, his aggression is reduced by -2 and becomes 255 due to overflow, and Gandhi becomes less aggressive in the game. All of a sudden, I was an extremely aggressive player. By that time, the game has usually progressed to the point where each civilization has nuclear weapons in terms of science and technology, so as a result, Gandhi suddenly starts a nuclear war on a certain turn. It was. This "Nuclear Gansey" was rather interesting and became a standard feature in subsequent Civilization series, but it was an unintentional bug in the first generation.

signed integer

Signed integers use two's complement representation, which treats the most significant bit specially . An n-digit integer in two's complement representation represents the same number as an unsigned number except for the nth digit, except that the most significant n-digit represents -2 n-1 instead of 2 n - 1 . The rule is to represent.

Specifically, if we consider a 4-digit binary number, each digit and the number it represents are as shown in the table below.

Four 3 2 1
If unsigned 8 Four 2 1
With sign -8 Four 2 1

As with unsigned values, the signed value represented by a bit pattern can be determined by looking at the position of the 1 bit. For example, if we consider 0b1110 as a 4-digit signed integer, the 2nd, 3rd, and 4th digits, that is, the 2nd, 4th, and -8th digits, are 1, so 0b1110 is 2 + This means that 4 + (-8) = -2. Some example diagrams are shown below.

Under this rule, unless the most significant bit is turned on, a signed integer represents the same number as if it were interpreted as an unsigned integer. For a 4-bit integer, 0-7 have the same bit pattern whether signed or unsigned. On the other hand, if the 4th bit is on, the bit pattern represents any number from -8 to -1 (0b1000 to 0b1111). If the most significant bit is on, it becomes a negative number, so the most significant bit is sometimes called the "sign bit. "

When you add 1 to a signed integer, the values ​​cycle as shown in the following graph. This is an example of a 4-bit integer.

Understanding the above rules will help explain a variety of seemingly strange behaviors of signed integers commonly encountered in programming.

Readers have probably experienced that when you add 1 to a signed integer, the number turns from a large number to an extremely small number when it overflows. If you consider 2's complement representation, you can understand exactly what is happening. For example, for an 8-bit signed integer, the largest number is 0b0111_1111 or 127. Adding 1 to this gives 0b1000_0000, which is -128 in two's complement representation. This is the largest negative number in absolute value.

For example, if a unary -test mainreturns -3, the exit code for the entire program would be 253. This means that mainwhile RAX is set to -3, that is, 0b1111_​1111_​1111_​1111_​1111_​1111_​1111_​1101, the receiving side considers only the lower 8 bits of RAX to be a meaningful value. and treats it as an unsigned integer, so the result is as if 0b1111_1101, or 253, were the return value.

In this way, what kind of number a certain bit pattern represents depends on the assumptions of the reader. For example, the characters on a paper book are actually just ink stains, but they only have meaning because there are people who read them thinking they are sentences.Just as the words on a paper book have meaning because there are people who read them thinking they are sentences, things in a computer's memory can also be turned on and off. It is just a string of bits and has no inherent meaning in itself. In order to exchange a certain number, the method of interpretation must match between the side that sets the value and the side that reads it.

Note that in two's complement representation, the number of negative numbers that can be represented is one more than the number of positive numbers that can be represented. For example, with an 8-bit integer, -128 is representable, but +128 is just outside the representable range. This imbalance in the positive and negative ranges cannot be helped due to the mechanism. There are 2 n patterns that can be represented by n bits , that is, there are always even numbers, but if you allocate one bit pattern for 0, an odd number of patterns will remain, so either positive or negative numbers will be more. That's what happens.

sign extension

In computers, the operation of widening the bit width of a number often occurs. For example, if you want to read an 8-bit number from memory and set it in a 64-bit register, you need to stretch the 8-bit value to 64 bits.

When you're working with unsigned integers, extending the value is easy, just fill the high-order bits with 0s. For example, expanding the 4-bit value 0b1110 = 14 to 8 bits becomes 0b0000_1110 = 14.

On the other hand, when dealing with signed integers, filling the upper bits with 0 will change the number. For example, if you intend to extend the 4-bit value 0b1110 = -2 to 8 bits and change it to 0b0000_1110, it will become 14. This is not even a negative number since the sign bit is not set in the first place.

When extending a signed integer, if the sign bit is 1, the new high-order bits must be filled with all 1s, and if the sign bit is 0, the new high-order bits must be filled with all 0s. This operation is called " sign extension ." For example, if we sign-extend the 4-bit value 0b1110 = -2 to 8 bits, we get 0b1111_1110 = -2, which shows that the bit width has been successfully expanded.

You can think of an unsigned integer as having an infinite number of 0s on the left side of the number, and extracting them when expanding.

Similarly, you can think of a signed integer as having an infinite number of values ​​equal to the sign bit on the left side of the number, and extracting them when expanding.

In this way, when trying to fit a number into a wider bit width, you need to be aware of whether the value you are currently dealing with is unsigned or signed.

Column: Representation of negative numbers that do not require sign extension

Two's complement representation is a commonly used way to represent signed integers in computers, but it's not the only way to map positive and negative integers to patterns of bits. For example, if we consider a negative binary number , the lowest digit represents (-2) 0 , (-2) 1 , (-2) 2 , etc. Below is a comparison table for the number that each digit represents in the case of 4 bits.

Four 3 2 1
unsigned 8 Four 2 1
two's complement -8 Four 2 1
minus binary number -8 Four -2 1

A 4-bit negative binary number can represent a total of 16 integers from -10 to 5, as shown below.

Five 0b0101
Four 0b0100
3 0b0111
2 0b0110
1 0b0001
0 0b0000
-1 0b0011
-2 0b0010
-3 0b1101
-Four 0b1100
-Five 0b1111
-6 0b1110
-7 0b1001
-8 0b1000
-9 0b1011
-Ten 0b1010

Negative binary numbers have the disadvantage that processing such as carrying is difficult, and 0 does not appear near the center of the representable range, but on the other hand, they have an interesting feature that they do not require a sign bit. Therefore, when expanding a negative binary number of one digit to a larger number of digits, the high-order bits can always be filled with 0.

In this way, there are various ways to represent integers on a computer, not just 2's complement representation. Of these, two's complement representation is the easiest to handle in terms of hardware, and is used in almost all existing computers.

Reversal of sign

The details of 2's complement representation are not necessarily necessary knowledge to create a compiler, but remembering some techniques related to 2's complement representation can be useful in various situations. Here we will explain a simple method to reverse the sign of a number.

In two's complement representation, if you invert all bits and add 1, the sign of the number will be reversed. For example, the steps to find the bit pattern from 3 to -3 for an 8-bit signed integer are as follows.

  1. Represent a number in binary. For 3, 0b0000_0011.
  2. Invert all bits. In this case, it will be 0b1111_1100.
  3. Add 1. In this case, it will be 0b1111_1101. This is the -3 bit pattern.

By remembering the above method, you can easily find the bit pattern of negative numbers.

Also, by performing the same operation on a bit pattern with a sign bit set to make it a positive number, you can easily find the numerical value represented by that bit pattern. For example, it is troublesome to find out what 0b1111_1101 represents by simple addition, but if you invert the bits and add 1, you get 0b0000_0011, which means that it represents the opposite sign of 3, which is -3. I understand.

The reason why the above trick works is relatively simple. Up to this point, we have not defined the operation of two's complement representation mathematically, so the explanation is a little vague, but the idea is as follows.

Inverting all bits is the same as subtracting from a bit pattern of -1, that is, all bits are 1. For example, the bit pattern 0b0011_0011 can be inverted as follows:

  1111 1111
- 0011 0011
= 1100 1100

In other words, reversing the bit pattern representing the number n is equivalent to calculating -1 - n. If we add 1 to that, we are calculating (-1 - n) + 1 = -n, which means we were able to find -n for n.

Column: Literal number base

The C standard allows numbers to be written in octal, decimal, or hexadecimal. Normally, when you write a number like 123, it becomes a decimal number, when you write it with a leading 0x like 0x8040, it becomes a hexadecimal number, and when you write it with a leading 0 like 0737, it becomes an octal number.

Many readers may think that they have never used the function to write numbers in octal in C, but in this syntax, even a simple 0 is written in octal, so any C programmer can actually write numbers in octal. I write very often. This is a bit of trivia, but if you think about it, there are reasons that may or may not be deep.

To begin with, 0 is a somewhat special notation for numbers. Normally, a number like 1 would not be written as 01 or 001 just because the 10's digit or the 100's digit is 0, but if you apply that rule to 0 as is, you will end up with an empty string. It would be a practical problem to write nothing when you want to write 0, so in that case we have a special rule to write 0, but if you do that, it becomes a somewhat special category in the C grammar. That's why.


Pointers and string literals

In the previous chapters, we have begun to develop a language that can perform meaningful calculations, but our language is not even Hello worldable to display it yet. It's time to add some strings so that the program can output meaningful messages.

C string literals charare closely related to types, global variables, and arrays. Consider the following function as an example.

int main(int argc, char **argv) {
  printf("Hello, world!\n");
}

The above code will compile the same way as the code below. However msg, it is a unique identifier that does not overlap with other identifiers.

char msg[15] = "Hello, world!\n";

int main(int argc, char **argv) {
  printf(msg);
}

Our compiler still lacks some features to support string literals. printfIn this chapter, we will implement the following functions in order so that we can support string literals, display messages, etc.

  1. unary &and unary*
  2. pointer
  3. array
  4. global variables
  5. char type
  6. string literal

We will also add the functions necessary to test the above functions in this chapter.

Step 16: Unary & and unary *

In this step, the first step in implementing a pointer is to implement a unary that returns an address &and a unary that references an address .*

These operators are originally operators that return or take pointer type values, but since our compiler does not yet have any types other than integers, we substitute integer types for pointer types. I'll decide. That is, returns the address of &xthe variable as just an integer. xAlso, *xis xan operation that regards the value of as an address and reads the value from that address.

Implementing such an operator allows code like the following to work:

x = 3;
y = &x;
return *y; // 3を返す

Also, by taking advantage of the fact that local variables are allocated contiguously in memory, it is also possible to forcefully access variables on the stack indirectly via a pointer. The code below assumes that the variable is y8 bytes above the variable on the stack.x

x = 3;
y = 5;
z = &y + 8;
return *z; // 3を返す

In such an implementation that does not distinguish between pointer types and integer types, the *4expression, for example, will be an expression that reads the value from address 4, but let's just assume that's fine.

Implementation is relatively easy. The grammar with unary &and unary *additions is shown below. Modify the parser to follow this grammar and read unary &and unary as nodes of type and *respectively .ND_ADDRND_DEREF

unary = "+"? primary
      | "-"? primary
      | "*" unary
      | "&" unary

There are very few changes to the code generator. The changes are listed below.

  case ND_ADDR:
    gen_lval(node->lhs);
    return;
  case ND_DEREF:
    gen(node->lhs);
    printf("  pop rax\n");
    printf("  mov rax, [rax]\n");
    printf("  push rax\n");
    return;

Step 17: Eliminate implicit variable definitions and introduce the int keyword

Until now, all variables and function return values ​​were implicitly set to int. Therefore, int x;we did not go out of our way to define variables with their type names, and assumed that every new identifier was a new variable name. You can no longer make such assumptions. So, let's modify that point first. Please implement the following functions.

Step 18: Introducing pointer types

Define a type that represents a pointer

In this step, we will now intallow zero or more following , which was previously only allowed for type names, as type names int. *In other words, it allows definitions such as and to be parsed int *x.int ***x

Of course, types such as "pointer to int" must be handled by the compiler. For example, xif a variable is a pointer to an int, the compiler *xmust know that the expression is of type int. Since types can be as complex as ``pointer to pointer to pointer to int'', this cannot be expressed only by a fixed-size type.

So what we're going to do is use a pointer. Until now, the only information associated with variables through maps was the offset from the base pointer (RBP) on the stack. Change this so you can have a variable type. Roughly speaking, the variable type should be a structure like the following.

struct Type {
  enum { INT, PTR } ty;
  struct Type *ptr_to;
};

Here, tyit can have one of two values: an int type or a "pointer to" type. ptr_toisty a meaningful member only when is of type ``pointer to'', in which case it contains a pointer to the Type object pointed to by ``. For example, for a "pointer to int", the data structure representing that type is internally as follows.

a data structure representing a pointer to an int

If it is a "pointer to a pointer to an int", it will look like this:

A data structure representing a pointer to an int

In this way, any number of difficult types can be expressed within the compiler.

Assign to the value pointed to by the pointer

*p=3How do I compile expressions where the left-hand side of the assignment expression is not a simple variable name, such as ? The basic concept of these expressions is the same as when the left side is a simple variable. In this case, you can compile as an lvalue pso that the address of is generated .*p

*p=3When compiling a syntax tree representing , we recursively descend the tree to generate code, and the first thing that is called is the code *pgenerator that compiles as an lvalue.

The code generator will branch depending on the type of the given syntax tree. With a simple variable, as mentioned above, we would output the code that outputs the address of that variable, but here we are given a dereference operator, so we need to do something different. If a dereference operator is given, compile the syntax tree within it as an "rvalue". Then it should compile into code that calculates some address (otherwise you can't dereference that result). Then you can just leave that address on the stack.

Once you've completed this step, you should be able to compile statements like the following:

int x;
int *y;
y = &x;
*y = 3;
return x; // → 3

Step 19: Implement pointer addition and subtraction

In this step, we will be able to write expressions like and for pvalues ​​of pointer type. Although this looks like simple addition of integers, it is actually quite a different operation. does not mean adding 1 to the address held by , but rather makes it a pointer to the next element of , so we have to add the width of the data type pointed to by the pointer. . For example , if points to an int, our ABI would add 4 to the number of bytes in the address. If one is a pointer to a pointer to an int, then will add 8.p+1p-5p+1ppppp+1pp+1

Therefore, when adding or subtracting pointers, you need a way to know the size of the type, but currently it is 4 for int and 8 for pointer, so please write code like that.

Since we don't have a way to allocate contiguous memory yet at this stage (our compiler doesn't have arrays yet), writing tests is a bit difficult. You can simply enlist the help of an external compiler, do malloc there, and write tests using the helper functions in your compiler's output. For example, you could test it like this.

int *p;
alloc4(&p, 1, 2, 4, 8);
int *q;
q = p + 2;
*q;  // → 4
q = p + 3;
return *q;  // → 8

Column: size of int or long

A data model such as the x86-64 System V ABI in which ints are 32 bits and longs and pointers are 64 bits is called LP64. This means that longs and pointers are 64 bits. Even with the same ABI on x86-64, Windows uses LLP64, a data model in which ints and longs are 32 bits, and long longs and pointers are 64 bits.

LP64 and LLP64 are not ABI compatible because they have different long sizes. For example, if you create a structure that contains long members, write the entire structure to a file as is, and then read it by directly casting the data in the file to that structure, you can write the file in Unix and Windows. They cannot be passed to each other for reading.

The C specification states that int is "plain" int object has the natural size suggested by the architecture of the execution environmen. When you say that, it feels like you have to make int into 64 bits on a 64-bit machine, but what is natural is a subjective question, and even 64-bit machines can usually handle 32-bit operations naturally. , it's not necessarily wrong to make int 32 bits even on a 64 bit machine. If you think about it realistically, making int 64 bits will cause the following problems:

For the reasons mentioned above, int is 32 bits on most existing 64-bit machines. However, ILP64, which has a 64-bit int, also exists. For example, the old Cray supercomputer was ILP64.

Step 20: sizeof operator

sizeofAlthough it looks like a function, it is grammatically a unary operator. In C, most operators are symbols, but grammatically there is no particular reason why an operator must be a symbol; in fact, sizeofit is an exception.

sizeofLet's review the operation of operators for a moment. sizeofis an operator that returns the number of bytes in memory of the type of the argument expression. For example, in our ABI, sizeof(x)returns 4 if xis and 8 if is a pointer. Any expression can be written as the argument of , for example , if the overall type of the expression is int, it will return 4, and if it is a pointer, it will return 8.intxsizeofsizeof(x+3)x+3

Our compiler doesn't have arrays yet, but sizeof(x)if xis an array , xreturns the total size in bytes. For example x, int x[10]if is defined as , sizeof(x)returns 40. xIf is int x[5][10]defined as , then sizeof(x)is 200, sizeof(x[0])is 40, sizeof(x[0][0])and is 4.

sizeofOperator arguments are only written to know the type; they are not the actual expressions to be executed. For example, sizeof(x[3])if you write the expression , x[3]no access actually occurs. x[3]The overall type of the expression is known at compile time, so the sizeof(x[3])expression will be replaced by the size of that type at compile time. Therefore, the concrete expression given to x[3]etc. sizeofno longer exists at runtime.

sizeofThe operation is shown below.

int x;
int *y;

sizeof(x); // 4
sizeof(y); // 8

sizeof(x + 3); // 4
sizeof(y + 3); // 8
sizeof(*y);    // 4

// sizeofに渡す式は何でもよい
sizeof(1); // 4

// sizeofの結果は現在int型なのでsizeof(int)と同じ
sizeof(sizeof(1)); // 4

Now, sizeoflet's implement this operator. sizeofTo implement an operator, you will need to modify both the tokenizer and parser.

First, modify the tokenizer to recognize sizeofthe keyword as a token of type .TK_SIZEOF

Next, we modify the parser to replace sizeofit with a constant of the type. The grammar with added operators is shown below. In the following grammar, is a unary operator and is defined as having the same precedence as unary plus and unary minus. This is the same syntax as C.intsizeofsizeof

unary = "sizeof" unary
      | ("+" | "-")? primary

In this grammar, writings such as and are sizeof(x)also sizeof xallowed, and the same is true in actual C.

In the parser, sizeofwhen an operator appears, it parses the expression that is its argument as usual, and replaces it with the number 4 if the type associated with the resulting syntax tree is 4, or 8 if it is a pointer int. please. Since the parser replaces it with a constant, there is no need to make any changes to the code generation tree.

Step 21: Implement the array

Define an array type

In this step we will implement the array. Up until this stage, we had only handled data that was large enough to fit in a register, but this is the first time that we are dealing with data that is larger than that.

However, C's syntax is restrictive when it comes to arrays. You cannot pass an array as a function argument or return an array as a function return value. If you write code with that intention, the array itself will not be passed by value, but a pointer pointing to that array will be automatically created and passed. Directly assigning and copying an array to an array is also not supported (must use memcpy).

Therefore, there is no need to pass data that does not fit into registers between functions and variables. It is sufficient to have the ability to allocate a memory area larger than one word on the stack.

Please make it possible to read variable definitions such as the following.

int a[10];

The type of a above is an array, the array has length 10, and the element type is int. Like pointer types, array types can be as complex as you want, so just as in step 7, use ptr_to to point to the type of array element. The struct representing the type should look like this:

struct Type {
  enum { INT, PTR, ARRAY } ty;
  struct Type *ptr_to;
  size_t array_size;
};

Here array_size, it is a field that is meaningful only when it is an array type, and it is a variable that holds the number of elements in the array.

Once you've done this, you should be able to easily allocate space for arrays on the stack. To find the size of an array in bytes, simply multiply the size of the array elements in bytes by the number of elements in the array. Up until now, all variables should have been allocated as one word in the stack area, but please change this and ensure that the array has the required size.

Implement implicit type conversion from array to pointer

Arrays and pointers are often used in combination, so C's syntax allows it to work without making much of a distinction between pointers and arrays, but this backfires and makes it difficult to understand the relationship between arrays and pointers. It seems that it is becoming difficult for programmers to understand. Therefore, here we will explain the relationship between arrays and pointers.

First, in C, arrays and pointers are completely different types.

A pointer (on x86-64) is an 8-byte value type. Just as operators such as + and - are defined for ints, + and - are also defined (in slightly different forms) for pointers. In addition, pointers *have unary operators defined that allow you to refer to what the pointer points to. *It can be said that there is nothing special about pointers except for unary ones . In other words, a pointer is a normal type like an int.

On the other hand, an array is a type that can be any number of bytes. Unlike pointers, there are few operators defined for arrays. The only operators defined sizeofare the operator that returns the size of the array and the & operator that returns a pointer to the first element of the array. There are no other operators that can be applied to arrays.

So why a[3]can an expression like this compile? In C, a[3]is*(a+3) defined as being equivalent to . Isn't the + operator not defined for arrays?

This is where the syntax of implicitly converting an array to a pointer comes into play. sizeofAn &array is implicitly converted to a pointer to the first element of the array, except when used as a unary operand. Therefore, *(a+3)is an expression that dereferences the pointer to the first element of array a plus 3, which has the same meaning as accessing the third element of the array.

[]In C, there are no operators for array access . C's []is just a shorthand notation for accessing array elements via pointers.

Similarly, if you pass an array as a function argument, it becomes a pointer to the first element of the array, or you can write it as if you were directly assigning the array to the pointer, but that is also possible as shown above. It depends on the reason.

Therefore, the compiler must convert arrays to pointers when implementing most operators. This shouldn't be too difficult to implement. Unless you are implementing a unary with , you should be able to parse the operand of an operator and assume that if it is an array of type T, it will be a pointer to sizeofT. &For array-type values, the code generator should generate code that pushes the address of the value onto the stack.

Once this is complete, you should be able to run code like the following.

int a[2];
*a = 1;
*(a + 1) = 2;
int *p;
p = a;
return *p + *(p + 1)  // → 3

Column: Linguistic lawyer

A person who has a good understanding of formal language specifications is sometimes called a `` language lawyer,'' referring to language specifications as laws. In the programmer's slang dictionary, The Jargon Files, language lawyer is described as: 6 .

The word language lawyer can also be used as a verb: language lawyering.

Skilled language lawyers are often respected by other programmers. When I worked on the C++ compiler team at Google, there was a person on the team who could be considered the ultimate language lawyer, and if there was something I didn't understand about C++, I would often decide to ask him ( There are many things in the C++ specifications that even people who create C++ compilers don't understand.) In fact, he was the person who implemented the main parts of Clang, a major C++ compiler, and was the lead author of the C++ specification, making him one of the most knowledgeable people in the world about C++. I remember thinking that the size of the C++ language specification and the complexity of the details were quite large.

This book intentionally avoids delving into the details of the C language specification until the compiler is more mature. There's a reason for that. When implementing a programming language that has specifications, it is necessary to become a language lawyer to some extent, but paying too much attention to details from the beginning is not a desirable development method. Just as when you draw a picture, you first complete a rough sketch of the whole thing, rather than just drawing one part in detail, when you are implementing a programming language, try to balance it out so that you don't become too much of a language lawyer at first. We need to maintain and develop it.

Step 22: Implement array subscripting

x[y]In C *(x+y), is defined as being equivalent to . Therefore, implementing subscripts is relatively simple. Simply read it as x[y]in your parser . For *(x+y)example , is.a[3]*(a+3)

In this grammar, 3[a]it *(3+a)expands to , so a[3]if it works , 3[a]it should also work, but in C, 3[a]expressions like are actually legal. try it.

Step 23: Implement global variables

I would like to be able to write literal strings in programs soon. In C, a literal string is an array of char. This is fine because we have already implemented arrays, but the difference is that literal strings are not values ​​that exist on the stack. String literals reside in a fixed location in memory, not on the stack. Therefore, to implement a string literal, we will first add a global variable.

Until now, only function definitions were allowed at the top level. Let's change that syntax so that we can write global variables at the top level.

Variable definitions are somewhat tricky to parse because they look similar to function definitions. For example, compare the following four definitions.

int *foo;
int foo[10];
int *foo() {}
int foo() {}

The top two foos are variable definitions, and the bottom two are function definitions, but you can't tell them apart until you get to the identifier that becomes the function name or variable name and read the next token. not. Therefore, you need to first call the ``read the first half of the type name'' function, read the identifier that should come after that, and then try reading ahead one token. If the token read ahead is "(", it means that a function definition was being read; otherwise, it means that a variable definition was being read.

Put the names of the parsed global variables in a map so you can look them up by name. Only if a variable name cannot be resolved as a local variable will an attempt be made to resolve it as a global variable. This allows for a natural way to have a local variable hide a global variable with the same name.

The parser converts local variable references and global variable references to separate nodes in the abstract syntax tree. Since the names can be resolved at the parsing stage, the types can also be separated at that stage.

Until now, all variables were supposed to be on the stack, so variables were read and written relative to the RBP (base pointer). A global variable is not a value on the stack, but a fixed location in memory, so compile to access that address directly. Please refer to the actual gcc output .

Once you implement it, you'll be surprised to find that local variables and global variables are quite different. The reason why it can be written without visual distinction is because the C language abstracts well. Local variables and global variables are actually implemented quite differently internally.

Step 24: Implement the character type

An array was a type that could be larger than one word, but a character was a type that could be smaller than one word. To get to this step, you probably needed to write a function that takes an object representing a type and returns the size of that type in bytes. First, add a character type, then modify the function so that it returns 1 for the character type.

There is no need to implement literal characters (characters enclosed in single quotes) in this step. Avoid the urge to implement everything all at once and keep the changes as small as possible.

So in this step, a character is really just a small integer type. movsx ecx, BYTE PTR [rax]By doing this, you can read one byte from the address pointed to by RAX and put it in ECX. If sign extension is not required, use the movzx ecx, BYTE PTR [rax]following instruction. movzxWhen writing mov [rax], cl, an 8-bit register is used as the source register.

Please refer to the actual compiler output .

Once you have implemented this step, you should be able to run code like this:

char x[3];
x[0] = -1;
x[1] = 2;
int y;
y = 4;
return x[0] + y;  // → 3

Column: Difference between 8-bit registers and 32-bit registers

Why do we need to use movsx or movzx when reading 1-byte values? When reading a 4-byte value, it was sufficient to simply load it into the lower 32-bit alias register such as EAX with a normal mov, so when reading a char, it seems sufficient to just load it into AL with a normal mov. But that doesn't work properly. The answer to this mystery lies in the x86-64 specifications.

On x86-64, reading the lower 32 bits of the alias's register resets the upper 32 bits to 0. However, when reading into the lower 8-bit alias register, the upper 56 bits remain at their previous values. This is an inconsistent specification, but x86-64 is an instruction set with a long history, so these inconsistencies exist in many places.

x86-64 evolved from a 16-bit processor called 8086 to 32-bit and then 64-bit, so first there was AL, then EAX, and then RAX. In other words, when loaded to AL, there was originally a specification that the upper 24 bits of EAX were not reset (remained as they were), and when expanded to 64 bits, when loaded to EAX, the upper 32 bits of RAX They should have established a specification to reset it. There's a good reason why we decided to create a specification that compromises consistency like this.

Modern processors look at instruction dependencies and increasingly execute unrelated instructions (instructions that do not use the result of the previous instruction) in parallel. If we were to create an instruction set that does not reset the upper 32 bits, but the upper 32 bits simply exist as garbage, but we ignore them until the end and continue to use only the lower 32 bits. However, this creates a false dependency between the instruction that generated the ignored upper 32 bits and subsequent instructions that use the same register. By sign-extending and resetting the upper 32 bits, the previous value will be completely overwritten, so you can break the dependency. For this reason, when converting x86 to 64 bits, we decided on a specification that would be easier to speed up, even though we were aware that consistency would be compromised.

Step 25: Implement string literals

This step parses the string enclosed in double quotes so that it can be compiled. Now that we have the necessary parts: arrays, global variables, and character types, I think it will be relatively easy to implement.

First, modify the tokenizer so that when you find a double quote, read up to the next double quote to create a string token. There is no need to implement backslash escaping etc. in this step. It's important to go step by step, so even if it seems easy to implement, try not to.

Assembly code representing string literal data cannot be output while generating machine code to be executed on the CPU. In the output assembly, global data and code must be written without mixing. In other words, when outputting code, we want to first output all string literals that appear in the code, but it is a pain to go through the syntax tree to do this. An easy way to do this would be to have a vector containing all the string literals you've ever seen, and simply add to it each time the parser sees a string.

Please refer to the actual compiler output .

By now printfyou should be able to output strings. This is a good opportunity to use your own programming language to write something a little more sophisticated than something obvious like test code. For example, wouldn't it be possible to write a solver for the 8-queen problem in your own language? It took humanity many decades to develop a programming language that allows us to easily write code at this level. It is a great advancement for humanity, and for you, that it can be implemented in a matter of weeks.

(When calling a function that takes variable-length arguments, the number of floating-point arguments is stored in AL. Our compiler does not yet have floating-point numbers. Therefore, when calling a function, (Always set AL to 0 before doing so.)

Step 26: Read input from file

Up until now, I have been passing the C code directly to the argument string, but since the input is gradually getting longer, I think it's time to modify it so that it takes the file name as a command line argument like a normal C compiler. Let's. A function that opens a given file, reads its contents, and '\0'returns a string terminated with can be written concisely as follows:

#include <errno.h>
#include <stdio.h>
#include <string.h>

// 指定されたファイルの内容を返す
char *read_file(char *path) {
  // ファイルを開く
  FILE *fp = fopen(path, "r");
  if (!fp)
    error("cannot open %s: %s", path, strerror(errno));

  // ファイルの長さを調べる
  if (fseek(fp, 0, SEEK_END) == -1)
    error("%s: fseek: %s", path, strerror(errno));
  size_t size = ftell(fp);
  if (fseek(fp, 0, SEEK_SET) == -1)
    error("%s: fseek: %s", path, strerror(errno));

  // ファイル内容を読み込む
  char *buf = calloc(1, size + 2);
  fread(buf, size, 1, fp);

  // ファイルが必ず"\n\0"で終わっているようにする
  if (size == 0 || buf[size - 1] != '\n')
    buf[size++] = '\n';
  buf[size] = '\0';
  fclose(fp);
  return buf;
}

Due to compiler implementation, data that ends with a newline character is easier to handle than data that ends with a newline character or EOF, so if the last byte of the file is \nnot \nI decided to.

Strictly speaking, this function will not work well if you are given a special file that cannot be accessed randomly. For example, if you specify a device file representing standard input /dev/stdinor a named pipe as a file name, /dev/stdin: fseek: Illegal seekyou should see an error message displayed. However, in practice this function should be fine. Modify your code to use this function to read the contents of the file and treat it as input.

Since input files usually contain multiple lines, you should also enhance the function to display error messages. When an error occurs, if you decide to display the input file name, the line number of the line with the error, and the contents of that line, the error message will be as follows.

foo.c:10: x = y + + 5;
                  ^ 式ではありません

A function that displays such an error message would look like this:

// 入力ファイル名
char *filename;

// エラーの起きた場所を報告するための関数
// 下のようなフォーマットでエラーメッセージを表示する
//
// foo.c:10: x = y + + 5;
//                   ^ 式ではありません
void error_at(char *loc, char *msg) {
  // locが含まれている行の開始地点と終了地点を取得
  char *line = loc;
  while (user_input < line && line[-1] != '\n')
    line--;

  char *end = loc;
  while (*end != '\n')
    end++;

  // 見つかった行が全体の何行目なのかを調べる
  int line_num = 1;
  for (char *p = user_input; p < line; p++)
    if (*p == '\n')
      line_num++;

  // 見つかった行を、ファイル名と行番号と一緒に表示
  int indent = fprintf(stderr, "%s:%d: ", filename, line_num);
  fprintf(stderr, "%.*s\n", (int)(end - line), line);

  // エラー箇所を"^"で指し示して、エラーメッセージを表示
  int pos = loc - line + indent;
  fprintf(stderr, "%*s", pos, ""); // pos個の空白を出力
  fprintf(stderr, "^ %s\n", msg);
  exit(1);
}

Although this error message output routine is quite simple, it can be said that it outputs errors in a fairly professional-looking format.

Column: Error recovery

If the input code is grammatically incorrect, many compilers will try to skip over the error and continue parsing the rest of the code. The goal is to find as many errors as possible and not just one. The ability to recover from a parser error and continue parsing is called "error recovery."

Error recovery was a very important feature in old compilers. In the 1960s and 1970s, programmers used large computers at computer centers for time-sharing, requiring them to bring in code to be compiled and wait, sometimes overnight, before they could get the compiled results. did. In such an environment, one of the important tasks of the compiler was to point out as many possible errors as possible. In older compiler textbooks, error recovery is one of the main topics in parsing.

Nowadays, development with compilers is more interactive, so error recovery is not as important a topic. The compiler we develop only prints the first error message. In modern times, this is sufficient in many cases.

Step 27: Line and block comments

Our compilers have gradually evolved, and it has become possible to write full-fledged code. In this case, what you want is a comment. This chapter implements comments.

There are two types of comments in C. A single comment is called a line comment, and //the length from to the end of the line is a comment. The other type is called a block comment, /*where is the start symbol and */is the end symbol. All characters in block comments */are skipped except for the two-character sequence .

Grammarly, a comment is treated the same as a single space character. Therefore, it is natural for tokenizers to skip comments in the same way as whitespace. The code to skip comments is shown below.

void tokenize() {
  char *p = user_input;

  while (*p) {
    // 空白文字をスキップ
    if (isspace(*p)) {
      p++;
      continue;
    }

    // 行コメントをスキップ
    if (strncmp(p, "//", 2) == 0) {
      p += 2;
      while (*p != '\n')
        p++;
      continue;
    }

    // ブロックコメントをスキップ
    if (strncmp(p, "/*", 2) == 0) {
      char *q = strstr(p + 2, "*/");
      if (!q)
        error_at(p, "コメントが閉じられていません");
      p = q + 2;
      continue;
    }
    ...

Here strstrwe used a function included in the C standard library to find the end of a block comment. strstrsearches for a string in a string and returns a pointer to the beginning of the passed string if it is found, or NULL if it is not found.

Column: Line comment

Only block comments existed in the original C, and line comments were officially added to the specification in 1999, almost 30 years after C was developed. Initially, this was supposed to be a non-breaking change, but in reality, in some delicate cases, code that was originally working could end up with a different meaning.

Specifically, the code below will read as if only block comments were supported a/b, and as if line comments were supported .a

a//*
// */ b

Column: Block comments and nesting

Block comments cannot be nested. /*has no special meaning in comments, so comment out the existing block comment.

/* /* ... */ */

, the */first would end the comment, */and the second would cause a syntax error.

If you want to comment out all lines that may contain block comments, use the C preprocessor.

#if 0
...
#endif

#if 0There is a way to do it like this .

Step 28: Rewrite the test in C

In this step we will rewrite the test make testto make it faster. By the time you get to this step, you'll probably already have over 100 tests written in your shell script. Shell script tests launch multiple processes for each test. In other words, for each test, I start my own compiler, assembler, linker, and test itself.

Starting a process is not that fast even for small programs. Therefore, if you do this hundreds of times, it will take a considerable amount of time in total. Your test scripts are probably taking a few seconds to run.

The reason I wrote the tests in shell scripts in the first place was because I couldn't do proper tests otherwise. When the language was at the calculator level, there was ifno such thing as ``and =='', so it was not possible to verify whether the calculation results were correct within the language. But now it can be verified. It is now possible to compare the results to see if they are correct, and if they are incorrect, display an error message (as a string) and exit.

Therefore, in this step, rewrite the tests that were written in shell scripts into C files.

Program execution image and initialization formula

Now, with these steps, our compiler now supports all the major programming elements: functions, global variables, and local variables. Also, by learning about split compilation and linking, you now understand how to compile a program in pieces and then combine them into a single file at the end.

This chapter describes how the OS executes executable files. By reading this chapter, you will be able to understand what kind of data is contained in the executable file and what happens before the main function is called.

This chapter also explains variable initialization expressions, how code like the one below is compiled, and adds support for initialization expressions to our compiler.

int x = 3;
int y[3] = {1, 2, 3};
char *msg1 = "foo";
char msg2[] = "bar";

In order to support initialization expressions, it may be surprising, but knowledge of how the program is running before it reaches main is essential.

This chapter describes a simple executable format that contains all code and data in one executable file. Such an executable file is called a "statically linked" executable file. In contrast to static linking, an executable format called ``dynamic linking'' is also widely used, in which fragments of one program are divided into multiple files, and when they are executed, they are combined in memory and executed. I will explain this in a separate chapter. First, let's understand the basic model, static links.

Executable file structure

An executable file consists of a file header and one or more regions called segments . An executable file usually has at least two segments, each containing separate executable code and data. A segment that contains executable code is called a `` text segment ,'' and a segment that contains other data is called a `` data segment .'' The actual executable file also contains other segments, but they are not needed to understand the mechanism, so they are omitted here.

Regarding terminology, please note that although text is the same word as text in a text file, it has a different meaning. Traditionally, in lower layers, data representing machine language is called "text." Also, machine language is just a string of bytes, and text is also a type of data, but when we talk about "text and data," "data" usually refers to "data other than text." . In this chapter, when we say data, we mean data other than text.

The object file that is input to the linker contains text and data separately. The linker concatenates text read from multiple object files and places it into one text segment, and similarly concatenates data read from multiple object files and places it into one data segment.

The file header of an executable file contains the memory address where each segment should be placed during execution. When you run an executable file, a program called the OS's " program loader " or simply " loader " copies text and data from the executable file into memory according to that information.

The following diagram shows an executable file and how it is loaded into memory by the loader.

Executable file and memory image

In the executable file shown in this figure, we assumed that the file header contained information to load the text segment at 0x41000 and the data segment at 0x50000.

The file header also contains information about the address at which execution should start. For example, if information is written that execution should start from 0x41040, the loader loads the executable file into memory as shown in the above figure, sets the stack pointer to 0x7fff_ffff_ffff_ffff, and then jumps to 0x41040. This will start execution of the user program.

Data segment contents

The content of the text segment is obviously machine code, but what's in the data segment? The answer is that the data segment contains global variables and literal strings.

Local variables are not directly contained in either the text or data segment. Local variables are created dynamically by the program in the stack area, so they do not exist immediately after the executable file is loaded into memory.

In the C execution model, a program can start executing the main function by simply loading the executable file into memory almost unchanged. Therefore, global variables must be set to appropriate initial values ​​simply by copying them from the executable file's data segment to memory.

Due to this restriction, in C, for example, the following initialization expression that uses a function call cannot be used for global variables.

int foo = bar();

If you have a global variable that requires dynamic initialization like the one above, someone will need to execute the above expression before executing the main function. However, C does not have an initialization mechanism that is activated before main, so such initialization is not possible.

In other words, the value of a global variable must be completed at link time and be stored as a byte sequence in the executable file. Such values ​​are limited to the following expressions:

It is obvious that constant expressions such as literal numbers or strings can be set as fixed values ​​in text segments.

The addresses of global variables and functions are not normally determined at compile time, but are usually determined when the linker completes the executable file. int *x = &y;Therefore, it is legal to define the value of a pointer type global variable by initializing it with the address of another global variable . The linker determines the layout of the program's segments by itself, and of course knows the addresses where functions and global variables are loaded, so it can fill in the contents at link time x.

Also, adding a constant to the address of a label is supported by the linker functionality, so a int *x = &y + 3;definition like is also legal.

Expressions other than the above patterns cannot be written in the initialization expression. For example, you cannot use the value (rather than the address) of a global variable in an initialization expression. ptrdiff_t x = &y - &z;An expression that calculates the difference between the addresses of two global variables, such as , such an expression cannot be written in an initialization expression. Only the limited patterns listed above are allowed for global variable initialization expressions.

An example of an expression that can be written as a global variable initialization expression in C is as follows.

int a = 3;
char b[] = "foobar";
int *c = &a;
char *d = b + 3;

The assembly corresponding to each expression looks like this:

a:
  .long 3
b:
  .byte 0x66 // 'f'
  .byte 0x6f // 'o'
  .byte 0x6f // 'o'
  .byte 0x62 // 'b'
  .byte 0x61 // 'a'
  .byte 0x72 // 'r'
  .byte 0    // '\0'
c:
  .quad a
d:
  .quad b + 3

You can also write it as follows .byteusing .asciithe notation " Continuous "..ascii "foobar\0"

Column: Dynamic initialization of global variables

In C, the contents of global variables must be statically determined, but in C++, global variables can be initialized using arbitrary expressions. In other words, in C++, global variable initialization expressions are executed before the main function is called. It works by the following mechanism.

In this way, the dynamic initialization of global variables is possible through the cooperation of the compiler, linker, and program loader.

C could support dynamic initialization of global variables by using the same mechanisms as C++, but such functionality has been intentionally left out of the C language specification.

The design choices of the C language specification are more restrictive for those who write programs, but they are more restrictive for those who run programs in environments with poor loaders or no loaders (such as code executed directly from ROM when the computer starts). However, it means that the language specifications can be fully satisfied. Therefore, this is a question of division, not one that is better than the other.

Initialization expression syntax

At first glance, an initialization expression appears to be just an assignment expression, but in reality, initialization expressions and assignment expressions are quite different syntactically, and there are some special ways of writing that are only allowed for initialization expressions. Let's make sure we understand this special way of writing.

First, an initialization expression can initialize an array. For example, the following expression initializes , x[0], x[1]and x[2]0, 1, and 2, respectively .x

int x[3] = {0, 1, 2};

If an initialization expression is given, the length of the array can be omitted because the length of the array can be determined by looking at the number of elements on the right side. For example, the expression above and the expression below have the same meaning.

int x[] = {0, 1, 2};

If the array length is explicitly given and only part of the initialization expression is given, the remaining elements must be initialized to zero. Therefore, the following two expressions have the same meaning.

int x[5] = {1, 2, 3, 0, 0};
int x[5] = {1, 2, 3};

In addition, charas a special syntax for initialization expressions for arrays, the following way of writing that uses literal strings as initialization expressions is allowed.

char msg[] = "foo";

The above expression has the same meaning as the following expression.

char msg[4] = {'f', 'o', 'o', '\0'};

Global variable initialization expression

Initialization expressions for global variables must be calculated at compile time. The result of a computation is either a simple sequence of bytes or a pointer to a function or global variable. For pointers, you can have one integer that represents the offset of that pointer.

A global variable that is not given any initialization expression must be initialized to have all bits set to 0. This is determined by the C syntax.

If the initialization expression does not result in the above calculation, treat it as a compilation error.

Local variable initialization expression

Local variable initialization expressions look the same as global variable initialization expressions, but their meaning is very different. A local variable initialization expression is an expression that is executed on the spot. Therefore, its contents need not be determined at compile time.

Basically, int x = 5;a statement like int x; x = 5;will compile the same way as if you had written it in two separate statements like .

int x[] = {1, 2, foo()};A statement like compiles the same as:

int x[3];
x[0] = 1;
x[1] = 2;
x[2] = foo();

The contents of local variables for which no initialization expression is given are undefined. Therefore such variables do not need to be initialized.

Column: word size

On x86-64, the term "word" means both 16-bit and 64-bit data. This is a complicated situation, but there are historical reasons why it is so.

The term "word" originally refers to the size of the largest integer or address that can be naturally handled on a computer. This is where 64 bits is called 1 word in x86-64, a 64-bit processor.

On the other hand, the word 16 bits comes from the term 8086, a 16-bit processor. When Intel engineers expanded the 8086 to 32 bits to create the 386 processor, they decided to call the 32 bits a double word or dword to avoid changing the size of the word. Did. Similarly, x86-64 extended 386 to 64 bits, and decided to call 64 bits a quad word or qword. This consideration of compatibility led to two different meanings of the word.

After step 29: [Addition required]

Static and dynamic links

So far in this book, we have only used a feature called static linking. Static linking is a straightforward execution model, so by focusing on that model, I was able to explain things like assembly code and memory images of executable files in an easy-to-understand manner. Static linking is not widely used when creating files. In reality, a feature called dynamic linking is widely used instead of static linking.

This chapter explains static linking and dynamic linking.

By default, compilers and linkers attempt to output executable files that perform dynamic linking. Readers may have seen the following error when you forget to add an option to (if you haven't, try removing the option ccfrom and then running Please try it).-static-staticMakefilemake

$ cc -o tmp tmp.s
/usr/bin/ld: /tmp/ccaRuuub.o: relocation R_X86_64_32S against `.data' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output

The linker attempts to dynamically link by default, but in order to dynamically link, the compiler must output assembly code that can do so. 9cc currently does not output such code, so if -staticyou forget to add it, an error like the one above will be displayed. After reading this chapter, you should be able to understand what the above errors mean and what you need to do to resolve them.

static link

A statically linked executable is a self-contained executable that does not require any other files to run. For printfexample, the function ``is not a function written by the user, but is libcincluded in the standard library, but when you create an executable file with static linking, the printfcode of `` libcis copied to the executable file. libcIt is not needed when running statically linked programs . Because libcthe necessary code and data inside has already been copied to the executable file.

Let's take a look at how the simple program below hello.cbecomes a statically linked executable.

#include <stdio.h>

int main() {
  printf("Hello world!\n");
}

hello.cTo compile and link this file hellointo an executable file named , enter the following command:

$ cc -c hello.c
$ cc -o hello hello.o

In the above command, the first line hello.ccompiles and hello.ocreates an object file, and the second line links it into an executable file. It is possible to write these two commands together cc -o hello hello.c, but when you start the compiler in that way, the same thing as the above two commands will be done internally.

hello.cHowever stdio.h, as we have seen so far in this book, the header file does not contain the code itself for the function body. Therefore, hello.owhen creating the file, the compiler knows the existence of the function stdio.hdeclared in and its type, but has no knowledge of the actual code in . Therefore, it is impossible for a file to contain such code . In reality, contains only the definition of . It is the role of the linker to complete the executable file by combining and the object file containing it.printfprintfprintfhello.ohello.omainhello.oprintf

ccWhen the linker is started via in the second line , hello.onot only the files passed on the command line are passed to the linker, but /usr/lib/x86_64-linux-gnu/libc.aalso are passed to the linker. printfThe function libc.ais contained in this function. .aIt is an archive file .tarsimilar .zipto . Let's take a peek inside.

$ ar t /usr/lib/x86_64-linux-gnu/libc.a
...
printf_size.o
fprintf.o
printf.o
snprintf.o
sprintf.o
...

musl printf.c

The archive file contains

The execution model for statically linked executables is simple. At runtime, only the executable file exists in memory, so each segment of the executable file can be loaded to any location in memory. The load will no longer fail when attempting to load to the default address determined at link time. This is because there is nothing in memory before the executable is loaded. Therefore, with static linking, the addresses of all global variables and functions can be determined at link time.

Static links have the following advantages:

Static links have the following disadvantages:

Dynamically linked executables, on the other hand, require other .so(on Unix) or (on Windows) files to run. Contains code for functions such as and global variables such as . Files such as and are called dynamic libraries, or simply libraries, or DSOs (dynamic shared objects)..dll.so.dllprintferrno.so.dll

C type syntax

C's type syntax is notoriously unnecessarily complex. C developer Dennis Ritchie co-authored the book `` The C Programming Language '' (commonly known as ``K&R'') and wrote, ``The syntax of declarations in the C language, especially those involving pointers to functions, is often criticized. 7 . _

In this way, the C type syntax is a bad design that even the author implicitly admits, but even so, this syntax is not that difficult once you understand the rules.

This chapter explains how to read the C type syntax. By following this step-by-step approach, by the end of this chapter you should be able to decipher complex types such as void (*x)(int).void (*signal(int, void (*)(int)))(int)

Diagram representing the type

The types that can be expressed in C are relatively simple. In order to separate the complexity of a type's syntax from the complexity of the type itself, let's put syntax aside for a moment and just think about types.

Complex types such as pointers and arrays can be represented by diagrams that connect simple types with arrows. For example, the following diagram shows a type that represents a "pointer to int".

In Japanese, it is pronounced "pointer of int pointer" from the end point of the arrow to the start point. In English, the opposite would be pronounced: a pointer to a pointer to an int, following the direction of the arrow.

Let's say the variable xhas the type shown in the diagram above. The simplest answer to the question " xWhat is the type of ?" is "It's a pointer." This is because the type that the first arrow points to is a pointer type. Note that xis first of all a pointer, not a type such as . intThe answer to the question "What type is that pointer pointing to?" is "It's a pointer." This is because the pointer that follows one arrow is also a pointer type. Finally, the answer to the question "What type is that pointer pointing to?" is "int."

The following figure shows an "array of pointers to int". The length of the array is 20. In the actual compiler, the length of an array is also expressed as a member of the type representing the array, as shown in the figure below.

If a variable xhas the type shown in the diagram above, xthen is an array of length 20, the elements of that array are pointers, and the pointer points to int.

Function types can also be represented graphically. The following figure shows the type of function that takes two arguments, an int and a pointer to an int, and returns a pointer to void.

Finally, let's look at a more complex example. The following figure shows a type called pointer to function that takes an int as an argument and returns a pointer to a function that returns an int. It's complicated when you put it into words, but when you put it in a diagram, you can see that it's just long and the structure is simple.

If a variable xhas the type shown in the diagram above, xthen is of type pointer, the pointer points to a function, the function's arguments are of type , the intreturn type is of pointer type, and the pointer points to What it points to is a function, and intwhat is the return type of that function.

Internally, the compiler represents types using the same method as in the diagram above. In other words, complex types involving pointers, arrays, and functions are represented inside the compiler as a data structure consisting of structures of simple types connected by pointers in the same order as shown in the diagram above. Therefore, it is no exaggeration to say that this diagram is the true form of the kata.

Notation for representing types

Although it is easier to understand the meaning of a type when it is represented in a diagram as shown above, it is tedious to draw a diagram every time to understand the type. In this section, we will consider a notation that can be written more compactly without compromising the clarity of the diagram.

Unless a function type is included, all boxes will be arranged in a daisy chain in the diagram without any branching. Therefore, if the type has only pointers or arrays, you should be able to express the diagram in text by writing the name of the type in the diagram from left to right.

Let's think about specific notation. The box representing the pointer *will be represented by the symbol. [n]Also, make it a rule to write the name of the type in boxes that represent arrays of length n and boxes that represent built-in types such as int. Then, the figure below * * intcan be represented by the string .

Pointer, pointer, and int appear in order from the starting point of the arrow, so the * * intnotation is as follows. * * intGiven the notation conversely, we can also draw the above diagram . In other words, this text representation is a notation that allows you to compactly write down the same information as a diagram in text.

The figure below [20] * intcan be expressed by the string .

For functions, func(引数の型, ...) 返り値の型let's write it as " ". For example, the type shown in the figure below func(int, * int) * voidis written as . Readers are encouraged to check that this notation matches the diagram.

Finally, the type that the diagram below represents * func(int) * func() intis .

The notation described so far is probably the most straightforward and simplest textual representation of a type. In fact, the type syntax in the Go programming language is exactly the same as the notation described here. Go is a language in which the people who created C are participating in its development, and the type syntax in Go has been subtly improved by taking advantage of the lessons learned from C8 .

How to read C types

In this section, we will learn how to decipher C types by combining the type notation we learned above with the C type notation.

A C type can be decomposed into four parts starting from the beginning:

  1. base type
  2. Asterisk representing a pointer
  3. Identifier or nested type in parentheses
  4. Parentheses to represent functions and arrays

For int xexample, the base type inthas no asterisks to represent pointers, and the identifier xhas no parentheses for functions or arrays. unsigned int *x()The base type is the unsigned intasterisk, which represents a pointer *, the identifier is an identifier x, and the parentheses represent a ()function. void **(*x)()The base type is voidthe asterisk representing a pointer **, the nested type is *xthe parentheses representing a function (), and so on.

How to read non-nested types

If what follows the pointer is just an identifier, the type is relatively easy to decipher.

When there are no parentheses for functions or arrays, the type notation and its meaning are as follows: The meaning will be written using the same notation as Go explained above.

C type notation meaning
int x int
int *x * int
int **x * * int

When there are function parentheses, the base type and the asterisk representing the pointer represent the return type. An example is shown below.

C type notation meaning
int x() func() int
int *x() func() * int
int **x(int) func(int) * * int

You can see that the type of the function when parsed without parentheses, that is int x, int *xthe int **xtype of , func(...)just follows after .

Similarly, when you see array parentheses, it means "array of." An example is shown below.

C type notation meaning
int x[5] [5] int
int *x[5] [5] * int
int **x[4][5] [4] [5] * * int

For example int *x[5], for the type , xthe type is an array of length 5, the elements of that array are pointers, and the pointer points to int. As with functions, you [...]can see that the type of an array without parentheses follows.

How to read nested types

When a pointer is followed by parentheses rather than an identifier, the parentheses represent a nested type. If you have nested types, you can parse the types inside and outside the parentheses separately and combine them later to get the overall type.

As an example, int (*x)()let's parse the declaration:

int (*x)()yIf you interpret the type by considering the entire first parenthesis as a single identifier (we'll use a random name here ), int y()it will look like this, so the type outside the parentheses func() intwill be . *xOn the other hand, if we consider the type in parentheses , it * ___represents the type. The type in parentheses intdoes not have the base type written in it, so it is not a complete type, but I decided to use the notation to represent the missing base part here ___.

int (*x)()By parsing the inside and outside of the parentheses separately, we were able to obtain two types func() int. The overall mold is created by fitting the outer mold into * ___the missing parts of the inner mold . ___In this case, * func() intis the type as a whole. In other words int (*x)(), in the declaration , xis a pointer, the type pointed to by that pointer is a function, and the return value of that function intis .

Let's take another example. void (*x[20])(int)If we consider the first parentheses as an identifier, void y(int)we get the same pattern as , so anything outside the parentheses func(int) voidrepresents the type . *x[20]The symbol in parentheses [20] * ___represents the symbol. Combining these two [20] * func(int) voidwill give you. That is, xis an array type of length 20, the type of its elements is pointers, the pointer points to a function that takes int, and the type the function returns is void.

You can read any complex type using the method described here. As an extreme example, signallet's look at the types of Unix functions. signalFunctions are famous for their type, which you can't tell what they are at first glance. signalThe function declaration is shown below .

void (*signal(int, void (*)(int)))(int);

Even complex types like this can be deciphered by breaking them down and reading them. First, let's consider what's inside and outside the first bracket. If we consider the type by considering the first parentheses as an identifier, we void y(int)get the same pattern as , so the outside of the parentheses func(int) voidrepresents the type.

The type in parentheses *signal(int, void (*)(int))is . This has no base type, an asterisk to represent a pointer *, an identifier signal, a set of parentheses to represent a function, and two arguments. Therefore, the rough type func(引数1, 引数2) * ___is as follows. If you think about the type of the argument, it looks like this:

If you put the above argument type funcin the argument part of , func(int, * func(int) void) * ___it becomes the type. Finally, when you combine the type inside the parentheses with the type outside the parentheses, func(int, * func(int) void) * func(int) voidyou get the type that is the final result.

That is, signalis a function that takes two arguments:

  1. int
  2. intvoidpointer to a function that takes and returns

signalThe return type of is a pointer, which points to the function that inttakes and returns.void

Column: Intent of C type syntax

C's type syntax was designed with the idea that types would be easier to understand if they were written the same way they would be used. In this design policy, int *x[20]the declaration, for example , means that *x[20]when you write the expression, the type of is determined intso that the type becomes . What is the type of the type that matches ? It can be said that it is the same as solving the problem.xint foo = *x[20]x

int *(*x[20])()If we think of this declaration as a problem of determining the type of int foo = *(*x[20])()to avoid a type error , first of all, must be an array, and the elements of that array should be safe to dereference with . Since we are calling a function with a pointer, the pointer's destination is a function, and since we are dereferencing its return value, we can think of it as a function that returns a pointer.xx**

C's type syntax may seem nonsensical, but it makes sense as a design principle. However, I must say that in the end it turned out to be a bad design.

Exercises

C type notation meaning
int x int
int *x * int
int x[] [] int
int x() func() int
int **x * * int
int (*x)[] * [] int
int (*x)() * func() int
int *x[] [] * int
int x[][] [] [] int
int *x() func() * int
int ***x * * * int
int (**x)[] * * [] int
int (**x)() * * func() int
int *(*x)[] * [] * int
int (*x)[][] * [] [] int
int *(*x)() * func() * int
int **x[] [] * * Int
int (*x[])[] [] * [] int
int (*x[])() [] * func() int
int *x[][] [] [] * int
int x[][][] [] [] [] int
int **x() func() * * int
int (*x())[] func() * [] int
int (*x())() func() * func() int

in conclusion

The main text of this book was written in Markdown format. I used Pandoc to convert Markdown to HTML , Graphviz to create the syntax tree diagram , and draw.io to create the other diagrams .


Appendix 1: x86-64 Instruction Set Cheat Sheet

This chapter summarizes the x86-64 instruction set features used by the compiler created in this book. In this chapter, we have used the following abbreviations for brevity.

List of integer registers

A table containing a list of 64-bit integer registers and their alias names.

64 32 16 8
RAX EAX AX AL
RDI EDI D.I. DIL
RSI ESI S.I. SIL
RDX EDX DX DL
RCX ECX CX C.L.
RBP EBP B.P. BPL
RSP ESP SP SPL
RBX EBX BX BL
R8 R8D R8W R8B
R9 R9D R9W R9B
R10 R10D R10W R10B
R11 R11D R11W R11B
R12 R12D R12W R12B
R13 R13D R13W R13B
R14 R14D R14W R14B
R15 R15D R15W R15B

The usage in ABI is as follows. Registers that do not need to be restored to their original values ​​when returning from a function are marked with a ✔.

register How to use
RAX Return value/Number of arguments
RDI 1st argument
RSI 2nd argument
RDX 3rd argument
RCX 4th argument
RBP base pointer
RSP stack pointer
RBX (nothing special)
R8 5th argument
R9 6th argument
R10 (nothing special)
R11 (nothing special)
R12 (nothing special)
R13 (nothing special)
R14 (nothing special)
R15 (nothing special)

callWhen calling a function, the instruction must be called with RSP a multiple of 16 (aligned to 16) . Function calls that do not meet this condition are not ABI compliant and may cause some functions to crash.

memory access

mov dst, [r64] Load the value into dst from the address pointed to by r64
mov [r64], src Store the value of src to the address pointed to by r64
push r64/imm Reduce RSP by 8 and store r64/imm to RSP
pop r64 Load from RSP to r64 and increase RSP by 8

function call

call label Push RIP onto stack and jump to label
call r64 Push RIP onto stack and jump to r64 address
ret Pop the stack and jump to that address
leave mov rsp, rbppop rbpequivalent to after

conditional branch

cmp reg1, reg2/imm
je label
If reg1 == reg2/imm, jump to label
cmp reg1, reg2/imm
jne label
If reg1 != reg2/imm, jump to label
cmp reg1, reg2/imm
jl label
If reg1 < reg2, jump to label
(signed comparison)
cmp reg1, reg2/imm
jle label
If reg1 <= reg2, jump to label
(signed comparison)

conditional assignment

cmp reg1, reg2/imm
sete al
movzb eax, al
RAX = (reg1 == reg2) ? 1 : 0
cmp reg1, reg2/imm
setne al
movzb eax, al
RAX = (reg1 != reg2) ? 1 : 0
cmp reg1, reg2/imm
setl al
movzb eax, al
RAX = (reg1 > reg2) ? 1 : 0
(signed comparison)
cmp reg1, reg2/imm
setle al
movzb eax, al
RAX = (reg1 >= reg2) ? 1 : 0
(signed comparison)

Integer/logical operations

add dst, src/imm dst = dst + src/imm
sub dst, src/imm dst = dst - src/imm
mul src RDX:RAX = RAX * src
imul dst, src dst = dst * src
div r32 EAX = EDX:EAX / r32
EDX = EDX:EAX % r32
div r64 RAX = RDX:RAX / r64
RDX = RDX:RAX % r64
idiv r32/r64 signed version of div
cqo Sign extend RAX to 128 bits and
store in RDX:RAX
and dst, src dst = src & dst
or dst, src dst = src | dst
xor dst, src dst = src^dst
neg dst dst = -dst
not dst dst = ~dst
shl dst, imm/CL Shift dst to the left by the value in the imm or CL register (when specifying the shift amount with a register, only CL can be used)
shr dst, imm/CL Logically shift dst to the right by the value of the imm or CL register.The
upper bits that have been shifted in are cleared to zero.
sar dst, imm/CL Arithmetically shift dst to the right by the value in the imm or CL register
. The upper bits shifted in will be the same as the sign bit of the original dst.
lea dst, [src] Calculate the address of [src], but do not access memory and store the result of the address calculation itself in dst.
movsb dst, r8 Sign extend r8 and store to dst
movzb dst, r8 Store r8 to dst without sign extension
movsw dst, r16 Sign extend r16 and store to dst
movzw dst, r16 Store r16 to dst without sign extension

Appendix 2: Version control with Git

Git was originally developed for Linux kernel version control. The Linux kernel is a huge project with thousands of developers, so Git has a rich set of features to meet the complex workflows involved. These functions are useful, but in individual development where you are the only committer, there is no need to fully utilize Git's functions. There's a lot to learn when it comes to mastering Git, but this book covers just a few things you'll need to remember. For those who are new to development using Git, we have prepared a cheat sheet below.

Workflow using Git

For those who are new to version control systems, let's explain a little about Git and version control system concepts.

Git is a database management tool that stores file change history. This database is called a repository. When you clone a repository from GitHub, etc., the repository is downloaded, and then a default, up-to-date directory tree is extracted from the repository, all under the current directory.

The directory tree extracted from the repository is called the "working tree." Although you use an editor to edit and compile source files in the working tree, the working tree itself is not part of the repository. The working tree is like a file extracted from a zip file, and no matter how many changes you make to it, the original repository remains intact.

Changes you make to your work tree are written back to the repository in units called "commits" at convenient points. This updates the database, and you can then make other changes. When using Git, development progresses by repeatedly changing files and committing them.

Points to note when committing

The commit message can be written in Japanese, so please write it properly. For example, a one-line message such as "Add * and / as operators, but do not handle operator precedence" is fine.

Divide commit units into as small a size as possible. For example, if you are changing code and want to do a little refactoring, commit the refactoring as a separate commit. It is undesirable to combine two or more separate features into one commit.

You don't need to use Git's advanced features. For example, you shouldn't need to use branches.

Be sure to commit code that adds functionality together with code that tests that functionality. Also, before committing, run tests to make sure that existing features are not broken and new features are working properly before committing. In other words, the goal is to be able to compile and test a repository no matter what point you check out. However, if you accidentally commit something that doesn't pass the test, you don't have to modify the git commit log to fix it. Simply put the fix in the next commit.

Internal structure of Git

When reading the Git documentation, there are many tricky functions, but if you build a model of how Git stores data in principle, it will be easier to understand the functions. Become. Therefore, here we will explain the internal structure of Git.

Git is a type of file system implemented as a user program. The structure of a Git database is very similar to a file system. However, the difference is that while normal file systems use file names to access files, Git uses the hash value of the file as the name.

This system where the name is determined based on the contents of the file is called a content-addressable system. In a content-addressable file system, files with the same name have the same content. Also, files with different contents cannot have the same name (same hash value). This is ensured by using a cryptographically secure hash function. These file systems have the property that there is no need to name each file separately, and once the name is determined, the contents of the file are also uniquely determined.

A commit is also a file within Git. In addition to the commit message, that file contains the hash values ​​of the files belonging to that commit and the hash value of the previous commit.

In order to retrieve a file from the Git file system, you must know the hash value of the file you want.

It may seem like a chicken-and-egg problem that you can't retrieve a commit file without knowing its hash value, but in reality, in addition to the file system, the repository also has the hash value of the commit file. and a list of names for it, so you can use it to find commits. For example, a repository contains da39a3ee5e...information such as the hash value of a commit named "master" (which is the history expanded into the working tree by default). Using this information, Git can expand files belonging to master into the working tree.

Internally, "committing" means adding files that have changed to Git's internal file system, and adding the hash values ​​of those files and the hash value of the previous commit. A commit file is added in the same way, and finally the inventory is updated with the hash value of that commit file.

For example, in order to make master's last commit disappear (although it is better not to do this), look at the commit file pointed to by master, get the hash value of the previous commit, and use that hash value. All you have to do is overwrite master. A "branch" simply means that there are two or more commits that have a certain commit as the previous commit, and those two commits are listed in the inventory.

These content-addressable version control systems also have security benefits. The name of a commit (commit file hash value) includes the hash values ​​of all files belonging to that commit and the hash value of the commit file before it. The previous commit file contains the hash of the previous commit file, so in the end the hash values ​​of all the commits leading up to the latest commit are included in the calculation of the hash value of the latest commit. There will be. Therefore, it is theoretically impossible to secretly modify the commit content or history without changing the hash value. That's an interesting trait.

When learning about Git's features, always keep this content-addressable file system in mind. I'm sure it will make things easier to understand.


Appendix 3: Creating a development environment using Docker

This appendix explains how to use Docker to build and run Linux applications on macOS.

Docker is software that provides a Linux virtual environment. Whereas a full-featured VM emulates the PC hardware and runs a regular OS on it, Docker provides (only) Linux system calls directly within the virtual environment. There is a difference. This difference gives Docker the advantage of faster startup and overall lighter weight compared to full-featured virtualization. Docker is often used when deploying applications to computers in the cloud.

When developing Linux applications on macOS using Docker, you can consider the following two system configurations.

  1. When developing Linux applications, the environment inside Docker and the environment outside are considered to be completely different, and all work is done inside Docker.
  2. A configuration in which normal platform-independent development work such as source code editing and git operations is performed outside Docker, and only build and test commands are executed within Docker.

In the former configuration, the configuration is simple because it is the same as preparing a Linux machine separately from the Mac for development, but the setup is a little complicated because you need to prepare a development environment on Linux, including an editor. is. On the other hand, in the latter configuration, you do not "live" inside Docker, so you don't have to worry about setting up the environment, but it is a bit troublesome to use both inside and outside of Docker.

Although you can choose either of the above two configurations, in this book we will choose the latter configuration to avoid describing the setup steps for a Linux environment. Therefore, only the commands you want to run in the Linux environment will be explicitly executed in Docker.

Setup steps

To set up a Linux development environment using Docker, first download and install Docker Desktop for Mac . compilerbookYou can then run the following command to create a Docker image named:

$ docker build -t compilerbook https://www.sigbus.info/compilerbook/Dockerfile

A "Docker image" or "image" is a collection of all the files and settings required for a Linux environment. What actually launches a Docker image is called a "Docker container" or simply "container" (same as the relationship between an executable file and a process).

To create a container and run a command inside it, docker runsupply the image name and command as arguments to the command. The following is an example of running ls /the command compilerbookinside a container.

$ docker run --rm compilerbook ls /
bin
root
dev
etc
...

It is possible to keep the container running in the background, but since our usage only requires interactive use, we provided an option that causes the container to exit as soon as the command finishes --rm. Therefore, each time you enter the above command, a container will be created and destroyed.

Build using containers

makeIn order to run and compile source files inside a container, you need to make the source files you are editing outside Docker visible to the container .

docker runYou can make the external environment path visible inside Docker -v <source>:<dest>by giving an option of the form . You can also optionally specify the current directory when executing the command . By using these options, you can run the program with the directory containing the source file set as the current directory.<source><dest>-wmake

9ccLet's assume that the source files are located in a subdirectory of your home directory . To run that directory from within the container make test, run the following command:

$ docker run --rm -v $HOME/9cc:/9cc -w /9cc compilerbook make test

Please run the build and test commands as described above.

If you want to start a shell inside the container and use it interactively, run it with the following options -it:docker run

$ docker run --rm -it -v $HOME/9cc:/9cc compilerbook

Add new application to container

The Docker container you created above comes with a set of development tools installed, but you may want to install additional applications. Here's how.

Docker containers are temporary objects. Even if you create a container and install an application from a shell inside it, those changes will not be written back to the original image. This property of Docker ensures that your application starts from the same fresh state every time, but this property is a liability when you want to make changes to the image.

docker commitIf you have made changes to the container and want to write them back to the image, you must run the command explicitly .

As an example, curllet's say you want to install a command. In that case, first create a container as follows.

$ docker run -it compilerbook

--rmNote that we are not passing any options. After that, install it aptusing the shell inside the container as follows , and exit from the container with the command.curlexit

$ sudo apt update
$ sudo apt install -y curl
$ exit

docker runSince I didn't specify any options when running --rm, the container remains in a suspended state even after exiting from the container shell. docker container ls -aYou can display suspended containers using the following command:

$ docker container ls -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                     PORTS               NAMES
a377e570d1da        compilerbook        "/bin/bash"         7 seconds ago       Exited (0) 5 seconds ago                       pedantic_noyce

compilerbooka377e570d1daYou should see that there is a container with the ID that ran the image . docker commitYou can write a container back to an image using the command:

$ docker commit a377e570d1da compilerbook

You can make changes to the image using the steps above.

Even if suspended containers and old images remain, there is no particular problem as they only consume some disk space, but if you are concerned about them, you can delete them by running docker system prune.


Reference materials


index


  1. http://www.drdobbs.com/cpp/increasing-compiler-speed-by-over-75/240158941

    DMD does memory allocation in a bit of a sneaky way. Since compilers are short-lived programs, and speed is of the essence, DMD just mallocs away, and never frees.

    ↩︎
  2. Linkers and Loaders, ISBN 978-1558604964, John R. Levine (1999) Chapter 1.2

    Programmers were using libraries of subprograms even before they used assemblers. By 1947, John Mauchly, who led the ENIAC project, wrote about loading programs along with subprograms selected from a catalog of programs stored on tapes, and of the need to relocate the subprograms' code to reflect the addresses at which they were loaded. Perhaps surprisingly, these two basic linker functions, relocation and library search, appear to predate even assemblers, as Mauchly expected both the program and subprograms to be written in machine language. The relocating loader allowed the authors and users of the subprograms to write each subprogram as though it would start at location zero, and to defer the actual address binding until the subprograms were linked with a particular main program.

    ↩︎
  3. https://parisc.wiki.kernel.org/images-parisc/b/b2/Rad_11_0_32.pdf The 32-bit PA-RISC run-time architecture document, v. 1.0 for HP-UX 11.0, Chapter 2.2.3

    When a process is initiated by the operating system, a virtual address range is allocated for that process to be used for the call stack, and the stack pointer (GR 30) is initialized to point to the low end of this range. As procedures are called, the stack pointer is incremented to allow the called procedure frame to exist at the address below the stack pointer. When procedures are exited, the stack pointer is decremented by the same amount.

    ↩︎
  4. https://www.acsac.org/2002/papers/classic-multics.pdf

    Thirty Years Later: Lessons from the Multics Security Evaluation, "Third, stacks on the Multics processors grew in the positive direction, rather than the negative direction. This meant that if you actually accomplished a buffer overflow, you would be overwriting unused stack frames, rather than your own return pointer, making exploitation much more difficult.

    ↩︎
  5. https://en.wikipedia.org/wiki/ASCII#cite_note-Mackenzie_1980-1

    There was some debate at the time whether there should be more control characters rather than the lowercase alphabet.

    ↩︎
  6. http://catb.org/~esr/jargon/html/L/language-lawyer.html

    language lawyer: n. A person, usually an experienced or senior software engineer, who is intimately familiar with many or most of the numerous restrictions and features (both useful and esoteric) applicable to one or more computer programming languages. A language lawyer is distinguished by the ability to show you the five sentences scattered through a 200-plus-page manual that together imply the answer to your question “if only you had thought to look there”. Compare wizard, legal, legalese.

    ↩︎
  7. C Programming Language, 2nd Edition (ISBN 978-0131103627), 5.12, p.122

    C is sometimes castigated for the syntax of its declarations, particularly ones that involve pointers to functions.

    ↩︎
  8. Go's Declaration Syntax by Rob Pike ↩︎

Original text
Rate this translation
Your feedback will be used to help improve Google Translate