The Four Stages of a Modern Compiler

Jacob Ide
3 min readFeb 6, 2020
Compiling a simple Hello World program

Compiled software languages have two different types of files: source code files (.c) and executable files(.exe). Source code is the human readable, human writable code that we think of as computer code. Executable files are only readable by the computer. Compiled executable files are the discrete 1s and 0s of binary files that the computer uses to execute various tasks. In order to convert these files from human to computer speak, the software engineer must invoke a special kind of program called a compiler. The standard compiler suite while working on Linux is the GNU Compiler Collection, or gcc. gcc operates in 4 distinct stages: pre-processing, compiling, assembly, and linking.

Pre-processing

Pre-processing operates in a similar way to the pre-processing section of running a Linux command works in that they both prepared the code to be converted into computer instructions. Rather than uncoiling expansions and searching the PATH for function names, the pre-processor strips out any comment sections and appends any code referenced after #include tags. This step is necessary for both the user and the CPU. Leaving the comments in would render the document unintelligible to the computer and writing without comments would instantly make every computer file a mystery to be decoded. Additionally, having to include the header files in every instance of a program file would be cumbersome to work with, storage intensive, and totally contrary to DRY principles. Pre-processors take care of this problem, by removing and adding what needs be at time of compile.

Compiler

One might think that the compiler stage is the point at which the code is converted to machine binary code, but that is not the case. This confusing is further compounded by the fact that the entire process being discussed today is called compilation. Within that larger process, the compiler converts the pure high-level language into assembly language. Each kind of assembly language is designed for a specific type of processor (Intel, AMD, etc.). Assembly language is still human readable, but not nearly to the degree of a high-level language like C

Assembly into Executable

The assembly code is then further broken down into machine code, that is, the binary that the processor can understand and action. The file created in stage is called the object file. The compiler then produces an executable file. Though not readable to humans, the executable file actions the commands of the source code. If changes are made to the source code, the program will have to be recompiled. However, once the program is stable and compiled, it is portable across all computers of the same processor type.

Linker

The final stage of compilation is the linker stage. Understand that when compiling a large C project there could be multiple different files pulling from various libraries. The linker phase draws these various disparate functions together into a cohesive executable file.

--

--