Skip to content

Introduction to Compiler Principles

Preface

When you press the "Run" button, how does your code become the result on screen? The computer actually can't "understand" any line of code you write — it only recognizes 0s and 1s. The compiler is the "translator" that converts human language into machine language. Understanding compiler principles helps you understand where error messages come from, why some languages are faster than others, and the underlying logic of code optimization.

What will you learn from this article?

After completing this chapter, you will gain:

  • Big picture view: Master the complete compilation pipeline from source code to executable program
  • Lexical analysis: Understand how compilers break code into tokens
  • Syntax analysis: Understand the construction of AST (Abstract Syntax Tree)
  • AST visualization: Intuitively see the tree structure of code
  • Semantic analysis and optimization: Understand the principles of type checking and code optimization
  • Optimization techniques in practice: Master core optimizations like constant folding and dead code elimination
  • Execution models: Distinguish between compiled, interpreted, and JIT execution approaches
ChapterContentCore Concepts
Chapter 1What Is a CompilerTranslator analogy, compilation pipeline
Chapter 2Lexical AnalysisTokens, lexical rules
Chapter 3Syntax AnalysisAST, syntax trees, precedence
Chapter 4AST VisualizationInteractive syntax tree, node types
Chapter 5Semantic Analysis and OptimizationType checking, constant folding, dead code elimination
Chapter 6Optimization Techniques in PracticeFunction inlining, loop hoisting, constant propagation
Chapter 7Compiled vs Interpreted vs JITComparison of three execution models

0. Big Picture: The "Translation Journey" of Code

Imagine you're a translator tasked with translating a Chinese novel into English. You wouldn't translate word by word literally. Instead, you would:

  1. Identify words — Break sentences into individual words (lexical analysis)
  2. Understand syntax — Determine if sentence structure is correct (syntax analysis)
  3. Understand semantics — Ensure the meaning is coherent and contradiction-free (semantic analysis)
  4. Polish and refine — Make the translation more natural and fluent (code optimization)
  5. Output the translation — Write the final English version (code generation)

A compiler does exactly the same thing, except it translates programming languages.

Compiler Principles: The Art of TranslationHow code becomes machine instructions
A compiler is like a translator, turning human-readable code into machine-readable instructions
The Complete Code Translation Pipeline
1
Lexical analysis
Break code into individual words called tokens
int age = 25 → [int, age, =, 25]
2
Syntax analysis
Check grammar rules and build a syntax tree
Validate whether statement structure is correct
3
Semantic analysis
Check whether the meaning of the code is valid
Check variable definitions and type compatibility
4
Intermediate code generation
Generate a machine-independent intermediate representation
Generate bytecode or intermediate representation
5
Optimization
Improve code so it runs more efficiently
Constant folding and dead-code elimination
6
Target code generation
Generate machine code or target code
Generate x86 or ARM machine instructions
Lexical analysis: tokenization
int age = 25;
Keywordint
Identifierage
Operator=
Number25
Separator;
Syntax analysis: build a tree
Assignment statement
Variableage
Operator=
Number25
Compilation vs Interpretation
Compiled languages
Source code → Compiler → Machine code
C, Go, Rust
✓ Fast execution
✓ Compile once, run many times
✗ Slow compile step
Interpreted languages
Source code → Interpreter → Line-by-line execution
Python, JavaScript, PHP
✓ Fast development
✓ Cross-platform
✗ Slower execution
Compiler Optimization
Before:
x = 5 + 3 + 2
⬇️
After:
x = 10
The compiler can optimize code automatically and improve runtime efficiency

1. The Compiler's Six-Stage Pipeline

A compiler's work can be divided into six stages, like a factory assembly line where each stage hands off to the next.

How a Compiler WorksA six-step journey from source code to machine code
1
Lexical analysis→ Token stream
2
Syntax analysis→ AST syntax tree
3
Semantic analysis→ Typed AST
4
Intermediate code generation→ IR (intermediate representation)
5
Code optimization→ Optimized IR
6
Target code generation→ Machine code
1Lexical analysisOutput: Token stream
Split source code into individual words called tokens, like recognizing each word in a sentence.
Recognize keywordsRecognize identifiersRecognize numbersRecognize operatorsFilter whitespace
int x = 10 + 5;
→ [int] [x] [=] [10] [+] [5] [;]
    keyword identifier operator number operator number separator
Live lexical analysis
intKeyword
xIdentifier
=Operator
10Number
+Operator
5Number
;Separator
Three Execution Models Compared
Compiled
Source Compiler Machine code CPU execution
Fast executionMust wait for compilation
C, C++, Rust, Go
Interpreted
Source Interpreter Line-by-line execution
Run immediately while writingSlower execution
Python, Ruby, PHP
JIT
Source Bytecode JIT hot path compilation Execution
Balances performance and flexibilitySlower startup
Java, JavaScript (V8)
Core idea:A compiler is like a translator: it gradually turns human-readable code into instructions the machine can run. The six stages each do one job: identify words → understand syntax → check meaning → generate IR → optimize → generate machine code.

Compilation Pipeline

  1. Lexical Analysis: Break source code into tokens (words)
  2. Syntax Analysis: Organize tokens into a syntax tree (AST)
  3. Semantic Analysis: Check if types are correct and variables are declared
  4. Intermediate Code Generation (IR Generation): Generate platform-independent intermediate representation
  5. Code Optimization: Make the intermediate code more efficient
  6. Code Generation: Generate machine code for the target platform
StageInputOutputAnalogy
Lexical AnalysisSource code character streamToken streamBreak sentences into words
Syntax AnalysisToken streamAST (syntax tree)Analyze sentence structure
Semantic AnalysisASTTyped ASTCheck if the meaning makes sense
Intermediate CodeTyped ASTIRWrite a first draft
Code OptimizationIROptimized IRPolish and trim
Code GenerationOptimized IRMachine codeOutput the final version

2. Lexical Analysis: Breaking Code into "Words"

Lexical analysis is the first step of compilation. The compiler scans each character of the source code from left to right, combining them into meaningful tokens.

🔤 Lexer: Split Code into Tokens

Enter a line of code and see lexical analysis results in real time

Just as your brain automatically combines letters into words when reading an English sentence, the lexer combines characters into tokens:

Source code: let x = 10 + 5;

Token stream:
[let]   → Keyword (language reserved word)
[x]     → Identifier (variable name)
[=]     → Operator (assignment)
[10]    → Numeric literal
[+]     → Operator (addition)
[5]     → Numeric literal
[;]     → Separator (statement end)

Five Types of Tokens

  • Keywords: Special words reserved by the language, such as let, if, return, function
  • Identifiers: Names defined by programmers, such as variable names and function names
  • Literals: Values written directly in code, such as the number 42 and the string "hello"
  • Operators: Symbols that perform operations, such as +, -, =, ===
  • Separators: Symbols that separate code structures, such as ;, ,, (, )

3. Syntax Analysis: Building the Syntax Tree (AST)

Lexical analysis breaks code into tokens, but tokens are just isolated "words." The task of syntax analysis is to organize these tokens into an Abstract Syntax Tree (AST) according to grammar rules — it reflects the structure of the code and operator precedence.

Expression: 1 + 2 * 3

Syntax tree:        Why this way?
       +       Because * has higher
      / \      precedence than +,
     1   *     so 2 * 3 groups
        / \    together first
       2   3

The Importance of AST

AST is the "core data structure" of a compiler. Subsequent semantic analysis, optimization, and code generation are all based on it. Modern development tools also heavily use AST:

  • ESLint: Parses code into AST and checks for rule violations
  • Prettier: Parses into AST and reformats the output
  • Babel: Parses AST → transforms → generates compatible code
  • IDE refactoring: Performs safe variable renaming and function extraction based on AST
Syntax StructureToken SequenceAST Node
Variable declarationlet x = 10VariableDeclaration → Identifier + Literal
Function calladd ( 1 , 2 )CallExpression → Identifier + Arguments
Conditional statementif ( a > b )IfStatement → BinaryExpression + Block

4. AST Visualization: Seeing the "Skeleton" of Code

Above we described AST structure in text, but "seeing" is more intuitive than "reading." The interactive component below lets you select different expressions and observe their syntax trees in real time.

🌳 AST Visualizer: See the Skeleton of Code

Choose an expression and inspect its abstract syntax tree

Syntax tree
BinaryExpression+
NumericLiteral1
BinaryExpression*
NumericLiteral2
NumericLiteral3
Parse notes
1* has higher precedence than +, so 2 * 3 groups first
22 * 3 forms a BinaryExpression subtree
31 and that subtree become the left and right operands of +
4The final + node is the root, showing the evaluation order
💡 Try AST Explorer — inspect ASTs for arbitrary code online

Through visualization, you'll find that the core patterns of AST are actually quite simple:

Code StructureAST Root NodeChild Nodes
1 + 2 * 3BinaryExpression (+)Left: NumericLiteral(1), Right: BinaryExpression(*)
let x = 10VariableDeclarationVariableDeclarator → Identifier(x) + NumericLiteral(10)
add(a, b)CallExpressionIdentifier(add) + Arguments(a, b)

AST in Daily Development

You may not have written a compiler directly, but you use AST-based tools every day:

  • ESLint / Prettier: Parse code into AST for rule checking or reformatting
  • Babel / SWC: Parse AST → transform syntax → generate compatible code
  • IDE refactoring: Safe renaming and function extraction based on AST
  • Tree-shaking: Analyze import/export in AST to remove unused code

5. Semantic Analysis and Code Optimization

Syntax analysis ensures code is "structurally correct," but structural correctness doesn't mean "semantically correct." Semantic analysis checks whether the meaning of the code is valid, while code optimization makes programs run faster.

Compilation PracticeFrom code to executable file
Input code
Compilation steps
1
Preprocess
gcc -E hello.c -o hello.i
Process #include and expand macros
2
Compile
gcc -S hello.i -o hello.s
Generate assembly code
3
Assemble
gcc -c hello.s -o hello.o
Generate object file
4
Link
gcc hello.o -o hello
Generate executable file
Generated files
📄
hello.c
Source code file
📝
hello.i
Preprocessed file
⚙️
hello.s
Assembly code file
📦
hello.o
Object file
🚀
hello
Executable file
Common compiler tools
GCC
GNU Compiler Collection
Clang
LLVM C/C++ compiler
MSVC
Microsoft Visual C++

4.1 Semantic Analysis: Checking if the "Meaning" Is Correct

CheckExampleResult
Type checkingint x = "hello"Type mismatch
Scope checkingUsing undeclared variable yVariable does not exist
Type inference1 + 2.0Inferred result is float
Parameter checkingadd(1, 2, 3) but function only accepts 2 parametersParameter count mismatch

Most Errors You See Come from Semantic Analysis

  • TypeError: Cannot read properties of undefined — Type checking
  • ReferenceError: x is not defined — Scope checking
  • Expected 2 arguments, but got 3 — Parameter checking

4.2 Code Optimization: Making Programs Faster

Before generating the final code, the compiler applies various optimizations to the intermediate code. These optimizations are transparent to the programmer but can significantly improve performance.

Optimization TechniqueBeforeAfterPrinciple
Constant foldingx = 10 + 5x = 15Compute the result at compile time
Dead code eliminationif (false) { ... }Removed entirelyCode that will never execute
Constant propagationx = 15; y = x * 2y = 30Replace with known values directly
Loop-invariant code motionRepeatedly computing len = arr.length inside a loopMove outside the loopAvoid redundant computation

6. Optimization Techniques in Practice: How Compilers Make Code Faster

Above we mentioned several optimization technique names. Now let's dive deeper into exactly how compilers do this. The interactive component below demonstrates 5 of the most common compiler optimizations. You can intuitively compare the code before and after optimization.

⚡ Compiler Optimization: Make Code Faster Automatically

Choose an optimization technique and see how the compiler improves code

📝 Before optimization
const width = 10
const height = 20
const area = width * height  // computed at runtime
console.log(area)
Compiler optimization
🚀 After optimization
const area = 200  // computed during compilation
console.log(200)
How Constant folding works
The compiler sees that width and height are constants, so it computes 10 * 20 = 200 during compilation. Runtime no longer needs a multiplication.
Performance gain:
30%

Modern compilers and JIT engines (such as V8, GCC, LLVM) automatically apply dozens of optimizations. As a developer, you don't need to perform these optimizations manually, but understanding them helps you:

  • Write code that's easier to optimize: For example, using const instead of let makes it easier for the compiler to apply constant folding
  • Understand performance differences: Why are small functions faster than large ones? Because the compiler can inline them
  • Avoid "de-optimization": Certain coding patterns prevent compiler optimization, such as eval() and with
Optimization TechniqueTrigger ConditionPerformance ImpactWhat Developers Can Do
Constant foldingAll constants in an expressionEliminates runtime computationUse const declarations more
Dead code eliminationUnreachable code or unused resultsReduces code sizeClean up unused code promptly
Loop-invariant code motionInvariant computation inside a loopReduces redundant computationManual extraction is also a good habit
Function inliningSmall functions called frequentlyEliminates call overheadKeep functions small and focused
Constant propagationVariable values known at compile timeEntire computation chain eliminatedUse constants instead of magic numbers

7. Compiled vs Interpreted vs JIT

After writing code, there are three "translation methods" to make it run. Each has its own strengths and weaknesses, directly determining the performance characteristics and use cases of the language.

🔄 Compiled vs Interpreted vs JIT

Click an execution mode to see how code moves from source to running program

📝
Source code
main.c
⚙️
Compiler
Full compilation
📦
Machine code
Binary executable
🚀
Run directly
CPU runs it directly
Run speed
Very fast
Startup
Slow; compile first
Portability
Recompile required
Representative languages:CC++RustGo
DimensionCompiledInterpretedJIT (Just-In-Time)
ProcessFully compile to machine code first, then executeTranslate and execute line by lineInterpret first, then compile hot code
Execution speedFastestSlowestMedium (hot code接近compiled speed)
Startup speedSlow (requires compilation)Fast (runs directly)Medium (requires warm-up)
Cross-platformRequires recompilationNaturally cross-platformCross-platform
Representative languagesC, Rust, GoPython, RubyJavaScript (V8), Java

Why Is JavaScript So Fast?

V8's JIT compiler monitors which code is executed frequently (hot code) and compiles it into highly optimized machine code. So although JavaScript is an "interpreted language," its performance in V8 can approach that of compiled languages. This is also the foundation that enables Node.js to be used on the server side.


Summary

Compiler principles aren't just knowledge for compiler developers. Understanding the compilation process helps you better understand error messages, choose appropriate languages, and write more efficient code.

Review the key points of this chapter:

  1. A compiler is a translator: Converts human-readable code into machine-executable instructions
  2. Six-stage pipeline: Lexical analysis → Syntax analysis → Semantic analysis → Intermediate code → Optimization → Code generation
  3. Lexical analysis breaks tokens: Breaks character streams into meaningful units like keywords, identifiers, and operators
  4. Syntax analysis builds AST: Organizes tokens into a tree structure according to grammar rules, reflecting operator precedence
  5. Semantic analysis ensures correctness: Type checking, scope checking — most errors you encounter come from here
  6. Compilers optimize automatically: Techniques like constant folding, dead code elimination, and function inlining make code automatically faster
  7. Three execution models: Compiled is fastest, interpreted is most flexible, JIT combines the best of both

Further Reading