Skip to content

Transformer and Attention Mechanism: The Core Engine of Large Language Models

In 2017, Google introduced the Transformer architecture in their paper "Attention Is All You Need," fundamentally changing the landscape of natural language processing. It abandoned traditional recurrent neural networks (RNNs) and relied solely on the attention mechanism to achieve stronger performance and higher training efficiency. Today, nearly all large language models — GPT, BERT, T5, LLaMA — are built upon the Transformer.

🔄
RNN 的困境
顺序处理,长距离依赖衰减
Transformer 突破
并行计算,全局注意力
🎯
注意力机制
动态关注重要信息
🚀
大模型基石
GPT、BERT 的核心架构

1. The Struggles of RNNs and the Breakthrough of Transformers

Before Transformers, the dominant approach for processing sequential data (such as text and speech) was recurrent neural networks (RNNs) and their variants like LSTM and GRU. These models used recurrent structures to process elements in a sequence one by one, maintaining a hidden state to remember historical information.

1.1 Three Fatal Flaws of RNNs

Sequential dependency, no parallelism: RNNs must wait for the previous time step to finish before processing the next word. This leads to extremely slow training and prevents full utilization of modern GPU parallel computing capabilities.

Long-range dependency decay: Even improved LSTMs gradually "forget" early information when processing long texts. For example, in a 500-word article, the model struggles to remember key information mentioned at the beginning.

Vanishing/exploding gradients: During backpropagation, gradients must pass through time steps layer by layer, making them prone to vanishing or exploding, leading to unstable training.

1.2 The Revolutionary Breakthrough of Transformers

Through the Self-Attention mechanism, Transformers allow the model to "see the entire sequence at a glance," directly computing relationships between any two positions without passing information step by step.

RNN / LSTM
顺序处理:词1 → 词2 → 词3
❌ 长距离依赖衰减
❌ 无法并行训练
Transformer
并行处理:所有词同时计算
✅ 全局注意力
✅ 高效并行

Core Advantages of Transformers

  • Parallel computation: Attention for all positions can be computed simultaneously, increasing training speed by tens of times
  • Global perspective: Directly captures long-range dependencies without sequence length limitations
  • Scalability: Clean, unified architecture that is easy to stack into deeper networks

2. Complete Transformer Architecture: From Overall to Detail

The complete Transformer architecture consists of an Encoder and a Decoder, responsible for understanding input and generating output respectively.

Encoder(编码器)
× N 层
Multi-Head Self-Attention
捕获输入序列内部依赖
Add & Norm
Feed Forward Network
位置独立的非线性变换
Add & Norm
输入
Token Embedding + Positional Encoding
Decoder(解码器)
输出
Linear + Softmax → 概率分布
× N 层
Masked Self-Attention
只看当前位置之前的词
Add & Norm
Cross-Attention
关注 Encoder 的输出
Add & Norm
Feed Forward Network
位置独立的非线性变换
Add & Norm
输出(移位)
Token Embedding + Positional Encoding

2.1 Encoder

Take the sentence "The balance in the bank account is insufficient" as an example. When the model processes the word "balance," it automatically computes relevance with other words:

  • "balance" and "account" are highly relevant (0.35)
  • "balance" and "bank" are moderately relevant (0.20)
  • "balance" and function words like "the," "in" have low relevance (0.05-0.10)

This relevance is not manually specified but automatically learned by the model from large amounts of data.

自注意力示例:「他」关注「小明」
小明 把 苹果 给了 的 母亲
小明
65%
5%
苹果
10%
给了
10%
5%
3%
母亲
2%
「他」把 65% 注意力投向「小明」,识别代词指代关系

2.2 The Attention Computation Process

The self-attention mechanism is implemented through three key steps:

  1. Generate Q, K, V vectors: Each word passes through three different linear transformations to produce Query, Key, and Value vectors
  2. Compute attention weights: Use Query to compute dot products with all Keys to obtain similarity scores
  3. Weighted sum: Use attention weights to compute a weighted sum of Value vectors to produce the final output

3. Query, Key, Value: The Three Musketeers of Attention

The Transformer attention mechanism draws inspiration from information retrieval, mapping each word to three different vector spaces.

3.1 The Roles of the Three Vectors

Query: Represents "what am I looking for." The current word's query intent, used to match against other words' Keys.

Key: Represents "what am I." Each word's feature identifier, used to be retrieved by Queries.

Value: Represents "what is my content." The actual information to be passed, weighted and summed according to attention weights.

The ingenuity of this design lies in the fact that similarity computation (Q·K) and information transfer (V) are decoupled. The model can learn that "which words to attend to" and "what information to extract after attending" are two independent problems.

🔍
Query
我想找什么
🔑
Key
我是什么
💎
Value
我的内容
Attention(Q, K, V) = softmax(QKT / √dk) V

3.2 Attention Computation Formula

The complete attention computation formula is:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:

  • QK^T: Computes the dot product of Query and Key to obtain a similarity matrix
  • √d_k: Scaling factor to prevent dot product values from becoming too large, which would cause softmax gradient vanishing
  • softmax: Converts similarities into a probability distribution (attention weights)
  • Finally multiplied with V: Uses attention weights to compute a weighted sum of Values

4. Multi-Head Attention: Understanding Semantics from Multiple Perspectives

A single attention head can only capture one type of dependency. To allow the model to understand sentences from multiple perspectives, Transformers introduced Multi-Head Attention.

4.1 How Multi-Head Works

Multi-head attention projects the input into multiple different subspaces, with each "head" independently computing attention, then concatenating all head outputs together.

Typical Transformers use 8 or 16 attention heads, with each head potentially focusing on different linguistic phenomena:

  • Syntax heads: Identify grammatical relationships like subject-verb-object, modifiers
  • Semantic heads: Capture word meaning correlations (e.g., "bank" and "account")
  • Positional heads: Focus on local dependencies between adjacent words
  • Coreference heads: Resolve pronoun references (e.g., "he" pointing to "John")
  • Sentiment heads: Identify positive/negative connotations and emotional tendencies
  • Entity heads: Recognize named entities like person names and place names
语法头
主谓宾关系
语义头
词义关联
位置头
距离关系
指代头
代词消解
情感头
情绪倾向
实体头
命名实体
修饰头
定状补
全局头
整体语境
8 个头从不同角度理解语义,最后拼接融合

4.2 Advantages of Multiple Heads

Stronger expressiveness: Different heads can capture different types of dependencies, avoiding the limitations of a single perspective.

Parallel computation: Multiple heads can compute simultaneously without increasing computation time.

Better robustness: Even if some heads fail to learn effectively, others can still provide useful information.

Mathematical Expression of Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Each head has independent weight matrices W^Q, W^K, W^V, and finally all head outputs are fused through W^O.


5. Complete Transformer Architecture: Encoder and Decoder

The complete Transformer architecture consists of an Encoder and a Decoder, responsible for understanding input and generating output respectively.

5.1 Encoder

The encoder is composed of multiple layers (typically 6-12) of identical structure stacked together, with each layer containing two sublayers:

  1. Multi-head self-attention layer: Captures dependencies within the input sequence
  2. Feed-forward neural network: Performs independent non-linear transformations on each position

Each sublayer is followed by a residual connection and layer normalization, ensuring training stability for deep networks.

5.2 Decoder

The decoder is also composed of multiple stacked layers, but each layer has three sublayers:

  1. Masked multi-head self-attention: Can only see words before the current position, preventing "cheating"
  2. Cross-attention: Connects the encoder and decoder, allowing the decoder to attend to the input sequence
  3. Feed-forward neural network: Same as in the encoder
Encoder(编码器)
× N 层
Multi-Head Self-Attention
捕获输入序列内部依赖
Add & Norm
Feed Forward Network
位置独立的非线性变换
Add & Norm
输入
Token Embedding + Positional Encoding
Decoder(解码器)
输出
Linear + Softmax → 概率分布
× N 层
Masked Self-Attention
只看当前位置之前的词
Add & Norm
Cross-Attention
关注 Encoder 的输出
Add & Norm
Feed Forward Network
位置独立的非线性变换
Add & Norm
输出(移位)
Token Embedding + Positional Encoding

5.3 Modern Variants: Encoder-Only vs Decoder-Only

Although the original Transformer includes both encoder and decoder, modern large language models typically use only one of them:

Architecture TypeRepresentative ModelsSuitable Tasks
Encoder-OnlyBERT, RoBERTaText classification, named entity recognition, question answering
Decoder-OnlyGPT, LLaMA, ClaudeText generation, dialogue, code completion
Encoder-DecoderT5, BARTTranslation, summarization, text rewriting

Why Does GPT Only Use a Decoder?

The GPT model family uses an autoregressive generation approach, predicting the next word one at a time. The decoder-only architecture is naturally suited for such generation tasks and offers a simpler structure that is easier to scale to hundreds of billions of parameters.


6. Positional Encoding: Telling the Model Word Order

The self-attention mechanism of Transformers is inherently position-agnostic — it treats a sentence as a set of words without caring about word order. But word order is crucial for semantics: "I love you" and "You love me" mean completely different things!

6.1 The Necessity of Positional Encoding

To allow the model to perceive positional information, Transformers add Positional Encoding to the input embeddings. Positional encoding is a vector with the same dimension as word embeddings, directly added to them.

问题:词序很重要
我爱你你爱我
解决:位置编码
Token Embedding + Positional Encoding
正弦余弦(Transformer 原始)
可学习(BERT、GPT)
旋转编码 RoPE(LLaMA)

6.2 Sinusoidal Positional Encoding

The original Transformer uses fixed sine and cosine functions to generate positional encodings:

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Advantages of this design:

  • Uniqueness: Each position has a unique encoding
  • Relative position: The model can learn relative distance relationships
  • Extrapolation: Can handle sequences longer than those seen during training

6.3 Modern Positional Encoding Schemes

As research has deepened, more positional encoding schemes have emerged:

Learnable positional encoding: BERT and GPT treat positional encodings as trainable parameters rather than fixed functions.

Relative positional encoding: T5 and DeBERTa encode relative distances between words rather than absolute positions.

Rotary Position Embedding (RoPE): Used by LLaMA and GPT-NeoX, injects positional information by rotating Q and K vectors, offering better extrapolation performance.

ALiBi: Achieves position awareness by adding a bias term to attention scores, requiring no additional parameters.


7. The Impact and Future of Transformers

The emergence of Transformers is not just the birth of a new architecture, but a paradigm shift in AI research as a whole.

7.1 A Unified Pre-training Paradigm

Transformers have made "pre-training + fine-tuning" the standard workflow in NLP. By pre-training on massive amounts of unlabeled text, models learn universal language representations and can then adapt to various downstream tasks with only a small amount of labeled data.

7.2 A Universal Cross-Modal Architecture

The success of Transformers is not limited to text. They have been successfully applied to:

  • Computer vision: Vision Transformer (ViT) surpasses CNNs in image classification
  • Speech recognition: Whisper uses Transformers for multilingual speech-to-text
  • Protein structure prediction: AlphaFold 2 uses Transformers to predict protein 3D structures
  • Reinforcement learning: Decision Transformer transforms RL problems into sequence modeling

7.3 The Cornerstone of the Large Model Era

From GPT-3's 175 billion parameters to GPT-4's trillions of parameters, Transformers have demonstrated astonishing scalability. Their parallel computation characteristics allow us to train unprecedentedly large models and observe emergent abilities — when models become large enough, they spontaneously "grasp" capabilities like reasoning, coding, and multilingualism.

7.4 Future Challenges and Directions

Despite the tremendous success of Transformers, challenges remain:

Computational complexity: Self-attention has O(n²) complexity, resulting in enormous computation for long texts.

Long-text modeling: Although theoretically capable of handling arbitrary lengths, it is practically constrained by memory and computational resources.

Interpretability: While attention weights provide some interpretability, the decision process of deep networks remains a black box.

Current research directions include:

  • Efficient Transformers: Linformer, Performer, Flash Attention, etc., reducing complexity
  • Long-context modeling: Sparse Attention, Sliding Window, Memory mechanisms
  • Multimodal fusion: Native multimodal architectures that uniformly process text, images, and audio

8. Conclusion

The introduction of Transformers and attention mechanisms marks a complete shift in deep learning from "handcrafted features" to "end-to-end learning." It not only resolved the technical bottlenecks of RNNs but, more importantly, provided a clean, universal, and scalable architecture that has become the cornerstone of the large model era.

Understanding Transformers is understanding the core of modern AI. From BERT's bidirectional encoding to GPT's autoregressive generation to unified multimodal representations, all these breakthroughs stand on the shoulders of the Transformer.

As computing power advances and algorithms improve, Transformers will continue to evolve, driving AI toward ever more powerful and general capabilities.