Transformer and Attention Mechanism: The Core Engine of Large Language Models

In 2017, Google introduced the Transformer architecture in their paper "Attention Is All You Need," fundamentally changing the landscape of natural language processing. It abandoned traditional recurrent neural networks (RNNs) and relied solely on the attention mechanism to achieve stronger performance and higher training efficiency. Today, nearly all large language models — GPT, BERT, T5, LLaMA — are built upon the Transformer.

🔄

RNN 的困境

顺序处理，长距离依赖衰减

⚡

Transformer 突破

并行计算，全局注意力

🎯

注意力机制

动态关注重要信息

🚀

大模型基石

GPT、BERT 的核心架构

1. The Struggles of RNNs and the Breakthrough of Transformers

Before Transformers, the dominant approach for processing sequential data (such as text and speech) was recurrent neural networks (RNNs) and their variants like LSTM and GRU. These models used recurrent structures to process elements in a sequence one by one, maintaining a hidden state to remember historical information.

1.1 Three Fatal Flaws of RNNs

Sequential dependency, no parallelism: RNNs must wait for the previous time step to finish before processing the next word. This leads to extremely slow training and prevents full utilization of modern GPU parallel computing capabilities.

Long-range dependency decay: Even improved LSTMs gradually "forget" early information when processing long texts. For example, in a 500-word article, the model struggles to remember key information mentioned at the beginning.

Vanishing/exploding gradients: During backpropagation, gradients must pass through time steps layer by layer, making them prone to vanishing or exploding, leading to unstable training.

1.2 The Revolutionary Breakthrough of Transformers

Through the Self-Attention mechanism, Transformers allow the model to "see the entire sequence at a glance," directly computing relationships between any two positions without passing information step by step.

RNN / LSTM

顺序处理：词1 → 词2 → 词3

❌ 长距离依赖衰减

❌ 无法并行训练

Transformer
并行处理：所有词同时计算
✅ 全局注意力
✅ 高效并行

Core Advantages of Transformers

Parallel computation: Attention for all positions can be computed simultaneously, increasing training speed by tens of times
Global perspective: Directly captures long-range dependencies without sequence length limitations
Scalability: Clean, unified architecture that is easy to stack into deeper networks

2. Complete Transformer Architecture: From Overall to Detail

The complete Transformer architecture consists of an Encoder and a Decoder, responsible for understanding input and generating output respectively.

Encoder（编码器）

× N 层

Multi-Head Self-Attention

捕获输入序列内部依赖

Add & Norm

Feed Forward Network

位置独立的非线性变换

Add & Norm

输入

Token Embedding + Positional Encoding

Decoder（解码器）

输出

Linear + Softmax → 概率分布

× N 层

Masked Self-Attention

只看当前位置之前的词

Add & Norm

Cross-Attention

关注 Encoder 的输出

Add & Norm

Feed Forward Network

位置独立的非线性变换

Add & Norm

输出（移位）

Token Embedding + Positional Encoding

2.1 Encoder

Take the sentence "The balance in the bank account is insufficient" as an example. When the model processes the word "balance," it automatically computes relevance with other words:

"balance" and "account" are highly relevant (0.35)
"balance" and "bank" are moderately relevant (0.20)
"balance" and function words like "the," "in" have low relevance (0.05-0.10)

This relevance is not manually specified but automatically learned by the model from large amounts of data.

自注意力示例：「他」关注「小明」

小明把苹果给了他的母亲

小明

65%

把

苹果

10%

给了

10%

他

的

母亲

「他」把 65% 注意力投向「小明」，识别代词指代关系

2.2 The Attention Computation Process

The self-attention mechanism is implemented through three key steps:

Generate Q, K, V vectors: Each word passes through three different linear transformations to produce Query, Key, and Value vectors
Compute attention weights: Use Query to compute dot products with all Keys to obtain similarity scores
Weighted sum: Use attention weights to compute a weighted sum of Value vectors to produce the final output

3. Query, Key, Value: The Three Musketeers of Attention

The Transformer attention mechanism draws inspiration from information retrieval, mapping each word to three different vector spaces.

3.1 The Roles of the Three Vectors

Query: Represents "what am I looking for." The current word's query intent, used to match against other words' Keys.

Key: Represents "what am I." Each word's feature identifier, used to be retrieved by Queries.

Value: Represents "what is my content." The actual information to be passed, weighted and summed according to attention weights.

The ingenuity of this design lies in the fact that similarity computation (Q·K) and information transfer (V) are decoupled. The model can learn that "which words to attend to" and "what information to extract after attending" are two independent problems.

🔍

Query

我想找什么

🔑

Key

我是什么

💎

Value

我的内容

Attention(Q, K, V) = softmax(QK^T / √d_k) V

3.2 Attention Computation Formula

The complete attention computation formula is:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:

QK^T: Computes the dot product of Query and Key to obtain a similarity matrix
√d_k: Scaling factor to prevent dot product values from becoming too large, which would cause softmax gradient vanishing
softmax: Converts similarities into a probability distribution (attention weights)
Finally multiplied with V: Uses attention weights to compute a weighted sum of Values

4. Multi-Head Attention: Understanding Semantics from Multiple Perspectives

A single attention head can only capture one type of dependency. To allow the model to understand sentences from multiple perspectives, Transformers introduced Multi-Head Attention.

4.1 How Multi-Head Works

Multi-head attention projects the input into multiple different subspaces, with each "head" independently computing attention, then concatenating all head outputs together.

Typical Transformers use 8 or 16 attention heads, with each head potentially focusing on different linguistic phenomena:

Syntax heads: Identify grammatical relationships like subject-verb-object, modifiers
Semantic heads: Capture word meaning correlations (e.g., "bank" and "account")
Positional heads: Focus on local dependencies between adjacent words
Coreference heads: Resolve pronoun references (e.g., "he" pointing to "John")
Sentiment heads: Identify positive/negative connotations and emotional tendencies
Entity heads: Recognize named entities like person names and place names

语法头

主谓宾关系

语义头

词义关联

位置头

距离关系

指代头

代词消解

情感头

情绪倾向

实体头

命名实体

修饰头

定状补

全局头

整体语境

8 个头从不同角度理解语义，最后拼接融合

4.2 Advantages of Multiple Heads

Stronger expressiveness: Different heads can capture different types of dependencies, avoiding the limitations of a single perspective.

Parallel computation: Multiple heads can compute simultaneously without increasing computation time.

Better robustness: Even if some heads fail to learn effectively, others can still provide useful information.

Mathematical Expression of Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Each head has independent weight matrices W^Q, W^K, W^V, and finally all head outputs are fused through W^O.

5. Complete Transformer Architecture: Encoder and Decoder

The complete Transformer architecture consists of an Encoder and a Decoder, responsible for understanding input and generating output respectively.

5.1 Encoder

The encoder is composed of multiple layers (typically 6-12) of identical structure stacked together, with each layer containing two sublayers:

Multi-head self-attention layer: Captures dependencies within the input sequence
Feed-forward neural network: Performs independent non-linear transformations on each position

Each sublayer is followed by a residual connection and layer normalization, ensuring training stability for deep networks.

5.2 Decoder

The decoder is also composed of multiple stacked layers, but each layer has three sublayers:

Masked multi-head self-attention: Can only see words before the current position, preventing "cheating"
Cross-attention: Connects the encoder and decoder, allowing the decoder to attend to the input sequence
Feed-forward neural network: Same as in the encoder

Encoder（编码器）

× N 层

Multi-Head Self-Attention

捕获输入序列内部依赖

Add & Norm

Feed Forward Network

位置独立的非线性变换

Add & Norm

输入

Token Embedding + Positional Encoding

Decoder（解码器）

输出

Linear + Softmax → 概率分布

× N 层

Masked Self-Attention

只看当前位置之前的词

Add & Norm

Cross-Attention

关注 Encoder 的输出

Add & Norm

Feed Forward Network

位置独立的非线性变换

Add & Norm

输出（移位）

Token Embedding + Positional Encoding

5.3 Modern Variants: Encoder-Only vs Decoder-Only

Although the original Transformer includes both encoder and decoder, modern large language models typically use only one of them:

Architecture Type	Representative Models	Suitable Tasks
Encoder-Only	BERT, RoBERTa	Text classification, named entity recognition, question answering
Decoder-Only	GPT, LLaMA, Claude	Text generation, dialogue, code completion
Encoder-Decoder	T5, BART	Translation, summarization, text rewriting

Why Does GPT Only Use a Decoder?

The GPT model family uses an autoregressive generation approach, predicting the next word one at a time. The decoder-only architecture is naturally suited for such generation tasks and offers a simpler structure that is easier to scale to hundreds of billions of parameters.

6. Positional Encoding: Telling the Model Word Order

The self-attention mechanism of Transformers is inherently position-agnostic — it treats a sentence as a set of words without caring about word order. But word order is crucial for semantics: "I love you" and "You love me" mean completely different things!

6.1 The Necessity of Positional Encoding

To allow the model to perceive positional information, Transformers add Positional Encoding to the input embeddings. Positional encoding is a vector with the same dimension as word embeddings, directly added to them.

问题：词序很重要

我爱你≠你爱我

解决：位置编码

Token Embedding + Positional Encoding

正弦余弦（Transformer 原始）

可学习（BERT、GPT）

旋转编码 RoPE（LLaMA）

6.2 Sinusoidal Positional Encoding

The original Transformer uses fixed sine and cosine functions to generate positional encodings:

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Advantages of this design:

Uniqueness: Each position has a unique encoding
Relative position: The model can learn relative distance relationships
Extrapolation: Can handle sequences longer than those seen during training

6.3 Modern Positional Encoding Schemes

As research has deepened, more positional encoding schemes have emerged:

Learnable positional encoding: BERT and GPT treat positional encodings as trainable parameters rather than fixed functions.

Relative positional encoding: T5 and DeBERTa encode relative distances between words rather than absolute positions.

Rotary Position Embedding (RoPE): Used by LLaMA and GPT-NeoX, injects positional information by rotating Q and K vectors, offering better extrapolation performance.

ALiBi: Achieves position awareness by adding a bias term to attention scores, requiring no additional parameters.

7. The Impact and Future of Transformers

The emergence of Transformers is not just the birth of a new architecture, but a paradigm shift in AI research as a whole.

7.1 A Unified Pre-training Paradigm

Transformers have made "pre-training + fine-tuning" the standard workflow in NLP. By pre-training on massive amounts of unlabeled text, models learn universal language representations and can then adapt to various downstream tasks with only a small amount of labeled data.

The success of Transformers is not limited to text. They have been successfully applied to:

Computer vision: Vision Transformer (ViT) surpasses CNNs in image classification
Speech recognition: Whisper uses Transformers for multilingual speech-to-text
Protein structure prediction: AlphaFold 2 uses Transformers to predict protein 3D structures
Reinforcement learning: Decision Transformer transforms RL problems into sequence modeling

7.3 The Cornerstone of the Large Model Era

From GPT-3's 175 billion parameters to GPT-4's trillions of parameters, Transformers have demonstrated astonishing scalability. Their parallel computation characteristics allow us to train unprecedentedly large models and observe emergent abilities — when models become large enough, they spontaneously "grasp" capabilities like reasoning, coding, and multilingualism.

7.4 Future Challenges and Directions

Despite the tremendous success of Transformers, challenges remain:

Computational complexity: Self-attention has O(n²) complexity, resulting in enormous computation for long texts.

Long-text modeling: Although theoretically capable of handling arbitrary lengths, it is practically constrained by memory and computational resources.

Interpretability: While attention weights provide some interpretability, the decision process of deep networks remains a black box.

Current research directions include:

Efficient Transformers: Linformer, Performer, Flash Attention, etc., reducing complexity
Long-context modeling: Sparse Attention, Sliding Window, Memory mechanisms
Multimodal fusion: Native multimodal architectures that uniformly process text, images, and audio

8. Conclusion

The introduction of Transformers and attention mechanisms marks a complete shift in deep learning from "handcrafted features" to "end-to-end learning." It not only resolved the technical bottlenecks of RNNs but, more importantly, provided a clean, universal, and scalable architecture that has become the cornerstone of the large model era.

Understanding Transformers is understanding the core of modern AI. From BERT's bidirectional encoding to GPT's autoregressive generation to unified multimodal representations, all these breakthroughs stand on the shoulders of the Transformer.

As computing power advances and algorithms improve, Transformers will continue to evolve, driving AI toward ever more powerful and general capabilities.

Transformer and Attention Mechanism: The Core Engine of Large Language Models ​

1. The Struggles of RNNs and the Breakthrough of Transformers ​

1.1 Three Fatal Flaws of RNNs ​

1.2 The Revolutionary Breakthrough of Transformers ​

2. Complete Transformer Architecture: From Overall to Detail ​

2.1 Encoder ​

2.2 The Attention Computation Process ​

3. Query, Key, Value: The Three Musketeers of Attention ​

3.1 The Roles of the Three Vectors ​

3.2 Attention Computation Formula ​

4. Multi-Head Attention: Understanding Semantics from Multiple Perspectives ​

4.1 How Multi-Head Works ​

4.2 Advantages of Multiple Heads ​

5. Complete Transformer Architecture: Encoder and Decoder ​

5.1 Encoder ​

5.2 Decoder ​

5.3 Modern Variants: Encoder-Only vs Decoder-Only ​

6. Positional Encoding: Telling the Model Word Order ​

6.1 The Necessity of Positional Encoding ​

6.2 Sinusoidal Positional Encoding ​

6.3 Modern Positional Encoding Schemes ​

7. The Impact and Future of Transformers ​

7.1 A Unified Pre-training Paradigm ​

7.2 A Universal Cross-Modal Architecture ​

7.3 The Cornerstone of the Large Model Era ​

7.4 Future Challenges and Directions ​

8. Conclusion ​

Transformer and Attention Mechanism: The Core Engine of Large Language Models

1. The Struggles of RNNs and the Breakthrough of Transformers

1.1 Three Fatal Flaws of RNNs

1.2 The Revolutionary Breakthrough of Transformers

2. Complete Transformer Architecture: From Overall to Detail

2.1 Encoder

2.2 The Attention Computation Process

3. Query, Key, Value: The Three Musketeers of Attention

3.1 The Roles of the Three Vectors

3.2 Attention Computation Formula

4. Multi-Head Attention: Understanding Semantics from Multiple Perspectives

4.1 How Multi-Head Works

4.2 Advantages of Multiple Heads

5. Complete Transformer Architecture: Encoder and Decoder

5.1 Encoder

5.2 Decoder

5.3 Modern Variants: Encoder-Only vs Decoder-Only

6. Positional Encoding: Telling the Model Word Order

6.1 The Necessity of Positional Encoding

6.2 Sinusoidal Positional Encoding

6.3 Modern Positional Encoding Schemes

7. The Impact and Future of Transformers

7.1 A Unified Pre-training Paradigm

7.2 A Universal Cross-Modal Architecture

7.3 The Cornerstone of the Large Model Era

7.4 Future Challenges and Directions

8. Conclusion