Chapter 15: The Encoder-Decoder Stack (Transformer)
Mr. Pallas's Cat's Warm Welcome
Unknowingly, we've journeyed through five chapters — starting from the simplest neuron, through backpropagation's learning, LSTM's memory, attention's focus, and finally arriving at the core of modern AI: the Transformer encoder-decoder stack. Today, we answer a crucial question: how do we organize simple components into powerful systems? What wisdom emerges when attention layers are stacked one upon another, when feedforward networks join in, when residual connections run throughout? Take your time — let's explore the mysteries of the encoder-decoder stack together.
Core Question: From Components to Systems
Piglet stared at the complex architecture diagram on the screen, brow slightly furrowed. "Professor, we've learned about attention mechanisms, feedforward networks, normalization... these components are all interesting, but how do they combine into a complete Transformer? It's like I have a pile of Lego bricks but don't know how to assemble them into a castle."
It was a spring morning in Kangle Garden at Sun Yat-sen University. Morning light streamed through the glass windows into the Black Stone House study, casting warm patches of light on the red-brick floor. Outside, kapok blossoms were in full bloom, their vivid red flowers swaying gently in the morning breeze. Inside the study, a wisp of steam rose from the gongfu tea set; the wall clock ticked steadily, as if timing the final sprint of their learning.
By the window, Little Seal looked up and adjusted his glasses: "This is really a system design problem. Historically, many complex systems are built from simple components organized through specific structures. The Transformer's breakthrough lies largely in its modular design and layered stacking."
Mr. Pallas's Cat gently set down his teacup and smiled. "You've raised an excellent question. A single attention layer is like a powerful 'eye,' but true intelligence requires organization. Today, let's explore how to organize these components into a complete Transformer system."
The Birth of Transformer: Attention Is All You Need
Piglet walked to the whiteboard and casually sketched attention, feedforward, and normalization diagrams.
"Professor, I remember the famous 2017 paper 'Attention Is All You Need.' The title says attention is all you need. But a Transformer has more than just attention — there are feedforward networks and normalization too."
Little Seal set down his book and gently added: "The title is actually rhetorical. The paper's actual contribution was showing that an attention-based encoder-decoder architecture could surpass the RNN and CNN models of the time. The key innovation was relying entirely on attention mechanisms to process sequences — no more recurrent or convolutional structures needed."
Mr. Pallas's Cat nodded: "Right. The core idea of the Transformer is: completely replace recurrence and convolution with attention mechanisms. But this isn't just a simple statement — it requires carefully designing the entire system architecture."
He drew the Transformer's overall architecture on the whiteboard:
Encoder Stack (N×):
Input → Positional Encoding → [Multi-Head Attention → Add & Norm → FFN → Add & Norm] × N → Output
Decoder Stack (N×):
Input → Positional Encoding → [Masked Multi-Head Attention → Add & Norm → Encoder-Decoder Attention → Add & Norm → FFN → Add & Norm] × N → Output"Look at this architecture," Mr. Pallas's Cat said, pointing at the whiteboard. "The Transformer is not a single algorithm, but an organized stacking of components. Each component has a clear function, connected in specific ways."
Piglet leaned in to study the architecture diagram closely: "Both the encoder and decoder are 'stacks'? Like stacking one layer on top of another, tier by tier?"
"Exactly," Mr. Pallas's Cat smiled. "The Transformer's 'stack' design embodies the core philosophy of deep learning: through deep hierarchical processing, extracting complex patterns from simple features."
The Encoder Stack: The Art of Understanding
Outside, the sunlight grew stronger, casting dappled shadows through the kapok leaves onto the red-brick floor.
Piglet rested her chin on her hand and asked: "Professor, what exactly does the encoder do? How does it 'understand' the input sequence?"
Mr. Pallas's Cat walked to the whiteboard and began explaining the encoder's design in detail.
"The encoder's task is to create rich representations of the input sequence," he explained. "Through multiple layers of processing, it gradually extracts and integrates information — somewhat like how we progress from words to sentences to paragraphs when reading."
He listed the three core components of an encoder layer on the whiteboard:
- Multi-head attention: lets every position attend to all positions, building global relationships
- Feedforward network: applies independent nonlinear transformations to each position, increasing model expressiveness
- Add & Norm: residual connections maintain information flow; layer normalization stabilizes training
Residual Connections: Information Highways
Mr. Pallas's Cat highlighted the "Add" symbol with red pen.
"Residual connections are a key innovation in deep learning," he explained. "The formula is simple:
Piglet tilted her head: "So residual connections let information 'skip' certain transformations? Even if this layer doesn't learn well, it can at least pass the original information through?"
"Well understood," Mr. Pallas's Cat nodded approvingly. "Residual connections solve the vanishing gradient problem in deep networks, enabling very deep stacking. More importantly, they provide an information highway — low-level features can pass directly to high layers without being completely altered by intermediate transformations."
Little Seal added: "This is somewhat like the 'shortcut connections' in the brain. Neuroscience has discovered that the brain also has direct pathways connecting distant regions, not necessarily going through all intermediate processing."
"Well said," Mr. Pallas's Cat said. "It is precisely because of residual connections that Transformers can stack dozens or even hundreds of layers without losing information or suffering from training difficulties."
Layer Normalization: Stable Training
Mr. Pallas's Cat wrote the layer normalization formula on the whiteboard:
where
"Layer normalization normalizes across the feature dimension for each sample," he explained. "This stabilizes the distribution of activation values, allowing training to converge faster."
Piglet understood: "So each layer's output gets normalized, ensuring the data distribution entering the next layer is relatively stable?"
"Exactly. Layer normalization and residual connections together form the 'stabilizer' of Transformer training, enabling deep networks to train smoothly."
Feedforward Network: Position-Independent Processing
Mr. Pallas's Cat drew the feedforward network structure:
Input → Linear Transform → ReLU Activation → Linear Transform → Output"The feedforward network operates independently on each position," he explained. "It provides nonlinear transformation capability, increasing the model's expressiveness. You can think of it as each position having its own 'mini-processor.'"
Little Seal thought: "The feedforward network is like each position's 'micro-brain'? Independently processing that position's information?"
"That's a vivid analogy," Mr. Pallas's Cat smiled. "The feedforward network handles internal processing for each position, while the attention mechanism handles external communication between positions. One manages 'internal affairs,' the other manages 'external exchange.'"
The Decoder Stack: The Art of Generation
Outside, kapok petals drifted down in the wind, like red snowflakes under the sunlight.
Piglet asked curiously: "Professor, why is the decoder more complex than the encoder? I see it has an extra attention layer?"
Mr. Pallas's Cat walked to the whiteboard and began comparing the encoder and decoder.
"The decoder's task is indeed more complex," he explained. "It must predict the next element based on the encoder's understanding and what has already been generated. This requires three attention mechanisms working together."
He listed the decoder's three attention sublayers on the whiteboard:
- Masked multi-head attention: causal self-attention — can only see what has already been generated
- Encoder-decoder attention: cross-attention — attends to the encoder's output
- Feedforward network: position-independent processing, same as in the encoder
Masked Attention: The Wisdom of Causal Constraint
Mr. Pallas's Cat drew a triangular mask on the attention matrix.
"Masked attention ensures the autoregressive property," he explained. "When generating position
Piglet understood: "This guarantees sequential generation? No 'cheating' by looking ahead?"
"Exactly. Masked attention is the foundation of sequence generation. It's what enables Transformers to be used for machine translation, text generation, speech synthesis — tasks that require sequential generation."
Encoder-Decoder Attention: The Art of Alignment
Mr. Pallas's Cat drew a diagram of cross-attention.
"Encoder-decoder attention realizes source-to-target alignment," he explained. "The decoder's query
Little Seal added: "This simulates the human translation process — looking at the source sentence and thinking about how to express it in the target language."
"Well said," Mr. Pallas's Cat said. "This attention mechanism allows the model to dynamically align different parts of the source and target languages, even when the sentence lengths differ — it can handle translation gracefully."
Orthogonal Computation Graph: Seeing the Transformer's Information Flow
Mr. Pallas's Cat turned on the projector, and a tidy computation graph appeared on the screen.
"This is the orthogonal computation graph of a Transformer encoder layer," Mr. Pallas's Cat said, pointing at the diagram. "We can see three paths of information flow: forward propagation, residual connections, and normalization stabilization."
Piglet studied the information flow in the diagram carefully: "Input
"Yes," Mr. Pallas's Cat explained. "The residual connection
Little Seal mused: "This computation flow seems to embody a 'transform-preserve-stabilize' cycle? Does every sublayer follow this pattern?"
"Very observant," Mr. Pallas's Cat said. "The Transformer's design philosophy can be summarized as: transform boldly, preserve carefully, stabilize always. The attention layer performs bold global information exchange, the residual connection carefully preserves the original information, and layer normalization consistently stabilizes the training process."
Positional Encoding: A Sense of Position in Sequences
Mr. Pallas's Cat wrote the sinusoidal positional encoding formulas on the whiteboard:
"Positional encoding provides absolute position information for each position," he explained. "Because the attention mechanism itself is position-agnostic — it only looks at content similarity, not position."
Piglet thought: "So we need to additionally tell the model 'which position is this'? Otherwise 'I like you' and 'You like I' might be seen as the same?"
"Exactly," Mr. Pallas's Cat smiled. "Positional encoding lets the model distinguish order. Interestingly, sinusoidal encoding also has a relative position property: the encoding for position
Little Seal looked up from his mathematics book: "This provides the ability for positional extrapolation? The model can handle sequences longer than those seen during training?"
"Theoretically yes," Mr. Pallas's Cat said. "But in practice, long-sequence extrapolation remains a challenge. Modern research is exploring better positional encoding methods."
Mental Model: The Wisdom of Modular Systems
Little Seal took a software engineering book from the shelf. "Professor, this reminds me of the 'modular design' principle in software engineering."
"An excellent connection," Mr. Pallas's Cat said. "The Transformer embodies multiple principles of excellent system design."
He wrote on the whiteboard:
Mental Model: Transformer Design Principles
- Modularity: each component (attention, feedforward, normalization) has a clear function and interface
- Hierarchy: stacking enables feature extraction from simple to complex
- Information preservation: residual connections ensure information isn't lost and gradients can propagate
- Training stability: layer normalization and proper initialization make deep networks trainable
- Parallel efficiency: attention mechanisms support large-scale parallel computation
"These five principles," Mr. Pallas's Cat explained, "are not only the secret to the Transformer's success, but also universal wisdom for excellent system design."
Piglet pondered: "So the Transformer is not just an AI model, but also a paradigm for system design? Its ideas can be applied to other fields?"
"Exactly," Mr. Pallas's Cat answered. "The Transformer's ideas of modularity, hierarchy, and residual connections have already influenced computer architecture, compiler design, software engineering, and many other fields."
"Attention Is All You Need"? A Reconsideration
Mr. Pallas's Cat wrote the paper title on the whiteboard, then drew question marks beside it.
"The title is rhetorical," he said. "In reality, the Transformer needs much more: positional encoding, feedforward networks, residual connections, layer normalization, proper initialization, massive data, powerful computation..."
Little Seal added: "But the title captures the essence: the attention mechanism is the core innovation. The other components are the 'infrastructure' that enables attention to work effectively."
"Yes," Mr. Pallas's Cat said. "The Transformer's lesson is: core innovations need supporting infrastructure. Great ideas require carefully designed environments to realize their power."
Key Takeaways
Mr. Pallas's Cat's Summary: The Wisdom of the Encoder-Decoder Stack
- System design philosophy: the Transformer is not a single algorithm but an organized stacking of components — embodying the systems thinking that "the whole is greater than the sum of its parts"
- The encoder's path of understanding: through alternating layers of attention and feedforward networks, gradually extracting hierarchical representations — achieving the leap from local features to global semantics
- The decoder's way of generation: combining masked self-attention (causal constraint), encoder-decoder attention (source-target alignment), and feedforward networks — realizing autoregressive sequence generation
- Training stabilization design: residual connections maintain information and gradient flow; layer normalization stabilizes activation distributions — together enabling deep stacking
- Modular universal architecture: the Transformer demonstrates principles of modularity, hierarchy, and standardization that transcend AI — becoming universal wisdom for complex system design
Code Practice: Complete Transformer Implementation in Python
"Let's implement the complete Transformer in Python code," Mr. Pallas's Cat said. "From basic components to the full architecture, and finally demonstrate it on a simple task."
Basic Transformer Component Implementation
import numpy as np
import matplotlib.pyplot as plt
class LayerNormalization:
"""Layer Normalization implementation"""
def __init__(self, d_model, eps=1e-6):
"""Initialize layer normalization
Args:
d_model: feature dimension
eps: numerical stability constant
"""
self.gamma = np.ones((1, d_model)) # scaling parameter
self.beta = np.zeros((1, d_model)) # shifting parameter
self.eps = eps
def forward(self, x):
"""Forward propagation
Args:
x: input (batch_size, seq_len, d_model)
Returns:
normalized output
"""
# compute mean and variance (along the last dimension)
mean = np.mean(x, axis=-1, keepdims=True)
variance = np.var(x, axis=-1, keepdims=True)
# normalize
x_normalized = (x - mean) / np.sqrt(variance + self.eps)
# scale and shift
output = self.gamma * x_normalized + self.beta
return output
class FeedForwardNetwork:
"""Feedforward Network (two linear transforms + activation)"""
def __init__(self, d_model, d_ff):
"""Initialize feedforward network
Args:
d_model: input/output dimension
d_ff: hidden layer dimension (typically 4*d_model)
"""
self.W1 = np.random.randn(d_model, d_ff) * 0.01
self.b1 = np.zeros((1, d_ff))
self.W2 = np.random.randn(d_ff, d_model) * 0.01
self.b2 = np.zeros((1, d_model))
def relu(self, x):
"""ReLU activation function"""
return np.maximum(0, x)
def forward(self, x):
"""Forward propagation"""
# first linear transform + ReLU
h = np.matmul(x, self.W1) + self.b1
h = self.relu(h)
# second linear transform
output = np.matmul(h, self.W2) + self.b2
return output
class MultiHeadAttention:
"""Multi-Head Attention (simplified, based on Chapter 14 implementation)"""
def __init__(self, d_model, num_heads):
"""Initialize multi-head attention"""
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.depth = d_model // num_heads
# weight matrices
self.W_q = np.random.randn(d_model, d_model) * 0.01
self.W_k = np.random.randn(d_model, d_model) * 0.01
self.W_v = np.random.randn(d_model, d_model) * 0.01
self.W_o = np.random.randn(d_model, d_model) * 0.01
def scaled_dot_product_attention(self, Q, K, V, mask=None):
"""Scaled dot-product attention"""
d_k = Q.shape[-1]
# compute attention scores
scores = np.matmul(Q, K.swapaxes(-1, -2)) # dot product
scores = scores / np.sqrt(d_k) # scale
# apply mask (if provided)
if mask is not None:
scores = scores + (mask * -1e9)
# softmax to get attention weights
attention_weights = self.softmax(scores, axis=-1)
# weighted value vectors
output = np.matmul(attention_weights, V)
return output, attention_weights
def softmax(self, x, axis=-1):
"""Numerically stable softmax implementation"""
x_exp = np.exp(x - np.max(x, axis=axis, keepdims=True))
return x_exp / np.sum(x_exp, axis=axis, keepdims=True)
def split_heads(self, x, batch_size):
"""Split into multiple heads"""
# reshape: (batch_size, seq_len, d_model) -> (batch_size, seq_len, num_heads, depth)
x = x.reshape(batch_size, -1, self.num_heads, self.depth)
# transpose: (batch_size, num_heads, seq_len, depth)
return x.transpose(0, 2, 1, 3)
def combine_heads(self, x, batch_size):
"""Combine multiple heads"""
# transpose back: (batch_size, seq_len, num_heads, depth)
x = x.transpose(0, 2, 1, 3)
# reshape: (batch_size, seq_len, d_model)
return x.reshape(batch_size, -1, self.d_model)
def forward(self, Q, K, V, mask=None):
"""Forward propagation"""
batch_size = Q.shape[0]
# linear transforms
Q = np.matmul(Q, self.W_q)
K = np.matmul(K, self.W_k)
V = np.matmul(V, self.W_v)
# split heads
Q = self.split_heads(Q, batch_size)
K = self.split_heads(K, batch_size)
V = self.split_heads(V, batch_size)
# scaled dot-product attention
scaled_attention, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
# combine heads
scaled_attention = self.combine_heads(scaled_attention, batch_size)
# output linear transform
output = np.matmul(scaled_attention, self.W_o)
return output, attention_weights
class PositionalEncoding:
"""Positional Encoding (sinusoidal)"""
def __init__(self, d_model, max_seq_len=5000):
"""Initialize positional encoding"""
self.d_model = d_model
self.max_seq_len = max_seq_len
# pre-compute positional encoding matrix
self.pe = self.create_positional_encoding(max_seq_len, d_model)
def create_positional_encoding(self, max_seq_len, d_model):
"""Create positional encoding matrix"""
pe = np.zeros((max_seq_len, d_model))
for pos in range(max_seq_len):
for i in range(0, d_model, 2):
pe[pos, i] = np.sin(pos / (10000 ** (2 * i / d_model)))
if i + 1 < d_model:
pe[pos, i + 1] = np.cos(pos / (10000 ** (2 * i / d_model)))
return pe
def forward(self, x):
"""Add positional encoding to input"""
seq_len = x.shape[1]
return x + self.pe[:seq_len]
# Basic component demo
print("Transformer Basic Component Demo:")
print("=" * 60)
# Test data
batch_size = 2
seq_len = 10
d_model = 64
x_test = np.random.randn(batch_size, seq_len, d_model)
print(f"Test data shape: {x_test.shape}")
# Test layer normalization
print("\n1. Layer Normalization Test:")
layer_norm = LayerNormalization(d_model=d_model)
x_norm = layer_norm.forward(x_test)
print(f" Input range: [{x_test.min():.3f}, {x_test.max():.3f}]")
print(f" Normalized range: [{x_norm.min():.3f}, {x_norm.max():.3f}]")
print(f" Normalized mean: {x_norm.mean():.6f} (close to 0)")
print(f" Normalized variance: {x_norm.var():.6f} (close to 1)")
# Test feedforward network
print("\n2. Feedforward Network Test:")
d_ff = 4 * d_model # typical setting
ffn = FeedForwardNetwork(d_model=d_model, d_ff=d_ff)
x_ffn = ffn.forward(x_test)
print(f" FFN output shape: {x_ffn.shape}")
print(f" Parameter count: {ffn.W1.size + ffn.b1.size + ffn.W2.size + ffn.b2.size}")
# Test multi-head attention
print("\n3. Multi-Head Attention Test:")
num_heads = 8
mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
# Self-attention test
output_mha, attn_weights = mha.forward(x_test, x_test, x_test)
print(f" Multi-head attention output shape: {output_mha.shape}")
print(f" Attention weights shape: {attn_weights.shape}")
# Test positional encoding
print("\n4. Positional Encoding Test:")
pos_enc = PositionalEncoding(d_model=d_model, max_seq_len=100)
x_with_pos = pos_enc.forward(x_test)
print(f" Shape after adding positional encoding: {x_with_pos.shape}")
# Positional encoding visualization
plt.figure(figsize=(12, 6))
plt.imshow(pos_enc.pe[:50].T, cmap='RdBu', aspect='auto')
plt.colorbar(label='Positional Encoding Value')
plt.xlabel('Position Index')
plt.ylabel('Dimension')
plt.title('Sinusoidal Positional Encoding (first 50 positions)')
plt.savefig('/tmp/positional_encoding_visualization.png', dpi=150, bbox_inches='tight')
plt.close()
print(f" Positional encoding visualization saved to /tmp/positional_encoding_visualization.png")Transformer Encoder Layer Implementation
class TransformerEncoderLayer:
"""Transformer Encoder Layer"""
def __init__(self, d_model, num_heads, d_ff):
"""Initialize encoder layer"""
# Sublayer 1: multi-head attention + residual + layer norm
self.multi_head_attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = LayerNormalization(d_model)
# Sublayer 2: feedforward network + residual + layer norm
self.feed_forward = FeedForwardNetwork(d_model, d_ff)
self.norm2 = LayerNormalization(d_model)
def forward(self, x, mask=None):
"""Forward propagation
Args:
x: input (batch_size, seq_len, d_model)
mask: attention mask (optional)
Returns:
encoder layer output
"""
# Sublayer 1: multi-head attention + residual + layer norm
attn_output, attn_weights = self.multi_head_attention.forward(x, x, x, mask)
# residual connection + layer normalization
x = self.norm1.forward(x + attn_output)
# Sublayer 2: feedforward network + residual + layer norm
ff_output = self.feed_forward.forward(x)
x = self.norm2.forward(x + ff_output)
return x, attn_weights
class TransformerEncoder:
"""Transformer Encoder (stack of multiple encoder layers)"""
def __init__(self, num_layers, d_model, num_heads, d_ff):
"""Initialize encoder"""
self.layers = []
for _ in range(num_layers):
layer = TransformerEncoderLayer(d_model, num_heads, d_ff)
self.layers.append(layer)
self.num_layers = num_layers
def forward(self, x, mask=None):
"""Forward propagation"""
all_attention_weights = []
for layer in self.layers:
x, attn_weights = layer.forward(x, mask)
all_attention_weights.append(attn_weights)
return x, all_attention_weights
# Encoder demo
print("\nTransformer Encoder Demo:")
print("=" * 60)
# Create encoder
num_layers = 3
d_model = 64
num_heads = 8
d_ff = 4 * d_model
encoder = TransformerEncoder(num_layers=num_layers, d_model=d_model,
num_heads=num_heads, d_ff=d_ff)
print(f"Encoder Configuration:")
print(f" Layers: {num_layers}")
print(f" Model dimension: {d_model}")
print(f" Attention heads: {num_heads}")
print(f" Feedforward dimension: {d_ff}")
# Forward pass test
encoder_output, all_attn_weights = encoder.forward(x_test)
print(f"\nEncoder output shape: {encoder_output.shape}")
print(f"Attention weights count (per layer): {len(all_attn_weights)}")
print(f"Per-layer attention weights shape: {all_attn_weights[0].shape}")
# Visualize attention patterns across different layers
def visualize_encoder_attention(all_attn_weights, sample_idx=0, head_idx=0):
"""Visualize attention patterns across encoder layers"""
num_layers = len(all_attn_weights)
fig, axes = plt.subplots(1, num_layers, figsize=(5*num_layers, 5))
for layer_idx in range(num_layers):
ax = axes[layer_idx] if num_layers > 1 else axes
# get attention weights for this layer
layer_weights = all_attn_weights[layer_idx][sample_idx, head_idx]
im = ax.imshow(layer_weights, cmap='viridis', aspect='auto')
ax.set_xlabel('Key Position')
ax.set_ylabel('Query Position')
ax.set_title(f'Layer {layer_idx+1}, Head {head_idx+1}')
# add colorbar
plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
plt.suptitle('Transformer Encoder Attention Patterns Across Layers', fontsize=14)
plt.tight_layout()
plt.savefig('/tmp/encoder_attention_patterns.png', dpi=150, bbox_inches='tight')
plt.close()
print(f"Encoder attention pattern visualization saved to /tmp/encoder_attention_patterns.png")
# Run visualization
visualize_encoder_attention(all_attn_weights, sample_idx=0, head_idx=0)
# Analyze information change across layers
print("\nInformation Change Analysis Across Encoder Layers:")
print("=" * 60)
# Compute output differences across layers
layer_outputs = []
# Simulate layer-by-layer processing (for demonstration)
current_x = x_test.copy()
for layer_idx, layer in enumerate(encoder.layers):
current_x, _ = layer.forward(current_x)
layer_outputs.append(current_x.copy())
# compute difference from input
diff_norm = np.linalg.norm(current_x - x_test)
print(f" Layer {layer_idx+1}: output-input difference = {diff_norm:.4f}")
# Compute inter-layer change pattern
print(f"\nInter-layer change pattern:")
for i in range(num_layers-1):
layer_diff = np.linalg.norm(layer_outputs[i+1] - layer_outputs[i])
print(f" Layer {i+1} → Layer {i+2}: change = {layer_diff:.4f}")Complete Transformer Implementation (Simplified)
class Transformer:
"""Complete Transformer model (simplified, encoder-only)"""
def __init__(self, vocab_size, d_model, num_heads, d_ff, num_layers, max_seq_len):
"""Initialize Transformer"""
self.vocab_size = vocab_size
self.d_model = d_model
# word embedding layer
self.embedding = np.random.randn(vocab_size, d_model) * 0.01
# positional encoding
self.positional_encoding = PositionalEncoding(d_model, max_seq_len)
# encoder
self.encoder = TransformerEncoder(num_layers, d_model, num_heads, d_ff)
# output layer (for classification tasks, etc.)
self.output_layer = np.random.randn(d_model, vocab_size) * 0.01
self.output_bias = np.zeros((1, vocab_size))
def forward(self, input_ids, attention_mask=None):
"""Forward propagation"""
batch_size, seq_len = input_ids.shape
# word embedding
embedded = np.zeros((batch_size, seq_len, self.d_model))
for b in range(batch_size):
for t in range(seq_len):
token_id = input_ids[b, t]
embedded[b, t] = self.embedding[token_id]
# add positional encoding
embedded = self.positional_encoding.forward(embedded)
# encoder processing
encoded, all_attn_weights = self.encoder.forward(embedded, attention_mask)
# output layer (using mean pooling for classification)
pooled = np.mean(encoded, axis=1) # (batch_size, d_model)
logits = np.matmul(pooled, self.output_layer) + self.output_bias
return logits, encoded, all_attn_weights
# Complete Transformer demo
print("\nComplete Transformer Model Demo:")
print("=" * 60)
# Create Transformer model
vocab_size = 1000
d_model = 128
num_heads = 8
d_ff = 4 * d_model
num_layers = 4
max_seq_len = 50
transformer = Transformer(
vocab_size=vocab_size,
d_model=d_model,
num_heads=num_heads,
d_ff=d_ff,
num_layers=num_layers,
max_seq_len=max_seq_len
)
print(f"Transformer Configuration:")
print(f" Vocabulary size: {vocab_size}")
print(f" Model dimension: {d_model}")
print(f" Attention heads: {num_heads}")
print(f" Feedforward dimension: {d_ff}")
print(f" Encoder layers: {num_layers}")
print(f" Max sequence length: {max_seq_len}")
# Estimate parameter count
embedding_params = vocab_size * d_model
attention_params = 4 * d_model * d_model * num_layers # W_q, W_k, W_v, W_o per layer
ffn_params = 2 * (d_model * d_ff + d_ff * d_model) * num_layers # W1, W2 per layer
output_params = d_model * vocab_size
total_params = embedding_params + attention_params + ffn_params + output_params
print(f"\nParameter Count Estimate:")
print(f" Embedding: {embedding_params:,}")
print(f" Attention layers: {attention_params:,}")
print(f" Feedforward networks: {ffn_params:,}")
print(f" Output layer: {output_params:,}")
print(f" Total: {total_params:,}")
# Simulate input
batch_size = 3
seq_len = 15
# Generate random token IDs (simulating text)
input_ids = np.random.randint(0, vocab_size-10, (batch_size, seq_len))
print(f"\nInput Data:")
print(f" input_ids shape: {input_ids.shape}")
print(f" Example input (first sample): {input_ids[0]}")
# Forward pass
logits, encoded, all_attn_weights = transformer.forward(input_ids)
print(f"\nOutput Results:")
print(f" logits shape: {logits.shape}")
print(f" Encoded output shape: {encoded.shape}")
print(f" Attention weights count: {len(all_attn_weights)}")
# Task demo: simple sequence classification
print(f"\nTransformer Demo on a Simple Task:")
print("=" * 60)
# Create a simple pattern recognition task
def create_pattern_task(num_samples, seq_len, vocab_size):
"""Create pattern recognition task data"""
X = np.zeros((num_samples, seq_len), dtype=int)
y = np.zeros((num_samples,), dtype=int)
for i in range(num_samples):
# generate random sequence
sequence = np.random.randint(0, vocab_size-5, seq_len)
# task: if sequence contains a specific pattern (e.g., three consecutive increasing numbers), classify as 1
has_pattern = 0
for j in range(seq_len - 2):
if (sequence[j] + 1 == sequence[j+1] and
sequence[j+1] + 1 == sequence[j+2]):
has_pattern = 1
break
X[i] = sequence
y[i] = has_pattern
return X, y
# Generate data
num_samples = 100
seq_len = 10
vocab_size = 50
X_task, y_task = create_pattern_task(num_samples, seq_len, vocab_size)
print(f"Task data: {num_samples} samples, sequence length {seq_len}, vocabulary size {vocab_size}")
print(f"Positive class ratio: {np.mean(y_task):.1%}")
# Create a Transformer suited for the task
task_transformer = Transformer(
vocab_size=vocab_size,
d_model=32, # smaller dimension
num_heads=4,
d_ff=128,
num_layers=2,
max_seq_len=seq_len
)
# Training demo (simplified)
def train_transformer_demo(model, X, y, epochs=20, lr=0.01):
"""Training demo for Transformer (simplified)"""
losses = []
accuracies = []
for epoch in range(epochs):
batch_losses = []
batch_accs = []
# mini-batch training (simplified, should shuffle in practice)
for i in range(0, len(X), 10):
batch_X = X[i:i+10]
batch_y = y[i:i+10]
# forward pass
logits, _, _ = model.forward(batch_X)
# compute loss (cross-entropy)
# convert logits to probabilities
exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
# binary cross-entropy loss
loss = -np.mean(batch_y * np.log(probs[:, 1] + 1e-8) +
(1 - batch_y) * np.log(probs[:, 0] + 1e-8))
# compute accuracy
predictions = np.argmax(logits, axis=1)
accuracy = np.mean(predictions == batch_y)
batch_losses.append(loss)
batch_accs.append(accuracy)
epoch_loss = np.mean(batch_losses)
epoch_acc = np.mean(batch_accs)
losses.append(epoch_loss)
accuracies.append(epoch_acc)
if epoch % 5 == 0:
print(f" Epoch {epoch}: loss={epoch_loss:.4f}, accuracy={epoch_acc:.2%}")
return losses, accuracies
print(f"\nStarting training demo (simplified)...")
losses, accuracies = train_transformer_demo(task_transformer, X_task, y_task, epochs=30)
# Visualize training process
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(losses, 'b-', linewidth=2)
plt.xlabel('Training Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.title('Transformer Training Loss Curve')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(accuracies, 'g-', linewidth=2)
plt.xlabel('Training Epoch')
plt.ylabel('Accuracy')
plt.title('Transformer Training Accuracy Curve')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('/tmp/transformer_training_demo.png', dpi=150, bbox_inches='tight')
plt.close()
print(f"\nTraining demo complete!")
print(f" Final loss: {losses[-1]:.4f}")
print(f" Final accuracy: {accuracies[-1]:.2%}")
print(f" Training curves saved to /tmp/transformer_training_demo.png")
# Test model performance on unseen data
X_test, y_test = create_pattern_task(50, seq_len, vocab_size)
logits_test, _, _ = task_transformer.forward(X_test)
predictions_test = np.argmax(logits_test, axis=1)
test_accuracy = np.mean(predictions_test == y_test)
print(f"\nTest Set Performance:")
print(f" Test samples: {len(X_test)}")
print(f" Test accuracy: {test_accuracy:.2%}")"Remember," Mr. Pallas's Cat summarized, "the Transformer is a model of deep learning system design — it organizes simple components (attention, feedforward, normalization) through carefully designed structures (residual connections, layer stacking, positional encoding) into a powerful system. The most important thing is not the innovation of any single component, but the wisdom of the overall architecture. In this architecture, we see the perfect embodiment of universal design principles: modularity, hierarchy, standardization, parallelization. It reminds us: true breakthroughs are often not leaps in individual technologies, but triumphs of systems thinking."
Mr. Pallas's Cat's Thought Questions
Practical Exploration (for Piglet)
- Transformer variant implementation: Implement different Transformer variants (e.g., Performer, Linformer, Reformer). Compare their computational efficiency and model performance.
- Positional encoding experiments: Implement different positional encoding methods (learnable positional encoding, relative positional encoding, rotary positional encoding). Compare their impact on model performance.
- Model compression experiments: Apply pruning, quantization, and distillation to a trained Transformer. How can we compress the model while maintaining performance?
Historical Investigation (for Little Seal)
- Predecessors and evolution of the Transformer: Research the mainstream sequence models before the Transformer (RNN, LSTM, GRU, CNN-seq). How did the Transformer absorb their strengths?
- Cross-domain transfer of the Transformer: Investigate how the Transformer migrated from NLP to CV (Vision Transformer), speech, bioinformatics, and other fields. What does this transfer reveal?
- Impact of the open-source ecosystem: Research how open-source communities like Hugging Face accelerated the adoption and application of Transformers. What impact does open source have on AI development?
Integrated Thinking
- Philosophical reflection: Does the Transformer's "Attention Is All You Need" imply a certain epistemology — that understanding is about establishing relationships? How does this differ from the traditional "representation-reasoning" paradigm?
- Ethical challenges: Transformer models require massive data and computing power, which may exacerbate resource inequality. How can we make Transformer technology more accessible and equitable?
- Creative exercise: Design a "self-explaining Transformer" that can explain its own attention patterns and decision-making processes. How would you design it?
- Ultimate challenge: Prove that the Transformer is Turing-complete (theoretically capable of simulating any Turing machine). What conditions are required? What does this say about the Transformer's capabilities?
Part 3 Summary: The Emergence of Neural Networks
The fragrance of tea filled the Black Stone House; the spring sunlight was warm and tranquil.
"We spent six chapters completing the full journey of neural networks," said Mr. Pallas's Cat. "From the simplest neuron to the most complex Transformer. Let's review this journey."
Piglet opened his notebook: "We started with neurons in Chapter 10 — the simplest perceptual unit, learning weighted summation and activation functions."
Little Seal added: "Chapter 11 explored backpropagation — how error becomes the ladder of progress, how gradient descent guides the direction of learning."
"Chapter 12 introduced the temporal dimension," Piglet continued. "LSTM's chains of memory, achieving selective memory through gating mechanisms."
Little Seal flipped through his notes: "Chapter 13 compared two philosophies of memory — LSTM's incremental memory vs. attention's direct access. The contest between forgetting and causality."
"Chapter 14 dove deep into attention mechanisms," Piglet said. "In this noisy world, where should we look? Scaled dot-product, multi-head division of labor, positional encoding..."
Mr. Pallas's Cat smiled: "Finally, Chapter 15 organized all the components — the Transformer encoder-decoder stack. From components to systems, from simple to complex."
"These six chapters demonstrate the emergence of intelligence," Mr. Pallas's Cat summarized. "From simple computational units, through connection, learning, memory, attention, and organization, powerful intelligent systems ultimately emerge. The core insight of this process is:"
Core Insights of Part 3
- Simplicity generates complexity: complex intelligent behavior can emerge from simple computational units (neurons) through connection and organization
- Learning arises from error: backpropagation transforms error into the ladder of progress, embodying the fundamental principle of trial-and-error learning
- Memory requires choice: LSTM's gating mechanism demonstrates the wisdom of selective memory — remember what's important, forget what's irrelevant
- Attention is selection: attention mechanisms realize the mathematical form of information selection — the "eyes" of intelligence
- Organization creates capability: the Transformer shows how modular, hierarchical organization of simple components creates powerful systems
- Depth requires stability: residual connections, layer normalization and other techniques make deep networks trainable — embodying engineering wisdom
"Most importantly," Mr. Pallas's Cat said, "the journey of neural networks teaches us: intelligence is not a mysterious black box, but a complex system that is understandable, constructible, and improvable. From mathematical formulas to code implementations, from theoretical principles to practical applications — every step is a manifestation of human wisdom."
Coming Up Next: The Path to the Reasoning Kingdom
"But do neural networks truly 'reason'?" Piglet asked. "Or are they just imitating statistical patterns?"
"That is exactly what Part 4 will explore," Mr. Pallas's Cat explained. "Having completed the technical journey of neural networks, we must return to the fundamental question: what is true reasoning?"
Little Seal said eagerly: "Part 4: The Path to the Reasoning Kingdom. We'll explore the myths of LLMs, the reasoning scientist's toolbox, and a letter to you beyond age 20."
Mr. Pallas's Cat smiled. "We'll take it slow. See you in the next part."
Piglet's note: I implemented a complete Transformer! Though simplified, it includes all core components. Most striking was seeing how attention patterns genuinely differ across layers — lower layers attend locally, higher layers attend globally. The Transformer truly is like a hierarchical understanding system.
Little Seal's note: I researched the history and impact of the Transformer, struck by its cross-domain transfer capability. From NLP to CV, from speech to protein structure prediction — the Transformer demonstrates the potential of a unified architecture. Most profound is its design philosophy: modularity, hierarchy, standardization — principles that transcend the AI field.
Mr. Pallas's Cat's closing words: Part 3's journey teaches us a profound lesson about the construction of intelligence: complexity arises from simplicity, capability arises from organization, intelligence arises from design. From neurons to Transformers, we see the perfect fusion of mathematics, engineering, and cognitive science. Most importantly, it reminds us: technical tools themselves possess no intelligence — intelligence lies in how we use these tools to understand the world, solve problems, and create value. On this path, understanding matters more than mere use; thinking matters more than memorization. We'll take it slow — understanding is what matters most.
