Skip to content

Neural Networks and Deep Learning

Preface

Neural networks are the engine of the AI revolution. From ChatGPT's language understanding to image recognition in autonomous driving, neural networks are what's working behind the scenes. It's not magic — it's an elegant mathematical framework that "learns" the mapping from input to output through large amounts of data. Understanding the fundamentals will help you use and debug AI tools more effectively.

What will you learn in this article?

After completing this chapter, you'll gain:

  • Core concepts: Understand the basic principles of neurons, layers, forward propagation, and backpropagation
  • Network types: Learn the characteristics and suitable use cases of mainstream architectures like CNN, RNN, and Transformer
  • Training process: Understand how models "learn" from data
  • Key techniques: Master practical concepts like overfitting, learning rate, and regularization
  • Development history: Understand the evolution from the Perceptron to large language models
ChapterContentCore Concepts
Chapter 1From Neuron to NetworkPerceptron, activation functions, forward propagation
Chapter 2How Networks LearnLoss functions, gradient descent, backpropagation
Chapter 3Mainstream Network ArchitecturesCNN, RNN, Transformer
Chapter 4The Art of TrainingOverfitting, regularization, hyperparameter tuning
Chapter 5Development History and FrontiersFrom Perceptron to GPT

1. From Neuron to Network

A Single Neuron

The smallest unit of a neural network is the neuron. It mimics how biological neurons work: receiving multiple input signals, computing a weighted sum, and producing an output through an activation function.

Input x1 ──→ ×w1 ──┐
Input x2 ──→ ×w2 ──┼──→ Σ(weighted sum) + b(bias) ──→ f(activation function) ──→ Output
Input x3 ──→ ×w3 ──┘

Mathematical expression: y = f(w₁x₁ + w₂x₂ + w₃x₃ + b)

How a Neuron Works
Adjust inputs and weights to see how the neuron output changes
Input × Weight
0.5
×
0.8
=0.40
-0.3
×
1.2
=-0.36
0.7
×
-0.5
=-0.35
Weighted sum + bias (0.1)
-0.21
Activation: Sigmoid
0.4477
0.1

Activation Functions: Why Nonlinearity Matters

Without activation functions, no matter how many layers of neurons you stack, the result is always equivalent to a single linear transformation (matrix multiplication). Activation functions introduce nonlinearity, enabling the network to learn complex patterns.

Activation FunctionFormulaCharacteristicsCommon Use Cases
ReLUmax(0, x)Simple, efficient, fast trainingDefault choice for hidden layers
Sigmoid1/(1+e⁻ˣ)Output range 0~1Binary classification output layer
Tanh(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)Output range -1~1Commonly used in RNNs
Softmaxeˣᵢ/ΣeˣⱼOutputs a probability distributionMulti-class classification output layer

From Neuron to Network

Organize multiple neurons into layers, and connect multiple layers in sequence to form a neural network:

Input Layer      Hidden Layer 1        Hidden Layer 2        Output Layer
(features)       (extracts low-level   (extracts high-level   (prediction)
                  features)             features)

 x1 ──→  [○ ○ ○ ○] ──→ [○ ○ ○] ──→  [○ ○]
 x2 ──→  [○ ○ ○ ○] ──→ [○ ○ ○] ──→  Cat/Dog
 x3 ──→  [○ ○ ○ ○] ──→ [○ ○ ○]
ConceptDescription
Input LayerReceives raw data (image pixels, text vectors, etc.)
Hidden LayerIntermediate processing layers; more layers means a "deeper" network (the "deep" in deep learning)
Output LayerProduces the final prediction (classification probabilities, regression values, etc.)
Forward PropagationThe process of data flowing from the input layer to the output layer, layer by layer

Why is it called "Deep" Learning?

Traditional machine learning typically uses only 1–2 layers. When the number of hidden layers increases to dozens or even hundreds, it's called "deep" learning. Deeper networks can learn more abstract features: the first layer learns edges, the second layer learns textures, the third layer learns parts, and deeper layers learn "this is a cat."


2. How Networks Learn

The "learning" in neural networks is essentially an optimization problem: find a set of weights (w) and biases (b) such that the network's predictions are as close as possible to the correct answers.

The Three-Step Training Loop

1. Forward propagation: Feed input data, get predictions
2. Compute loss: Use a loss function to measure the gap between predictions and true values
3. Backpropagation: Calculate the gradient of each weight based on the loss, then update weights

Repeat the above steps until the loss is sufficiently small

Loss Functions: Measuring "How Wrong You Are"

A loss function quantifies the gap between predicted and true values. The goal of training is to minimize the loss.

Loss FunctionFormula SummaryUse Case
MSE (Mean Squared Error)Mean of squared differences between predictions and true valuesRegression problems
Cross-Entropy-Σ y·log(ŷ)Classification problems
Binary Cross-EntropyBinary version of cross-entropyBinary classification problems

Gradient Descent: Finding the Lowest Point

Imagine you're standing on a mountain, blindfolded, and need to walk to the lowest point. All you can do is feel the slope under your feet, then take a step downhill. This is gradient descent.

Loss

  │    ╱╲
  │   ╱  ╲      ← Current position
  │  ╱    ╲    ↙ Descend along gradient direction
  │ ╱      ╲╱   ← Local minimum
  │╱            ╲╱  ← Global minimum
  └──────────────→ Weight value
ConceptDescription
GradientThe partial derivative of the loss function with respect to each weight, indicating "which direction to adjust to reduce the loss"
Learning RateHow far to step each time. Too large and you'll overshoot the minimum; too small and convergence is too slow
Batch SizeHow many samples to use for each gradient calculation. Full batch is too slow, single sample is too noisy, mini-batch is the compromise

Backpropagation: The Triumph of the Chain Rule

Backpropagation is an efficient algorithm for calculating gradients. It leverages the chain rule from calculus, starting from the output layer and working backward layer by layer to compute each weight's contribution to the loss.

Forward propagation: Input → Hidden Layer 1 → Hidden Layer 2 → Output → Loss
Backpropagation:    Loss → Output → Hidden Layer 2 → Hidden Layer 1 → Update all weights

Intuition for Backpropagation

Think of a neural network as an assembly line. When a product (prediction) has a problem (large loss), you need to trace back from the final step, checking how much each step (each layer's weights) contributed to the problem, then adjust proportionally. Those that contributed more get adjusted more; those that contributed less get adjusted less.


3. Mainstream Network Architectures

Different types of data require different network architectures. Choosing the right architecture gets you twice the results with half the effort.

Common Neural Network Layer Types
Click a layer to inspect its role and parameters
Dense layer
Each neuron connects to every neuron in the previous layer. This is the most basic layer type and learns combinations of input features.
units (number of neurons)activation
Output layers for classification or regression, and simple feature extraction
Dense(128, activation="relu")

3.1 CNN (Convolutional Neural Network)

CNN is the king of image processing. The core idea: slide small convolution kernels across the image to extract local features.

Input image → [Convolution → Activation → Pooling] × N → Fully Connected Layer → Output
  28×28      Extract edges/textures/shapes                   Classification result
FeatureDescription
Local ConnectivityEach neuron only looks at a small patch, not the entire image
Parameter SharingThe same convolution kernel is reused across the entire image, drastically reducing parameters
Translation InvarianceA cat on the left or right side of the image can still be recognized
Hierarchical FeaturesShallow layers learn edges, deep layers learn semantics

Representative models: LeNet, AlexNet, VGG, ResNet, EfficientNet

3.2 RNN (Recurrent Neural Network)

RNN is designed for sequential data. Its hidden state is passed to the next time step, giving the network a form of "memory."

Time step t1    Time step t2    Time step t3
  "I"   ──→    "like"  ──→    "cats"
   ↓             ↓             ↓
  [h1]  ──→    [h2]   ──→    [h3] ──→ Output
   ↑             ↑             ↑
 Hidden state passes between time steps (memory)
VariantProblem SolvedCore Mechanism
Vanilla RNNBasic sequence modelingSimple recurrent connections
LSTMVanishing gradients in long sequencesForget gate, input gate, output gate
GRULSTM has too many parametersSimplified to reset gate and update gate
Bidirectional RNNCan only see the pastProcesses forward and backward simultaneously

LSTM's Gating Mechanism

The elegance of LSTM lies in its three "gates": the forget gate decides which old memories to discard, the input gate decides which new information to store, and the output gate decides what to output. It's like reading a book — you selectively remember important plot points and forget irrelevant details.

3.3 Transformer: Attention Is All You Need

In 2017, Google published the paper "Attention Is All You Need," proposing the Transformer, which fundamentally transformed the AI field. It replaces recurrent structures with the self-attention mechanism and serves as the foundation for large models like GPT, BERT, and Claude.

Input sequence → Embedding + Positional Encoding → [Multi-Head Attention → Feed-Forward Network] × N → Output

                                          Every token can "see" every other token
AdvantageDescription
Parallel ComputationUnlike RNNs that must process sequentially, Transformers can process the entire sequence in parallel
Long-Range DependenciesDirect connections between any two positions, regardless of distance
ScalabilityThe larger the model and the more data, the better the performance (Scaling Law)

Intuition for self-attention: When reading the sentence "The cat sat on the mat because it was tired," "it" needs to attend to "cat" to understand the meaning. Self-attention lets the model learn this kind of association — computing a "relevance score" for every pair of tokens in the sequence.

Common Neural Network Architectures
Click to inspect each architecture, its characteristics, and applications
Feedforward neural network(FNN)
1958
The most basic neural network structure. Data flows one way from the input layer through hidden layers to the output layer, with no recurrence. Neurons in each layer connect to all neurons in the next layer.
Network structure
Input layer Hidden layers ×N Output layer
Typical applications
ClassificationRegressionFunction approximation
Key idea:Map inputs to outputs through multiple nonlinear transformations. More layers can represent more complex functions.

4. The Art of Training

Having a good architecture isn't enough — there are many pitfalls to avoid during training.

4.1 Overfitting vs. Underfitting

ProblemSymptomsCauseSolutions
OverfittingGood performance on training set, poor on test setModel is too complex, "memorizing answers" rather than learning patternsRegularization, Dropout, data augmentation, early stopping
UnderfittingPoor performance on both training and test setsModel is too simple, unable to learn patternsIncrease model capacity, train longer, better features
Error

  │ ╲  Training error          Test error  ╱
  │  ╲                                    ╱
  │   ╲─────────────────╱
  │    Underfitting ← Sweet spot → Overfitting
  └──────────────────────────→ Model complexity

4.2 Key Hyperparameters

Hyperparameters are parameters that must be set manually before training (not learned by the model):

HyperparameterRoleCommon RangeTuning Advice
Learning RateStep size for each update1e-5 ~ 1e-1The most important hyperparameter; usually start from 1e-3
Batch SizeNumber of samples per training step16 ~ 512Larger batches are more stable but require more VRAM
EpochsNumber of passes through the entire dataset10 ~ 100+Use with early stopping; stop when validation performance no longer improves
OptimizerGradient update strategyAdam, SGDAdam is the default choice; SGD + momentum for fine-tuning

4.3 Regularization Techniques

Common methods for preventing overfitting:

TechniquePrincipleUsage
DropoutRandomly deactivate some neurons during trainingTypically p=0.1~0.5
Weight DecayAdd a penalty on weight magnitude to the loss functionL2 regularization, λ=1e-4
Data AugmentationApply random transformations to training data (flip, crop, rotate)Essential for image tasks
Early StoppingStop training when validation loss stops decreasingpatience=5~10
Batch NormalizationNormalize the input distribution of each layerAccelerates convergence, has a mild regularization effect

Rules of Thumb for Training

  1. First, run through the entire pipeline on a small dataset to confirm there are no code bugs
  2. Start by fine-tuning an existing pre-trained model rather than training from scratch
  3. The learning rate is the hyperparameter most worth your time to tune
  4. If the training loss isn't decreasing, check your data and code first before doubting the model

5. Development History and Frontiers

The development of neural networks has gone through several "winters" and "renaissances," with each breakthrough driven by key technological innovations.

EraMilestoneKey Breakthrough
1958PerceptronThe first neural network model; could only handle linear problems
1986Backpropagation AlgorithmMade training multi-layer networks possible
1998LeNet (CNN)Convolutional networks achieved great success on handwritten digit recognition
2012AlexNetDeep CNNs crushed traditional methods on ImageNet; deep learning explosion
2014GAN (Generative Adversarial Network)Two networks trained adversarially, capable of generating realistic images
2017Transformer"Attention Is All You Need"; attention mechanism replaced RNNs
2018BERTPre-training + fine-tuning paradigm; NLP breakthroughs across the board
2020GPT-3175 billion parameters, demonstrating emergent capabilities of large models
2022ChatGPTRLHF alignment technique; AI entered public consciousness
2023+Multimodal Large ModelsGPT-4V, Claude, etc.; understanding both text and images simultaneously
DirectionDescription
Large Language Models (LLMs)Parameter counts from hundreds of millions to trillions; emergent abilities in reasoning, coding, etc.
MultimodalA single model handling text, images, audio, and video
Efficient Fine-TuningTechniques like LoRA and QLoRA enable ordinary developers to fine-tune large models
AI AgentsEnabling large models to use tools, plan tasks, and autonomously complete complex goals
Small Model DistillationUsing knowledge from large models to train smaller models for on-device deployment

Takeaways for Developers

You don't need to train neural networks from scratch. Modern AI development is more about calling APIs (like OpenAI or Claude API) or fine-tuning pre-trained models (e.g., using Hugging Face). But understanding the underlying principles helps you choose models more wisely, design better prompts, and diagnose problems more effectively.


Summary

Core ConceptOne-Sentence Summary
NeuronWeighted sum + activation function; the smallest computational unit of a network
Forward PropagationData flows from input layer to output layer, producing a prediction
BackpropagationStarting from the loss, compute gradients layer by layer and update weights
CNNConvolution kernels extract local features; the go-to for image processing
RNN/LSTMRecurrent connections maintain memory; for processing sequential data
TransformerSelf-attention with parallel processing; the foundational architecture of large models
OverfittingModel "memorizes answers"; prevent with regularization, Dropout, etc.
Transfer LearningStand on the shoulders of giants; fine-tune pre-trained models for new tasks

Further Reading