Skip to content

Model Fine-tuning and Deployment

Preface

Large models are powerful, but they don't understand your business. GPT-4 can write poetry and code, but it doesn't know your company's product terminology or your industry's professional standards. Fine-tuning is the process of making a general-purpose large model "learn" your professional knowledge — like giving a knowledgeable generalist on-the-job training to become your domain expert.

What will you learn from this article?

After completing this chapter, you will gain:

  • Process understanding: Master the complete fine-tuning pipeline from data preparation to model deployment
  • Data engineering: Understand the format requirements and quality standards for fine-tuning data
  • Efficient fine-tuning: Understand the principles and advantages of parameter-efficient fine-tuning techniques like LoRA
  • Model compression: Master how quantization techniques enable large models to run on consumer hardware
  • Deployment practices: Understand mainstream architectures and selection strategies for model serving
ChapterContentCore Concepts
Chapter 1Fine-tuning PipelineData → Training → Evaluation → Deployment
Chapter 2Training DataData formats, quality control
Chapter 3LoRA Fine-tuningLow-rank adaptation, parameter efficiency
Chapter 4Model QuantizationFP16, INT8, INT4
Chapter 5Model DeploymentInference serving, API gateway

0. Overview: Why is Fine-tuning Needed?

Large language model training is divided into two phases: pre-training and fine-tuning. Pre-training learns language capabilities from massive general data, while fine-tuning learns specialized capabilities from task-specific data.

To use an analogy: pre-training is like going to college — learning general knowledge and understanding a bit of everything; fine-tuning is like onboarding training — learning professional skills for a specific position.

When Do You Need Fine-tuning?

  • Specific output formats: When you need the model to consistently output in a fixed JSON format
  • Professional domain knowledge: Terminology and standards in medical, legal, financial, and other domains
  • Language style transfer: Making the model respond in a specific tone or style (e.g., customer service scripts)
  • Niche language support: Improving model performance on specific languages
  • Cost optimization: Using a fine-tuned small model to replace large model API calls, reducing inference costs

1. Fine-tuning Pipeline: The Complete Journey from Data to Production

Fine-tuning is not just "throwing data at a model and calling it done." It's a rigorous engineering process where every step affects the final result.

微调流水线演示

点击每个阶段,了解微调的完整流程

🧠
选择基座模型
📊
准备训练数据
⚙️
执行微调训练
📈
评估与测试
🚀
部署上线
🧠 选择基座模型

微调的第一步是选择一个合适的预训练基座模型。基座模型已经在海量数据上学习了通用的语言能力,我们要做的是在此基础上进行"专业化训练"。

1根据任务需求选择模型规模(7B、13B、70B 等)
2考虑开源许可证(Apache 2.0、Llama 许可等)
3评估模型的基础能力是否匹配目标场景
4常见选择:Llama、Qwen、Mistral、DeepSeek 等
示例
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-7B")
1 / 5

Five Stages of Fine-tuning

  1. Data Preparation: Collect, clean, and annotate training data — this is the most time-consuming and critical step
  2. Model Selection: Choose an appropriate base model, such as Llama 3, Qwen, or Mistral
  3. Training Configuration: Set hyperparameters like learning rate, batch size, and number of epochs
  4. Training Execution: Run training on GPUs, monitoring loss curves and evaluation metrics
  5. Evaluation and Deployment: Evaluate performance on a test set, then deploy as an API service if it passes
StageKey ActionsCommon Pitfalls
Data PreparationClean, deduplicate, formatPoor data quality leads to the model "learning bad habits"
Model SelectionEvaluate base model capabilitiesModel too large to train, or too small for good results
Training ConfigurationAdjust hyperparametersLearning rate too high causes catastrophic forgetting
Training ExecutionMonitor loss and metricsOverfitting, training not converging
Evaluation and DeploymentA/B testing, gradual rolloutTest set leakage leading to inflated evaluation metrics

2. Training Data: The Ceiling of Fine-tuning Performance

There's an old saying in fine-tuning: "Garbage in, garbage out." The quality of training data directly determines the upper limit of fine-tuning effectiveness. 100 high-quality data points often outperform 10,000 low-quality ones.

训练数据格式演示

切换不同格式,了解微调数据的组织方式

指令跟随

最常见的微调数据格式。每条数据包含一个指令(instruction)、可选的输入(input)和期望的输出(output)。适合训练通用助手类模型。

通用助手ChatGPT 风格最常用
数据样例
"instruction": "请将以下中文翻译成英文"
"input": "人工智能正在改变世界"
"output": "AI is changing the world"
数据质量要点
指令要清晰明确,避免歧义
输出要完整、准确、格式规范
覆盖多种任务类型(翻译、摘要、问答等)
数据量建议:1,000 ~ 50,000 条

Three Common Fine-tuning Data Formats

  1. Instruction Format: The most commonly used format, containing three fields: instruction, input, and expected output. Suitable for training models to follow instructions.
  2. Chat Format: Multi-turn conversation format containing message lists for system, user, and assistant roles. Suitable for training chatbots.
  3. Completion Format: Simple prompt-completion pairs, suitable for text generation, code completion, and similar scenarios.
Data Quality DimensionDescriptionVerification Method
AccuracyAnswers must be correctManual review, expert verification
ConsistencySimilar questions have consistent response stylesSample comparison checks
DiversityCover enough scenarios and variationsStatistical distribution of question types
DeduplicationAvoid duplicate samples causing overfittingText deduplication, semantic deduplication
Data VolumeUsually 500~5000 high-quality data points sufficeStart small, gradually increase

3. LoRA: Achieving 90% of Results with 1% of Parameters

Full fine-tuning requires updating all model parameters — for a 70B parameter model, this means needing hundreds of GB of VRAM and massive GPU computing power. For most teams, this is impractical.

LoRA (Low-Rank Adaptation) provides an elegant solution: freeze the original model parameters and only train a small set of newly added low-rank matrices. These matrices typically have only 0.1%~1% of the original model's parameters but can achieve results close to full fine-tuning.

LoRA 低秩适配原理演示

理解 LoRA 如何用极少参数实现高效微调

原始权重 W
4096x4096
16,777,216 参数
冻结不动
+
LoRA 适配器
A
4096x8
x
B
8x4096
65,536 参数
可训练
参数节省比例
节省 99.6% 参数
秩越小 = 参数越少、训练越快秩越大 = 表达力越强、效果越好

LoRA's Core Idea

The original model's weight matrix W is a huge matrix (e.g., 4096×4096). LoRA doesn't directly modify W but adds a "bypass" alongside it: W' = W + BA, where B and A are two small matrices (e.g., 4096×8 and 8×4096). During training, only B and A are updated while the original W remains unchanged.

  • Rank (r): Higher r values mean stronger expressiveness but more parameters. Usually r=8~64 is sufficient
  • Merge for deployment: After training, BA can be merged back into W for zero additional overhead during inference
Fine-tuning MethodTrainable ParametersVRAM RequirementTraining SpeedEffect
Full Fine-tuning100%Extremely highSlowBest
LoRA0.1%~1%LowFastClose to full
QLoRA0.1%~1%LowerMediumSlightly below LoRA
Prompt Tuning< 0.01%Extremely lowVery fastLimited

4. Model Quantization: Slimming Down Large Models

A 70B parameter model stored in FP32 (32-bit floating point) requires 280GB of VRAM — impossible to run without several top-tier GPUs. Quantization technology compresses model size by reducing numerical precision, enabling large models to run on consumer hardware.

模型量化演示

拖动滑块,直观感受不同精度下的模型体积、速度与质量变化

FP32
32 bit
FP16
16 bit
INT8
8 bit
INT4
4 bit
💾
模型体积
~28 GB (7B 模型)
推理速度
1x (基准)
🎯
输出质量
100% (无损)
🖥️
显存需求
~32 GB
FP32 详解

FP32(32位浮点数)是模型训练时的默认精度。每个参数用 32 位存储,精度最高但体积最大。通常只在训练阶段使用,推理时很少直接使用 FP32。

单个参数存储示意
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
每个参数占用 32 位 = 4 字节
适用场景:模型训练、科研实验、精度敏感的任务

The Core Trade-off of Quantization

Quantization is fundamentally a precision-for-space trade-off. FP32 → FP16 is nearly lossless, INT8 has minor loss, and INT4 has noticeable but usually acceptable quality degradation. The key is finding the optimal balance point for your scenario.

  • FP16 (half precision): Halves the size with almost no quality loss; the default choice for training and inference
  • INT8 (8-bit integer): Halves the size again with minimal quality loss; suitable for most inference scenarios
  • INT4 (4-bit integer): Only 1/8 of FP32 size with some quality loss; suitable for resource-constrained scenarios
PrecisionBytes Per Parameter70B Model SizeQuality LossApplicable Scenario
FP324 bytes~280 GBNoneTraining baseline
FP162 bytes~140 GBNearly noneStandard training and inference
INT81 byte~70 GBVery smallProduction inference
INT40.5 bytes~35 GBAcceptableEdge devices, local deployment

5. Model Deployment: From Lab to Production

The model is trained, quantized and compressed — the final step is deploying it as a callable service. Model deployment isn't just about "running the model"; it also involves engineering issues like concurrency handling, load balancing, and cost control.

模型服务架构演示

点击不同部署方案,对比其特点与适用场景

🌐
API 服务
最常见的在线部署方式
📱
边缘部署
在终端设备上本地运行
📦
批量处理
离线批量推理大量数据
🌐API 服务

将模型封装为 RESTful API 或 gRPC 服务,通过 HTTP 请求调用。适合需要实时响应的在线应用,如聊天机器人、智能客服、内容生成等。是目前最主流的部署方式。

架构流程
客户端请求
负载均衡
推理服务器
GPU 推理
返回结果
响应延迟
100ms - 2s
并发能力
高(可水平扩展)
部署成本
中高(需 GPU 服务器)
运维复杂度
中等
常用工具
vLLMTGITritonFastAPIOllama

Three Mainstream Deployment Solutions

  1. API Service Providers: Use APIs from OpenAI, Anthropic, and other providers directly. Zero operations, pay per token, suitable for rapid validation and small-to-medium scale usage.
  2. Self-hosted Inference: Deploy on your own GPU servers using frameworks like vLLM or TGI. Controllable costs, data stays on-premises, suitable for scenarios with privacy requirements or large-scale calls.
  3. Serverless Inference: Use platforms like AWS SageMaker or Replicate, pay per request with automatic scaling. Suitable for scenarios with fluctuating traffic.
Deployment SolutionCost ModelLatencyOperations ComplexityApplicable Scenario
API Service ProviderPay per tokenMediumZeroRapid prototyping, small-to-medium scale
vLLM Self-deploymentGPU rental costsLowHighLarge-scale, privacy-sensitive
ServerlessPay per requestHigher cold startLowFluctuating traffic
Edge DeploymentOne-time hardware costVery lowMediumOffline scenarios, IoT

Summary

Model fine-tuning and deployment are critical steps in transforming large models from "general-purpose tools" to "professional assistants." From data preparation to model deployment, every step requires engineering thinking and practice.

Key takeaways from this chapter:

  1. Fine-tuning is onboarding training: Making general-purpose models learn domain-specific knowledge and behavioral patterns
  2. Data quality determines the ceiling: 100 high-quality data points beat 10,000 low-quality ones
  3. LoRA is the efficiency champion: Achieving near full fine-tuning results with less than 1% of parameters
  4. Quantization is a deployment enabler: INT4 quantization makes running 70B models on a single GPU possible
  5. Deployment solutions vary by scenario: Use APIs for rapid validation, self-deployment for large scale, and serverless for fluctuating traffic

Further Reading