Skip to content

01-Deploy

🚀 ROCm LLM Deployment in Practice

Get started with LLM deployment on AMD GPUs from scratch

Back to Home · 中文

Introduction

  This module provides comprehensive tutorials for deploying large language models on AMD GPUs. Whether you are a beginner or an experienced developer, you can quickly learn how to deploy and run LLMs on the ROCm platform through these tutorials.

  Since ROCm 7.10.0, ROCm supports seamless installation in Python virtual environments just like CUDA, significantly lowering the barrier for LLM deployment on AMD GPUs.

  This module uses Google Gemma 4 (gemma-4-E4B-it primarily) as the example model by default, and also provides parallel tutorials for Qwen3 as a reference. The directory structure is as follows:

01-Deploy/
└── models/
    ├── Gemma4/           # Deployment tutorials with Gemma 4 as the primary model (recommended)
    └── Qwen3/            # Qwen3 series deployment tutorials (reference/comparison)

Tutorial List

Ubuntu 24.04 + ROCm 7 Environment Setup Tutorial

  This tutorial walks you through installing and verifying ROCm 7.1.0 on Ubuntu 24.04 step by step, including uninstalling old ROCm environments, running the official script to install ROCm, and using tools like rocminfo / rocm-smi / amd-smi to confirm GPU and driver status. It is recommended to complete this tutorial before starting any deployment tutorial.

  • Target Audience: Users setting up a ROCm environment on an AMD GPU for the first time
  • Difficulty Level: ⭐⭐
  • Estimated Time: 1 hour

📖 Start the Environment Setup Tutorial (Gemma4)
📎 Reference: Qwen3 Version


Gemma 4 Model Introduction

  Before starting deployment, it is recommended to read the Gemma 4 model introduction to understand the architecture characteristics, capability differences, and hardware selection recommendations for the four versions: Gemma 4 E2B / E4B / 31B / 26B A4B, so you can choose the right model for your environment.

  • Target Audience: Users trying Gemma 4 for the first time
  • Difficulty Level: ⭐
  • Estimated Time: 15 minutes

📖 Read the Gemma 4 Model Introduction


LM Studio LLM Deployment from Scratch

  LM Studio is a user-friendly desktop application that supports running large language models locally. This tutorial uses Gemma 4 E4B-it Q4_K_M as an example to guide you through deploying and running LLMs on AMD GPUs using LM Studio with the ROCm version of the llama.cpp backend.

  • Target Audience: Beginners and users who want to quickly experience LLMs
  • Difficulty Level: ⭐
  • Estimated Time: 30 minutes

📖 Start the LM Studio Deployment Tutorial (Gemma4)
📎 Reference: Qwen3 Version


vLLM LLM Deployment from Scratch

  vLLM is a high-performance LLM inference and serving framework that supports efficient PagedAttention and continuous batching. This tutorial uses Gemma 4 E4B-it as an example, covering both a quick start method using the official ROCm vLLM Docker image, and an advanced method for manually compiling Triton / FlashAttention / vLLM from source.

  • Target Audience: Developers who need to set up inference services
  • Difficulty Level: ⭐⭐
  • Estimated Time: 1 hour

📖 Start the vLLM Deployment Tutorial (Gemma4)
📎 Reference: Qwen3 Version


Ollama LLM Deployment from Scratch

  Ollama is a framework for quickly serving large language models and vision-language models with an efficient backend runtime. This tutorial uses Gemma 4 E4B-it Q4_K_M as an example to guide you through deploying LLMs on AMD GPUs using Ollama (ROCm version llama.cpp backend), with tokens/s benchmark examples included.

  • Target Audience: Developers who want to spin up a local inference service with a single command
  • Difficulty Level: ⭐⭐
  • Estimated Time: 1 hour

📖 Start the Ollama Deployment Tutorial (Gemma4)
📎 Reference: Qwen3 Version


llama.cpp LLM Deployment from Scratch

  llama.cpp is a lightweight and high-performance inference backend that supports multiple model formats including GGUF, with optimized versions available for ROCm. This tutorial uses Gemma 4 E4B-it Q4_K_M (GGUF) as an example, showing how to deploy mainstream models on Ubuntu 24.04 + ROCm 7+ using both pre-built binaries and Docker.

  • Target Audience: Developers who want to freely orchestrate inference workflows via CLI / REST API
  • Difficulty Level: ⭐⭐⭐
  • Estimated Time: 1.5 hours

📖 Start the llama.cpp Deployment Tutorial (Gemma4)
📎 Reference: Qwen3 Version


Requirements

Hardware Requirements

  • AMD GPU (ROCm-supported GPUs such as RX 7000 / 9000 series, Ryzen AI MAX / AI 300, Instinct MI series, etc.)
  • At least 8GB VRAM recommended (Gemma 4 E4B Q4_K_M quantized version can run with 8GB VRAM; for native bfloat16 inference or larger models, please refer to the VRAM recommendations in the corresponding tutorials)

Software Requirements

  • Operating System: Linux (Ubuntu 22.04+) or Windows 11
  • ROCm 7.10.0 or higher
  • Python 3.10+

FAQ

Q: How do I check if my AMD GPU supports ROCm?

Please refer to the ROCm official support list to check supported GPU models.

Q: What should I do if I encounter a "HIP error" during deployment?
  1. Confirm that ROCm is properly installed
  2. Check that environment variables are correctly set
  3. Try restarting the system and running again
Q: I get a permission denied error when downloading Gemma 4?

Gemma series models require you to first click Agree & Access on the corresponding model page on Hugging Face (e.g., google/gemma-4-E4B-it), then log in with a Hugging Face Token that has read permissions or inject it via HF_TOKEN into the container / process.

Reference Resources


Contributions for more deployment tutorials are welcome! 🎉

Submit an Issue | Submit a PR