hello-rocm

llama.cpp Deployment from Scratch (Ubuntu 24.04 + ROCm 7+)

This section explains how to use llama.cpp for inference on Ubuntu 24.04 + ROCm 7+, including:

Using pre-built executables (recommended)
Using Docker + official ROCm image to build from source

The example model is Qwen3-8B Q4_K_M (GGUF format).

Prerequisite: ROCm 7.1.0 system installation and verification is complete (see env-prepare-ubuntu24-rocm7.md).

Method 1 (Recommended): Pre-built Executables

1. Download Pre-built Version

Use the pre-built version provided by Lemonade, where:

370 corresponds to the gfx1150 architecture
395 corresponds to the gfx1151 architecture

2. Verify ROCm 7+ Installation (Must Be System-level ROCm)

Use amd-smi to confirm GPU model, driver, and ROCm version:

bash

amd-smi

Example output (showing GPU model, driver version, ROCm version):

If the output is normal, the GPU is ready for inference.

3. Enter the llama Backend Directory and Set Permissions / Environment Variables

bash

cd llama-*x64/
sudo chmod +x *
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH

4. Download Qwen3-8B Q4_K_M GGUF Model

Ollama / llama-server typically uses the GGUF model format.

Here we use the Chinese Hugging Face mirror https://hf-mirror.com/. You need to log in to
Hugging Face to get your username and token.

Reference commands:

bash

# Create model storage directory
mkdir -p ~/models
cd ~/models

# Download GGUF file (example, replace with official/open-source download URL)
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
export HF_ENDPOINT=https://hf-mirror.com
sudo apt update
sudo apt install aria2

# You need to log in to Hugging Face to get your username and token
./hfd.sh netrunnerllm/Qwen3-8B-Q4_K_M-GGUF --hf_username <USERNAME> --hf_token hf_***

5. Start llama-server

bash

# ROCm driver library path
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH
cd llama-*x64/
./llama-server -m ~/models/Qwen3-8B-Q4_K_M-GGUF/qwen3-8b-q4_k_m.gguf -ngl 99

6. Test the API (curl + jq to Calculate tokens/s)

Use curl to call the local llama-server API and calculate QPS / TPS:

bash

curl -s -X POST http://127.0.0.1:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen3-8b-q4_k_m",
  "prompt": "Explain large language models in one sentence",
  "max_tokens": 128
}' | jq -r '
# Print generated text
.choices[0].text as $txt |
# Calculate token/s
(.usage.completion_tokens / (.timings.predicted_ms / 1000)) as $tps |
"Generated text:\n\($txt)\n\ntokens/s: \($tps|tostring)"
'

Screenshot example:

Test result example (Qwen3-8B Q4_K_M, ctx=4096):

Approximately 40.71 tokens/s

Method 2: Docker Method (Official ROCm llama.cpp Image)

If you prefer using Docker, refer to the official documentation:

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/llama-cpp-install.html

Note: If using Docker, you need to install amdgpu-dkms:
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html
These steps are included in the installation script mentioned earlier; if you did not run the script, you need to install it manually.

1. Pull the Container Image

bash

export MODEL_PATH='~/models'

sudo docker run -it \
  --name=$(whoami)_llamacpp \
  --privileged --network=host \
  --device=/dev/kfd --device=/dev/dri \
  --group-add video --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --ipc=host --shm-size 16G \
  -v $MODEL_PATH:/data \
  rocm/dev-ubuntu-24.04:7.0-complete

2. Prepare the Workspace Inside the Container

After entering the container, set up your working directory and dependencies:

bash

apt-get update && apt-get install -y nano libcurl4-openssl-dev cmake git
mkdir -p /workspace && cd /workspace

3. Clone the ROCm Official llama.cpp Repository Inside the Container

bash

git clone https://github.com/ROCm/llama.cpp
cd llama.cpp

4. Set the ROCm Architecture (Using AI MAX 395 as Example)

bash

export LLAMACPP_ROCM_ARCH=gfx1151

To compile for multiple micro-architectures simultaneously:

bash

export LLAMACPP_ROCM_ARCH=gfx803,gfx900,gfx906,gfx908,gfx90a,gfx942,gfx1010,gfx1030,gfx1032,gfx1100,gfx1101,gfx1102,gfx1150,gfx1151

5. Build and Install llama.cpp

bash

HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=$LLAMACPP_ROCM_ARCH -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=ON && \
cmake --build build --config Release -j$(nproc)

6. Test the Installation

bash

cd /workspace/llama.cpp
./build/bin/test-backend-ops

7. Run Qwen3-8B Q4_K_M Test

bash

./build/bin/llama-cli -m /data/Qwen3-8B-Q4_K_M-GGUF/qwen3-8b-q4_k_m.gguf -ngl 99

Screenshot example:

Test result (Qwen3-8B Q4_K_M, ctx=4096):

Approximately 39.60 tokens/s

llama.cpp Deployment from Scratch (Ubuntu 24.04 + ROCm 7+) ​

Method 1 (Recommended): Pre-built Executables ​

1. Download Pre-built Version ​

2. Verify ROCm 7+ Installation (Must Be System-level ROCm) ​

3. Enter the llama Backend Directory and Set Permissions / Environment Variables ​

4. Download Qwen3-8B Q4_K_M GGUF Model ​

5. Start llama-server ​

6. Test the API (curl + jq to Calculate tokens/s) ​

Method 2: Docker Method (Official ROCm llama.cpp Image) ​

1. Pull the Container Image ​

2. Prepare the Workspace Inside the Container ​

3. Clone the ROCm Official llama.cpp Repository Inside the Container ​

4. Set the ROCm Architecture (Using AI MAX 395 as Example) ​

5. Build and Install llama.cpp ​

6. Test the Installation ​

7. Run Qwen3-8B Q4_K_M Test ​

llama.cpp Deployment from Scratch (Ubuntu 24.04 + ROCm 7+)

Method 1 (Recommended): Pre-built Executables

1. Download Pre-built Version

2. Verify ROCm 7+ Installation (Must Be System-level ROCm)

3. Enter the llama Backend Directory and Set Permissions / Environment Variables

4. Download Qwen3-8B Q4_K_M GGUF Model

5. Start llama-server

6. Test the API (curl + jq to Calculate tokens/s)

Method 2: Docker Method (Official ROCm llama.cpp Image)

1. Pull the Container Image

2. Prepare the Workspace Inside the Container

3. Clone the ROCm Official llama.cpp Repository Inside the Container

4. Set the ROCm Architecture (Using AI MAX 395 as Example)

5. Build and Install llama.cpp

6. Test the Installation

7. Run Qwen3-8B Q4_K_M Test