Skip to content

Ollama Deployment from Scratch (Ubuntu 24.04 + ROCm 7+)

This section explains how to install and use Ollama (ROCm version llama.cpp backend) on Ubuntu 24.04 + ROCm 7+, with performance testing using Qwen3-8B Q4_K_M as an example.

Prerequisite: ROCm 7.1.0 environment setup is complete (refer to env-prepare-ubuntu24-rocm7.md).


1. Install Ollama (System Service)

Step one: One-click installation managed via systemctl, which starts the service on local port 11434.
For more information, refer to the official documentation: https://docs.ollama.com/linux

Installation command:

bash
curl -fsSL https://ollama.com/install.sh | sh

2. Verify Service Connectivity

After installation, verify the service is running properly:

bash
curl http://localhost:11434

If JSON information is returned (such as version number, etc.), the service has started successfully.


3. Basic Usage Commands

Common basic commands:

bash
# List all models
ollama list

# Download a model
ollama pull qwen3:8b-q4_K_M

# Test model execution (interactive)
ollama run qwen3:8b-q4_K_M

4. Benchmarking with curl (Calculating tokens/s)

The following example command calls Ollama's REST API and uses jq to parse the evaluation information from the response to calculate inference speed:

bash
curl -s -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
  "model": "qwen3:8b-q4_K_M",
  "prompt": "Explain what a large language model is in one sentence",
  "stream": false
}' | jq '.eval_count, .eval_duration' | \
awk 'NR==1{count=$1} NR==2{duration=$1/1e9} END{printf "tokens/s: %.2f\n", count/duration}'
  • eval_count: Number of tokens generated during inference
  • eval_duration: Inference time (in nanoseconds)
  • tokens/s: Calculated as count / (duration / 1e9)

5. Qwen3-8B Q4_K_M Performance Example

Testing the Qwen3-8B Q4_K_M model in the above environment (context length 4096), example results:

  • Approximately 37.50 tokens/s

Screenshot example: