Ollama Deployment from Scratch (Ubuntu 24.04 + ROCm 7+)
This section explains how to install and use Ollama (ROCm version llama.cpp backend) on Ubuntu 24.04 + ROCm 7+, with performance testing using Gemma 4 E4B-it Q4_K_M as an example.
Prerequisite: ROCm 7.1.0 environment setup is complete (refer to
env-prepare-ubuntu24-rocm7.md).
1. Install Ollama (System Service)
Step one: One-click installation managed via systemctl, which starts the service on local port 11434.
For more information, refer to the official documentation: https://docs.ollama.com/linux
Installation command:
curl -fsSL https://ollama.com/install.sh | sh2. Verify Service Connectivity
After installation, verify the service is running properly:
curl http://localhost:11434If JSON information is returned (such as version number, etc.), the service has started successfully.
3. Basic Usage Commands
Common basic commands:
# List all models
ollama list
# Download a model (using Gemma 4 E4B-it Q4_K_M as an example)
ollama pull gemma4:e4b-it-q4_K_M
# Test model execution (interactive)
ollama run gemma4:e4b-it-q4_K_MNote: Gemma 4 tag naming in the Ollama official library may change with upstream updates. Please check ollama.com/library/gemma4 for the latest tags before use. If you have sufficient VRAM, you can also switch to larger variants like
gemma4:31borgemma4:26b-a4b.
4. Benchmarking with curl (Calculating tokens/s)
The following example command calls Ollama's REST API and uses jq to parse the evaluation information from the response to calculate inference speed:
curl -s -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:e4b-it-q4_K_M",
"prompt": "Explain what a large language model is in one sentence",
"stream": false
}' | jq '.eval_count, .eval_duration' | \
awk 'NR==1{count=$1} NR==2{duration=$1/1e9} END{printf "tokens/s: %.2f\n", count/duration}'eval_count: Number of tokens generated during inferenceeval_duration: Inference time (in nanoseconds)tokens/s: Calculated ascount / (duration / 1e9)
5. Gemma 4 E4B-it Q4_K_M Performance Example
Testing the Gemma 4 E4B-it Q4_K_M model in the above environment (context length 4096):
- tokens/s depends on your actual hardware (Gemma 4 E4B activates only 4.5B effective parameters during inference, typically faster than comparable 8B models at the same Q4_K_M quantization)
Screenshot example:

To experience Gemma 4's multimodal capabilities (image / video / audio input), select a Gemma 4 tag in Ollama that is labeled as supporting
vision/multimodal, and pass images via theimagesfield (Base64-encoded) in the/api/chatAPI endpoint. Please refer to the latest Ollama documentation for specific parameters.