hello-rocm

Gemma 4 Model Introduction

Gemma 4 is Google DeepMind's next-generation multimodal open-source model family, built on the same research and technology as Gemini 3, released under the Apache 2.0 license. It supports text, image, video, and audio inputs, generating text outputs. Gemma 4 is a comprehensive upgrade over previous generations — stronger reasoning capabilities, more flexible multimodal support, more efficient architecture design — while offering multiple sizes from edge devices to servers, specifically designed for advanced reasoning and agentic workflows.

Since the first Gemma release, developers have collectively downloaded the models over 400 million times, and the community has built over 100,000 Gemma variants, forming a massive Gemmaverse ecosystem. Gemma 4 is Google's latest response to community needs — achieving frontier capabilities with parameter efficiency.

References: Hugging Face Blog - Welcome Gemma 4 · Google Blog - Gemma 4

1. Model Version Overview

Gemma 4 comes in four sizes, each with both base and IT (instruction-tuned) versions:

Model	Parameter Scale	Context Window	Model Links
Gemma 4 E2B	2.3B effective parameters, 5.1B total with embeddings	128K	base / IT
Gemma 4 E4B	4.5B effective parameters, 8B total with embeddings	128K	base / IT
Gemma 4 31B	31B dense model	256K	base / IT
Gemma 4 26B A4B	MoE architecture, 4B active / 26B total parameters	256K	base / IT

Among these, E2B and E4B are small variants suitable for edge deployment, supporting image, video, text, and audio inputs; 31B and 26B A4B are large variants supporting image, video, and text inputs.

Arena AI Rankings (as of April 2026): The 31B dense model ranks #3 among open-source models on the Arena AI text leaderboard, and the 26B MoE ranks #6, capable of beating models with 20x more parameters. This extreme "intelligence per parameter" means developers can achieve frontier-level capabilities with significantly less hardware.

Large Models (26B / 31B)

Designed for researchers and developers, the unquantized bfloat16 weights run efficiently on a single 80GB NVIDIA H100 GPU; quantized versions can run locally on consumer-grade GPUs, powering IDEs, coding assistants, and agentic workflows. The 26B MoE activates only 3.8B parameters during inference with extremely low latency; the 31B Dense pursues maximum raw quality and is an ideal base for fine-tuning.

Edge Models (E2B / E4B)

Designed from the ground up for mobile and IoT devices, activating only 2B / 4B effective parameters during inference to save memory and battery life. Google has deeply collaborated with the Pixel team and mobile hardware manufacturers like Qualcomm and MediaTek, enabling these multimodal models to run fully offline with near-zero latency on phones, Raspberry Pi, NVIDIA Jetson Orin Nano, and other edge devices. Android developers can prototype via the AICore Developer Preview with forward compatibility to the upcoming Gemini Nano 4.

2. Architecture Features

Gemma 4 incorporates multiple proven architectural innovations, achieving an excellent balance between compatibility, efficiency, and long-context support:

2.1 Alternating Attention Mechanism

Gemma 4 uses an alternating layer structure of local sliding window attention and global full-context attention. Smaller dense models use a 512-token sliding window, while larger models use 1024 tokens. This design controls computation while preserving long-range dependency modeling.

2.2 Dual RoPE Configuration

The model uses two RoPE (Rotary Position Embedding) configurations: standard RoPE for sliding window layers and scaled RoPE for global layers, to better support extra-long contexts.

2.3 Per-Layer Embeddings (PLE)

PLE is one of the most distinctive innovations in Gemma 4's small models (first introduced in Gemma-3n). In traditional Transformers, each token gets only one embedding vector at input, and all subsequent layers build on it. PLE adds a parallel, low-dimensional conditioning path alongside the main residual stream — generating a dedicated small vector for each token at each layer, combining token identity information and contextual information. This allows each layer to receive token-specific information without needing to pack everything into the initial embedding. Since PLE dimensions are much smaller than the main hidden dimension, the parameter cost is low, but it brings significant per-layer specialization effects.

2.4 Shared KV Cache

The last several layers of the model no longer independently compute their own Key and Value projections, but instead reuse the KV tensors from the last non-shared layer of the same attention type (sliding or global). This optimization significantly reduces memory usage and computation during long-text inference with minimal quality impact, making it ideal for edge deployment.

2.5 Vision Encoder

The vision encoder uses learned 2D positional encoding and multi-dimensional RoPE, supports variable aspect ratio inputs, and can be configured with different image token counts (70, 140, 280, 560, 1120), flexibly trading off between speed, memory, and quality.

2.6 Audio Encoder

The small variants (E2B, E4B) include a built-in USM-style Conformer audio encoder, using the same base architecture as Gemma-3n, supporting speech understanding and transcription tasks.

3. Core Capabilities

Gemma 4 goes beyond simple chat scenarios, demonstrating strong capabilities across multiple dimensions:

3.1 Advanced Reasoning

Capable of multi-step planning and deep logical reasoning, with significant improvements on math and instruction-following benchmarks.

3.2 Agentic Workflows

Natively supports Function Calling, structured JSON output, and system instructions, enabling the construction of autonomous agents that interact with external tools and APIs to reliably execute complex workflows.

3.3 Code Generation

Supports high-quality offline code generation, turning workstations into local-first AI coding assistants.

3.4 Vision and Audio

All models natively process video and images with variable resolution support, excelling at OCR, chart understanding, and other visual tasks. E2B and E4B additionally support native audio input for speech recognition and understanding.

3.5 Long Context

Edge models support 128K context windows, and large models up to 256K, allowing complete code repositories or long documents to be passed in a single prompt.

3.6 Multilingual Support

Natively trained on 140+ languages, helping developers build high-performance applications for global users.

3.7 Multimodal Task Demonstrations

In Hugging Face's practical testing, Gemma 4 demonstrated out-of-the-box capabilities in the following multimodal scenarios:

Object Detection and Localization: The model can natively return detected bounding box coordinates in JSON format without specific instructions or constrained generation.
GUI Element Detection: Can identify and locate UI elements (such as buttons, menu items, etc.), suitable for GUI automation scenarios.
Image Captioning: All model sizes excel at capturing details in complex scenes.
Video Understanding: Although not explicitly post-trained on video, the model can understand video content with or without audio. Small variants can process both video frames and audio simultaneously.
Audio QA and Transcription: E2B and E4B support description and transcription of speech content, with training data primarily focused on speech.
Multimodal Reasoning (Thinking): Supports chain-of-thought reasoning mode on multimodal inputs.
Multimodal Function Calling: Can recognize information from image content and call external tools, e.g., identifying a location in an image and then calling a weather query API.

4. Benchmark Results

Gemma 4 achieves excellent results across multiple dimensions including reasoning, coding, vision, audio, and long context. The 31B dense model achieves an LMArena score (text-only) of 1452, while the 26B MoE achieves 1441 with only 4B active parameters.

Below are the main benchmark results for the instruction-tuned versions:

Benchmark	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 E4B	Gemma 4 E2B	Gemma 3 27B
Reasoning & Knowledge
MMLU Pro	85.2%	82.6%	69.4%	60.0%	67.6%
AIME 2026 (no tools)	89.2%	88.3%	42.5%	37.5%	20.8%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%	42.4%
BigBench Extra Hard	74.4%	64.8%	33.1%	21.9%	19.3%
Coding
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%	29.1%
Codeforces ELO	2150	1718	940	633	110
Vision
MMMU Pro	76.9%	73.8%	52.6%	44.2%	49.7%
MATH-Vision	85.6%	82.4%	59.5%	52.4%	46.0%
Long Context
MRCR v2 128K (avg)	66.4%	44.1%	25.4%	19.1%	13.5%

Key Highlights:

26B A4B (MoE) activates only 4B parameters, achieving performance close to the 31B dense model, with extremely high efficiency.
Compared to the previous generation Gemma 3 27B, Gemma 4 improves AIME 2026 math reasoning by over 4x (20.8% → 89.2%), and Codeforces coding score by over 15x (110 → 1718+).
Vision and long context capabilities significantly surpass the previous generation.

5. Ecosystem and Inference Framework Support

Gemma 4 has broad toolchain and platform support, allowing developers to choose flexibly based on their needs:

5.1 Inference Frameworks (Day-One Support)

Framework	Description
transformers	Official integration, supports `AutoModelForMultimodalLM`, `any-to-any` pipeline
llama.cpp	Supports text+image inference, available via OpenAI-compatible API, compatible with LM Studio, Jan, and other local apps
vLLM	High-performance inference engine
SGLang	High-performance inference engine, transformers backend
MLX	Native Apple Silicon support with TurboQuant quantization optimization
Ollama	One-click local deployment
LM Studio	GUI-based local inference
transformers.js	Browser-based execution (WebGPU)
mistral.rs	Rust-native inference engine, supports all modalities and built-in tool calling
ONNX	Provides ONNX format checkpoints for edge device and browser deployment
LiteRT-LM	Google's edge inference runtime
NVIDIA NIM / NeMo	NVIDIA inference and training platform

5.2 Hardware Platforms

Gemma 4 is optimized for multiple hardware platforms:

NVIDIA GPU: Full coverage from Jetson Orin Nano to Blackwell series
AMD GPU: Integration through the open-source ROCm software stack
Google TPU: Supports Trillium and Ironwood TPU for large-scale deployment
Apple Silicon: Native support via MLX
Mobile Chips: Optimized in collaboration with Qualcomm and MediaTek

5.3 Development Platforms and Quick Start

Platform	Purpose
Google AI Studio	Try 31B and 26B MoE online
Google AI Edge Gallery	Try edge models E4B and E2B
Android Studio Agent Mode	Android app development integration
Hugging Face	Model downloads and community
Kaggle	Model downloads and Gemma 4 Good Challenge
Google Colab	Free fine-tuning and experimentation
Vertex AI	Enterprise deployment (Cloud Run / GKE / TPU acceleration)

6. Fine-tuning Support

Gemma 4's highly optimized architecture enables efficient fine-tuning across various hardware — from billions of Android devices to laptop GPUs, developer workstations, and accelerators. The community already has successful fine-tuning examples: INSAIT created BgGPT, a Bulgarian-first language model based on Gemma, and Yale University leveraged Cell2Sentence-Scale to discover new cancer treatment pathways.

Gemma 4 supports fine-tuning with various tools:

TRL (Transformer Reinforcement Learning): Supports multimodal tool-response training, can receive image feedback from environments (e.g., training Gemma 4 to learn driving in the CARLA simulator).
TRL on Vertex AI: Provides complete examples for SFT fine-tuning on Google Cloud Vertex AI, with the ability to freeze vision and audio towers for function calling capability extension.
Unsloth Studio: Supports fine-tuning via GUI locally or on Colab.
Keras: Google's high-level deep learning API support.

7. License and Safety

Gemma 4 is released under the Apache 2.0 open-source license, granting developers complete flexibility and digital sovereignty — full control over data, infrastructure, and models, with the freedom to build and deploy in any environment, locally or in the cloud.

In terms of safety, Gemma 4 undergoes the same rigorous infrastructure security protocols as Google's proprietary models, providing a trustworthy and transparent foundation for enterprises and sovereign organizations.

8. Summary

Gemma 4 is among the best open-source multimodal models available today. Its core advantages include:

Truly Open Source: Apache 2.0 license, free for commercial use, with full control.
Extreme Parameter Efficiency: Beats models with 20x more parameters, Arena AI open-source ranking Top 3/6.
Full Modality Support: Unified processing of text, image, video, and audio, covering 140+ languages.
Full Size Coverage: From 2.3B edge models (mobile/IoT) to 31B server-grade models, fitting all scenarios.
Efficient Architecture: PLE, shared KV cache, MoE, and other designs maintain high performance while significantly reducing resource requirements.
Agent-Ready: Native function calling, structured output, and system instructions, suitable for building complex agentic workflows.
Complete Ecosystem: Day-one support from mainstream inference frameworks, multi-hardware platform optimization (NVIDIA / AMD ROCm / Google TPU / Apple Silicon), with quantization and fine-tuning tools readily available.

For more information, see:
Hugging Face Blog - Welcome Gemma 4
Google Blog - Gemma 4
Hugging Face Gemma 4 Model Collection

Qwen3

Gemma4

Gemma 4 Model Introduction

1. Model Version Overview

Large Models (26B / 31B)

Edge Models (E2B / E4B)

2. Architecture Features

2.1 Alternating Attention Mechanism

2.2 Dual RoPE Configuration

2.3 Per-Layer Embeddings (PLE)

2.4 Shared KV Cache

2.5 Vision Encoder

2.6 Audio Encoder

3. Core Capabilities

3.1 Advanced Reasoning

3.2 Agentic Workflows

3.3 Code Generation

3.4 Vision and Audio

3.5 Long Context

3.6 Multilingual Support

3.7 Multimodal Task Demonstrations

4. Benchmark Results

5. Ecosystem and Inference Framework Support

5.1 Inference Frameworks (Day-One Support)

5.2 Hardware Platforms

5.3 Development Platforms and Quick Start

6. Fine-tuning Support

7. License and Safety

8. Summary

Gemma 4 Model Introduction ​

1. Model Version Overview ​

Large Models (26B / 31B) ​

Edge Models (E2B / E4B) ​

2. Architecture Features ​

2.1 Alternating Attention Mechanism ​

2.2 Dual RoPE Configuration ​

2.3 Per-Layer Embeddings (PLE) ​

2.4 Shared KV Cache ​

2.5 Vision Encoder ​

2.6 Audio Encoder ​

3. Core Capabilities ​

3.1 Advanced Reasoning ​

3.2 Agentic Workflows ​

3.3 Code Generation ​

3.4 Vision and Audio ​

3.5 Long Context ​

3.6 Multilingual Support ​

3.7 Multimodal Task Demonstrations ​

4. Benchmark Results ​

5. Ecosystem and Inference Framework Support ​

5.1 Inference Frameworks (Day-One Support) ​

5.2 Hardware Platforms ​

5.3 Development Platforms and Quick Start ​

6. Fine-tuning Support ​

7. License and Safety ​

8. Summary ​

Gemma 4 Model Introduction

1. Model Version Overview

Large Models (26B / 31B)

Edge Models (E2B / E4B)

2. Architecture Features

2.1 Alternating Attention Mechanism

2.2 Dual RoPE Configuration

2.3 Per-Layer Embeddings (PLE)

2.4 Shared KV Cache

2.5 Vision Encoder

2.6 Audio Encoder

3. Core Capabilities

3.1 Advanced Reasoning

3.2 Agentic Workflows

3.3 Code Generation

3.4 Vision and Audio

3.5 Long Context

3.6 Multilingual Support

3.7 Multimodal Task Demonstrations

4. Benchmark Results

5. Ecosystem and Inference Framework Support

5.1 Inference Frameworks (Day-One Support)

5.2 Hardware Platforms

5.3 Development Platforms and Quick Start

6. Fine-tuning Support

7. License and Safety

8. Summary