Gemma 4 Model Introduction
Gemma 4 is Google DeepMind's next-generation multimodal open-source model family, built on the same research and technology as Gemini 3, released under the Apache 2.0 license. It supports text, image, video, and audio inputs, generating text outputs. Gemma 4 is a comprehensive upgrade over previous generations — stronger reasoning capabilities, more flexible multimodal support, more efficient architecture design — while offering multiple sizes from edge devices to servers, specifically designed for advanced reasoning and agentic workflows.
Since the first Gemma release, developers have collectively downloaded the models over 400 million times, and the community has built over 100,000 Gemma variants, forming a massive Gemmaverse ecosystem. Gemma 4 is Google's latest response to community needs — achieving frontier capabilities with parameter efficiency.
References: Hugging Face Blog - Welcome Gemma 4 · Google Blog - Gemma 4
1. Model Version Overview
Gemma 4 comes in four sizes, each with both base and IT (instruction-tuned) versions:
| Model | Parameter Scale | Context Window | Model Links |
|---|---|---|---|
| Gemma 4 E2B | 2.3B effective parameters, 5.1B total with embeddings | 128K | base / IT |
| Gemma 4 E4B | 4.5B effective parameters, 8B total with embeddings | 128K | base / IT |
| Gemma 4 31B | 31B dense model | 256K | base / IT |
| Gemma 4 26B A4B | MoE architecture, 4B active / 26B total parameters | 256K | base / IT |
Among these, E2B and E4B are small variants suitable for edge deployment, supporting image, video, text, and audio inputs; 31B and 26B A4B are large variants supporting image, video, and text inputs.
Arena AI Rankings (as of April 2026): The 31B dense model ranks #3 among open-source models on the Arena AI text leaderboard, and the 26B MoE ranks #6, capable of beating models with 20x more parameters. This extreme "intelligence per parameter" means developers can achieve frontier-level capabilities with significantly less hardware.
Large Models (26B / 31B)
Designed for researchers and developers, the unquantized bfloat16 weights run efficiently on a single 80GB NVIDIA H100 GPU; quantized versions can run locally on consumer-grade GPUs, powering IDEs, coding assistants, and agentic workflows. The 26B MoE activates only 3.8B parameters during inference with extremely low latency; the 31B Dense pursues maximum raw quality and is an ideal base for fine-tuning.
Edge Models (E2B / E4B)
Designed from the ground up for mobile and IoT devices, activating only 2B / 4B effective parameters during inference to save memory and battery life. Google has deeply collaborated with the Pixel team and mobile hardware manufacturers like Qualcomm and MediaTek, enabling these multimodal models to run fully offline with near-zero latency on phones, Raspberry Pi, NVIDIA Jetson Orin Nano, and other edge devices. Android developers can prototype via the AICore Developer Preview with forward compatibility to the upcoming Gemini Nano 4.
2. Architecture Features
Gemma 4 incorporates multiple proven architectural innovations, achieving an excellent balance between compatibility, efficiency, and long-context support:
2.1 Alternating Attention Mechanism
Gemma 4 uses an alternating layer structure of local sliding window attention and global full-context attention. Smaller dense models use a 512-token sliding window, while larger models use 1024 tokens. This design controls computation while preserving long-range dependency modeling.
2.2 Dual RoPE Configuration
The model uses two RoPE (Rotary Position Embedding) configurations: standard RoPE for sliding window layers and scaled RoPE for global layers, to better support extra-long contexts.
2.3 Per-Layer Embeddings (PLE)
PLE is one of the most distinctive innovations in Gemma 4's small models (first introduced in Gemma-3n). In traditional Transformers, each token gets only one embedding vector at input, and all subsequent layers build on it. PLE adds a parallel, low-dimensional conditioning path alongside the main residual stream — generating a dedicated small vector for each token at each layer, combining token identity information and contextual information. This allows each layer to receive token-specific information without needing to pack everything into the initial embedding. Since PLE dimensions are much smaller than the main hidden dimension, the parameter cost is low, but it brings significant per-layer specialization effects.
2.4 Shared KV Cache
The last several layers of the model no longer independently compute their own Key and Value projections, but instead reuse the KV tensors from the last non-shared layer of the same attention type (sliding or global). This optimization significantly reduces memory usage and computation during long-text inference with minimal quality impact, making it ideal for edge deployment.
2.5 Vision Encoder
The vision encoder uses learned 2D positional encoding and multi-dimensional RoPE, supports variable aspect ratio inputs, and can be configured with different image token counts (70, 140, 280, 560, 1120), flexibly trading off between speed, memory, and quality.
2.6 Audio Encoder
The small variants (E2B, E4B) include a built-in USM-style Conformer audio encoder, using the same base architecture as Gemma-3n, supporting speech understanding and transcription tasks.
3. Core Capabilities
Gemma 4 goes beyond simple chat scenarios, demonstrating strong capabilities across multiple dimensions:
3.1 Advanced Reasoning
Capable of multi-step planning and deep logical reasoning, with significant improvements on math and instruction-following benchmarks.
3.2 Agentic Workflows
Natively supports Function Calling, structured JSON output, and system instructions, enabling the construction of autonomous agents that interact with external tools and APIs to reliably execute complex workflows.
3.3 Code Generation
Supports high-quality offline code generation, turning workstations into local-first AI coding assistants.
3.4 Vision and Audio
All models natively process video and images with variable resolution support, excelling at OCR, chart understanding, and other visual tasks. E2B and E4B additionally support native audio input for speech recognition and understanding.
3.5 Long Context
Edge models support 128K context windows, and large models up to 256K, allowing complete code repositories or long documents to be passed in a single prompt.
3.6 Multilingual Support
Natively trained on 140+ languages, helping developers build high-performance applications for global users.
3.7 Multimodal Task Demonstrations
In Hugging Face's practical testing, Gemma 4 demonstrated out-of-the-box capabilities in the following multimodal scenarios:
- Object Detection and Localization: The model can natively return detected bounding box coordinates in JSON format without specific instructions or constrained generation.
- GUI Element Detection: Can identify and locate UI elements (such as buttons, menu items, etc.), suitable for GUI automation scenarios.
- Image Captioning: All model sizes excel at capturing details in complex scenes.
- Video Understanding: Although not explicitly post-trained on video, the model can understand video content with or without audio. Small variants can process both video frames and audio simultaneously.
- Audio QA and Transcription: E2B and E4B support description and transcription of speech content, with training data primarily focused on speech.
- Multimodal Reasoning (Thinking): Supports chain-of-thought reasoning mode on multimodal inputs.
- Multimodal Function Calling: Can recognize information from image content and call external tools, e.g., identifying a location in an image and then calling a weather query API.
4. Benchmark Results
Gemma 4 achieves excellent results across multiple dimensions including reasoning, coding, vision, audio, and long context. The 31B dense model achieves an LMArena score (text-only) of 1452, while the 26B MoE achieves 1441 with only 4B active parameters.
Below are the main benchmark results for the instruction-tuned versions:
| Benchmark | Gemma 4 31B | Gemma 4 26B A4B | Gemma 4 E4B | Gemma 4 E2B | Gemma 3 27B |
|---|---|---|---|---|---|
| Reasoning & Knowledge | |||||
| MMLU Pro | 85.2% | 82.6% | 69.4% | 60.0% | 67.6% |
| AIME 2026 (no tools) | 89.2% | 88.3% | 42.5% | 37.5% | 20.8% |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 43.4% | 42.4% |
| BigBench Extra Hard | 74.4% | 64.8% | 33.1% | 21.9% | 19.3% |
| Coding | |||||
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 44.0% | 29.1% |
| Codeforces ELO | 2150 | 1718 | 940 | 633 | 110 |
| Vision | |||||
| MMMU Pro | 76.9% | 73.8% | 52.6% | 44.2% | 49.7% |
| MATH-Vision | 85.6% | 82.4% | 59.5% | 52.4% | 46.0% |
| Long Context | |||||
| MRCR v2 128K (avg) | 66.4% | 44.1% | 25.4% | 19.1% | 13.5% |
Key Highlights:
- 26B A4B (MoE) activates only 4B parameters, achieving performance close to the 31B dense model, with extremely high efficiency.
- Compared to the previous generation Gemma 3 27B, Gemma 4 improves AIME 2026 math reasoning by over 4x (20.8% → 89.2%), and Codeforces coding score by over 15x (110 → 1718+).
- Vision and long context capabilities significantly surpass the previous generation.
5. Ecosystem and Inference Framework Support
Gemma 4 has broad toolchain and platform support, allowing developers to choose flexibly based on their needs:
5.1 Inference Frameworks (Day-One Support)
| Framework | Description |
|---|---|
| transformers | Official integration, supports AutoModelForMultimodalLM, any-to-any pipeline |
| llama.cpp | Supports text+image inference, available via OpenAI-compatible API, compatible with LM Studio, Jan, and other local apps |
| vLLM | High-performance inference engine |
| SGLang | High-performance inference engine, transformers backend |
| MLX | Native Apple Silicon support with TurboQuant quantization optimization |
| Ollama | One-click local deployment |
| LM Studio | GUI-based local inference |
| transformers.js | Browser-based execution (WebGPU) |
| mistral.rs | Rust-native inference engine, supports all modalities and built-in tool calling |
| ONNX | Provides ONNX format checkpoints for edge device and browser deployment |
| LiteRT-LM | Google's edge inference runtime |
| NVIDIA NIM / NeMo | NVIDIA inference and training platform |
5.2 Hardware Platforms
Gemma 4 is optimized for multiple hardware platforms:
- NVIDIA GPU: Full coverage from Jetson Orin Nano to Blackwell series
- AMD GPU: Integration through the open-source ROCm software stack
- Google TPU: Supports Trillium and Ironwood TPU for large-scale deployment
- Apple Silicon: Native support via MLX
- Mobile Chips: Optimized in collaboration with Qualcomm and MediaTek
5.3 Development Platforms and Quick Start
| Platform | Purpose |
|---|---|
| Google AI Studio | Try 31B and 26B MoE online |
| Google AI Edge Gallery | Try edge models E4B and E2B |
| Android Studio Agent Mode | Android app development integration |
| Hugging Face | Model downloads and community |
| Kaggle | Model downloads and Gemma 4 Good Challenge |
| Google Colab | Free fine-tuning and experimentation |
| Vertex AI | Enterprise deployment (Cloud Run / GKE / TPU acceleration) |
6. Fine-tuning Support
Gemma 4's highly optimized architecture enables efficient fine-tuning across various hardware — from billions of Android devices to laptop GPUs, developer workstations, and accelerators. The community already has successful fine-tuning examples: INSAIT created BgGPT, a Bulgarian-first language model based on Gemma, and Yale University leveraged Cell2Sentence-Scale to discover new cancer treatment pathways.
Gemma 4 supports fine-tuning with various tools:
- TRL (Transformer Reinforcement Learning): Supports multimodal tool-response training, can receive image feedback from environments (e.g., training Gemma 4 to learn driving in the CARLA simulator).
- TRL on Vertex AI: Provides complete examples for SFT fine-tuning on Google Cloud Vertex AI, with the ability to freeze vision and audio towers for function calling capability extension.
- Unsloth Studio: Supports fine-tuning via GUI locally or on Colab.
- Keras: Google's high-level deep learning API support.
7. License and Safety
Gemma 4 is released under the Apache 2.0 open-source license, granting developers complete flexibility and digital sovereignty — full control over data, infrastructure, and models, with the freedom to build and deploy in any environment, locally or in the cloud.
In terms of safety, Gemma 4 undergoes the same rigorous infrastructure security protocols as Google's proprietary models, providing a trustworthy and transparent foundation for enterprises and sovereign organizations.
8. Summary
Gemma 4 is among the best open-source multimodal models available today. Its core advantages include:
- Truly Open Source: Apache 2.0 license, free for commercial use, with full control.
- Extreme Parameter Efficiency: Beats models with 20x more parameters, Arena AI open-source ranking Top 3/6.
- Full Modality Support: Unified processing of text, image, video, and audio, covering 140+ languages.
- Full Size Coverage: From 2.3B edge models (mobile/IoT) to 31B server-grade models, fitting all scenarios.
- Efficient Architecture: PLE, shared KV cache, MoE, and other designs maintain high performance while significantly reducing resource requirements.
- Agent-Ready: Native function calling, structured output, and system instructions, suitable for building complex agentic workflows.
- Complete Ecosystem: Day-one support from mainstream inference frameworks, multi-hardware platform optimization (NVIDIA / AMD ROCm / Google TPU / Apple Silicon), with quantization and fine-tuning tools readily available.
For more information, see: