Skip to content

Gemma 4 Model Introduction

Gemma 4 is Google DeepMind's next-generation multimodal open-source model family, built on the same research and technology as Gemini 3, released under the Apache 2.0 license. It supports text, image, video, and audio inputs, generating text outputs. Gemma 4 is a comprehensive upgrade over previous generations — stronger reasoning capabilities, more flexible multimodal support, more efficient architecture design — while offering multiple sizes from edge devices to servers, specifically designed for advanced reasoning and agentic workflows.

Since the first Gemma release, developers have collectively downloaded the models over 400 million times, and the community has built over 100,000 Gemma variants, forming a massive Gemmaverse ecosystem. Gemma 4 is Google's latest response to community needs — achieving frontier capabilities with parameter efficiency.

References: Hugging Face Blog - Welcome Gemma 4 · Google Blog - Gemma 4


1. Model Version Overview

Gemma 4 comes in four sizes, each with both base and IT (instruction-tuned) versions:

ModelParameter ScaleContext WindowModel Links
Gemma 4 E2B2.3B effective parameters, 5.1B total with embeddings128Kbase / IT
Gemma 4 E4B4.5B effective parameters, 8B total with embeddings128Kbase / IT
Gemma 4 31B31B dense model256Kbase / IT
Gemma 4 26B A4BMoE architecture, 4B active / 26B total parameters256Kbase / IT

Among these, E2B and E4B are small variants suitable for edge deployment, supporting image, video, text, and audio inputs; 31B and 26B A4B are large variants supporting image, video, and text inputs.

Arena AI Rankings (as of April 2026): The 31B dense model ranks #3 among open-source models on the Arena AI text leaderboard, and the 26B MoE ranks #6, capable of beating models with 20x more parameters. This extreme "intelligence per parameter" means developers can achieve frontier-level capabilities with significantly less hardware.

Large Models (26B / 31B)

Designed for researchers and developers, the unquantized bfloat16 weights run efficiently on a single 80GB NVIDIA H100 GPU; quantized versions can run locally on consumer-grade GPUs, powering IDEs, coding assistants, and agentic workflows. The 26B MoE activates only 3.8B parameters during inference with extremely low latency; the 31B Dense pursues maximum raw quality and is an ideal base for fine-tuning.

Edge Models (E2B / E4B)

Designed from the ground up for mobile and IoT devices, activating only 2B / 4B effective parameters during inference to save memory and battery life. Google has deeply collaborated with the Pixel team and mobile hardware manufacturers like Qualcomm and MediaTek, enabling these multimodal models to run fully offline with near-zero latency on phones, Raspberry Pi, NVIDIA Jetson Orin Nano, and other edge devices. Android developers can prototype via the AICore Developer Preview with forward compatibility to the upcoming Gemini Nano 4.


2. Architecture Features

Gemma 4 incorporates multiple proven architectural innovations, achieving an excellent balance between compatibility, efficiency, and long-context support:

2.1 Alternating Attention Mechanism

Gemma 4 uses an alternating layer structure of local sliding window attention and global full-context attention. Smaller dense models use a 512-token sliding window, while larger models use 1024 tokens. This design controls computation while preserving long-range dependency modeling.

2.2 Dual RoPE Configuration

The model uses two RoPE (Rotary Position Embedding) configurations: standard RoPE for sliding window layers and scaled RoPE for global layers, to better support extra-long contexts.

2.3 Per-Layer Embeddings (PLE)

PLE is one of the most distinctive innovations in Gemma 4's small models (first introduced in Gemma-3n). In traditional Transformers, each token gets only one embedding vector at input, and all subsequent layers build on it. PLE adds a parallel, low-dimensional conditioning path alongside the main residual stream — generating a dedicated small vector for each token at each layer, combining token identity information and contextual information. This allows each layer to receive token-specific information without needing to pack everything into the initial embedding. Since PLE dimensions are much smaller than the main hidden dimension, the parameter cost is low, but it brings significant per-layer specialization effects.

2.4 Shared KV Cache

The last several layers of the model no longer independently compute their own Key and Value projections, but instead reuse the KV tensors from the last non-shared layer of the same attention type (sliding or global). This optimization significantly reduces memory usage and computation during long-text inference with minimal quality impact, making it ideal for edge deployment.

2.5 Vision Encoder

The vision encoder uses learned 2D positional encoding and multi-dimensional RoPE, supports variable aspect ratio inputs, and can be configured with different image token counts (70, 140, 280, 560, 1120), flexibly trading off between speed, memory, and quality.

2.6 Audio Encoder

The small variants (E2B, E4B) include a built-in USM-style Conformer audio encoder, using the same base architecture as Gemma-3n, supporting speech understanding and transcription tasks.


3. Core Capabilities

Gemma 4 goes beyond simple chat scenarios, demonstrating strong capabilities across multiple dimensions:

3.1 Advanced Reasoning

Capable of multi-step planning and deep logical reasoning, with significant improvements on math and instruction-following benchmarks.

3.2 Agentic Workflows

Natively supports Function Calling, structured JSON output, and system instructions, enabling the construction of autonomous agents that interact with external tools and APIs to reliably execute complex workflows.

3.3 Code Generation

Supports high-quality offline code generation, turning workstations into local-first AI coding assistants.

3.4 Vision and Audio

All models natively process video and images with variable resolution support, excelling at OCR, chart understanding, and other visual tasks. E2B and E4B additionally support native audio input for speech recognition and understanding.

3.5 Long Context

Edge models support 128K context windows, and large models up to 256K, allowing complete code repositories or long documents to be passed in a single prompt.

3.6 Multilingual Support

Natively trained on 140+ languages, helping developers build high-performance applications for global users.

3.7 Multimodal Task Demonstrations

In Hugging Face's practical testing, Gemma 4 demonstrated out-of-the-box capabilities in the following multimodal scenarios:

  • Object Detection and Localization: The model can natively return detected bounding box coordinates in JSON format without specific instructions or constrained generation.
  • GUI Element Detection: Can identify and locate UI elements (such as buttons, menu items, etc.), suitable for GUI automation scenarios.
  • Image Captioning: All model sizes excel at capturing details in complex scenes.
  • Video Understanding: Although not explicitly post-trained on video, the model can understand video content with or without audio. Small variants can process both video frames and audio simultaneously.
  • Audio QA and Transcription: E2B and E4B support description and transcription of speech content, with training data primarily focused on speech.
  • Multimodal Reasoning (Thinking): Supports chain-of-thought reasoning mode on multimodal inputs.
  • Multimodal Function Calling: Can recognize information from image content and call external tools, e.g., identifying a location in an image and then calling a weather query API.

4. Benchmark Results

Gemma 4 achieves excellent results across multiple dimensions including reasoning, coding, vision, audio, and long context. The 31B dense model achieves an LMArena score (text-only) of 1452, while the 26B MoE achieves 1441 with only 4B active parameters.

Below are the main benchmark results for the instruction-tuned versions:

BenchmarkGemma 4 31BGemma 4 26B A4BGemma 4 E4BGemma 4 E2BGemma 3 27B
Reasoning & Knowledge
MMLU Pro85.2%82.6%69.4%60.0%67.6%
AIME 2026 (no tools)89.2%88.3%42.5%37.5%20.8%
GPQA Diamond84.3%82.3%58.6%43.4%42.4%
BigBench Extra Hard74.4%64.8%33.1%21.9%19.3%
Coding
LiveCodeBench v680.0%77.1%52.0%44.0%29.1%
Codeforces ELO21501718940633110
Vision
MMMU Pro76.9%73.8%52.6%44.2%49.7%
MATH-Vision85.6%82.4%59.5%52.4%46.0%
Long Context
MRCR v2 128K (avg)66.4%44.1%25.4%19.1%13.5%

Key Highlights:

  • 26B A4B (MoE) activates only 4B parameters, achieving performance close to the 31B dense model, with extremely high efficiency.
  • Compared to the previous generation Gemma 3 27B, Gemma 4 improves AIME 2026 math reasoning by over 4x (20.8% → 89.2%), and Codeforces coding score by over 15x (110 → 1718+).
  • Vision and long context capabilities significantly surpass the previous generation.

5. Ecosystem and Inference Framework Support

Gemma 4 has broad toolchain and platform support, allowing developers to choose flexibly based on their needs:

5.1 Inference Frameworks (Day-One Support)

FrameworkDescription
transformersOfficial integration, supports AutoModelForMultimodalLM, any-to-any pipeline
llama.cppSupports text+image inference, available via OpenAI-compatible API, compatible with LM Studio, Jan, and other local apps
vLLMHigh-performance inference engine
SGLangHigh-performance inference engine, transformers backend
MLXNative Apple Silicon support with TurboQuant quantization optimization
OllamaOne-click local deployment
LM StudioGUI-based local inference
transformers.jsBrowser-based execution (WebGPU)
mistral.rsRust-native inference engine, supports all modalities and built-in tool calling
ONNXProvides ONNX format checkpoints for edge device and browser deployment
LiteRT-LMGoogle's edge inference runtime
NVIDIA NIM / NeMoNVIDIA inference and training platform

5.2 Hardware Platforms

Gemma 4 is optimized for multiple hardware platforms:

  • NVIDIA GPU: Full coverage from Jetson Orin Nano to Blackwell series
  • AMD GPU: Integration through the open-source ROCm software stack
  • Google TPU: Supports Trillium and Ironwood TPU for large-scale deployment
  • Apple Silicon: Native support via MLX
  • Mobile Chips: Optimized in collaboration with Qualcomm and MediaTek

5.3 Development Platforms and Quick Start

PlatformPurpose
Google AI StudioTry 31B and 26B MoE online
Google AI Edge GalleryTry edge models E4B and E2B
Android Studio Agent ModeAndroid app development integration
Hugging FaceModel downloads and community
KaggleModel downloads and Gemma 4 Good Challenge
Google ColabFree fine-tuning and experimentation
Vertex AIEnterprise deployment (Cloud Run / GKE / TPU acceleration)

6. Fine-tuning Support

Gemma 4's highly optimized architecture enables efficient fine-tuning across various hardware — from billions of Android devices to laptop GPUs, developer workstations, and accelerators. The community already has successful fine-tuning examples: INSAIT created BgGPT, a Bulgarian-first language model based on Gemma, and Yale University leveraged Cell2Sentence-Scale to discover new cancer treatment pathways.

Gemma 4 supports fine-tuning with various tools:

  • TRL (Transformer Reinforcement Learning): Supports multimodal tool-response training, can receive image feedback from environments (e.g., training Gemma 4 to learn driving in the CARLA simulator).
  • TRL on Vertex AI: Provides complete examples for SFT fine-tuning on Google Cloud Vertex AI, with the ability to freeze vision and audio towers for function calling capability extension.
  • Unsloth Studio: Supports fine-tuning via GUI locally or on Colab.
  • Keras: Google's high-level deep learning API support.

7. License and Safety

Gemma 4 is released under the Apache 2.0 open-source license, granting developers complete flexibility and digital sovereignty — full control over data, infrastructure, and models, with the freedom to build and deploy in any environment, locally or in the cloud.

In terms of safety, Gemma 4 undergoes the same rigorous infrastructure security protocols as Google's proprietary models, providing a trustworthy and transparent foundation for enterprises and sovereign organizations.


8. Summary

Gemma 4 is among the best open-source multimodal models available today. Its core advantages include:

  1. Truly Open Source: Apache 2.0 license, free for commercial use, with full control.
  2. Extreme Parameter Efficiency: Beats models with 20x more parameters, Arena AI open-source ranking Top 3/6.
  3. Full Modality Support: Unified processing of text, image, video, and audio, covering 140+ languages.
  4. Full Size Coverage: From 2.3B edge models (mobile/IoT) to 31B server-grade models, fitting all scenarios.
  5. Efficient Architecture: PLE, shared KV cache, MoE, and other designs maintain high performance while significantly reducing resource requirements.
  6. Agent-Ready: Native function calling, structured output, and system instructions, suitable for building complex agentic workflows.
  7. Complete Ecosystem: Day-one support from mainstream inference frameworks, multi-hardware platform optimization (NVIDIA / AMD ROCm / Google TPU / Apple Silicon), with quantization and fine-tuning tools readily available.

For more information, see: