Google Gemma 4: Run It Locally and Build Your First AI Agent

Google dropped Gemma 4 on April 2, 2026, and it immediately shook up the open-weight model rankings. Four models, Apache 2.0 license, multimodal from the ground up, and — the part that caught my attention — native function calling that actually works well enough to build real agents with.

I’ve been running Gemma 4 locally for the past week and a half. Here’s what you need to know if you want to do the same, from picking the right model size to building a simple agent that can call tools on your behalf.

What Makes Gemma 4 Different from Every Other Open Model

Gemma 4 is built from the same research that powers Gemini 3, Google’s flagship commercial model. That lineage shows. The benchmark jumps from Gemma 3 are frankly absurd: AIME 2026 math scores went from 20.8% to 89.2%, LiveCodeBench coding jumped from 29.1% to 80.0%, and GPQA science scores climbed from 42.4% to 84.3%.

But benchmarks are benchmarks. What actually matters for local use is the combination of three things: the Apache 2.0 license means you can do whatever you want commercially (no MAU caps like Llama’s 700M limit, no EU restrictions), the architecture is optimized for efficient inference on consumer hardware, and the function calling capability is baked into the model rather than bolted on as an afterthought.

The 256K context window across all model sizes is another quiet advantage. Most open models either cap out at 32K or charge you dearly in compute for longer contexts. Gemma 4 handles long documents natively, which matters a lot when you’re feeding tool outputs back into an agent loop.

The Four Model Sizes (and Which One You Actually Want)

Google released four Gemma 4 variants, each targeting a different use case. Here’s the honest breakdown:

Gemma 4 E2B — The Phone Model

2.3 billion parameters. This one runs on smartphones and edge devices. It’s impressively capable for its size, but unless you’re building a mobile app, skip it. You’ll be frustrated by its reasoning limitations for anything beyond simple Q&A.

Hardware: Runs on basically anything. 4GB of RAM is enough.

Gemma 4 E4B — The Sweet Spot for Getting Started

4.5 billion effective parameters. This is where I’d tell most people to start. The download is about 9.6GB, it needs roughly 6GB of VRAM minimum, and it supports 128K context out of the box.

E4B punches well above its weight class. For quick prototyping, code generation tasks, and simple agent workflows, it handles itself surprisingly well. The gap between E4B and the larger models only becomes obvious when you push it on complex multi-step reasoning or nuanced creative writing.

Hardware: Any modern GPU with 8GB+ VRAM, or a MacBook with 16GB unified memory. An RTX 3060 or M1 MacBook Air will do fine.

Gemma 4 26B-A4B — The Efficiency King

This is the one that gets hardware nerds excited. It’s a Mixture-of-Experts model with 26 billion total parameters, but only 3.8 billion fire on any given token. The architecture uses 128 small experts and activates 8 plus 1 shared expert per forward pass.

The result: you get roughly 97% of the quality of the full 31B dense model at a fraction of the compute cost. On an RTX 4090 with 24GB VRAM, you can run this at Q4 quantization with the full 256K context window. That’s remarkable.

Hardware: 16-24GB VRAM for comfortable use. An RTX 3090, RTX 4090, or a Mac with 32GB unified memory. With 4-bit quantization, 16GB is feasible but tight.

Gemma 4 31B — Maximum Quality, Maximum Appetite

31 billion dense parameters. No MoE tricks — every parameter activates on every token. This is the ceiling of what Gemma 4 can do, and it earns the #3 ranking among all open models globally.

Unless you have a 48GB+ GPU or a well-specced Mac, though, this model will test your patience. At full FP16 precision, you’re looking at around 80GB of memory. Quantization makes it manageable on an RTX 5090 (32GB) or a 48GB Mac, but you’re still giving up your entire GPU to one model.

Hardware: RTX 5090, RTX 6000 Ada, or Mac with 48GB+ unified memory. The 26B MoE variant is the smarter choice for most people.

My Recommendation

Start with E4B to learn the ropes. Move to 26B-A4B when you want production-quality output without selling a kidney for GPU memory. Only bother with 31B if you’re chasing benchmark scores or have the hardware to spare.

Running Gemma 4 Locally with Ollama: The Actual Steps

Ollama remains the easiest path to running models locally. Here’s how to get Gemma 4 up and running.

Step 1: Install Ollama

If you don’t already have Ollama, grab it from ollama.com. On macOS:

brew install ollama

On Linux:

curl -fsSL https://ollama.com/install.sh | sh

Start the Ollama service if it’s not already running:

ollama serve

Step 2: Pull the Model

For the E4B model (recommended starting point):

ollama pull gemma4:e4b

For the 26B MoE model:

ollama pull gemma4:26b

The download will take a few minutes depending on your connection. The E4B model is about 9.6GB, the 26B model is larger.

Step 3: Verify and Run

Check that the model downloaded correctly:

ollama list

Then start chatting:

ollama run gemma4:e4b

You should get an interactive prompt. Try asking it something to confirm it’s working. Type /bye to exit.

Step 4: Use the REST API

Ollama exposes a local API on port 11434 that’s compatible with the OpenAI format. This is what you’ll use to build applications:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:e4b",
  "messages": [
    {"role": "user", "content": "Explain quantum computing in two sentences."}
  ]
}'

If you get a JSON response with the model’s answer, you’re good to go.

Troubleshooting Common Issues

Model loads slowly or crashes: You’re probably running out of VRAM. Try a smaller model or enable quantization. Ollama uses Q4 by default, which is usually fine.

Responses are garbled or nonsensical: Make sure you pulled the right model tag. gemma4:e4b and gemma4:26b are the official tags.

API connection refused: Ollama service isn’t running. Start it with ollama serve in a separate terminal.

Building a Simple AI Agent with Gemma 4 Function Calling

This is where Gemma 4 gets interesting. Unlike older open models where function calling was a community hack, Gemma 4 has native tool use support trained directly into the model. The quality difference is noticeable — it follows tool schemas reliably and chains multiple calls without losing track of what it’s doing.

How Gemma 4 Function Calling Works

The flow follows four stages:

Define Tools — You describe available functions with their arguments and descriptions in your system prompt
Model’s Turn — Gemma 4 returns a structured function call object instead of plain text
Developer’s Turn — You parse the output, execute the actual function, and append the results back to the chat history
Final Response — The model processes the tool results and generates a natural language answer

A Practical Example: Weather + Calendar Agent

Here’s a simple Python agent that gives Gemma 4 access to two tools — checking the weather and looking up calendar events. This uses the OpenAI-compatible API that Ollama provides.

import json
import requests
 
OLLAMA_URL = "http://localhost:11434/api/chat"
MODEL = "gemma4:e4b"
 
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g. 'Seoul' or 'San Francisco'"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_calendar_events",
            "description": "Get upcoming calendar events for a specific date",
            "parameters": {
                "type": "object",
                "properties": {
                    "date": {
                        "type": "string",
                        "description": "Date in YYYY-MM-DD format"
                    }
                },
                "required": ["date"]
            }
        }
    }
]
 
def get_weather(location):
    # Replace with a real weather API call
    return {"location": location, "temp": "18C", "condition": "Partly cloudy"}
 
def get_calendar_events(date):
    # Replace with a real calendar integration
    return {"date": date, "events": ["Team standup at 10am", "Lunch with Sarah at 12pm"]}
 
tool_handlers = {
    "get_weather": get_weather,
    "get_calendar_events": get_calendar_events,
}
 
def run_agent(user_message):
    messages = [{"role": "user", "content": user_message}]
 
    response = requests.post(OLLAMA_URL, json={
        "model": MODEL,
        "messages": messages,
        "tools": tools,
        "stream": False
    })
 
    result = response.json()
    message = result["message"]
 
    if "tool_calls" not in message:
        return message["content"]
 
    messages.append(message)
 
    for tool_call in message["tool_calls"]:
        fn_name = tool_call["function"]["name"]
        fn_args = tool_call["function"]["arguments"]
 
        if fn_name in tool_handlers:
            output = tool_handlers[fn_name](**fn_args)
            messages.append({
                "role": "tool",
                "content": json.dumps(output)
            })
 
    followup = requests.post(OLLAMA_URL, json={
        "model": MODEL,
        "messages": messages,
        "tools": tools,
        "stream": False
    })
 
    return followup.json()["message"]["content"]
 
print(run_agent("What's the weather in Seoul and do I have any meetings today?"))

Run this and you’ll see Gemma 4 make two tool calls — one for weather, one for calendar — then synthesize both results into a coherent response. The fact that a 4.5B parameter model running on your laptop can do this reliably is kind of wild.

Scaling Up: Multi-Step Agent Loops

The example above handles a single round of tool calls. For more complex agents, you’ll want a loop that keeps running until the model decides it has enough information:

def run_agent_loop(user_message, max_rounds=5):
    messages = [{"role": "user", "content": user_message}]
 
    for _ in range(max_rounds):
        response = requests.post(OLLAMA_URL, json={
            "model": MODEL,
            "messages": messages,
            "tools": tools,
            "stream": False
        })
 
        message = response.json()["message"]
 
        if "tool_calls" not in message:
            return message["content"]
 
        messages.append(message)
 
        for tool_call in message["tool_calls"]:
            fn_name = tool_call["function"]["name"]
            fn_args = tool_call["function"]["arguments"]
 
            if fn_name in tool_handlers:
                output = tool_handlers[fn_name](**fn_args)
                messages.append({
                    "role": "tool",
                    "content": json.dumps(output)
                })
 
    return messages[-1].get("content", "Agent reached maximum rounds.")

This pattern is the foundation of every agent framework out there. LangChain, CrewAI, OpenClaw — they all do some variation of this loop. Once you understand the basic cycle, the frameworks are just convenience wrappers.

Tips for Reliable Function Calling

After a week of building with Gemma 4’s tool use, a few things stand out:

Keep tool descriptions precise. Vague descriptions lead to hallucinated arguments. “Get weather for a city” works better than “Get weather information.”

Use the 26B model for complex chains. E4B handles simple single-tool calls well, but if your agent needs to reason about which of five tools to use and in what order, the 26B MoE model is noticeably more reliable.

Always validate tool call arguments. The model will occasionally pass arguments in unexpected formats, especially with E4B. A quick schema validation step before executing the function saves you from mysterious runtime errors.

Set a max rounds limit. Without it, a confused model can loop forever. Five rounds is generous for most use cases.

Gemma 4 vs Llama 4 vs Mistral Small 4: The Open Model Showdown

You can’t talk about Gemma 4 without addressing the elephant herd in the room. Here’s how it stacks up against the other major open models as of April 2026.

Architecture Philosophy

These three families take fundamentally different approaches. Gemma 4 optimizes across the full range from edge to server with four model sizes. Llama 4 goes big with Maverick (400B total, 128 experts) and Scout (109B, 16 experts, 10M token context). Mistral Small 4 splits the difference with 119B total but only 6B active per token via 128 experts.

Performance

Gemma 4 31B holds the #3 spot among open models globally. Llama 4 Maverick hits 1417 ELO on Chatbot Arena, outperforming GPT-4o. Mistral Small 4 leads on coding efficiency, producing 20% less output than competitors for equivalent results — which translates to real cost savings at scale.

For local use, though, raw benchmarks matter less than what fits on your hardware. Gemma 4’s 26B MoE variant running on a 24GB GPU gives you better quality-per-VRAM-dollar than anything else in this comparison.

Licensing — This Actually Matters

Gemma 4 and Mistral Small 4 both ship under Apache 2.0. Full commercial freedom, no strings. Llama 4 technically uses its own license with a 700M monthly active user cap and restrictions in the EU. If you’re building a commercial product, this difference alone might make your decision.

The Practical Verdict

Pick Gemma 4 if you want the best balance of quality, efficiency, and licensing freedom for local deployment. Pick Llama 4 Scout if you need that 10M token context window for processing massive documents. Pick Mistral Small 4 if coding efficiency is your top priority and you have the hardware for it.

I’ve been reaching for Gemma 4 26B-A4B as my daily driver. The MoE architecture means I can keep it running alongside other applications without my Mac grinding to a halt, and the output quality is good enough that I’ve stopped routing most tasks to cloud APIs.

When to Use Gemma 4 vs Cloud APIs

Running models locally isn’t always the right call. Here’s how I decide:

Go Local When:

Privacy matters. Medical data, financial records, proprietary code — anything you don’t want leaving your machine.
Latency matters more than throughput. Local inference on a good GPU eliminates network round-trips. For interactive applications, that 50-100ms savings per call adds up fast.
You’re prototyping. No API keys, no rate limits, no billing surprises. Just pull a model and start building.
Cost at scale. If you’re making thousands of API calls per day, even cheap APIs add up. A one-time GPU investment pays for itself within months.

Stick with Cloud APIs When:

You need frontier-level reasoning. Claude, GPT-4.5, Gemini 3 Ultra — the cloud models are still meaningfully better than any open model for complex tasks. That gap is shrinking, but it’s real.
You’re handling massive concurrency. Serving hundreds of simultaneous users from a single local GPU isn’t practical. Cloud APIs handle scaling for you.
You don’t want to manage infrastructure. Running models locally means babysitting GPU drivers, managing model updates, and debugging CUDA errors at 2am.

The Hybrid Approach

What I actually do: Gemma 4 locally for high-volume, moderate-complexity tasks (summarization, classification, simple tool use), and cloud APIs for the hard stuff (long-form writing, complex code generation, nuanced reasoning). The local model handles 80% of my daily AI usage, and the cloud APIs handle the 20% where quality really matters.

Getting Started This Weekend

If you’ve been putting off running models locally, Gemma 4 makes the barrier pretty low. Ollama setup takes five minutes, you probably already own hardware that can run E4B, and the function calling support means you can build agents that actually do things — not just chat interfaces.

Start with ollama pull gemma4:e4b. Run it. Try the agent code above. The part that surprised me wasn’t the quality (it’s good, not magic) — it was how quickly I stopped reaching for cloud APIs for routine tasks. Whether that trade-off works for you depends on your workload, but it costs nothing to find out.