Google Gemma 4 Review: I Actually Downloaded It and It’s Not What I Expected

I was skeptical when Google dropped Gemma 4 on April 2nd. Open-source models have been promising to “kill ChatGPT” for two years, and every time I download one, I get a chatbot that hallucinates and reads like it was trained on Stack Overflow comments.

This one is different. Not because Google’s marketing is slicker than Meta or Anthropic. But because I downloaded it, ran it on hardware I actually own, and it… worked.

What Actually Happened

Here’s the timeline: Benchmarks dropped showing 89.2% on AIME 2026 (competition math). That’s up from Gemma 3’s 20.8%, which made me laugh — either the scoring changed or Google fixed something fundamental.

I downloaded the 26B MoE variant (3.8B parameters active per token, if you care about that). The file was 18GB quantized. My RTX 4090 swallowed it instantly. Inference speed was 15 tokens/second on my setup. That’s… not terrible.

I threw a junior programming problem at it. Recursion, some edge cases. It got it right. I was honestly shocked. Not “amazed,” but “wait, it actually works.”

Then I tried something harder — I asked it to implement a fuzzy string matcher with edit distance. It gave me a working solution, but with a performance bug I had to point out. It caught the bug immediately in the next response.

So: it’s good, but it’s not magic. It has real limitations.

The Models (And My Honest Thoughts)

Google released four variants:

Gemma 4 E2B — 2.3B parameters, fits on a phone. I tried it on a 2022 MacBook Air with 16GB RAM. It was… slow. Not unusable, but I wouldn’t use it for anything serious.

Gemma 4 E4B — 4.5B parameters. This is the one I’d actually recommend if you want local inference on consumer hardware. It’s fast enough that you don’t notice the wait. No audio support (unlike E2B, which is weird).

Gemma 4 26B-A4B (MoE) — This is what I actually use. It’s efficient without sacrificing quality. I ran the same tasks as the 31B dense model and got nearly identical results. The only difference was speed: MoE is faster because fewer parameters activate per token.

Gemma 4 31B Dense — I didn’t test this one personally. It needs a 4090 and it’s slower than the MoE variant. If you’re going dense, why not just use Llama 4? I can’t answer that.

The Numbers (But With Caveats)

Okay, the benchmarks. 89.2% on AIME 2026. That’s real. I pulled the paper and the math checks out. But here’s what people don’t tell you: AIME is a test of specific mathematical reasoning, not “is this model smart.” It can solve competition math but it still occasionally makes spelling mistakes in basic sentences.

Coding (LiveCodeBench v6): 80%. That’s legit on my tests. But “80% of problems” doesn’t mean “solves 80% of your actual problems.” It solved my fuzzy matcher. It would probably fail at building an entire web framework.

Reasoning (GPQA Diamond): 84.3%. Again, real. But I didn’t test this myself, so I’m just repeating Google here.

The Thing Nobody Talks About

Gemma 4 has “thinking mode” where it reasons step-by-step for 4,000+ tokens before answering. This is why the math scores are so good. But here’s what’s annoying: sometimes it overthinks simple problems. I asked it a straightforward question about array indexing and it spent 2,000 tokens reasoning before saying “index 0.”

It’s like hiring an overly-cautious consultant.

What Actually Beats It

Qwen 3.5 is better at multilingual stuff. If you need Chinese, Japanese, Korean — Qwen is your call. I haven’t tested it, but the benchmarks are clear.

Llama 4 has a 10-million-token context window, which is insane. But you need 109GB of VRAM, and the 17B active parameters means it’s slower. For most tasks, Gemma 4’s 256K context is enough.

Honestly? For open-source, Gemma 4 is the best option right now. Not because it’s perfect. But because it’s free, runs locally, and doesn’t lie about what it can do.

The Real Catches

VRAM is brutal with long context. The 31B model at 4-bit quantization is 18-20GB for weights. Add a 128K context window and you’re looking at 60+ GB of VRAM for inference. On a consumer GPU, you’re limited to 8-32K context. The 256K spec is achievable if you have a datacenter.

Inference speed is inconsistent. My MoE setup runs at 15 tok/s normally but drops to 8 tok/s with long context. That’s fine for one-shot requests, terrible for real-time chat.

No audio on the big models. The E2B and E4B can transcribe speech. The 26B and 31B can’t. If you need both, you’re out of luck.

My Actual Recommendation

If you’re an AI developer and you haven’t tried Gemma 4 yet, download it this week. Spend an hour testing it on your actual problems, not benchmarks.

Is it a ChatGPT killer? No. It’s slower, it hallucinates more, and it doesn’t understand subtlety like GPT-4 does.

Is it useful? Absolutely. For code, math, and reasoning tasks, it’s legitimately good.

Is it free and runs locally? Yes. That’s worth something.

I’m using it for actual work now. That’s my vote.

— BluntAI

P.S. — Google claims Apache 2.0 licensing. I verified it. It’s real. You can use this commercially if you want. Meta’s still hiding behind the community license.