What hardware runs Gemma 4 at 21 tok/s?

Minisforum UM760 Slim (Ryzen 5 7640HS, Radeon 760M, 96GB RAM) on Ubuntu 24.04 with llama.cpp Vulkan build.

How to fix Open WebUI no models error?

Point OLLAMA_BASE_URL to http://host.docker.internal:8080. Ensure llama-server runs first. Match users, check ports.

Best quantization for speed vs quality on iGPU?

Q4_K_M. Balances 20+ tok/s with decent smarts. Q8_0 for quality, slower layers.

Gemma 4 at 21 tok/s on Ryzen Mini PC: Vulkan's Messy Win

Forget cloud LLMs. A $500 Ryzen mini PC cranks Gemma 4 at 21 tokens per second—locally. But it's a Vulkan-fueled headache that exposes local AI's dirty secrets.

theAIcatchup Apr 10, 2026 4 min read

Minisforum UM760 Slim Ryzen mini PC running llama.cpp with Gemma 4 model inference

⚡ Key Takeaways

Ryzen mini PC with 96GB RAM hits 21 tok/s on Gemma 4 via llama.cpp Vulkan—no cloud needed. 𝕏
Setup's messy: BIOS tweaks, compiles, OOM fights. Not for casuals. 𝕏
Unique edge: Local APIs mimic OpenAI, perfect for VS Code/Copilot alternatives. 𝕏

Published by

theAIcatchup

Ship faster. Build smarter.

#Gemma 4 #Ryzen mini PC #Vulkan inference #llama.cpp #local AI #local AI inference #local LLMs

Worth sharing?

Get the best Developer Tools stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

⚡ Key Takeaways

The 60-Second TL;DR

theAIcatchup

Share this article

Worth sharing?

Related Stories

Why the Intel Arc B580 Crushes Local AI Dreams on a $249 Budget

Browser AI Hits Escape Velocity: Transformers.js Delivers Zero-Cost LLMs on Your Device

Ditch Dumb Routing: Build a Hybrid LLM Brain

Running Llama 3.1 on an RTX 5070 Ti From My Home Office—And Why It Actually Works

Stay in the loop