Unsloth Releases Gemma 4 MTP Assistant Quantized Model: Multi-Token Prediction Ushers in the Premium QAT Era
Unsloth Releases Gemma 4 MTP Assistant Quantized Models: Multi-Token Prediction Enters the Era of Premium QAT
If you’ve been waiting to run Google’s latest Gemma 4 large model locally with ultra‑low latency and zero compromise on inference quality, now is the perfect moment. Unsloth, the open‑source community’s top‑tier fine‑tuning framework, has just released a family of Gemma 4 QAT MTP assistant models on Hugging Face, all provided in GGUF format and spanning specs from 12B to E2B (approximately 212B) – including a 32B variant optimized specifically for mobile. These models debut primarily at q8_0 quantization granularity while also offering larger quantization options, marking yet another leap forward for edge inference.
This time, Gemma 4 truly “understands” multi‑token prediction
The models are not named after the standard Gemma 4; they explicitly carry the “MTP” suffix. MTP stands for Multi‑Token Prediction – the Gemma 4 series natively supports predicting multiple future tokens simultaneously in a single forward pass to assist the main model’s generation, dramatically reducing the number of autoregressive decoding iterations. However, without careful quantization, the native MTP assistant heads can easily lose their collaborative ability at low precision. The key move by Unsloth this time is using QAT (Quantization‑Aware Training) to quantize and fine‑tune the MTP auxiliary decoder together with the main model, rather than applying simple post‑training quantization. The resulting mtp‑gemma‑4‑*.gguf files preserve the acceleration benefits of multi‑token prediction almost losslessly at q8_0 precision, while slashing the model size to a level extremely friendly for consumer‑grade GPUs and CPU inference.
q8_0 becomes the new baseline, with a clearly layered quantization suite
Open any model repository on Hugging Face and you’ll find a carefully designed directory structure: the core GGUF file at q8_0 is placed directly in the model root directory, while a separate MTP folder houses q8_0 and higher‑bitrate quantization variants. This arrangement means ordinary users can simply pull the root directory model and get started right away, while developers who demand higher precision can step into the MTP folder and choose versions such as q5_k_m, q6_k, or even f16. Unsloth has built a complete QAT pipeline for the following five Gemma 4 models and open‑sourced them all:
- gemma‑4‑12B‑it‑qat‑GGUF — A versatile all‑rounder balancing performance and resource consumption
- gemma‑4‑26B‑A4B‑it‑qat‑GGUF — A Mixture‑of‑Experts model with 26B parameters and 4B active experts
- gemma‑4‑31B‑it‑qat‑GGUF — A 31B dense model, a reliable choice for general‑purpose scenarios
- gemma‑4‑E2B‑it‑qat‑GGUF — The super‑sized E2B architecture, delivering maximum firepower for cloud and workstations
- gemma‑4‑E2B‑it‑qat‑mobile‑GGUF — A supermodel optimized specifically for on‑device inference, pushing the limits of the edge
Quantization‑aware training “tames” MTP, accelerating inference without sacrificing intelligence
In traditional model quantization schemes, quantizing multi‑head attention or auxiliary prediction heads often causes attention‑score drift, rendering multi‑token prediction useless. This time, Unsloth directly applied quantization‑aware training to Gemma 4’s MTP modules, so that the quantized assistant model and the main model maintain tight information coupling. In real‑world tests, using the q8_0 MTP model for multi‑token prediction reduces autoregressive steps by nearly 30%, delivering an immediately noticeable end‑to‑end generation speedup, while metrics such as perplexity remain almost identical to the floating‑point version. For scenarios requiring long‑sequence generation, such as chat and code completion, this amounts to a free performance upgrade.
Deploy now: from Hugging Face to local in just one step
All QAT MTP models are fully compatible with mainstream GGUF inference engines such as llama.cpp, Ollama, and LM Studio. Simply download the corresponding GGUF file, configure the multi‑token prediction parameters, and you can run the fully accelerated version of Gemma 4 on an M‑series Mac, an RTX 40‑series GPU, or even a Raspberry Pi cluster. What Unsloth has released this time is not merely a batch of model files, but a complete “quantization‑as‑acceleration” methodology, signaling that in the future all large models equipped with MTP capability will undergo a second evolution through QAT quantization.
Visit the repositories below immediately and secure your own MTP acceleration engine:
Gemma 4 12B QAT GGUF | Gemma 4 26B A4B QAT GGUF | Gemma 4 31B QAT GGUF | Gemma 4 E2B QAT GGUF | Gemma 4 E2B Mobile Optimized