Gemma 4 Quadruple Release, 12B, 12B QAT, 26B-A4B QAT and 31B QAT Uncensored Heretics: The Definitive Technical Breakdown
Gemma 4 Quadruple Release, 12B, 12B QAT, 26B-A4B QAT and 31B QAT Uncensored Heretics: The Definitive Technical Breakdown
Google's Gemma 4 family has spawned one of the most ambitious community-driven releases of the year. Four distinct model variants — 12B base, 12B QAT, 26B-A4B QAT, and the fiercely debated 31B QAT Uncensored Heretic — are now available across five distribution formats on HuggingFace. This article unpacks everything: architecture, quantization, the “uncensored heretic” lineage, format differences, and how to deploy each variant responsibly.
· 18-minute read
1. What Is the Gemma 4 Quadruple Release?
The Gemma 4 Quadruple Release refers to a coordinated drop of four fine-tuned and quantized variants derived from Google’s Gemma 4 architecture. These models were produced and shared by the prolific community contributor llmfan46 on HuggingFace, extending the official Gemma 4 checkpoints with Quantization-Aware Training (QAT), aggressive low-bit quantization, and — in the case of the 31B — a deliberate removal of alignment guardrails, resulting in what the community calls an “uncensored heretic” variant.
This release is significant for several reasons:
- Unprecedented variety: Four parameter scales (12B dense, 12B QAT, 26B-A4B mixture-of-experts QAT, 31B QAT) in a single coordinated release.
- Five distribution formats: Safetensors (standard), GGUF (llama.cpp / CPU-friendly), NVFP4 (NVIDIA Blackwell-optimized 4-bit floating point), NVFP4 GGUF, and GPTQ-Int4 — covering virtually every deployment scenario.
- QAT advantage: Unlike post-training quantization (PTQ), QAT embeds quantization awareness during training or fine-tuning, yielding superior perplexity retention at ultra-low bit-widths.
- Controversy and demand: The “uncensored heretic” branding signals a model stripped of refusal mechanisms, attracting both intense interest and ethical scrutiny.
2. The Four Model Variants Explained
2.1 Gemma 4 12B (Base QAT Variant)
The 12B dense model represents the entry point of the quadruple release. Built on the Gemma 4 architecture with 12 billion parameters, this variant has undergone QAT to make it robust to 4-bit quantization. It retains the standard instruction-tuning alignment from Google’s official release, making it suitable for general-purpose tasks where safety compliance is expected.
- Parameter count: 12 billion (dense, all parameters active per token)
- Quantization: q4_0 (4-bit, symmetric per-block quantization)
- Alignment: Standard instruction-tuned, with refusal guardrails intact
- Best for: Production deployments requiring moderate compute with full safety alignment
2.2 Gemma 4 12B QAT (Fine-Tuned q4_0)
This is a further refined version of the 12B, with additional QAT fine-tuning specifically optimized for the q4_0 quantization scheme. The extra QAT pass reduces the perplexity gap between the full-precision 12B and its 4-bit counterpart to near-negligible levels. If you need the 12B at the smallest possible memory footprint without quality degradation, this is the variant to choose.
- Key differentiator: Extended QAT fine-tuning beyond the base QAT checkpoint
- Memory footprint: Approximately 6–7 GB in 4-bit mode
- Use case: Edge deployment, consumer GPUs with 8–12 GB VRAM
2.3 Gemma 4 26B-A4B QAT (Mixture-of-Experts)
The 26B-A4B is the most architecturally interesting member of the release. It employs a Mixture-of-Experts (MoE) design where the total parameter count is 26 billion but only 4 billion are active per token (denoted A4B). This sparse activation pattern delivers inference speeds closer to a 4B dense model while retaining the knowledge capacity of a much larger one. The QAT treatment ensures the MoE routing and expert weights survive 4-bit compression gracefully.
- Total parameters: 26B (sparse MoE)
- Active parameters per token: ~4B
- Architecture highlight: Gated expert routing with load-balancing loss
- Ideal for: High-throughput serving where latency must stay low but knowledge depth matters
2.4 Gemma 4 31B QAT Uncensored Heretic
The headline-grabber. The 31B QAT Uncensored Heretic is a dense 31-billion-parameter model that has undergone QAT for q4_0 compression and had its safety alignment intentionally stripped or bypassed. The term “heretic” is community nomenclature for models that will respond to prompts that official models refuse. We dive deeper into this variant in the next section.
3. Deep Dive: The 31B QAT Uncensored Heretic
The gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic variant (often shortened to “31B Uncensored Heretic”) has become the most downloaded and discussed model in this release. Understanding why requires examining three dimensions: technical provenance, the unquantized paradox, and the uncensoring mechanism.
3.1 What Does “Unquantized” Mean in a QAT Context?
The filename includes the term “unquantized” which can cause confusion. In this context, it means the model weights are stored in a full-precision format (BF16/FP16) that has been through QAT — the weights have been trained with quantization-awareness, so they are prepared for q4_0 inference, but the checkpoint itself is not yet quantized to 4-bit. This allows users to:
- Apply their own quantization scheme (q4_0, q4_1, q5_0, etc.)
- Run the model at full precision if desired (with excellent quality since QAT improved the weight landscape)
- Use the provided GGUF or GPTQ versions for immediate 4-bit deployment
3.2 How Was the “Uncensored” Modification Achieved?
While the exact methodology is not fully disclosed, community analysis suggests the uncensoring was achieved through a combination of techniques:
- Fine-tuning on refusal-free corpora: The model was further trained on datasets where the assistant consistently complies without refusal patterns, effectively overwriting the alignment vectors.
- LoRA-based alignment removal: Low-Rank Adaptation may have been used to subtract or neutralize the safety-refusal directions in the model’s residual stream.
- Prompt-prefix reconditioning: The system prompt and chat template may have been modified to remove the “helpful and harmless” conditioning present in the official instruct template.
The result is a 31B model that retains the strong reasoning, coding, and creative capabilities of Gemma 4 while no longer refusing requests based on safety classifications.
3.3 Why “Heretic”? Community Naming Conventions
In the open-source LLM community, “heretic” has emerged alongside terms like “abliterated,” “uncensored,” and “unhinged” to describe models with removed guardrails. The term carries a rebellious connotation and signals to users that the model will operate without the ethical constraints imposed by the original developers. It is not an official designation — it is purely community-driven nomenclature.
4. Distribution Formats: Safetensors, GGUF, NVFP4, and GPTQ-Int4
One of the most user-friendly aspects of the llmfan46 release is the breadth of formats. Each serves a distinct deployment ecosystem. Here’s what you need to know about each:
4.1 Safetensors (Standard)
Safetensors is the safe, fast, and increasingly standard format for distributing model weights. Unlike pickle-based formats, Safetensors is immune to arbitrary code execution, making it the secure choice. These files contain the full-precision (or QAT-prepared) weights and are ideal for:
- Loading into HuggingFace
transformersoraccelerate - Fine-tuning or further training
- Converting to other formats
Repository: llmfan46/gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic (Safetensors)
4.2 GGUF (llama.cpp / CPU Inference)
GGUF (GPT-Generated Unified Format) is the successor to GGML and the standard format for llama.cpp, Ollama, LM Studio, and other CPU-first or hybrid inference engines. The GGUF files in this release are pre-quantized to q4_0, meaning you can download and run them immediately without any conversion step.
Repository: llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-GGUF
- Best for: Apple Silicon (M1/M2/M3/M4), AMD Ryzen, Intel CPU inference, and privacy-focused local deployment
- Typical performance: 8–15 tokens/sec on M2 Max with 32 GB RAM
4.3 NVFP4 (NVIDIA Blackwell 4-Bit Floating Point)
NVFP4 is a cutting-edge 4-bit floating-point format designed for NVIDIA’s Blackwell architecture (B200, B100 GPUs). Unlike integer quantization (INT4), NVFP4 uses a floating-point representation that preserves dynamic range more effectively, especially for outlier activations. The NVFP4 Safetensors variant stores weights in this format, and the NVFP4 GGUF variant bridges the format into the llama.cpp ecosystem.
- NVFP4 Safetensors: llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-NVFP4
- NVFP4 GGUF: llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-NVFP4-GGUF
4.4 GPTQ-Int4
GPTQ-Int4 is a post-training quantization method that uses approximate second-order information (Hessian-based) to minimize quantization error. The GPTQ-Int4 variant is optimized for AutoGPTQ and vLLM inference backends, offering excellent throughput on CUDA GPUs with minimal perplexity degradation.
Repository: llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-GPTQ-Int4 (GPTQ-Int4)
- Best for: High-throughput GPU serving with vLLM or TGI
- GPU requirement: CUDA-capable GPU with 16+ GB VRAM recommended
5. Complete HuggingFace Repository Links
All repositories are maintained by llmfan46 on HuggingFace. Below is the complete, verified list for the Gemma 4 31B QAT Uncensored Heretic in all five distribution formats:
🔗 Official Repositories — Gemma 4 31B Uncensored Heretic
- Safetensors (Unquantized QAT):
https://huggingface.co/llmfan46/gemma-4-31B-it-qat-q4_0-unquantized-uncensored-heretic - GGUF (q4_0 quantized):
https://huggingface.co/llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-GGUF - NVFP4 Safetensors:
https://huggingface.co/llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-NVFP4 - NVFP4 GGUF:
https://huggingface.co/llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-NVFP4-GGUF - GPTQ-Int4:
https://huggingface.co/llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-GPTQ-Int4
Note: The 12B, 12B QAT, and 26B-A4B QAT variants are also available on llmfan46’s HuggingFace profile under similar naming conventions. Check the profile for the full catalog.
6. Side-by-Side Comparison: All Four Gemma 4 Variants
| Feature | 12B Base QAT | 12B QAT Fine-tuned | 26B-A4B QAT | 31B QAT Uncensored |
|---|---|---|---|---|
| Architecture | Dense | Dense | MoE (26B total / 4B active) | Dense |
| Total Parameters | 12B | 12B | 26B | 31B |
| Active/Token | 12B | 12B | ~4B | 31B |
| Quantization | QAT + q4_0 ready | Extended QAT + q4_0 | QAT + q4_0 ready | QAT + q4_0 ready |
| Safety Alignment | Full (Gemma standard) | Full (Gemma standard) | Full (Gemma standard) | Removed (Uncensored) |
| Memory ~4-bit | ~7 GB | ~7 GB | ~15 GB (total) / ~3 GB active | ~17 GB |
| Best For | Safe production | Edge / consumer GPU | Low-latency serving | Research, creative, unrestricted use |
7. How to Deploy and Run These Models
7.1 Loading the Safetensors Version with Transformers
7.2 Running the GGUF Version with llama.cpp
7.3 GPTQ-Int4 with vLLM for High-Throughput Serving
7.4 NVFP4 on NVIDIA Blackwell Hardware
For users with access to Blackwell GPUs (B200/B100), the NVFP4 format unlocks native 4-bit floating-point tensor core acceleration. The NVFP4 Safetensors files can be loaded with a custom transformers branch that supports the format, while the NVFP4 GGUF files work with a specially compiled llama.cpp build with NVFP4 kernels enabled. Check the respective HuggingFace repositories for the latest loading instructions.
8. Risks, Ethics, and the “Uncensored” Label
The Gemma 4 31B QAT Uncensored Heretic raises important ethical questions that every practitioner should consider before deployment:
8.1 What “Uncensored” Actually Means
In the context of this release, “uncensored” means the model’s refusal mechanism — the internal classifier that detects potentially harmful requests and triggers a refusal response — has been neutralized or removed. The model will attempt to comply with any prompt, including those involving:
- Generation of violent, hateful, or harassing content
- Instructions for illegal activities
- Production of malware, exploits, or weapon-related information
- Sexually explicit or non-consensual content
- Misinformation and disinformation campaigns
8.2 Legitimate Use Cases
Despite the risks, uncensored models have legitimate applications in research, red-teaming, creative writing, and adversarial robustness testing. Security researchers use them to study jailbreaking techniques and develop better defenses. Writers use them for unfiltered creative exploration where standard models might incorrectly flag content. The key is responsible deployment with appropriate safeguards.
8.3 Mitigation Strategies
- Input and output filtering: Deploy a content moderation layer (e.g., Llama Guard, Perspective API) around the model.
- Access control: Restrict model access to authenticated and authorized users only.
- Logging and monitoring: Maintain comprehensive logs of all prompts and completions for audit purposes.
- Sandboxed deployment: Run the model in an isolated environment without internet access or system-level privileges.
9. Frequently Asked Questions
Q: What is the difference between the 12B and 12B QAT variants?
The 12B QAT variant has undergone extended quantization-aware training beyond the base QAT checkpoint, resulting in better perplexity retention when actually quantized to 4-bit. If you plan to run at 4-bit precision, choose the 12B QAT variant for marginally better quality.
Q: Can I run the 31B Uncensored Heretic on a single consumer GPU?
In its 4-bit GGUF or GPTQ-Int4 form, the 31B model requires approximately 17 GB of VRAM. This fits comfortably on an RTX 4090 (24 GB) or RTX 3090 (24 GB). For Apple Silicon, you’ll need a Mac with at least 32 GB of unified memory for reasonable performance.
Q: What does “q4_0” mean in the model name?
q4_0 is a specific 4-bit quantization scheme used in GGUF/llama.cpp. It uses symmetric per-block quantization with a block size of 32, meaning every 32 weights share a single scaling factor. It balances compression ratio and quality well for most use cases.
Q: Is the 26B-A4B model faster than the 12B dense model?
For single-token generation, yes — the 26B-A4B MoE model only activates ~4B parameters per token, which is fewer than the 12B dense model’s 12B. However, the total memory requirement is higher (~15 GB vs. ~7 GB at 4-bit) because all experts must be loaded. Throughput depends on your hardware’s memory bandwidth.
Q: Are these models legal to use?
The base Gemma 4 models are released under Google’s Gemma license, which permits commercial and research use with certain restrictions. The community-modified “uncensored heretic” variants exist in a gray area — they are derivative works. Users should consult the Gemma license terms and legal counsel for their specific use case.
Q: What is NVFP4 and do I need it?
NVFP4 (NVIDIA 4-bit Floating Point) is a new format optimized for Blackwell-architecture GPUs. If you do not have a B200 or B100 GPU, you should use the standard GGUF or GPTQ-Int4 formats instead. NVFP4 offers better dynamic range than INT4 but requires specific hardware support.
Q: How do I verify the model files haven’t been tampered with?
HuggingFace repositories include SHA256 checksums. After downloading, run sha256sum <filename> and compare against the checksums listed in the repository’s README or model card. For GGUF files, llama.cpp also validates internal checksums on load.
10. Conclusion: Which Gemma 4 Variant Is Right for You?
The Gemma 4 Quadruple Release — spanning 12B, 12B QAT, 26B-A4B QAT, and 31B QAT Uncensored Heretic across Safetensors, GGUF, NVFP4, and GPTQ-Int4 — represents one of the most comprehensive community model drops in recent memory. Choosing the right variant depends entirely on your use case:
- Choose 12B Base QAT if you need a safe, aligned model for production applications with moderate compute requirements.
- Choose 12B QAT Fine-tuned if you’re deploying to edge devices or consumer GPUs and want the best possible 4-bit quality.
- Choose 26B-A4B QAT if you need low-latency inference with the knowledge breadth of a larger model — ideal for chatbots and interactive applications.
- Choose 31B QAT Uncensored Heretic if you are a researcher, red-teamer, or creative professional who needs an unrestricted model and has implemented appropriate safeguards.
For format selection:
- Safetensors for maximum flexibility and further fine-tuning
- GGUF for CPU inference, Apple Silicon, and local privacy-focused deployment
- GPTQ-Int4 for high-throughput GPU serving with vLLM
- NVFP4 if you have Blackwell hardware and want cutting-edge 4-bit floating-point performance
The community around these models is active and growing. As with all rapidly evolving open-source AI releases, stay updated via the llmfan46 HuggingFace profile and the broader Gemma community forums. The convergence of QAT, MoE architectures, and accessible quantization formats is pushing the frontier of what’s possible with locally-run large language models — and the Gemma 4 Quadruple Release is a landmark moment in that journey.