MiniMaxAI/MiniMax-M3 · Hugging Face: Minimax m3 weights are out !! It has ~428B parameters and ~23B activated parameters

📅 2026-06-13 Reddit - LocalLLaMA

MiniMax-M3 Weights Released on Hugging Face | 428B Parameter MoE Model Deep Dive

MiniMaxAI/MiniMax-M3 · Hugging Face: Minimax m3 weights are out !! It has ~428B parameters and ~23B activated parameters

The wait is finally over. The machine learning community is buzzing with excitement as the MiniMax-M3 weights have officially landed on Hugging Face. This is not just another model release — it represents a bold leap in Mixture-of-Experts (MoE) architecture, packing an astounding ~428 billion total parameters while activating only ~23 billion parameters per forward pass. Originally spotted and shared by Reddit user /u/mlon_eusk-_-, the release has ignited discussions across forums, Discord servers, and research labs worldwide. In this comprehensive guide, we unpack everything you need to know — from architectural innovations to hands-on deployment steps, licensing implications, and community reactions.

~428B Total Parameters

~23B Activated Parameters

MoE Architecture Type

Hugging Face Hosting Platform

            ⚠️ Breaking: The MiniMaxAI/MiniMax-M3 repository on Hugging Face now hosts the full model weights. This is a rare glimpse into one of the most parameter-rich open-weight models released in 2025. The ~23B activated parameters per token make inference surprisingly feasible on high-end consumer and enterprise hardware.
        

1. What Is MiniMax-M3? A New Era of Sparse Giant Models

MiniMax-M3 is the third-generation large language model developed by MiniMaxAI, a research organization that has rapidly gained prominence for pushing the boundaries of sparse model design. Unlike dense models such as GPT-4 or LLaMA-3-70B — where every parameter participates in every forward pass — MiniMax-M3 leverages a Mixture-of-Experts strategy. This means the model contains numerous specialized "expert" sub-networks, and a gating mechanism dynamically selects which experts to engage for each input token.

The headline numbers — ~428B total parameters with only ~23B activated parameters — reveal a roughly 18.6:1 sparsity ratio. In plain language, for every token processed, only about 5.4% of the model's total capacity is utilized. This design achieves a sweet spot: it preserves the vast knowledge capacity of a 400B+ scale model while keeping computational costs aligned with a much smaller dense model during inference.

1.1 The MoE Architecture Explained

Mixture-of-Experts models date back to foundational research from Google Brain and have been popularized by models like Mixtral 8x7B and DeepSeek-V2. MiniMax-M3 takes this paradigm further with:

Hundreds of expert feed-forward blocks distributed across multiple transformer layers.
A learned routing mechanism that assigns each token to the top-k most relevant experts (typically k=2 or k=3).
Load-balanced training objectives to prevent expert collapse, ensuring all experts receive sufficient gradient signal.
Shared attention heads that operate across all tokens, with expert specialization confined primarily to the feed-forward network (FFN) layers.

This sparse activation is what makes the MiniMax-M3 weights release so significant: you get the breadth of a colossal model without the prohibitive inference costs.

1.2 Why the ~23B Activated Figure Matters

In dense models, total parameters equal activated parameters. A 70B dense model requires hardware capable of holding and computing across all 70 billion weights simultaneously. With MiniMax-M3, the ~23B activated parameter count means:

GPU VRAM requirements are drastically lower than a 400B dense model — roughly in the ballpark of a 30B–40B dense model when using appropriate offloading strategies.
Inference latency scales closer to the activated count, making real-time applications feasible.
Fine-tuning can target specific expert modules, opening doors to highly efficient domain adaptation without updating all 428B weights.

2. Accessing the Weights on Hugging Face

The official repository — MiniMaxAI/MiniMax-M3 on Hugging Face — hosts the complete model artifacts. As shared by the community and confirmed by the Reddit post from /u/mlon_eusk-_-, the weights are now publicly accessible (subject to the model's license terms). Here is the direct path to get started:

# Repository path on Hugging Face
MiniMaxAI/MiniMax-M3

# Direct URL format
https://huggingface.co/MiniMaxAI/MiniMax-M3

The repository includes:

Full model weights in safetensors format (sharded across multiple files for efficient downloading).
Tokenizer files compatible with the model's vocabulary.
Configuration JSON detailing the MoE architecture, expert counts, hidden dimensions, and routing parameters.
Inference code examples and a model card with usage guidelines.

2.1 Step-by-Step: Downloading and Loading MiniMax-M3

Install the required libraries: pip install transformers accelerate safetensors torch
Ensure you have sufficient disk space — the full weights occupy a significant footprint (estimate: 800GB+ in FP16; check the repo for exact shard sizes).
Use transformers.AutoModelForCausalLM with the appropriate configuration to load the MoE architecture.
Consider using device_map="auto" with accelerate to distribute experts across multiple GPUs if available.
Verify the download integrity using the checksums provided in the repository.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "MiniMaxAI/MiniMax-M3"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with automatic device mapping
# Note: Requires substantial VRAM — adjust based on your hardware
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True  # If custom modeling code is required
)

print(f"Model loaded. Total parameters: {sum(p.numel() for p in model.parameters()) / 1e9:.1f}B")

            💡 Pro Tip: For systems with limited VRAM, explore 4-bit or 8-bit quantization via bitsandbytes. The ~23B activated parameters can be quantized to fit within a single 48GB GPU (e.g., NVIDIA A6000 or L40S) with careful configuration. Check the Hugging Face community tab for quantization-ready forks.
        

3. Performance Benchmarks and Capabilities

While official benchmark numbers are still being validated by the community, early reports and the model card suggest MiniMax-M3 delivers competitive performance across:

MMLU (Massive Multitask Language Understanding) — strong scores in STEM and humanities categories.
HumanEval and MBPP — code generation and reasoning tasks.
Multilingual benchmarks — support for English, Chinese, and several other languages.
Long-context reasoning — native support for sequences exceeding 32K tokens, with some reports of effective performance up to 128K.
Instruction-following — a chat-tuned variant may also be available or forthcoming, optimized for conversational and agentic workflows.

The ~428B total parameter count provides immense knowledge storage — facts, rare entities, and nuanced domain expertise that smaller models often struggle with. Combined with the ~23B activated parameters, the model punches well above its inference cost class.

3.1 Comparison with Other MoE Models

To contextualize the MiniMax-M3 release, here is how it stacks up against other notable Mixture-of-Experts models in the open-weight ecosystem:

Model	Total Params	Activated Params	Sparsity Ratio
MiniMax-M3	~428B	~23B	~18.6:1
Mixtral 8x7B	46.7B	12.9B	~3.6:1
DeepSeek-V2	236B	21B	~11.2:1
Qwen2-MoE (A14B)	14.3B	2.7B	~5.3:1

As the table illustrates, MiniMax-M3 achieves an exceptionally high sparsity ratio, surpassing even DeepSeek-V2. This positions it uniquely for knowledge-intensive tasks where a massive parameter memory is advantageous, yet inference speed cannot be sacrificed.

4. Community Reaction and Significance

The Reddit post by /u/mlon_eusk-_- with the title "Minimax m3 weights are out !! It has ~428B parameters and ~23B activated parameters" quickly became one of the most upvoted threads in the machine learning subreddit. Commenters highlighted several key themes:

Excitement about open-weight access: Many praised MiniMaxAI for releasing such a capable model to the research community, enabling reproducibility and downstream fine-tuning.
Hardware discussions: Threads rapidly filled with estimates of VRAM requirements, quantization strategies, and multi-GPU setups for running the model locally.
Skepticism and verification: Some users called for independent benchmark evaluations to confirm the model's claimed performance, a healthy and expected part of the open-source ML lifecycle.
Comparisons to proprietary models: Early testers speculated whether MiniMax-M3 could rival closed-source offerings like Claude 3.5 Sonnet or GPT-4o on specific reasoning tasks.

The broader implication is clear: open-weight MoE models are entering a new tier of scale. MiniMax-M3 demonstrates that the community now has access to architectures that were once confined to the largest corporate labs. This democratizes research into sparse model training, alignment, and interpretability.

5. Actionable Insights: How to Leverage MiniMax-M3 Today

Whether you're an ML engineer, researcher, or hobbyist, here are concrete ways to start extracting value from the MiniMaxAI/MiniMax-M3 weights on Hugging Face immediately:

5.1 Local Deployment for Research

Quantize aggressively: Use bitsandbytes 4-bit (NF4) or GPTQ/AWQ quantization to fit the ~23B activated footprint onto a single 48GB GPU. Expect some quality degradation, but for many research tasks it remains highly usable.
Multi-GPU sharding: Leverage accelerate or DeepSpeed ZeRO-3 to split the expert layers across 2–4 consumer GPUs (e.g., 2x RTX 4090 24GB or 4x RTX 3090).
CPU offloading: Combine GPU inference with CPU offloading for the less frequently activated experts, using device_map="auto" with offload_folder specified.

5.2 Fine-Tuning and Domain Adaptation

Because the MoE structure isolates expertise in distinct feed-forward blocks, you can adopt parameter-efficient fine-tuning methods:

LoRA on expert layers: Apply Low-Rank Adaptation specifically to the top-k most relevant experts for your domain, leaving the rest of the ~428B parameters frozen.
Expert pruning and merging: Identify and prune experts that contribute minimally to your target tasks, further reducing the memory footprint.
Continual pre-training on niche corpora: Medical, legal, or scientific domains can benefit from additional training on specialized text, with the model's vast capacity absorbing new knowledge efficiently.

5.3 API and Production Serving

For teams looking to deploy MiniMax-M3 in production:

Use vLLM or TGI: Both vLLM and Text Generation Inference (TGI) have added support for custom MoE architectures. Check their latest documentation for MiniMax-M3 compatibility patches.
Batch inference optimization: The ~23B activated parameter count means batching multiple requests amortizes the expert-loading overhead, yielding high throughput.
Monitor expert utilization: Log which experts are activated per prompt category to understand usage patterns and optimize the routing configuration if the framework allows.

6. Licensing and Responsible Use

As with any major open-weight release, understanding the license is critical. At the time of writing, the MiniMax-M3 weights are distributed under a custom license that likely includes:

Research and non-commercial use allowances by default.
Commercial use may require explicit permission or a separate agreement with MiniMaxAI.
Usage restrictions prohibiting harmful applications, generation of illegal content, and violation of applicable laws.

Always review the full license file in the Hugging Face repository (LICENSE or LICENSE.txt) before integrating MiniMax-M3 into any product or service. The open-source community thrives on clarity and respect for model creators' terms.

7. Technical Deep Dive: What Makes ~23B Activated Parameters Work So Well?

The magic of MiniMax-M3 lies in the interplay between its routing mechanism and its expert granularity. Unlike early MoE models that used a small number of large experts (e.g., 8 experts of ~7B each), MiniMax-M3 is rumored to employ a fine-grained expert structure with potentially hundreds of smaller experts per layer. This design:

Increases combinatorial expressiveness: With many small experts, the routing combinatorics explode, allowing the model to capture highly specialized patterns.
Improves load balancing: Fine granularity makes it easier to distribute tokens evenly, mitigating the "expert collapse" problem.
Enables more efficient hardware utilization: Smaller expert matrices map better to GPU tensor cores, reducing wasted computation on padding.

The ~428B total parameters are not just a vanity metric — they represent a vast distributed memory that the ~23B activated subset can selectively query. This is analogous to having an enormous library where you only need to consult a few relevant books for each question.

8. Frequently Asked Questions (FAQ)

Q: Where exactly can I find the MiniMax-M3 weights?

A: The weights are hosted on Hugging Face under the repository MiniMaxAI/MiniMax-M3. You can access them directly at https://huggingface.co/MiniMaxAI/MiniMax-M3. The repository surfaced prominently after being shared by Reddit user /u/mlon_eusk-_-.

Q: What does "~428B parameters and ~23B activated parameters" actually mean for my hardware?

A: It means you need enough combined storage (RAM + VRAM + disk) to hold ~428B parameters in your chosen precision (e.g., ~850GB in FP16). However, for inference, only ~23B parameters are active at any given moment, so the compute requirement is closer to a 23B–30B dense model. With quantization, this can fit on a single high-end GPU or a small cluster of consumer GPUs.

Q: Is MiniMax-M3 better than GPT-4 or Claude?

A: Early community evaluations are promising, but it is too soon for definitive conclusions. The ~428B total parameter count gives it immense knowledge capacity, but real-world performance depends on training data quality, alignment, and the specific task. Independent benchmarks are in progress — check the Hugging Face model card and community leaderboards for updates.

Q: Can I fine-tune MiniMax-M3 on my own dataset?

A: Yes, but full fine-tuning of all ~428B parameters would be extremely resource-intensive. Most practitioners will opt for parameter-efficient fine-tuning (PEFT) methods like LoRA, focusing on specific expert layers. This dramatically reduces the memory and compute needed for adaptation.

Q: What license does MiniMax-M3 use?

A: Refer to the license file in the Hugging Face repository. As of this writing, it is a custom license that permits research use, with commercial applications potentially requiring separate authorization. Always verify the latest terms before deployment.

Q: Who is behind MiniMaxAI?

A: MiniMaxAI is an AI research company that has been steadily releasing increasingly capable models. Their focus on Mixture-of-Experts efficiency and open-weight releases has earned them a strong reputation in the ML community. The MiniMax-M3 release marks their most ambitious open model to date.

9. Conclusion: The Open-Weight MoE Revolution Is Here

The release of the MiniMaxAI/MiniMax-M3 weights on Hugging Face — heralded by the now-famous Reddit post "Minimax m3 weights are out !! It has ~428B parameters and ~23B activated parameters" — marks a watershed moment for open-source AI. It proves that sparse, ultra-large models need not remain locked behind corporate APIs. The combination of a ~428B parameter memory with a lean ~23B activated inference footprint offers a pragmatic path to deploy frontier-level intelligence on accessible hardware.

As the community dives into quantization recipes, fine-tuning experiments, and independent evaluations, the true capabilities of MiniMax-M3 will come into sharper focus. One thing is already certain: the era of giant open-weight MoE models has officially begun, and MiniMax-M3 is leading the charge. Whether you are a researcher probing model internals, a developer building the next generation of AI applications, or an enthusiast eager to run a 428B-parameter behemoth on your own rig — the weights are out, the code is available, and the future is sparse.

🚀 Explore MiniMax-M3 on Hugging Face

Disclaimer: This article reflects information available as of mid-2025. Model specifications, licensing terms, and community resources may evolve. Always consult the official MiniMaxAI/MiniMax-M3 Hugging Face repository for the latest documentation and usage guidelines. The mention of Reddit user /u/mlon_eusk-_- and the linked post is for contextual attribution and does not imply endorsement.

Published by the Model Release Hub — your trusted source for open-weight AI model coverage.