What Models You Guys Running on 8GB? 16GB VRAM? 24GB? 32GB? 48GB? — The Complete Local AI Stacks Guide

📅 2026-06-13 Reddit - LocalLLaMA

What Models to Run on 8GB, 16GB, 24GB, 32GB, and 48GB VRAM — The Definitive Local AI Guide

What Models You Guys Running on 8GB? 16GB VRAM? 24GB? 32GB? 48GB? — The Complete Local AI Stacks Guide

The local AI landscape moves at breakneck speed. One month a model dominates every leaderboard; the next, a new quantization technique or inference engine reshapes what's possible on consumer hardware. This article is a living snapshot, aggregating real-world community experiences about what models people are actually running at each VRAM tier — from budget 8GB cards all the way up to 48GB workstation beasts. We cover model choices, KV cache configurations, context length trade-offs, tokens-per-second performance, underlying hardware, and the diverse use cases driving these setups. Whether you're building a privacy-first coding assistant, a research analysis pipeline, or a creative storytelling companion, this guide will help you dial in your stack with confidence.

📋 In This Guide

8GB VRAM Tier — The Efficiency Sweet Spot
16GB VRAM Tier — The Mainstream Powerhouse
24GB VRAM Tier — The Enthusiast's Playground
32GB VRAM Tier — Prosumer & Multi-GPU Bridge
48GB VRAM Tier — The Workstation Class
KV Cache & Context Length Deep Dive
Hardware-Aware Model Selection Matrix
Real-World Performance Benchmarks
What Are People Actually Using These Models For?
Frequently Asked Questions
Conclusion & Community Wisdom

8GB VRAM Tier — The Efficiency Sweet Spot

Eight gigabytes of VRAM is the entry point that still unlocks genuinely useful local AI. With 8GB, you're not running unquantized 70B monsters, but a wave of highly optimized 7B–13B parameter models at 4-bit or 5-bit quantization (Q4_K_M, Q5_K_M) delivers surprisingly capable results. The community has converged on a few standout performers that balance intelligence, speed, and memory footprint.

Top Model Picks for 8GB VRAM

Mistral-7B-Instruct (v0.3 / v0.4) — Q5_K_M — The reigning champ for general-purpose chat, summarization, and light coding on constrained hardware. Snappy inference, strong instruction following.
Llama-3-8B-Instruct — Q4_K_M — Meta's 8B offers remarkable reasoning depth for its size. Q4_K_M fits comfortably with room for a 4K–8K context window.
Gemma-2-9B-Instruct — Q4_K_M or IQ4_NL — Google's 9B punches above its weight class, especially for factual recall and structured output. The IQ4_NL quant saves precious VRAM with minimal quality loss.
Phi-3-mini-4k (3.8B) — Q8_0 or FP16 — When speed and low latency are paramount, Microsoft's tiny powerhouse runs fully unquantized on 8GB and handles RAG, classification, and lightweight tool-calling admirably.
Qwen2.5-7B-Instruct — Q5_K_M — Exceptional for multilingual tasks and code generation; Qwen's 7B at Q5_K_M fits 8GB with a healthy context buffer.

KV Cache & Context Settings for 8GB

KV cache memory is the hidden tax on your VRAM budget. On 8GB cards, every token of context consumes significant space — roughly 0.5MB to 1.2MB per 1K tokens for a 7B model at 4-bit, depending on the attention implementation. Community wisdom suggests:

Default context: 4096–8192 tokens for 7B–8B models at Q4/Q5 quants.
KV cache quantization (FP8 / Q8_0 cache): Enabling KV cache quantization in llama.cpp or exllamav2 can reclaim 30–40% of cache memory, letting you stretch to 12K–16K context on 8GB.
Flash Attention: If your backend supports it, flash attention dramatically reduces peak memory during prefill, improving context headroom.
Swapping to shared GPU memory (CUDA system fallback): Some users let overflow spill into shared memory, but this tanks token generation speed below 5 t/s — generally not recommended for interactive use.

Typical Hardware for 8GB Setups

NVIDIA RTX 3070 / RTX 3060 Ti / RTX 4060 Ti 8GB
NVIDIA RTX 2070 Super / GTX 1080
AMD Radeon RX 6600 XT / RX 7600 (via ROCm or Vulkan backends)
Apple M1/M2 with 8GB unified memory (Metal-accelerated via llama.cpp)

Performance Expectations

Model	Quantization	Context	Hardware	Tokens/sec
Mistral-7B-Instruct	Q5_K_M	4096	RTX 3070 8GB	45–55 t/s
Llama-3-8B-Instruct	Q4_K_M	8192	RTX 4060 Ti 8GB	38–48 t/s
Gemma-2-9B	IQ4_NL	6144	RTX 3070 8GB	40–50 t/s
Phi-3-mini (3.8B)	FP16	4096	RTX 3060 Ti 8GB	90–120 t/s
Qwen2.5-7B	Q5_K_M	4096	AMD RX 7600 (Vulkan)	25–35 t/s

16GB VRAM Tier — The Mainstream Powerhouse

Sixteen gigabytes is where local AI truly opens up. This is the most common VRAM capacity among serious hobbyists, and it comfortably hosts 7B–13B models at high quantization (Q6_K, Q8_0) or smaller models at full FP16, while also unlocking entry to 20B–34B class models at aggressive quants (IQ3_XXS, Q3_K_M). The 16GB tier is also the first rung where running a Mixture of Experts (MoE) model like a quantized Mixtral becomes viable.

Top Model Picks for 16GB VRAM

Llama-3.1-8B-Instruct — Q8_0 — Running the 8B class at near-lossless Q8_0 quality with ample context room. Fantastic for long-form writing and complex multi-turn conversations.
Mistral-Nemo-12B (Mistral + Nvidia) — Q5_K_M — A 12B joint effort with a 128K native context window. At Q5_K_M it fits 16GB with 8K–16K usable context and delivers excellent multilingual performance.
Qwen2.5-14B-Instruct — Q4_K_M — The 14B Qwen sits in a goldilocks zone: significantly smarter than 7B models, yet still fits 16GB at Q4 with comfortable 8K context.
Phi-3-medium-14B — Q4_K_M — Microsoft's mid-tier Phi model excels at reasoning-heavy tasks and fits 16GB with room to spare.
Mixtral-8x7B-Instruct — IQ3_XXS or Q2_K — MoE architecture means only ~12.9B active parameters per token, but the full model spans ~46B. Aggressive quants run on 16GB, yielding surprisingly coherent outputs for creative writing and brainstorming.
CodeQwen1.5-7B-Chat — Q8_0 — For developers, running a dedicated code model at Q8_0 on 16GB leaves VRAM for LSP integration and large codebase context.

KV Cache & Context Settings for 16GB

8B models at Q8_0: Comfortable at 16K–32K context with KV cache quantization enabled.
12B–14B models at Q4/Q5: 8K–16K context is the sweet spot; pushing to 32K requires aggressive KV cache quantization (Q4_0 cache) and may slow generation slightly.
MoE models (Mixtral): KV cache overhead is proportional to total parameter count, not active parameters. Keep context at 4K–8K for smooth performance on 16GB.
Tool of choice: exllamav2 with its 8-bit cache is widely praised in the community for maximizing context on 16GB cards.

Typical Hardware for 16GB Setups

NVIDIA RTX 4080 / RTX 4070 Ti Super / RTX 3080
NVIDIA RTX 4060 Ti 16GB
AMD Radeon RX 6800 / RX 6900 XT / RX 7800 XT
Apple M2 Pro / M3 with 16GB unified memory
Intel Arc A770 16GB (via IPEX-LLM or llama.cpp Vulkan)

Performance Expectations

Model	Quantization	Context	Hardware	Tokens/sec
Llama-3.1-8B	Q8_0	16K	RTX 4080 16GB	55–70 t/s
Mistral-Nemo-12B	Q5_K_M	12K	RTX 4070 Ti Super 16GB	35–45 t/s
Qwen2.5-14B	Q4_K_M	8K	RTX 3080 16GB (modded)	30–40 t/s
Mixtral-8x7B	IQ3_XXS	4K	RTX 4080 16GB	25–35 t/s
CodeQwen1.5-7B	Q8_0	32K	RX 7800 XT (ROCm)	40–50 t/s

24GB VRAM Tier — The Enthusiast's Playground

Twenty-four gigabytes is the enthusiast sweet spot — the domain of the RTX 3090, RTX 4090, and high-end workstation cards. Here, 13B–20B models run at Q6_K or Q8_0 with generous 16K–32K context, and 34B class models become viable at Q4_K_M. This tier also supports running Mixtral-8x7B at Q4_K_M and similar MoE models with comfortable context, making it a favorite for those who prioritize quality over raw speed.

Top Model Picks for 24GB VRAM

Llama-3.1-70B — IQ2_XXS / IQ3_XXS (via 24GB) — Yes, a 70B model on 24GB. With the newest ultra-low quants from the IQ series, a 70B Llama can just squeeze onto a 24GB card with 2K–4K context. Quality is degraded but still surpasses many smaller models for certain reasoning tasks.
Qwen2.5-32B-Instruct — Q4_K_M — The 32B Qwen is arguably the best single-card 24GB model for complex reasoning, advanced code generation, and long-form structured output. At Q4_K_M it fits with 8K–16K context.
Gemma-2-27B-Instruct — Q4_K_M — Google's 27B excels at instruction following and factual accuracy. Fits 24GB at Q4 with 8K context and delivers strong performance.
Mixtral-8x7B-Instruct — Q5_K_M — The MoE sweet spot: Q5_K_M on 24GB with 8K–12K context. Excellent for creative writing, roleplay, and multilingual tasks.
Command-R-Plus (Cohere, 104B) — IQ2_XXS — Another ultra-quant experiment that fits 24GB. Primarily for research and experimentation; not recommended for production use but fascinating for testing the limits.
CodeLlama-34B-Instruct — Q4_K_M — A dedicated 34B code model for serious software engineering tasks, fitting 24GB with comfortable context for large codebase reasoning.

KV Cache & Context Settings for 24GB

32B models at Q4: 8K–16K context is standard; 32K achievable with Q8_0 KV cache and flash attention.
MoE models at Q5: 8K–12K context is the practical ceiling before generation speed degrades below 15 t/s.
70B ultra-quants: 2K–4K context only; KV cache consumes a huge fraction of remaining VRAM. Consider this an experimental playground, not a daily driver.
Multi-GPU offloading preview: Some 24GB owners pair a secondary card (e.g., RTX 3060 12GB) to offload layers, unlocking larger models with higher quants — a bridge to the 32GB+ tier.

Typical Hardware for 24GB Setups

NVIDIA RTX 4090 / RTX 3090 / RTX 3090 Ti
NVIDIA RTX A5000 / A5500 (workstation cards)
NVIDIA Titan RTX
AMD Radeon RX 7900 XTX (24GB, via ROCm)
Dual RTX 3060 12GB setups (combined 24GB via llama.cpp layer splitting)

Performance Expectations

Model	Quantization	Context	Hardware	Tokens/sec
Qwen2.5-32B	Q4_K_M	12K	RTX 4090 24GB	28–38 t/s
Gemma-2-27B	Q4_K_M	8K	RTX 3090 24GB	25–35 t/s
Mixtral-8x7B	Q5_K_M	10K	RTX 4090 24GB	22–32 t/s
Llama-3.1-70B	IQ3_XXS	3K	RTX 4090 24GB	6–10 t/s
CodeLlama-34B	Q4_K_M	8K	RTX 3090 24GB	20–30 t/s

32GB VRAM Tier — Prosumer & Multi-GPU Bridge

The 32GB tier is less about single consumer GPUs and more about multi-GPU setups, Apple Silicon Macs with large unified memory, and professional workstation cards. Two RTX 3090s in NVLink or pooled via llama.cpp, an Apple M2 Ultra with 32GB+ unified memory, or a single RTX 5000 Ada / A6000-class card all fall here. This capacity comfortably runs 34B–70B models at Q4_K_M to Q5_K_M with 16K+ context.

Top Model Picks for 32GB VRAM

Llama-3.1-70B-Instruct — Q4_K_M — The community's most-cited "daily driver" for 32GB. Full 70B power at Q4 with 8K–16K context. Excellent for research, advanced analysis, and professional writing.
Qwen2.5-72B-Instruct — Q4_K_M — A strong 70B-class alternative with exceptional multilingual and coding capabilities. Fits 32GB with 8K–12K context.
Command-R-Plus (104B) — Q3_K_M — Cohere's massive model at Q3_K_M squeezes onto 32GB with 4K–6K context. Impressive for RAG-style enterprise tasks.
Mixtral-8x22B-Instruct — Q4_K_M — The bigger MoE sibling with 22B experts. Total ~141B parameters but only ~39B active. Fits 32GB at Q4 with 6K–8K context and delivers top-tier multilingual reasoning.
DeepSeek-V2-Lite-Chat (16B MoE) — Q6_K — DeepSeek's efficient architecture runs luxuriously on 32GB with high quant and long context for coding and math.

KV Cache & Context Settings for 32GB

70B at Q4: 8K–16K context standard; 32K possible with Q8_0 KV cache and flash attention, though generation speed may dip to 8–12 t/s at long contexts.
MoE 141B at Q4: 6K–10K context; KV cache is the primary constraint due to total parameter count.
Multi-GPU splitting: When using llama.cpp with tensor parallelism across two 16GB GPUs, KV cache is typically replicated (not sharded), so the per-GPU cache budget is half the total — plan accordingly.
Apple Silicon unified memory: On M2 Ultra with 32GB, Metal-backed llama.cpp handles 70B Q4 with 8K context smoothly; the unified memory architecture eliminates PCIe bottlenecks entirely.

Typical Hardware for 32GB Setups

Dual RTX 3090 24GB (pooled, 48GB total but often reported in 32GB-usable configs for model+KV cache)
Single RTX A6000 / RTX 5000 Ada (32GB workstation card)
Apple M2 Ultra with 32GB unified memory (or M3 Max with 36GB)
Dual RTX 4060 Ti 16GB (32GB combined via layer splitting)
AMD Radeon Pro W6800 32GB

Performance Expectations

Model	Quantization	Context	Hardware	Tokens/sec
Llama-3.1-70B	Q4_K_M	12K	Dual RTX 3090 (48GB total)	14–22 t/s
Qwen2.5-72B	Q4_K_M	8K	Dual RTX 3090	12–20 t/s
Mixtral-8x22B	Q4_K_M	8K	RTX A6000 32GB	15–22 t/s
Command-R-Plus (104B)	Q3_K_M	4K	Apple M2 Ultra 32GB	6–10 t/s

48GB VRAM Tier — The Workstation Class

Forty-eight gigabytes is the realm of dual RTX 3090/4090 setups in NVLink, RTX A6000 Ada (48GB), and high-end Apple Silicon (M2 Ultra 48GB+). This tier comfortably runs 70B models at Q6_K or Q8_0 with 16K–32K context, and can even host 120B+ models at Q4. It's the target for those running local AI as a primary work tool — researchers, indie developers building AI-native apps, and enterprises keeping data in-house.

Top Model Picks for 48GB VRAM

Llama-3.1-70B-Instruct — Q6_K or Q8_0 — At near-lossless quantization with 32K context, this is the local AI experience most comparable to hosted APIs. Stunning quality for professional writing, analysis, and agentic workflows.
Qwen2.5-72B-Instruct — Q6_K — Running a 72B at Q6_K with 16K+ context is a premium experience for coding, math, and structured data tasks.
Command-R-Plus (104B) — Q4_K_M — Fits 48GB with 6K–10K context; a strong choice for enterprise RAG pipelines and long-document summarization.
Falcon-40B-Instruct — Q8_0 or FP16 — While older, Falcon's 40B at full precision on 48GB is a research darling for fine-tuning experiments and structured output.
Yi-34B-200K — Q5_K_M — Yi's massive 200K native context window becomes practically usable on 48GB. At Q5_K_M with 32K–64K context, it's ideal for legal document review and academic research.
DeepSeek-V2-Chat (236B MoE) — IQ3_XXS — The full DeepSeek MoE at ultra-low quants can just fit 48GB with 2K–4K context. A glimpse into the frontier of local MoE inference.

KV Cache & Context Settings for 48GB

70B at Q6/Q8: 16K–32K context is comfortable; with flash attention and KV cache quantization, 64K+ is achievable for some architectures.
100B+ models at Q4: 6K–12K context is the practical range; the larger parameter count means larger per-token KV cache entries.
200K native context models (Yi): True 200K context requires disabling KV cache quantization and accepting slower speeds (5–10 t/s), but 32K–64K is perfectly usable at full speed.
NVLink benefits: On dual 3090/4090 setups with NVLink, peer-to-peer memory access reduces KV cache replication overhead, effectively increasing usable cache by 15–25% compared to non-NVLink pooling.

Typical Hardware for 48GB Setups

Dual RTX 4090 24GB (NVLink) or Dual RTX 3090 24GB
Single NVIDIA RTX A6000 Ada 48GB
NVIDIA L40 / L40S 48GB (data center GPUs)
Apple M2 Ultra with 48GB–64GB unified memory
Dual AMD Radeon Pro W7900 24GB (48GB combined)

Performance Expectations

Model	Quantization	Context	Hardware	Tokens/sec
Llama-3.1-70B	Q8_0	32K	Dual RTX 4090 48GB	18–28 t/s
Qwen2.5-72B	Q6_K	16K	RTX A6000 Ada 48GB	15–24 t/s
Command-R-Plus (104B)	Q4_K_M	8K	Dual RTX 3090 48GB	10–16 t/s
Yi-34B-200K	Q5_K_M	48K	Dual RTX 4090 48GB	12–18 t/s
DeepSeek-V2 (236B MoE)	IQ3_XXS	3K	Apple M2 Ultra 64GB	3–6 t/s

KV Cache & Context Length — The Silent Performance Knob

If model size is the engine, KV cache configuration is the transmission. The key-value cache stores the attention keys and values for every token in your context window, and it grows linearly with both model size and context length. Misconfigure it, and you'll either crash with out-of-memory errors or leave significant VRAM idle.

How Much VRAM Does KV Cache Consume?

A rough formula used across the community for a model with N layers, H hidden dimensions, and G KV heads, running C context tokens at B bytes per cache element:

KV_cache_bytes ≈ 2 × N × G × (H / total_heads) × C × B × 2  (for K and V matrices)

In practice, for a 7B model at 4K context with FP16 KV cache, expect ~0.8–1.2 GB consumed by the cache alone. At 32K context, that balloons to 6–10 GB. This is why KV cache quantization (FP8, Q8_0, Q4_0) is the most impactful optimization after model quantization itself.

Community KV Cache Strategies

Flash Attention 2/3: Reduces peak memory during prefill by avoiding materialization of the full attention matrix. Supported in exllamav2, vLLM, and recent llama.cpp builds.
KV Cache Quantization (FP8 / Q8_0 / Q4_0): Trade a tiny amount of output quality for 30–60% cache memory savings. On 8GB and 16GB cards, this is often the difference between a 4K and a 12K context window.
Sliding Window Attention: Some models (Mistral, some Qwen variants) use sliding window attention, which bounds cache growth and enables longer effective contexts without linear memory scaling.
Context Offloading: In llama.cpp, unused KV cache portions can be offloaded to CPU RAM, but this incurs a significant latency penalty on token generation — best reserved for batch processing, not interactive chat.
Cache Pruning / Eviction Policies: Advanced backends like vLLM implement intelligent eviction of less-important KV entries, maintaining quality while capping memory usage — increasingly adopted for long-context serving.

Hardware-Aware Model Selection Matrix

Use this quick-reference table to map your hardware to the optimal model tier and expected experience level:

Your VRAM	Recommended Model Class	Quantization Range	Comfortable Context	Experience Level
8GB	3B–8B	Q4_K_M to Q8_0 (for <5B)	4K–12K	Everyday assistant, light coding, summarization
16GB	8B–14B (or MoE at IQ3)	Q4_K_M to Q8_0	8K–32K	Serious hobbyist, professional writing, mid-complexity coding
24GB	14B–34B (or 70B at IQ2)	Q4_K_M to Q6_K	8K–32K	Enthusiast, advanced coding, research, creative work
32GB	34B–72B	Q4_K_M to Q5_K_M	8K–32K	Prosumer, enterprise RAG, multilingual analysis
48GB	70B–104B (or MoE at Q4+)	Q4_K_M to Q8_0	16K–64K	Workstation, fine-tuning, agentic systems, legal/academic research

Real-World Performance Benchmarks — Tokens Per Second & Quality Trade-offs

Performance is a nuanced concept in local AI. Raw tokens-per-second is just one axis; time-to-first-token (TTFT), prompt processing speed, and output quality at a given quant all matter. Community benchmarks consistently show:

TTFT becomes the bottleneck at long contexts: Processing a 32K-token prompt on a 70B model can take 30–90 seconds before the first token appears, even on 48GB dual-GPU setups. Flash attention and prompt caching in backends like vLLM mitigate this.
IQ quants vs K-quants: The newer IQ (Integer Quantization) series from llama.cpp generally preserve more quality at equivalent bit widths compared to the older K-quant series, especially at 2-bit and 3-bit levels. For 70B on 24GB, IQ3_XXS often outperforms Q3_K_S in human preference tests.
exllamav2 vs llama.cpp: For pure GPU inference on NVIDIA hardware, exllamav2 consistently delivers 10–25% higher throughput and lower latency. llama.cpp remains the king of cross-platform compatibility (Apple Silicon, AMD, Intel, CPU fallback).
Batch size matters for throughput: If you're serving multiple users or running batched evaluations, vLLM with continuous batching can multiply effective throughput 3–5× compared to single-stream inference in llama.cpp.

                ⚡ Community Pro Tip: For the smoothest interactive experience, target 20+ t/s generation speed. Below 10 t/s, the experience feels sluggish for chat. Reserve sub-10 t/s setups for batch jobs, overnight research runs, or situations where model intelligence justifies the wait.
            

What Are People Actually Using These Models For?

The question "What are you using your models for?" reveals the incredible diversity of local AI applications. Based on aggregated community responses, here are the most common use cases at each tier:

8GB Tier — Everyday AI Assistants

Privacy-first Personal journaling and reflection with local chat (no data leaves the machine)
Coding Lightweight code autocomplete and inline suggestions (Continue.dev + Ollama)
Education Language learning partners, flashcard generation, textbook Q&A
Creative Short-form story drafting, D&D campaign notes, NPC dialogue generation
Home automation On-device intent parsing for Home Assistant voice control

16GB Tier — Professional & Creative Powerhouses

Development Full-stack code generation, refactoring, and test writing with dedicated code models
Writing Long-form content drafting, editing, and style transfer (novels, screenplays, marketing copy)
Research Paper summarization, citation extraction, literature review assistance
Multilingual Translation and cross-lingual content creation with Qwen or Mistral-Nemo
Gaming AI-driven NPCs in modded games (Skyrim, Mount & Blade) via local API servers

24GB+ Tier — Advanced & Enterprise Workloads

Agentic AI Multi-step autonomous agents for research, data analysis, and task automation
Legal Contract review, clause extraction, compliance checking with long-context models
Academic Full-paper analysis, cross-reference verification, hypothesis generation
Enterprise RAG Internal knowledge base Q&A with 70B+ models on proprietary documents
Fine-tuning LoRA/QLoRA fine-tuning of 7B–13B models for domain-specific tasks, using the larger GPU for training while inference runs elsewhere
Medical/Health On-premise analysis of clinical notes (HIPAA-compliant, no cloud exposure)

Frequently Asked Questions

What is the absolute best model I can run on 8GB VRAM right now?

As of mid-2025, the community consensus points to Llama-3.1-8B-Instruct at Q4_K_M or Gemma-2-9B-Instruct at IQ4_NL as the top contenders. Gemma-2-9B offers slightly better factual accuracy, while Llama-3.1-8B excels at creative tasks and conversational nuance. Both fit 8GB with 4K–8K context. For pure speed, Phi-3-mini (3.8B) at FP16 delivers blazing 90+ t/s on an RTX 3070.

Can I run a 70B model on a single 24GB GPU?

Yes, but with significant caveats. Using IQ2_XXS or IQ3_XXS quantization from the latest llama.cpp, a 70B model can load onto 24GB with about 2–4GB remaining for KV cache — enough for a 2K–4K context window. Output quality is degraded compared to Q4, but for certain analytical tasks that benefit from the 70B's deeper reasoning, it can still outperform smaller models. This is an experimental configuration, not a daily driver for most users.

How do I choose between exllamav2, llama.cpp, and vLLM?

exllamav2: Best raw performance on NVIDIA GPUs. Supports flash attention, FP8 KV cache, and efficient tensor parallelism. Ideal for single-user interactive inference on 8GB–48GB NVIDIA cards.
llama.cpp: The universal choice. Runs on NVIDIA, AMD, Apple Silicon, Intel, and even CPU-only. Supports the widest range of quantization formats (GGUF, IQ series). Best for cross-platform setups and Apple Silicon users.
vLLM: Built for serving. If you need an OpenAI-compatible API endpoint with continuous batching for multiple concurrent users, vLLM is the gold standard. Requires more setup but delivers unmatched throughput for production deployments.

What KV cache settings should I use for long-context (32K+) work?

Enable flash attention and set KV cache quantization to Q8_0 or FP8. On a 16GB card with a 8B model at Q8_0, this typically allows 32K context without overflow. Monitor your VRAM usage during prefill — if you see spikes near 95% utilization, reduce context by 2K–4K increments until stable. For 48GB+ setups running 70B models at Q6+, 32K–64K context is routinely achievable with these optimizations.

Is Apple Silicon competitive for local AI?

Absolutely. The unified memory architecture on M2 Ultra (48GB–64GB) and M3 Max (36GB+) is a game-changer. While raw GPU compute is lower than an RTX 4090, the ability to allocate all unified memory to the model eliminates PCIe bottlenecks and enables running 70B models at Q4 with 8K+ context at 8–15 t/s. For Mac-first developers, this is a seamless and quiet local AI experience. The Metal backend in llama.cpp has matured significantly.

What's the deal with Mixture of Experts (MoE) models and VRAM?

MoE models like Mixtral-8x7B and DeepSeek-V2 keep total parameter counts high but only activate a fraction per token. This means VRAM must hold the entire model (all experts), but compute cost per token is much lower. The VRAM requirement is dictated by total parameters, not active parameters. This is why a 46B-total Mixtral at Q4 fits 24GB, but a dense 46B model at Q4 would not. MoE models are an excellent way to "punch above" your VRAM weight class for generation quality, but they don't reduce memory footprint.

Conclusion — Community Wisdom on Building Your Local AI Stack

The question "What models you guys running?" elicits a different answer every few months — and that's the beauty of the local AI movement. Hardware that seemed constrained yesterday runs a polished 8B model with 32K context today. The collective tinkering, benchmarking, and quant-pushing from the open-source community continuously redefines what's possible on consumer silicon.

If there's one meta-insight from hundreds of community responses, it's this: start with the best model your VRAM comfortably hosts at Q4_K_M or higher, dial in your KV cache for 8K–16K context, and resist the urge to chase bleeding-edge ultra-quants unless you genuinely need the larger model's reasoning depth. A snappy, reliable 8B setup often beats a sluggish, memory-starved 70B for daily use.

Key takeaways to future-proof your local AI journey:

Quantization is your best friend. The IQ series and K-quants make models 2–4× smaller with minimal quality loss. Always prefer Q4_K_M or Q5_K_M as your baseline; go lower only when necessary.
KV cache tuning is not optional. Spend time dialing in context length, cache quantization, and flash attention. This is the difference between a smooth experience and constant OOM crashes.
Backend choice matters. exllamav2 for NVIDIA speed, llama.cpp for universal compatibility, vLLM for serving. Don't hesitate to switch backends as your needs evolve.
Community knowledge compounds. The setups documented here represent a snapshot of mid-2025. Follow the active threads, Discord servers, and GitHub discussions — the next breakthrough quant or architecture is probably weeks away.
Define your use case first. A code model for Cursor integration, a creative model for novel drafting, and a reasoning model for research are different tools. Build your stack around what you actually do daily, not around benchmark scores.

This guide aggregates community experiences and is updated periodically as new models, quantization methods, and inference backends emerge. Last updated: June 2025. Your mileage may vary based on driver versions, backend builds, and specific hardware configurations. Always test with your own workload before committing to a production stack.