Quantization, context, throughput: the knobs that actually matter when you move an LLM off the cloud and onto your desk. No hand-waving, just the numbers that hold up in practice.
The back-of-envelope math that tells you whether a model will load at all, and whether it will run fast enough to be useful. Memorize these before you download a 40 GB file.
Multiply the parameter count by the quantization's bytes-per-weight, then add roughly 20 to 25 percent for KV cache and activations. At Q4 (~0.5 B/param), a 7B fits in ~5 GB; at Q8, ~9 GB; at FP16, ~16 GB.
# VRAM ≈ params × bpw × 1.2 Q4_K_M → params × 0.55 GB Q6_K → params × 0.80 GB Q8_0 → params × 1.05 GB FP16 → params × 2.00 GB
Context isn't free. Each token in the window consumes VRAM roughly equal to 2 × layers × hidden_dim × bytes. A 32k context on a 13B model eats 3 to 6 GB on top of the weights. Cut context aggressively if you're close to the edge.
# Rough KV cache for 7B-class models 4k ctx → ~0.5 GB 8k ctx → ~1.0 GB 32k ctx → ~4.0 GB 128k ctx → ~16 GB
3B models are autocomplete-grade; 7 to 8B handle simple chat and structured extraction; 13 to 14B cross the "feels coherent" line; 30 to 34B is where reasoning starts to hold; 70B and up rivals frontier APIs on most tasks except the hardest math and code.
≤3B classification · autocomplete 7–8B chat · RAG · extraction 13–14B agents · drafting · summary 30–34B reasoning · long context 70B+ near-frontier quality
With llama.cpp, you pick how many layers live on GPU (-ngl). Full offload is always fastest. If even one layer spills to CPU, throughput collapses disproportionately: aim for 100 percent or don't bother. Prefer a smaller quant that fully fits over a larger one that partially offloads.
# Priority order when VRAM-limited: 1. smaller quant (Q4 over Q6) 2. shorter context 3. smaller model (7B over 13B) 4. last resort: partial offload
A 7B Q4 model (~4 GB) on a 1 TB/s GPU has a theoretical ceiling of ~250 tok/s. Halve the bandwidth, halve the speed. This is why Apple Silicon with unified memory (400 to 800 GB/s) runs laps around discrete GPUs forced to use DDR5 system RAM (~60 GB/s).
# Memory bandwidth × 0.6 ÷ model_GB
RTX 4090 → 1008 GB/s
M4 Max → 546 GB/s
M2 Ultra → 800 GB/s
DDR5 (CPU-only)→ ~60 GB/s
For most chat tasks: temperature 0.7, top_p 0.9, min_p 0.05, repeat_penalty 1.05. For code and JSON: drop temp to 0.1 to 0.3. For creative writing: push temp to 0.9 to 1.1 and lean on min_p instead of top_k. Avoid stacking top_k + top_p + min_p: pick one filter.
chat temp=0.7 min_p=0.05 code/JSON temp=0.2 top_p=0.95 creative temp=1.0 min_p=0.02 extract temp=0.0 (greedy)
The compression levels you'll see in GGUF filenames. Lower numbers mean fewer bits per weight: smaller files, faster inference, more quality loss. The suffix (K_S, K_M, K_L) indicates which layers are quantized more aggressively.
Quantization replaces 16-bit floats with lower-precision integers. Q4 means roughly 4 bits per weight, Q6 roughly 6, and so on. The "K" family (K-quants) uses mixed precision: K_S quantizes everything uniformly, K_M keeps important tensors at higher precision, K_L is the most conservative.
The practical sweet spot for most models is Q4_K_M: about a 70 percent file-size reduction versus FP16 with quality loss typically under 2 percent on benchmarks. Going below Q4 degrades rapidly; going above Q6 offers diminishing returns.
| Quant | Bits/w | 7B size | Verdict | Quality | When to use |
|---|---|---|---|---|---|
Q2_K |
~2.6 | 2.8 GB | Avoid | Toy experiments only. Coherence breaks down badly on models below 30B. | |
Q3_K_M |
~3.4 | 3.5 GB | Budget | Only on 30B+ models where the larger parameter count absorbs the damage. | |
Q4_K_S |
~4.1 | 4.1 GB | Budget | Tight VRAM on 7B class. Noticeable but usable quality trade-off. | |
Q4_K_M |
~4.8 | 4.4 GB | Sweet spot | The default choice. Best size-to-quality ratio for almost any model. Start here. | |
Q5_K_M |
~5.7 | 5.1 GB | Quality | When you have 15 to 20 percent more VRAM to spare. Slightly sharper on reasoning tasks. | |
Q6_K |
~6.6 | 5.9 GB | Quality | Nearly indistinguishable from FP16 on most benchmarks. Ceiling for practical use. | |
Q8_0 |
~8.5 | 7.2 GB | Reference | Functionally lossless. Useful as a quality reference or for fine-tuning base. | |
FP16 / F16 |
16.0 | 13.5 GB | Reference | Original precision. Only use if you're training, evaluating, or VRAM-rich. |
The file format dictates which runtime you can use and, more subtly, how well the model runs on your specific hardware. For most people, GGUF is the answer, but the alternatives matter on specific rigs.
Successor to GGML. Single-file packaging with metadata embedded. Runs on CPU, GPU, and hybrid via llama.cpp, Ollama, LM Studio, koboldcpp. The universal format: if you only learn one, learn this.
Mixed-precision quantization targeting NVIDIA GPUs. Faster than GGUF on pure GPU workloads, especially for batched inference. Run via exllamav2 or TabbyAPI. Pays off on 3090/4090/5090-class cards.
Apple's native ML framework. On M-series chips, MLX versions often run 15 to 30 percent faster than equivalent GGUF builds because they exploit the unified memory architecture and Metal Performance Shaders directly.
4-bit GPU quantization with excellent quality retention. Favored by vLLM, TGI, and SGLang for production serving with request batching. Overkill for single-user desktop use, but the right answer for internal inference endpoints.
Earlier GPU quantization scheme. Largely superseded by AWQ and EXL2 for new deployments, but still widespread on Hugging Face. Works with transformers + auto-gptq. Fine if it's what's available.
The original weights as released by the lab. Safe replacement for pickle-based checkpoints. You want these only for training, fine-tuning, or converting to another format. Too large and slow for direct inference on most desks.
What tokens-per-second ranges feel like in practice, and what hardware you need to hit them. Numbers below are for a 7B class model at Q4_K_M with a 4k context. Scale down roughly linearly for larger models.
The things that aren't in the README but show up the moment you actually try to use this stuff for real work. Hard-won operational knowledge, distilled.
Most local-LLM disappointment comes from mismatched expectations rather than weak hardware. These are the adjustments that make a desktop model feel professional rather than like a toy.
In llama.cpp, pass -fa. Cuts KV cache memory and speeds up prompt processing, especially at long contexts. It's stable enough now that there's no reason to leave it off.
Use --cache-type-k q8_0 --cache-type-v q8_0 to halve KV memory with negligible quality loss. This alone can unlock 2× the context on a tight VRAM budget.
Look for "IQ" variants (IQ3_M, IQ4_XS) or quants marked "imatrix" in the filename. They use a calibration dataset to preserve quality in the quantized weights, often matching a full bit-level higher in standard quants.
Perplexity differences between Q4 and Q6 are tiny on paper but can mean the difference between working and broken JSON output. Always test the quant on your actual workload before committing.
Every instruct-tuned model has a specific chat template (ChatML, Llama-3, Mistral, Gemma). Using the wrong one degrades output dramatically, sometimes silently. Check the model card and verify your frontend is applying it correctly.
Models quote generation tok/s, but ingesting a 16k prompt can take 30+ seconds before the first output token. For RAG and long-context work, prompt processing is often the real bottleneck. Benchmark both.
Pair a small "draft" model with your main model (e.g. 1B drafts for a 70B). The large model verifies the draft's guesses in parallel, often delivering 1.5 to 3× speedup with zero quality loss. Supported in llama.cpp via --model-draft.
Mixture-of-Experts models (Mixtral, Qwen3-MoE, DeepSeek) activate only a fraction of parameters per token. A 47B MoE often runs at 13B speeds while delivering 70B-class quality. Check the "active parameters" figure, not the total.
Enable prompt caching / context shifting so that repeat system prompts and recent turns don't get reprocessed. llama.cpp's --prompt-cache and Ollama's built-in caching can turn a 5-second chat turn into a 500ms one.
Sustained GPU inference pushes cards to their power limits for minutes at a time. If your tok/s drops 15 to 20 percent after the first few prompts, it's thermal throttling, not the software. Improve case airflow before blaming the model.
The "hello, tell me a joke" benchmark is useless. Measure with a representative prompt from your actual workload, ideally a RAG chunk at the context length you'll really use. Numbers from YouTube reviews rarely match your use case.
Progress in post-training means a well-tuned 8B released this quarter often outperforms last year's 34B on practical tasks. Check leaderboards by date, not just by size. The frontier moves quickly, especially for chat and tool use.