A practitioner's cheat sheet

Running models on your own iron.

Quantization, context, throughput: the knobs that actually matter when you move an LLM off the cloud and onto your desk. No hand-waving, just the numbers that hold up in practice.

Target readerself-hoster
Stackllama.cpp · Ollama · LM Studio
Formats coveredGGUF · EXL2 · AWQ · GPTQ · MLX
PrecisionQ2 to FP16
Biaspractical over theoretical
01

Rules of thumb

The back-of-envelope math that tells you whether a model will load at all, and whether it will run fast enough to be useful. Memorize these before you download a 40 GB file.

§ 1.1  ·  VRAM budget

Parameters × bytes per weight + overhead

Multiply the parameter count by the quantization's bytes-per-weight, then add roughly 20 to 25 percent for KV cache and activations. At Q4 (~0.5 B/param), a 7B fits in ~5 GB; at Q8, ~9 GB; at FP16, ~16 GB.

# VRAM ≈ params × bpw × 1.2
Q4_K_M  → params × 0.55 GB
Q6_K    → params × 0.80 GB
Q8_0    → params × 1.05 GB
FP16    → params × 2.00 GB
        
Quick glossary
VRAM
Video RAM. The memory physically on your GPU card. The model weights have to fit here or the whole thing spills to slow system memory.
Parameters
The numbers the model learned during training. A "7B model" has 7 billion of them. More parameters generally means more capability, but also more memory.
Bytes per weight (bpw)
How many bytes each parameter takes up. Full precision uses 2 bytes (FP16); Q4 uses about 0.5 bytes. This is where quantization saves space.
Overhead
Extra memory the model needs while running, on top of the weights themselves. Mostly the KV cache and intermediate activations. Budget 20 to 25 percent.
§ 1.2  ·  Context window cost

KV cache scales with tokens × layers

Context isn't free. Each token in the window consumes VRAM roughly equal to 2 × layers × hidden_dim × bytes. A 32k context on a 13B model eats 3 to 6 GB on top of the weights. Cut context aggressively if you're close to the edge.

# Rough KV cache for 7B-class models
4k   ctx  →  ~0.5 GB
8k   ctx  →  ~1.0 GB
32k  ctx  →  ~4.0 GB
128k ctx  →  ~16  GB
        
Quick glossary
KV cache
Key/Value cache. The model's short-term memory of the conversation so far. Stored so it doesn't have to re-read the whole prompt on every new token. Grows linearly with context length.
Token
The unit the model reads and writes. Roughly 3 to 4 characters of English, or 1.5 to 2 characters of Portuguese. "Hello world" is about 2 tokens.
Context window
The maximum number of tokens the model can consider at once, system prompt plus chat history plus new input plus the reply it's generating. Exceed it and the earliest content gets dropped.
Layers / hidden_dim
Layers are the stacked blocks that make up the model; hidden_dim is how "wide" each block is. Bigger values mean more KV memory per token.
§ 1.3  ·  Model size vs task

Match parameter count to the job

3B models are autocomplete-grade; 7 to 8B handle simple chat and structured extraction; 13 to 14B cross the "feels coherent" line; 30 to 34B is where reasoning starts to hold; 70B and up rivals frontier APIs on most tasks except the hardest math and code.

≤3B    classification · autocomplete
7–8B   chat · RAG · extraction
13–14B agents · drafting · summary
30–34B reasoning · long context
70B+   near-frontier quality
        
Quick glossary
"B" in 7B, 13B, 70B
Billion. "7B" means roughly 7 billion parameters. It's the most common rough proxy for model capability, though architecture and training matter too.
RAG
Retrieval-Augmented Generation. You fetch relevant documents (from a vector database, usually) and paste them into the prompt before the model answers. Lets small models reason over large knowledge bases.
Extraction
Pulling structured data (names, dates, fields, JSON) out of unstructured text. One of the most reliable uses of small local models.
Agents
Models that call tools, loop, and make multi-step plans. They need enough intelligence to decide what to do next without getting lost. 13B is roughly the floor for reliable agent behavior.
§ 1.4  ·  Offload math

Every layer on GPU is 5 to 20× faster

With llama.cpp, you pick how many layers live on GPU (-ngl). Full offload is always fastest. If even one layer spills to CPU, throughput collapses disproportionately: aim for 100 percent or don't bother. Prefer a smaller quant that fully fits over a larger one that partially offloads.

# Priority order when VRAM-limited:
1. smaller quant (Q4 over Q6)
2. shorter context
3. smaller model (7B over 13B)
4. last resort: partial offload
        
Quick glossary
Offload
Moving some of the model's layers onto the GPU instead of running them on CPU. More layers on GPU means faster inference, until you run out of VRAM.
-ngl flag
"Number of GPU Layers." The llama.cpp command-line option that tells it how many layers to put on the GPU. Setting it to a high number like 999 means "put everything possible."
Partial offload
When only some layers fit on GPU and the rest run on CPU. Sounds reasonable but performance falls off a cliff because every token has to bounce between the two.
Layers
Transformer blocks stacked on top of each other. A 7B model typically has 32 layers; 70B has 80. Each layer processes the data in turn.
§ 1.5  ·  Memory bandwidth is king

Inference speed ≈ bandwidth ÷ model size

A 7B Q4 model (~4 GB) on a 1 TB/s GPU has a theoretical ceiling of ~250 tok/s. Halve the bandwidth, halve the speed. This is why Apple Silicon with unified memory (400 to 800 GB/s) runs laps around discrete GPUs forced to use DDR5 system RAM (~60 GB/s).

# Memory bandwidth × 0.6 ÷ model_GB
RTX 4090       →  1008 GB/s
M4 Max         →   546 GB/s
M2 Ultra       →   800 GB/s
DDR5 (CPU-only)→   ~60 GB/s
        
Quick glossary
Memory bandwidth
How fast data can be moved between memory and the processor, measured in GB/s. Inference is bottlenecked by this far more than by raw compute: the GPU spends most of its time waiting for weights to arrive.
Unified memory
Apple Silicon architecture where CPU and GPU share one big fast memory pool. No copying between RAM and VRAM, and you get much higher bandwidth than regular system RAM.
tok/s
Tokens per second. The standard throughput metric. Human reading speed is around 5 to 10 tok/s; anything above 40 feels comfortable in chat.
DDR5
Current-generation system RAM. Fast for general computing (~60 GB/s), painfully slow for LLM inference compared to GPU memory.
§ 1.6  ·  Sampling parameters

Defaults you actually want

For most chat tasks: temperature 0.7, top_p 0.9, min_p 0.05, repeat_penalty 1.05. For code and JSON: drop temp to 0.1 to 0.3. For creative writing: push temp to 0.9 to 1.1 and lean on min_p instead of top_k. Avoid stacking top_k + top_p + min_p: pick one filter.

chat       temp=0.7  min_p=0.05
code/JSON  temp=0.2  top_p=0.95
creative   temp=1.0  min_p=0.02
extract    temp=0.0  (greedy)
        
Quick glossary
Temperature
Controls randomness. 0.0 always picks the single most likely next token (deterministic); 1.0 samples closer to the raw probability distribution; above 1.0 flattens the distribution and gets weird.
top_p (nucleus)
Keep only the most likely tokens whose cumulative probability adds up to this threshold. top_p = 0.9 means "consider the tokens that together cover 90 percent of the probability mass."
min_p
A newer, better filter. Discards any token less likely than min_p times the top candidate. Adapts to how confident the model is on each step. Safer default than top_p.
top_k
Keep only the k most likely tokens, throw out the rest. Blunt and somewhat outdated compared to min_p.
Greedy
Temperature = 0. Always picks the single most likely token. Fully deterministic; best for extraction and strict JSON output.
repeat_penalty
Slightly reduces the probability of tokens that just appeared. Prevents the model from getting stuck in loops. 1.05 to 1.1 is subtle and safe; higher values distort vocabulary.
02

Quantization ladder

The compression levels you'll see in GGUF filenames. Lower numbers mean fewer bits per weight: smaller files, faster inference, more quality loss. The suffix (K_S, K_M, K_L) indicates which layers are quantized more aggressively.

Quantization replaces 16-bit floats with lower-precision integers. Q4 means roughly 4 bits per weight, Q6 roughly 6, and so on. The "K" family (K-quants) uses mixed precision: K_S quantizes everything uniformly, K_M keeps important tensors at higher precision, K_L is the most conservative.

The practical sweet spot for most models is Q4_K_M: about a 70 percent file-size reduction versus FP16 with quality loss typically under 2 percent on benchmarks. Going below Q4 degrades rapidly; going above Q6 offers diminishing returns.

Quant Bits/w 7B size Verdict Quality When to use
Q2_K
~2.6 2.8 GB Avoid
Toy experiments only. Coherence breaks down badly on models below 30B.
Q3_K_M
~3.4 3.5 GB Budget
Only on 30B+ models where the larger parameter count absorbs the damage.
Q4_K_S
~4.1 4.1 GB Budget
Tight VRAM on 7B class. Noticeable but usable quality trade-off.
Q4_K_M
~4.8 4.4 GB Sweet spot
The default choice. Best size-to-quality ratio for almost any model. Start here.
Q5_K_M
~5.7 5.1 GB Quality
When you have 15 to 20 percent more VRAM to spare. Slightly sharper on reasoning tasks.
Q6_K
~6.6 5.9 GB Quality
Nearly indistinguishable from FP16 on most benchmarks. Ceiling for practical use.
Q8_0
~8.5 7.2 GB Reference
Functionally lossless. Useful as a quality reference or for fine-tuning base.
FP16 / F16
16.0 13.5 GB Reference
Original precision. Only use if you're training, evaluating, or VRAM-rich.
quick decision
Which quant should I actually download?
IFyou have more VRAM than the model needs at Q6, pick Q6_K. Done.
ELSE IFthe model fits comfortably at Q4_K_M with full GPU offload, pick Q4_K_M.
ELSE IFyou're on a 30B+ model and tight, Q3_K_M is acceptable.
ELSEdrop to a smaller model at Q4_K_M rather than the same model at Q2.
03

Formats & containers

The file format dictates which runtime you can use and, more subtly, how well the model runs on your specific hardware. For most people, GGUF is the answer, but the alternatives matter on specific rigs.

Primary choice
GGUF
GPT-generated unified format

Successor to GGML. Single-file packaging with metadata embedded. Runs on CPU, GPU, and hybrid via llama.cpp, Ollama, LM Studio, koboldcpp. The universal format: if you only learn one, learn this.

Runtimellama.cpp family
HardwareAny (CPU/GPU/mixed)
QuantsQ2 to Q8, FP16
Best forEverything
Speed-focused
EXL2
ExLlamaV2 · GPU only

Mixed-precision quantization targeting NVIDIA GPUs. Faster than GGUF on pure GPU workloads, especially for batched inference. Run via exllamav2 or TabbyAPI. Pays off on 3090/4090/5090-class cards.

RuntimeExLlamaV2, TabbyAPI
HardwareNVIDIA GPU only
Quants2.0 to 8.0 bpw
Best forMax GPU throughput
Apple Silicon
MLX
Apple machine learning

Apple's native ML framework. On M-series chips, MLX versions often run 15 to 30 percent faster than equivalent GGUF builds because they exploit the unified memory architecture and Metal Performance Shaders directly.

Runtimemlx-lm, LM Studio
HardwareApple Silicon only
Quants4-bit, 8-bit, FP16
Best forM1/M2/M3/M4 Macs
Inference servers
AWQ
Activation-aware weight quant.

4-bit GPU quantization with excellent quality retention. Favored by vLLM, TGI, and SGLang for production serving with request batching. Overkill for single-user desktop use, but the right answer for internal inference endpoints.

RuntimevLLM, TGI, SGLang
HardwareNVIDIA GPU
QuantsMostly 4-bit
Best forMulti-user serving
Legacy GPU
GPTQ
Post-training quantization

Earlier GPU quantization scheme. Largely superseded by AWQ and EXL2 for new deployments, but still widespread on Hugging Face. Works with transformers + auto-gptq. Fine if it's what's available.

Runtimetransformers, AutoGPTQ
HardwareNVIDIA GPU
Quants3, 4, 8-bit
Best forExisting pipelines
Original
Safetensors
Unquantized FP16 / BF16

The original weights as released by the lab. Safe replacement for pickle-based checkpoints. You want these only for training, fine-tuning, or converting to another format. Too large and slow for direct inference on most desks.

Runtimetransformers, diffusers
HardwareGPU w/ 2 × params GB
PrecisionFP16, BF16, FP32
Best forTraining & conversion
04

Throughput vs quality

What tokens-per-second ranges feel like in practice, and what hardware you need to hit them. Numbers below are for a 7B class model at Q4_K_M with a 4k context. Scale down roughly linearly for larger models.

Perceptual speed bands
tokens per second, as felt by a reader
Painful< 5 tok/s
Slower than reading aloud. Usable for batch jobs only.
Tolerable5 to 15 tok/s
Roughly human reading speed. Acceptable for long-form answers.
Comfortable15 to 40 tok/s
Feels responsive. The minimum for interactive chat.
Fluent40 to 100 tok/s
Output faster than you can read. Great for agent loops.
Instant100+ tok/s
Perceptually complete. Enables real-time tool use and streaming UX.
CPU only
Modern x86, 32 GB DDR5
7B Q4: 3 to 7 tok/s
13B Q4: 1 to 3 tok/s
Last-resort tier. Avoid for interactive use.
RTX 3060 12GB
360 GB/s bandwidth
7B Q4: 40 to 55 tok/s
13B Q4: 20 to 28 tok/s
Budget sweet spot for single-user chat.
RTX 4090
1008 GB/s, 24 GB VRAM
7B Q4: 120 to 180 tok/s
34B Q4: 35 to 50 tok/s
The prosumer standard. Handles 34B comfortably.
M4 Max 64GB
546 GB/s unified memory
7B Q4: 70 to 95 tok/s
70B Q4: 8 to 12 tok/s
Huge VRAM pool. Can actually load 70B models.
M2/M3 Ultra 192GB
800 GB/s unified memory
70B Q4: 12 to 20 tok/s
120B Q4: 5 to 8 tok/s
The large-model desk option. Slow but feasible.
Dual 3090 / 4090
48 GB VRAM total
70B Q4: 15 to 25 tok/s
34B Q8: 30 to 45 tok/s
Best price-per-token for 70B-class local inference.
05

Tips & tricks

The things that aren't in the README but show up the moment you actually try to use this stuff for real work. Hard-won operational knowledge, distilled.

Most local-LLM disappointment comes from mismatched expectations rather than weak hardware. These are the adjustments that make a desktop model feel professional rather than like a toy.

i.

Use flash attention when available

In llama.cpp, pass -fa. Cuts KV cache memory and speeds up prompt processing, especially at long contexts. It's stable enough now that there's no reason to leave it off.

ii.

Quantize the KV cache separately

Use --cache-type-k q8_0 --cache-type-v q8_0 to halve KV memory with negligible quality loss. This alone can unlock 2× the context on a tight VRAM budget.

iii.

Imatrix quants are better than plain ones

Look for "IQ" variants (IQ3_M, IQ4_XS) or quants marked "imatrix" in the filename. They use a calibration dataset to preserve quality in the quantized weights, often matching a full bit-level higher in standard quants.

iv.

Don't trust perplexity; trust task benchmarks

Perplexity differences between Q4 and Q6 are tiny on paper but can mean the difference between working and broken JSON output. Always test the quant on your actual workload before committing.

v.

Match the prompt template exactly

Every instruct-tuned model has a specific chat template (ChatML, Llama-3, Mistral, Gemma). Using the wrong one degrades output dramatically, sometimes silently. Check the model card and verify your frontend is applying it correctly.

vi.

Prompt processing is not generation speed

Models quote generation tok/s, but ingesting a 16k prompt can take 30+ seconds before the first output token. For RAG and long-context work, prompt processing is often the real bottleneck. Benchmark both.

vii.

Speculative decoding is real magic

Pair a small "draft" model with your main model (e.g. 1B drafts for a 70B). The large model verifies the draft's guesses in parallel, often delivering 1.5 to 3× speedup with zero quality loss. Supported in llama.cpp via --model-draft.

viii.

MoE models punch above their weight

Mixture-of-Experts models (Mixtral, Qwen3-MoE, DeepSeek) activate only a fraction of parameters per token. A 47B MoE often runs at 13B speeds while delivering 70B-class quality. Check the "active parameters" figure, not the total.

ix.

Context shifting saves your latency

Enable prompt caching / context shifting so that repeat system prompts and recent turns don't get reprocessed. llama.cpp's --prompt-cache and Ollama's built-in caching can turn a 5-second chat turn into a 500ms one.

x.

Thermals throttle inference silently

Sustained GPU inference pushes cards to their power limits for minutes at a time. If your tok/s drops 15 to 20 percent after the first few prompts, it's thermal throttling, not the software. Improve case airflow before blaming the model.

xi.

Benchmark with realistic prompts

The "hello, tell me a joke" benchmark is useless. Measure with a representative prompt from your actual workload, ideally a RAG chunk at the context length you'll really use. Numbers from YouTube reviews rarely match your use case.

xii.

A newer small model beats an older big one

Progress in post-training means a well-tuned 8B released this quarter often outperforms last year's 34B on practical tasks. Check leaderboards by date, not just by size. The frontier moves quickly, especially for chat and tool use.