Local LLM Field Manual

Rules of thumb

The back-of-envelope math that tells you whether a model will load at all, and whether it will run fast enough to be useful. Memorize these before you download a 40 GB file.

§ 1.1 · VRAM budget

Parameters × bytes per weight + overhead

Multiply the parameter count by the quantization's bytes-per-weight, then add roughly 20 to 25 percent for KV cache and activations. At Q4 (~0.5 B/param), a 7B fits in ~5 GB; at Q8, ~9 GB; at FP16, ~16 GB.

# VRAM ≈ params × bpw × 1.2
Q4_K_M  → params × 0.55 GB
Q6_K    → params × 0.80 GB
Q8_0    → params × 1.05 GB
FP16    → params × 2.00 GB

Quick glossary

VRAM: Video RAM. The memory physically on your GPU card. The model weights have to fit here or the whole thing spills to slow system memory.
Parameters: The numbers the model learned during training. A "7B model" has 7 billion of them. More parameters generally means more capability, but also more memory.
Bytes per weight (bpw): How many bytes each parameter takes up. Full precision uses 2 bytes (FP16); Q4 uses about 0.5 bytes. This is where quantization saves space.
Overhead: Extra memory the model needs while running, on top of the weights themselves. Mostly the KV cache and intermediate activations. Budget 20 to 25 percent.

§ 1.2 · Context window cost

KV cache scales with tokens × layers

Context isn't free. Each token in the window consumes VRAM roughly equal to 2 × layers × hidden_dim × bytes. A 32k context on a 13B model eats 3 to 6 GB on top of the weights. Cut context aggressively if you're close to the edge.

# Rough KV cache for 7B-class models
4k   ctx  →  ~0.5 GB
8k   ctx  →  ~1.0 GB
32k  ctx  →  ~4.0 GB
128k ctx  →  ~16  GB

Quick glossary

KV cache: Key/Value cache. The model's short-term memory of the conversation so far. Stored so it doesn't have to re-read the whole prompt on every new token. Grows linearly with context length.
Token: The unit the model reads and writes. Roughly 3 to 4 characters of English, or 1.5 to 2 characters of Portuguese. "Hello world" is about 2 tokens.
Context window: The maximum number of tokens the model can consider at once, system prompt plus chat history plus new input plus the reply it's generating. Exceed it and the earliest content gets dropped.
Layers / hidden_dim: Layers are the stacked blocks that make up the model; hidden_dim is how "wide" each block is. Bigger values mean more KV memory per token.

§ 1.3 · Model size vs task

Match parameter count to the job

3B models are autocomplete-grade; 7 to 8B handle simple chat and structured extraction; 13 to 14B cross the "feels coherent" line; 30 to 34B is where reasoning starts to hold; 70B and up rivals frontier APIs on most tasks except the hardest math and code.

≤3B    classification · autocomplete
7–8B   chat · RAG · extraction
13–14B agents · drafting · summary
30–34B reasoning · long context
70B+   near-frontier quality

Quick glossary

"B" in 7B, 13B, 70B: Billion. "7B" means roughly 7 billion parameters. It's the most common rough proxy for model capability, though architecture and training matter too.
RAG: Retrieval-Augmented Generation. You fetch relevant documents (from a vector database, usually) and paste them into the prompt before the model answers. Lets small models reason over large knowledge bases.
Extraction: Pulling structured data (names, dates, fields, JSON) out of unstructured text. One of the most reliable uses of small local models.
Agents: Models that call tools, loop, and make multi-step plans. They need enough intelligence to decide what to do next without getting lost. 13B is roughly the floor for reliable agent behavior.

§ 1.4 · Offload math

Every layer on GPU is 5 to 20× faster

With llama.cpp, you pick how many layers live on GPU (-ngl). Full offload is always fastest. If even one layer spills to CPU, throughput collapses disproportionately: aim for 100 percent or don't bother. Prefer a smaller quant that fully fits over a larger one that partially offloads.

# Priority order when VRAM-limited:
1. smaller quant (Q4 over Q6)
2. shorter context
3. smaller model (7B over 13B)
4. last resort: partial offload

Quick glossary

Offload: Moving some of the model's layers onto the GPU instead of running them on CPU. More layers on GPU means faster inference, until you run out of VRAM.
-ngl flag: "Number of GPU Layers." The llama.cpp command-line option that tells it how many layers to put on the GPU. Setting it to a high number like 999 means "put everything possible."
Partial offload: When only some layers fit on GPU and the rest run on CPU. Sounds reasonable but performance falls off a cliff because every token has to bounce between the two.
Layers: Transformer blocks stacked on top of each other. A 7B model typically has 32 layers; 70B has 80. Each layer processes the data in turn.

§ 1.5 · Memory bandwidth is king

Inference speed ≈ bandwidth ÷ model size

A 7B Q4 model (~4 GB) on a 1 TB/s GPU has a theoretical ceiling of ~250 tok/s. Halve the bandwidth, halve the speed. This is why Apple Silicon with unified memory (400 to 800 GB/s) runs laps around discrete GPUs forced to use DDR5 system RAM (~60 GB/s).

# Memory bandwidth × 0.6 ÷ model_GB
RTX 4090       →  1008 GB/s
M4 Max         →   546 GB/s
M2 Ultra       →   800 GB/s
DDR5 (CPU-only)→   ~60 GB/s

Quick glossary

Memory bandwidth: How fast data can be moved between memory and the processor, measured in GB/s. Inference is bottlenecked by this far more than by raw compute: the GPU spends most of its time waiting for weights to arrive.
Unified memory: Apple Silicon architecture where CPU and GPU share one big fast memory pool. No copying between RAM and VRAM, and you get much higher bandwidth than regular system RAM.
tok/s: Tokens per second. The standard throughput metric. Human reading speed is around 5 to 10 tok/s; anything above 40 feels comfortable in chat.
DDR5: Current-generation system RAM. Fast for general computing (~60 GB/s), painfully slow for LLM inference compared to GPU memory.

§ 1.6 · Sampling parameters

Defaults you actually want

For most chat tasks: temperature 0.7, top_p 0.9, min_p 0.05, repeat_penalty 1.05. For code and JSON: drop temp to 0.1 to 0.3. For creative writing: push temp to 0.9 to 1.1 and lean on min_p instead of top_k. Avoid stacking top_k + top_p + min_p: pick one filter.

chat       temp=0.7  min_p=0.05
code/JSON  temp=0.2  top_p=0.95
creative   temp=1.0  min_p=0.02
extract    temp=0.0  (greedy)

Quick glossary

Temperature: Controls randomness. 0.0 always picks the single most likely next token (deterministic); 1.0 samples closer to the raw probability distribution; above 1.0 flattens the distribution and gets weird.
top_p (nucleus): Keep only the most likely tokens whose cumulative probability adds up to this threshold. top_p = 0.9 means "consider the tokens that together cover 90 percent of the probability mass."
min_p: A newer, better filter. Discards any token less likely than min_p times the top candidate. Adapts to how confident the model is on each step. Safer default than top_p.
top_k: Keep only the k most likely tokens, throw out the rest. Blunt and somewhat outdated compared to min_p.
Greedy: Temperature = 0. Always picks the single most likely token. Fully deterministic; best for extraction and strict JSON output.
repeat_penalty: Slightly reduces the probability of tokens that just appeared. Prevents the model from getting stuck in loops. 1.05 to 1.1 is subtle and safe; higher values distort vocabulary.

Quantization ladder

The compression levels you'll see in GGUF filenames. Lower numbers mean fewer bits per weight: smaller files, faster inference, more quality loss. The suffix (K_S, K_M, K_L) indicates which layers are quantized more aggressively.

Quantization replaces 16-bit floats with lower-precision integers. Q4 means roughly 4 bits per weight, Q6 roughly 6, and so on. The "K" family (K-quants) uses mixed precision: K_S quantizes everything uniformly, K_M keeps important tensors at higher precision, K_L is the most conservative.

The practical sweet spot for most models is Q4_K_M: about a 70 percent file-size reduction versus FP16 with quality loss typically under 2 percent on benchmarks. Going below Q4 degrades rapidly; going above Q6 offers diminishing returns.

Quant	Bits/w	7B size	Verdict	When to use
Q2_K	~2.6	2.8 GB	Avoid	Toy experiments only. Coherence breaks down badly on models below 30B.
Q3_K_M	~3.4	3.5 GB	Budget	Only on 30B+ models where the larger parameter count absorbs the damage.
Q4_K_S	~4.1	4.1 GB	Budget	Tight VRAM on 7B class. Noticeable but usable quality trade-off.
Q4_K_M	~4.8	4.4 GB	Sweet spot	The default choice. Best size-to-quality ratio for almost any model. Start here.
Q5_K_M	~5.7	5.1 GB	Quality	When you have 15 to 20 percent more VRAM to spare. Slightly sharper on reasoning tasks.
Q6_K	~6.6	5.9 GB	Quality	Nearly indistinguishable from FP16 on most benchmarks. Ceiling for practical use.
Q8_0	~8.5	7.2 GB	Reference	Functionally lossless. Useful as a quality reference or for fine-tuning base.
FP16 / F16	16.0	13.5 GB	Reference	Original precision. Only use if you're training, evaluating, or VRAM-rich.

quick decision

Which quant should I actually download?

IFyou have more VRAM than the model needs at Q6, pick Q6_K. Done.

ELSE IFthe model fits comfortably at Q4_K_M with full GPU offload, pick Q4_K_M.

ELSE IFyou're on a 30B+ model and tight, Q3_K_M is acceptable.

ELSEdrop to a smaller model at Q4_K_M rather than the same model at Q2.

Formats & containers

The file format dictates which runtime you can use and, more subtly, how well the model runs on your specific hardware. For most people, GGUF is the answer, but the alternatives matter on specific rigs.

Primary choice

GGUF

GPT-generated unified format

Successor to GGML. Single-file packaging with metadata embedded. Runs on CPU, GPU, and hybrid via llama.cpp, Ollama, LM Studio, koboldcpp. The universal format: if you only learn one, learn this.

Runtimellama.cpp family

HardwareAny (CPU/GPU/mixed)

QuantsQ2 to Q8, FP16

Best forEverything

Speed-focused

EXL2

ExLlamaV2 · GPU only

Mixed-precision quantization targeting NVIDIA GPUs. Faster than GGUF on pure GPU workloads, especially for batched inference. Run via exllamav2 or TabbyAPI. Pays off on 3090/4090/5090-class cards.

RuntimeExLlamaV2, TabbyAPI

HardwareNVIDIA GPU only

Quants2.0 to 8.0 bpw

Best forMax GPU throughput

Apple Silicon

MLX

Apple machine learning

Apple's native ML framework. On M-series chips, MLX versions often run 15 to 30 percent faster than equivalent GGUF builds because they exploit the unified memory architecture and Metal Performance Shaders directly.

Runtimemlx-lm, LM Studio

HardwareApple Silicon only

Quants4-bit, 8-bit, FP16

Best forM1/M2/M3/M4 Macs

Inference servers

AWQ

Activation-aware weight quant.

4-bit GPU quantization with excellent quality retention. Favored by vLLM, TGI, and SGLang for production serving with request batching. Overkill for single-user desktop use, but the right answer for internal inference endpoints.

RuntimevLLM, TGI, SGLang

HardwareNVIDIA GPU

QuantsMostly 4-bit

Best forMulti-user serving

Legacy GPU

GPTQ

Post-training quantization

Earlier GPU quantization scheme. Largely superseded by AWQ and EXL2 for new deployments, but still widespread on Hugging Face. Works with transformers + auto-gptq. Fine if it's what's available.

Runtimetransformers, AutoGPTQ

HardwareNVIDIA GPU

Quants3, 4, 8-bit

Best forExisting pipelines

Original

Safetensors

Unquantized FP16 / BF16

The original weights as released by the lab. Safe replacement for pickle-based checkpoints. You want these only for training, fine-tuning, or converting to another format. Too large and slow for direct inference on most desks.

Runtimetransformers, diffusers

HardwareGPU w/ 2 × params GB

PrecisionFP16, BF16, FP32

Best forTraining & conversion

Tips & tricks

The things that aren't in the README but show up the moment you actually try to use this stuff for real work. Hard-won operational knowledge, distilled.

Most local-LLM disappointment comes from mismatched expectations rather than weak hardware. These are the adjustments that make a desktop model feel professional rather than like a toy.

Use flash attention when available

In llama.cpp, pass -fa. Cuts KV cache memory and speeds up prompt processing, especially at long contexts. It's stable enough now that there's no reason to leave it off.

ii.

Quantize the KV cache separately

Use --cache-type-k q8_0 --cache-type-v q8_0 to halve KV memory with negligible quality loss. This alone can unlock 2× the context on a tight VRAM budget.

iii.

Imatrix quants are better than plain ones

Look for "IQ" variants (IQ3_M, IQ4_XS) or quants marked "imatrix" in the filename. They use a calibration dataset to preserve quality in the quantized weights, often matching a full bit-level higher in standard quants.

iv.

Don't trust perplexity; trust task benchmarks

Perplexity differences between Q4 and Q6 are tiny on paper but can mean the difference between working and broken JSON output. Always test the quant on your actual workload before committing.

Match the prompt template exactly

Every instruct-tuned model has a specific chat template (ChatML, Llama-3, Mistral, Gemma). Using the wrong one degrades output dramatically, sometimes silently. Check the model card and verify your frontend is applying it correctly.

vi.

Prompt processing is not generation speed

Models quote generation tok/s, but ingesting a 16k prompt can take 30+ seconds before the first output token. For RAG and long-context work, prompt processing is often the real bottleneck. Benchmark both.

vii.

Speculative decoding is real magic

Pair a small "draft" model with your main model (e.g. 1B drafts for a 70B). The large model verifies the draft's guesses in parallel, often delivering 1.5 to 3× speedup with zero quality loss. Supported in llama.cpp via --model-draft.

viii.

MoE models punch above their weight

Mixture-of-Experts models (Mixtral, Qwen3-MoE, DeepSeek) activate only a fraction of parameters per token. A 47B MoE often runs at 13B speeds while delivering 70B-class quality. Check the "active parameters" figure, not the total.

ix.

Context shifting saves your latency

Enable prompt caching / context shifting so that repeat system prompts and recent turns don't get reprocessed. llama.cpp's --prompt-cache and Ollama's built-in caching can turn a 5-second chat turn into a 500ms one.

Thermals throttle inference silently

Sustained GPU inference pushes cards to their power limits for minutes at a time. If your tok/s drops 15 to 20 percent after the first few prompts, it's thermal throttling, not the software. Improve case airflow before blaming the model.

xi.

Benchmark with realistic prompts

The "hello, tell me a joke" benchmark is useless. Measure with a representative prompt from your actual workload, ideally a RAG chunk at the context length you'll really use. Numbers from YouTube reviews rarely match your use case.

xii.

A newer small model beats an older big one

Progress in post-training means a well-tuned 8B released this quarter often outperforms last year's 34B on practical tasks. Check leaderboards by date, not just by size. The frontier moves quickly, especially for chat and tool use.

Running models on your own iron.

Rules of thumb

Parameters × bytes per weight + overhead

KV cache scales with tokens × layers

Match parameter count to the job

Every layer on GPU is 5 to 20× faster

Inference speed ≈ bandwidth ÷ model size

Defaults you actually want

Quantization ladder

Formats & containers

Throughput vs quality

Tips & tricks

Use flash attention when available

Quantize the KV cache separately

Imatrix quants are better than plain ones

Don't trust perplexity; trust task benchmarks

Match the prompt template exactly

Prompt processing is not generation speed

Speculative decoding is real magic

MoE models punch above their weight

Context shifting saves your latency

Thermals throttle inference silently

Benchmark with realistic prompts

A newer small model beats an older big one