Stop treating VRAM like a static bucket. Too many devs just tally up the model weights, pad it with a few gigs for “overhead,” and assume they’re in the clear. Then they crank the context window to 32k or 128k. Boom. Out-of-Memory error. It’s almost always the KV cache.
I was burned by this in production. I was sizing memory for an on-device inference project and padded 8 GB above the model weight requirement. Felt conservative. The model card said 32k context was supported. First long-context batch job: immediate OOM. The weights fit fine. The KV cache for that batch alone needed another 12 GB I hadn’t accounted for. That’s the trap. The weights are the cheap part. The KV cache is where the real VRAM war is fought.
The Mechanics of KV Cache VRAM Consumption
What the KV Cache Actually Stores
Transformers have to look at every single previous token to figure out what comes next. If you don’t use a cache, you’re recomputing the Key (K) and Value (V) vectors for every token in the sequence every single time you generate a new word. It’s a complete waste of cycles.
The KV cache fixes this by dumping these vectors into GPU memory the first time they’re computed. When the model hits token N+1, it just grabs the cached vectors for tokens 1 through N from VRAM. Simple. We’re trading VRAM for speed.
The Linear Scaling Problem
Linear scaling is the real killer here. Double your context length, and you double the VRAM the cache eats. It’s manageable at a few thousand tokens, but it turns into a wall the moment you try to implement long-context windows.
A million tokens will make the cache dwarf the model weights. I’ve run production setups where the model weights took 40 GB, but the KV cache for a high-concurrency workload demanded another 200 GB. That’s where the math stops working in your favor.
Computation-Bound vs. Memory-Bound Inference
This is the flip that catches most people. You profile prefill on a short prompt and see 1,000 tokens per second. Looks great. Then you ship code that generates into 20k contexts. The decode phase drags to 30 tokens per second and nobody knows why until you look at how much time the GPU is spending moving data instead of computing.
During the prefill phase, the GPU is hammering matrix multiplications. It’s computation-bound and fast. Everything changes during decoding. Once you’re generating tokens one by one, you’re memory-bound. The GPU spends more time shuffling the KV cache from VRAM to the compute cores than it does doing math. That’s why tokens-per-second tanks as the sequence grows, regardless of how expensive your GPU was.
I debugged one of these exactly. An inference project was generating fine at 8k contexts, fast enough for interactive use. At 32k, generation speed fell off a cliff mid-response. GPU utilization looked fine. Memory bandwidth was saturated. The KV cache had grown to four times the model weight footprint and the hardware was just shuffling data.
Step-by-Step VRAM Calculation for Non-Engineers
The Core Variables
Forget the linear algebra PhD. You only need three numbers to figure out your VRAM. First: layer count. Second: KV heads. Be careful: GQA means this isn’t the same as query heads. Third: head dimension.
These constants set the vector shape. If the model uses GQA, you’ll save significant memory because you aren’t storing as many KV heads. It’s a major improvement over old multi-head attention.
The “Rule of Thumb” Metric
Want a shortcut? A conservative upper bound for large models without GQA is about 1 MB per token. Modern GQA models are more efficient, closer to 0.14–0.3 MB per token depending on architecture.
Even at the low end, this adds up fast. A 1 million token context on a large full-MHA model requires roughly 200 GB of VRAM for the KV cache alone. That’s before you touch model weights or activation memory. This is where the single-GPU fantasy ends and you have to buy a cluster or use heavy quantization.
I tried this once on a high-memory production inference server. I wanted a repository-aware coding agent with 128k context on a 13B model. Weights took 26 GB. The KV cache for a single 128k request needed 52 GB. Technically fits on an 80 GB card. Barely. Add a batch size of two and you’re immediately over. The agent was useful at 16k. At 128k, it needed a cluster.
Walkthrough of the Memory Formula
The formula is straightforward multiplication. Start with bytes per element: FP16 is 2 bytes, INT8 is 1. Multiply the head dimension by the number of KV heads to get the vector size for one layer. Multiply by the number of layers to get per-token cost. Finally, multiply by context length and batch size.
def kv_cache_gb(layers, kv_heads, head_dim, context_len, batch_size=1, dtype_bytes=2):
"""
Calculate KV cache memory in GB.
dtype_bytes: 2=FP16, 1=INT8/Q8, 0.5 approx Q4
"""
per_token = 2 * layers * kv_heads * head_dim * dtype_bytes
total_bytes = per_token * context_len * batch_size
return total_bytes / (1024 ** 3)
# Qwen3:8b — 36 layers, 8 GQA KV heads, head_dim=128
fp16_32k = kv_cache_gb(36, 8, 128, 32_000)
fp16_128k = kv_cache_gb(36, 8, 128, 128_000)
q8_128k = kv_cache_gb(36, 8, 128, 128_000, dtype_bytes=1)
print(f"32k ctx, FP16: {fp16_32k:.2f} GB") # 4.39 GB
print(f"128k ctx, FP16: {fp16_128k:.2f} GB") # 17.59 GB
print(f"128k ctx, Q8: {q8_128k:.2f} GB") # 8.79 GB Substitute your model’s actual layer count, KV head count, and head dimension. For any model using GQA, the KV head count is usually 4–8 regardless of how many query heads the model has.
How Much Context Can Your Hardware Actually Handle?
The shortcut formula: max_context = (available_vram − model_weight_footprint) / kv_bytes_per_token. Simple. The catch is that “available VRAM” is not the same as “total VRAM”: you lose memory to activation buffers, the CUDA runtime, and OS overhead, typically 1–2 GB.
The table below uses Qwen3:8b as the reference model: 36 layers, 8 GQA KV heads, head_dim=128, loaded at Q4_K_M quantization (~5 GB). Single batch, FP16 and Q8 KV cache.
| Hardware | Total Memory | Max Context (FP16 KV) | Max Context (Q8 KV) |
|---|---|---|---|
| RTX 4090 | 24 GB VRAM | ~138k tokens | ~276k tokens |
| RTX 5090 | 32 GB VRAM | ~197k tokens | ~393k tokens |
| H100 SXM | 80 GB HBM | ~546k tokens | ~1.09M tokens |
| M4 Pro | 24 GB unified | ~138k tokens | ~276k tokens |
| M4 Max | 64 GB unified | ~430k tokens | ~860k tokens |
| M4 Max | 128 GB unified | ~896k tokens | ~1.79M tokens |
| AMD Ryzen AI MAX | 128 GB shared | ~896k tokens | ~1.79M tokens |
Two things to understand about this table. First: all figures assume a single inference request. Production batch sizes multiply the KV cache footprint directly. A batch size of 8 divides every row’s max context by 8. Second: the Apple Silicon and AMD APU rows achieve their large context numbers through unified memory, which is a different architecture from discrete GPU VRAM. HBM bandwidth on an H100 is 3.35 TB/s. LPDDR5X on an M4 Max is around 546 GB/s. You can hold a 900k-token context on an M4 Max 128 GB, but generation at that context length will be slower than an H100 at 500k tokens.
For Apple Silicon specifically, the unified memory architecture means there is no separate VRAM pool; the same physical memory serves both the CPU and GPU. This is significant. You can run Qwen3:8b at 64 GB unified memory and get a maximum context that outperforms a 24 GB discrete GPU by a factor of three, and the model runs from local inference tools like Ollama and llama.cpp that are well-optimized for Apple Silicon via Metal and MLX.
For AMD APU setups (Ryzen AI MAX series), the situation is similar to Apple Silicon: system RAM doubles as GPU memory. The theoretical context limits match Apple Silicon at equivalent RAM, but inference bandwidth is typically lower than Apple’s memory subsystem on the AI workloads I’ve run.
If you’re doing serious local inference planning and haven’t nailed down the full VRAM budget beyond just the KV cache, start there first: model weights, quantization overhead, and activation memory all compete for the same pool.
Model-Specific VRAM Planning Tables
Qwen 3.5 9B Consumption Profile
If you’re sizing hardware for Qwen 3.5 9B, your KV cache precision is where you’ll win or lose. Based on a 32K context window, the VRAM requirements shift fast:
| KV Cache Config | VRAM Usage (32K ctx) | Recommendation |
|---|---|---|
| FP16 K + FP16 V | 16.0 GB | Use only if VRAM is plentiful |
| Q8 K + Q8 V | 8.0 GB | Production default |
| Q4 K + Q8 V | 6.0 GB | Use when desperate |
I’ve done this wrong in production. I built a deployment with FP16 KV cache because it was the default, tested it at 4k context in staging, shipped it. Production batch sizes were 8x staging. Hit OOM on the first real traffic spike. The KV cache per batch was 8x what I’d validated. The fix was Q8 KV, which cut the cache footprint in half immediately and ran on the same hardware without any other changes.
Llama 3 and Mistral Memory Requirements
The logic is identical for Llama 3 and Mistral. Both use GQA, so confirm the KV head count for the specific variant you’re running, then plug your numbers into the formula above. Llama 3 70B has 80 layers and 8 GQA KV heads, which puts it in a different class than 8B models for context length planning. A 32k context on Llama 3 70B at FP16 requires significantly more KV cache than the same context on a 9B model, even with the same GQA configuration, purely because of the layer count.
Impact of Batch Size on Total VRAM
Those tables assume a batch size of one. That’s fine for a demo but a fantasy for production. KV cache requirements don’t just grow: they multiply. Running a batch size of 16 means multiplying the entire KV cache footprint by 16. Most engineers plan for a 32K context window and then act surprised when ten concurrent users trigger an immediate OOM. The table above is your single-request ceiling. Every concurrent request below that cuts your per-request budget proportionally.
Quantization Trade-offs: FP16 vs. INT8 vs. INT4
I spent more time than I should have defaulting to FP16 KV because it was the obvious “safe” choice. It isn’t. The quality difference between FP16 and Q8 is negligible for almost every use case, and the memory savings are substantial. The only time FP16 KV makes sense is when VRAM is genuinely not a constraint.
The “Free Lunch” of Q8 (INT8)
Q8 (INT8) KV cache quantization is the closest thing to a free lunch in local inference. Switch from FP16 to Q8 and you immediately cut your VRAM usage in half. Quality loss is negligible: research shows a perplexity delta of under 0.1% across every model measured. Modern GPUs have hardware support for int-to-float conversion on the fly, so you get the memory savings without a real latency hit.
The deployment that OOM’d on my first traffic spike was fixed immediately. After switching to Q8 KV, the same workload ran stable. I ran eval on identical prompts before and after. Zero measurable quality difference. If you’re still running FP16 in production, you’re just burning money.
In llama.cpp, add --cache-type-k q8_0 --cache-type-v q8_0. In vLLM, set --kv-cache-dtype fp8 (vLLM’s fp8 backend uses hardware INT8 conversion where available). In MLX on Apple Silicon, the --kv-bits 8 flag sets Q8 KV.
The Risks of Q4 (INT4) Quantization
Drop to Q4 and you’re playing with fire, specifically on the V-cache. K-caches handle Q4 acceptably. V-caches don’t.
I tried full Q4 (both K and V) once on a long-context summarization task over 40k tokens. The model started losing references it had clearly seen, citing details from earlier in the context incorrectly, paraphrasing with invented specifics. Switched to the Asymmetric approach (Q4 K + Q8 V) and the problem vanished. The K-cache encodes positional and semantic lookup structure. The V-cache stores the actual content values that flow into the output. Quantizing those values to Q4 corrupts the information the model is actually trying to retrieve.
The rule: Q4 K-cache is acceptable when you need the memory. Q4 V-cache is not worth it unless you’ve exhausted every other option.
Performance Metrics: Perplexity and HumanEval
The numbers make the tradeoff clear. Moving from FP16 to Q8 has essentially zero impact on HumanEval benchmarks; the delta is within noise. Moving to the Asymmetric Q4 K + Q8 V configuration shows a HumanEval delta of -1.5 and a perplexity increase of 1.4%. That’s measurable, not catastrophic, but real. Full Q4 K + Q4 V is worse still. The signal is unambiguous: Q8 always, Asymmetric Q4/Q8 when desperate, full Q4 only as a last resort before you give up and add hardware.
TurboQuant: When 3-Bit KV Cache Actually Works
Most KV cache quantization research hits the same wall: compress below Q4 and quality collapses. TurboQuant, published at ICLR 2026 by Google Research (arXiv:2504.19874), breaks that pattern. It achieves an effective 3.5-bit average compression on the KV cache while maintaining quality metrics that hold up against FP16 baselines.
How PolarQuant + QJL Works
TurboQuant is a two-stage pipeline. The first stage, PolarQuant, applies a learned rotation transform to KV vectors before quantization. This rotation maximizes energy spread across dimensions, making the vectors far more amenable to low-bit representation. The distribution becomes easier to encode at low precision without the outlier explosion that kills naive Q4.
The second stage is QJL (Quantized Johnson-Lindenstrauss), a 1-bit residual that captures what PolarQuant left unrepresented. Together the two stages average out to roughly 3.5 bits per element: better compression than Q4 with quality characteristics closer to Q8. No retraining required. TurboQuant runs as a post-hoc transformation on any existing model.
This matters because Mixture of Experts architectures multiply the KV cache problem: each active expert contributes its own KV pairs, and MoE models are already memory-intensive at long contexts. TurboQuant’s no-training requirement means it can compress KV caches for any model out of the box, including large MoE deployments where retraining would be prohibitive.
What 6× Compression Actually Buys You
The performance numbers are significant. Against an FP16 baseline: 6× smaller KV cache footprint, up to 8× faster attention on H100 GPUs, and quality loss within the noise floor of Q8. That 8× attention speedup comes from memory bandwidth being the bottleneck for long-context decode. A smaller cache means less data movement, which is where memory-bound generation time gets spent.
Practically: a workload that needed 200 GB for KV cache at FP16 drops to roughly 33 GB with TurboQuant. That’s the difference between a multi-node H100 cluster and a single 80 GB H100 for a 1M-token context. The implications for the hardware table above are significant: every max-context figure roughly doubles relative to Q8 if you can use TurboQuant.
Using TurboQuant Today
TurboQuant is open-source with reference implementations targeting HuggingFace Transformers. If you’re on a serving stack that uses the Transformers backend, you can use it now. For vLLM, community integration is ongoing as of mid-2026. For llama.cpp and Ollama, native support requires format-level changes that haven’t landed yet.
The honest recommendation: if you’re running a Transformers-based datacenter serving stack, TurboQuant is worth testing now. If you’re on a local inference stack, watch the llama.cpp and MLX issue trackers. The memory savings are compelling enough that native support will follow.
Advanced Memory Management & Fragmentation
PagedAttention vs. Standard Caching
Standard contiguous caching wastes space by design. It forces you to allocate a huge, unbroken block of memory for the maximum possible sequence length, which creates massive internal fragmentation. One of the things that pushed me to vLLM early was watching utilization stats on a standard caching setup: 60–70% memory efficiency, the rest internal fragmentation, memory allocated but never touched because sequences ended before their reserved blocks were used.
PagedAttention, the core innovation behind vLLM, fixes this by treating the KV cache like virtual memory in an OS. It breaks the cache into smaller pages. The raw memory per token doesn’t change, but the waste does. Dramatically. I’ve seen effective utilization jump from that 60% range to over 90% after switching to a vLLM serving backend for the same workload.
In production, I hit an OOM that illustrated the fragmentation problem directly. nvidia-smi showed 15 GB free. The allocation for the next request failed anyway. The KV cache manager had reserved contiguous blocks for active sequences and there was no single contiguous block of the required size left. Total free memory was fine; the layout was fragmented.
Hybrid Memory Solutions and Offloading
Once your context window exceeds VRAM, you’ve got two choices: crash or offload. Hybrid memory solutions push the KV cache to CPU RAM or NVMe storage. This stops the OOM errors. The latency penalty is brutal.
I tested CPU offload for KV cache in an on-device inference setup when VRAM was tight. It stopped the OOM errors immediately. It also made interactive generation unusable, going from 35 tokens per second to under 3. The PCIe bus is not VRAM. For anything a human is watching in real time, offloading is just a slower crash. Don’t use it for anything requiring interactive generation. It’s only viable for asynchronous tasks or cold context retrieval where a 10-second response is acceptable.
Cache Eviction and Compression Strategies
If you’re fighting for every megabyte of VRAM in long-context windows, you have five real options: cache eviction (dropping tokens that don’t matter), cache compression, hybrid memory solutions, novel attention mechanisms like TurboQuant, and combination strategies. These aren’t academic exercises; they’re the only way to stop the linear scaling problem. You have to stop treating every single token like it needs a permanent home in VRAM. The arXiv 2603.20397 survey from Dell Technologies is worth reading for a systematic comparison of trade-offs across all five directions if you’re building a production serving stack.
Deployment Scenarios and Hardware Mapping
Long-Context Coding Agents
Context length is where you’ll hit the wall first for coding agents that ingest entire repositories. Stick to Q8 or Asymmetric Q4/Q8 quantization. A single A100 (80 GB) is tighter than it looks. At 128k context on a 13B model, the KV cache alone eats most of the available headroom after model weights.
I ran this experiment on a production inference cluster. 80 GB total, 26 GB for model weights, 52 GB for KV cache at a single 128k request. That’s 78 GB out of 80 GB for one inference request. Add any real batch size and it’s gone. The agent worked at 16k context with comfortable headroom. At 128k, it needed cluster-level deployment. There’s no way around the physics.
For serious long-context agent work, you’re forced into high-VRAM hardware or Apple Silicon’s unified memory architecture, or you apply TurboQuant and push the numbers down by 6×.
High-Throughput Datacenter Serving
When you’re running in a datacenter, throughput is everything. Run vLLM with INT8 per-tensor quantization. It’s the industry standard for a reason. Native hardware conversion ensures quantization doesn’t slow down the attention inner loop, which lets you crank batch sizes and keeps the GPU actually working. PagedAttention handles the fragmentation problem, which is critical when you’re running hundreds of concurrent requests with variable sequence lengths.
Local Inference and Edge Devices
Switching to local setups, especially on Apple Silicon, changes the game. Unified memory is effectively a cheat code. Tools like MLX and llama.cpp let you use system RAM as VRAM, because on Apple Silicon there is no distinction. Mac Studio configurations with 192 GB of unified memory handle larger KV caches than any current discrete NVIDIA GPU.
I’ve run both for long-context work and the comparison is straightforward. The 4090 is faster at short contexts, until it isn’t. At 32k context on a 14B model, a 4090’s 24 GB VRAM is already at the edge. The Mac Studio at 64 GB unified memory runs the same session without pressure, and you can push to 64k or 96k before hitting limits. For sustained long-context generation where stability matters more than raw throughput, the Mac wins. For batch inference at short-to-medium contexts where you need maximum tokens per second, the discrete GPU wins.
For AMD APU setups (Ryzen AI MAX series), you get similar context capacity advantages to Apple Silicon at equivalent memory configurations. The inference bandwidth is lower than Apple Silicon’s memory subsystem in my testing, but the context window capability is comparable, and it runs on Windows, which matters for some deployment targets.
Use this formula: KV cache bytes = 2 × layers × kv_heads × head_dim × context_length × batch_size × bytes_per_element. For FP16, bytes_per_element is 2. For Q8, it is 1. Find your model's layer count, KV head count (not query heads), and head dimension in the model card or config.json. GQA models like Qwen3 and Llama 3 have far fewer KV heads than query heads, which significantly reduces the footprint.
Use Q8 (INT8) by default. It cuts VRAM usage in half with under 0.1% perplexity impact, and modern GPUs handle the int-to-float conversion without measurable latency overhead. If you need to go lower, use the Asymmetric approach: Q4 for the K-cache, Q8 for the V-cache. Avoid Q4 on the V-cache, as it causes measurable quality degradation at long contexts. Full Q4 on both K and V should only be a last resort.
Yes, fundamentally. Apple Silicon uses unified memory, where the CPU and GPU share the same physical memory pool, so there is no separate VRAM limit. A Mac Studio with 128 GB unified memory can hold a KV cache that would require a multi-GPU setup on discrete hardware. The trade-off is memory bandwidth: Apple Silicon's LPDDR5X peaks around 546 GB/s, versus over 3 TB/s for H100 HBM3e. You get more context capacity but lower tokens-per-second at the high end.
TurboQuant is a two-stage KV cache compression method from Google Research (ICLR 2026) that achieves 3.5-bit average compression with quality comparable to Q8. It uses a rotation transform stage (PolarQuant) followed by a 1-bit residual (QJL) and requires no model retraining. It delivers 6x memory reduction and up to 8x faster attention on H100 GPUs. As of mid-2026, it works with HuggingFace Transformers stacks. llama.cpp and Ollama support is not yet native. If you are running a Transformers-based serving stack, it is worth testing now.
Because long-context decoding is memory-bandwidth-bound, not compute-bound. During the prefill phase, the GPU is doing matrix multiplications and is compute-bound. During autoregressive decoding, it must read the entire KV cache for every generated token. As the KV cache grows, the GPU spends more time moving data from VRAM to compute cores than actually computing. No amount of compute throughput fixes this; only reducing the KV cache footprint (via quantization or compression) or increasing memory bandwidth helps.
You get an OOM error and the request fails, unless you have offloading configured. CPU offloading pushes the KV cache to system RAM over the PCIe bus, which stops the OOM but reduces generation speed dramatically, often from 30+ tokens per second to under 5. NVMe offloading is even slower. For interactive use cases, offloading is rarely acceptable. The practical options are: reduce context length, apply KV quantization, switch to a hardware platform with more memory, or use PagedAttention via vLLM to eliminate fragmentation waste.
