VRAM Estimator
Estimate model inference VRAM and review formulas, deployment matrices, and advanced structure settings.
Inputs
Update any input and the estimate will refresh automatically.
Deployment Evaluation Matrix
Estimate VRAM usage across different prompt-token and concurrency combinations.
Running the matrix will temporarily override the current prompt-token and concurrency values with the lists below.
Estimate Results
LLM-only
Recommended total VRAM
13.8 GiB
Recommended per-GPU VRAM
13.8 GiB
Effective KV tokens
34816
Model weights4.10 GiB
KV cache4.78 GiB
Activation overhead1.20 GiB
Runtime buffer1.44 GiB
| Common GPU tier | Memory per GPU | Covers current single-instance need |
|---|---|---|
| RTX 4090 24G | 24 GiB | Covers it |
| RTX 5090 32G | 32 GiB | Covers it |
| L40S / RTX 6000 Ada 48G | 48 GiB | Covers it |
| H100 / H800 80G | 80 GiB | Covers it |
| H200 141G | 141 GiB | Covers it |
How to use
1
Choose a model preset
Start from a preset that matches your target model size.
2
Set runtime inputs
Adjust quantization, prompt tokens, output length, and concurrency values.
3
Review the estimate
Check VRAM breakdown, GPU coverage, and optimization hints.