Skip to content
Home/Online Tools/VRAM Estimator

VRAM Estimator

Estimate model inference VRAM and review formulas, deployment matrices, and advanced structure settings.

Inputs

Update any input and the estimate will refresh automatically.

Deployment Evaluation Matrix

Estimate VRAM usage across different prompt-token and concurrency combinations.

Running the matrix will temporarily override the current prompt-token and concurrency values with the lists below.

Estimate Results

LLM-only
Recommended total VRAM
13.8 GiB
Recommended per-GPU VRAM
13.8 GiB
Effective KV tokens
34816
Model weights4.10 GiB
KV cache4.78 GiB
Activation overhead1.20 GiB
Runtime buffer1.44 GiB
Common GPU tierMemory per GPUCovers current single-instance need
RTX 4090 24G24 GiBCovers it
RTX 5090 32G32 GiBCovers it
L40S / RTX 6000 Ada 48G48 GiBCovers it
H100 / H800 80G80 GiBCovers it
H200 141G141 GiBCovers it

How to use

1

Choose a model preset

Start from a preset that matches your target model size.

2

Set runtime inputs

Adjust quantization, prompt tokens, output length, and concurrency values.

3

Review the estimate

Check VRAM breakdown, GPU coverage, and optimization hints.