VRAM Estimator

Estimate model inference VRAM and review formulas, deployment matrices, and advanced structure settings.

Inputs

Update any input and the estimate will refresh automatically.

Estimate VRAM usage across different prompt-token and concurrency combinations.

Running the matrix will temporarily override the current prompt-token and concurrency values with the lists below.

LLM-only

Common GPU tier	Memory per GPU	Covers current single-instance need
RTX 4090 24G	24 GiB	Covers it
RTX 5090 32G	32 GiB	Covers it
L40S / RTX 6000 Ada 48G	48 GiB	Covers it
H100 / H800 80G	80 GiB	Covers it
H200 141G	141 GiB	Covers it

Start from a preset that matches your target model size.

Adjust quantization, prompt tokens, output length, and concurrency values.

Check VRAM breakdown, GPU coverage, and optimization hints.