How We Estimate LLM Inference Speed
Every GPU page on GPUDojo shows an estimated tokens-per-second figure for LLM inference. This article explains the methodology behind those numbers, the formulas we use, and the caveats you should be aware of.
Two Phases of LLM Inference
LLM inference has two distinct phases, each bottlenecked by different hardware:
1. Prefill (Prompt Processing)
During prefill, the model processes your entire input prompt at once. This phase is compute-bound - it depends on the GPU's raw TFLOPS (floating point operations per second). All tokens in the prompt are processed in parallel through matrix multiplications.
2. Generation (Token Output)
During generation, the model produces tokens one at a time. This phase is memory-bandwidth-bound - the GPU must read the entire model weights from VRAM for each token generated. The speed at which it can read those weights determines your tokens per second.
The Formulas
Generation Speed (What Most People Care About)
For autoregressive generation, each token requires reading the full model weights once. The formula is:
The model size in memory depends on the quantization level:
- FP16 (16-bit): Parameters x 2 bytes
- Q8 (8-bit): Parameters x 1 byte
- Q4 (4-bit): Parameters x 0.5 bytes
For a 7B parameter model at Q4 quantization: 7B x 0.5 = 3.5 GB in memory.
Prefill Speed
For prefill, the bottleneck is compute. The formula is:
In practice, prefill is much faster than generation (often 10-50x) because the GPU can process many tokens simultaneously using its compute cores efficiently.
Real Examples
RTX 3090 running Llama 3 8B at Q4
Memory Bandwidth: 936 GB/s
Model size at Q4: 8B x 0.5 = 4.0 GB
936 / 4.0 = ~234 tokens/sec (theoretical max)
Real-world with overhead: ~120-150 tok/s
Tesla P40 running Llama 3 8B at Q4
Memory Bandwidth: 347 GB/s
Model size at Q4: 8B x 0.5 = 4.0 GB
347 / 4.0 = ~87 tokens/sec (theoretical max)
Real-world with overhead: ~40-55 tok/s
Tesla M40 running Llama 3 8B at Q4
Memory Bandwidth: 288 GB/s
Model size at Q4: 8B x 0.5 = 4.0 GB
288 / 4.0 = ~72 tokens/sec (theoretical max)
Real-world with overhead: ~30-40 tok/s
Efficiency Factor
Theoretical maximums are never achieved in practice. We apply an efficiency factor to account for:
- KV cache overhead - Memory bandwidth is shared between reading weights and reading/writing the key-value cache
- Software overhead - Framework overhead from llama.cpp, vLLM, etc.
- Memory controller efficiency - GPUs rarely sustain 100% bandwidth utilization
- Quantization kernel efficiency - Dequantization adds compute overhead
We use an efficiency factor of approximately 0.45-0.60 depending on the GPU architecture and model size. Newer architectures (Ampere, Ada Lovelace) tend toward the higher end; older architectures (Maxwell, Pascal) toward the lower end.
What About Multi-GPU?
When a model is split across multiple GPUs, bandwidth is limited by the slowest link. For two GPUs connected via PCIe, the effective bandwidth is roughly the bandwidth of a single GPU minus inter-GPU transfer overhead. Multi-GPU setups with NVLink fare better, but consumer/used enterprise setups typically use PCIe.
Caveats
Important Limitations of These Estimates
- These are estimates, not benchmarks. Real-world performance depends on your specific software stack, model, prompt length, and system configuration.
- Context length matters. Longer contexts mean larger KV caches, which consume both VRAM and bandwidth, reducing generation speed.
- Batch size = 1. Our estimates assume single-user inference (batch size 1). Server workloads with batching will see different characteristics.
- CPU offloading changes everything. If a model doesn't fit in VRAM and layers are offloaded to system RAM, speed drops dramatically (limited by DDR4/DDR5 bandwidth, typically 40-60 GB/s).
- Thermal throttling. Passively-cooled enterprise GPUs (P40, M40) may throttle under sustained load without adequate airflow.
- Software support varies. Older GPUs may not support the latest CUDA versions or optimized kernels, reducing real-world efficiency.
How We Display Speed on GPU Pages
On each GPU detail page, we show estimated tok/s for common model sizes at Q4 quantization. These represent what you might realistically achieve with llama.cpp on a single GPU with a short context window. We round to whole numbers and err on the conservative side.
If you want to see real benchmarks for a specific GPU/model combination, we recommend checking community-submitted results on Reddit's r/LocalLLaMA or the llama.cpp GitHub discussions.