How We Estimate LLM Inference Speed

Every GPU page on GPUDojo shows an estimated tokens-per-second figure for LLM inference. This article explains the methodology behind those numbers, the formulas we use, and the caveats you should be aware of.

Two Phases of LLM Inference

LLM inference has two distinct phases, each bottlenecked by different hardware:

1. Prefill (Prompt Processing)

During prefill, the model processes your entire input prompt at once. This phase is compute-bound - it depends on the GPU's raw TFLOPS (floating point operations per second). All tokens in the prompt are processed in parallel through matrix multiplications.

2. Generation (Token Output)

During generation, the model produces tokens one at a time. This phase is memory-bandwidth-bound - the GPU must read the entire model weights from VRAM for each token generated. The speed at which it can read those weights determines your tokens per second.

The Formulas

Generation Speed (What Most People Care About)

For autoregressive generation, each token requires reading the full model weights once. The formula is:

Generation Speed Formula tokens/sec = Memory Bandwidth (GB/s) / Model Size in Memory (GB)

The model size in memory depends on the quantization level:

For a 7B parameter model at Q4 quantization: 7B x 0.5 = 3.5 GB in memory.

Prefill Speed

For prefill, the bottleneck is compute. The formula is:

Prefill Speed Formula tokens/sec = (FP16 TFLOPS x 10^12) / (2 x Parameters x Ops_per_token)

In practice, prefill is much faster than generation (often 10-50x) because the GPU can process many tokens simultaneously using its compute cores efficiently.

Real Examples

RTX 3090 running Llama 3 8B at Q4

Memory Bandwidth: 936 GB/s

Model size at Q4: 8B x 0.5 = 4.0 GB

936 / 4.0 = ~234 tokens/sec (theoretical max)

Real-world with overhead: ~120-150 tok/s

Tesla P40 running Llama 3 8B at Q4

Memory Bandwidth: 347 GB/s

Model size at Q4: 8B x 0.5 = 4.0 GB

347 / 4.0 = ~87 tokens/sec (theoretical max)

Real-world with overhead: ~40-55 tok/s

Tesla M40 running Llama 3 8B at Q4

Memory Bandwidth: 288 GB/s

Model size at Q4: 8B x 0.5 = 4.0 GB

288 / 4.0 = ~72 tokens/sec (theoretical max)

Real-world with overhead: ~30-40 tok/s

Efficiency Factor

Theoretical maximums are never achieved in practice. We apply an efficiency factor to account for:

We use an efficiency factor of approximately 0.45-0.60 depending on the GPU architecture and model size. Newer architectures (Ampere, Ada Lovelace) tend toward the higher end; older architectures (Maxwell, Pascal) toward the lower end.

What About Multi-GPU?

When a model is split across multiple GPUs, bandwidth is limited by the slowest link. For two GPUs connected via PCIe, the effective bandwidth is roughly the bandwidth of a single GPU minus inter-GPU transfer overhead. Multi-GPU setups with NVLink fare better, but consumer/used enterprise setups typically use PCIe.

Caveats

Important Limitations of These Estimates

  • These are estimates, not benchmarks. Real-world performance depends on your specific software stack, model, prompt length, and system configuration.
  • Context length matters. Longer contexts mean larger KV caches, which consume both VRAM and bandwidth, reducing generation speed.
  • Batch size = 1. Our estimates assume single-user inference (batch size 1). Server workloads with batching will see different characteristics.
  • CPU offloading changes everything. If a model doesn't fit in VRAM and layers are offloaded to system RAM, speed drops dramatically (limited by DDR4/DDR5 bandwidth, typically 40-60 GB/s).
  • Thermal throttling. Passively-cooled enterprise GPUs (P40, M40) may throttle under sustained load without adequate airflow.
  • Software support varies. Older GPUs may not support the latest CUDA versions or optimized kernels, reducing real-world efficiency.

How We Display Speed on GPU Pages

On each GPU detail page, we show estimated tok/s for common model sizes at Q4 quantization. These represent what you might realistically achieve with llama.cpp on a single GPU with a short context window. We round to whole numbers and err on the conservative side.

If you want to see real benchmarks for a specific GPU/model combination, we recommend checking community-submitted results on Reddit's r/LocalLLaMA or the llama.cpp GitHub discussions.