Optimizing LLM Inference: From 500ms to 50ms
The Latency Problem
Large language models are powerful but slow. A 70B parameter model running on a single A100 GPU generates tokens at roughly 30-40 tokens per second — that is 25-33ms per token. For a 200-token response, the total generation time is 5-7 seconds. Add network overhead, preprocessing, and queueing time, and users experience 8-10 seconds of latency. For interactive applications, this is unacceptable.
Over the past year, we have optimized LLM inference pipelines for dozens of clients, reducing end-to-end latency by 10x or more. This post covers the techniques that consistently deliver the biggest improvements, ordered by implementation complexity and impact.
Quantization: The Biggest Win
Quantization reduces model weights from 16-bit floating point to lower precision — 8-bit, 4-bit, or even 2-bit. The impact is dramatic: a 70B model in FP16 requires 140 GB of GPU memory and two A100 GPUs. The same model in 4-bit (AWQ or GPTQ) requires 35 GB and fits on a single GPU, immediately halving inference cost and eliminating inter-GPU communication overhead.
Modern quantization methods preserve quality remarkably well:
# AWQ quantization with vLLM
from vllm import LLM, SamplingParams
# Load 4-bit AWQ quantized model
llm = LLM(
model="TheBloke/Llama-2-70B-AWQ",
quantization="awq",
tensor_parallel_size=1, # single GPU!
gpu_memory_utilization=0.90,
max_model_len=4096,
)
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Explain quantum computing:"], params)
In our benchmarks, AWQ 4-bit quantization on Llama 2 70B reduces perplexity by less than 0.5% compared to FP16 while increasing throughput by 3-4x. For most production use cases, this quality tradeoff is imperceptible.
Continuous Batching with vLLM
Traditional serving frameworks process requests one at a time or in fixed-size batches. vLLM introduced continuous batching (also called iteration-level batching), which dynamically adds new requests to the batch at every decoding step. This eliminates the waste of waiting for the longest sequence in a batch to finish before starting new requests.
The impact is substantial: continuous batching improves throughput by 2-5x compared to static batching, especially under variable-length workloads. Combined with PagedAttention — vLLM's memory management system that handles KV cache fragmentation — you can serve 3-4x more concurrent users per GPU.
# vLLM server with continuous batching (default)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--max-num-batched-tokens 8192 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.92 \
--port 8000
Speculative Decoding
Speculative decoding is a technique where a small, fast "draft" model generates candidate tokens, and the large "target" model verifies them in parallel. Because verification is much faster than generation (it is a single forward pass for multiple tokens), this can accelerate generation by 2-3x without any quality loss.
The key insight is that for many tokens, the draft model's prediction matches the target model's output. If the draft model is correct 70% of the time and generates 5 candidate tokens per step, you effectively generate 3.5 tokens per forward pass of the target model instead of 1.
This technique works best when the draft model is well-aligned with the target model. We typically use a 1-2B parameter model from the same family (e.g., Llama 2 7B as the draft for Llama 2 70B) and fine-tune it on representative prompts to maximize acceptance rate.
KV Cache Optimization
The key-value cache stores attention states from previous tokens, avoiding recomputation during autoregressive generation. For long contexts, the KV cache can consume more GPU memory than the model weights themselves. A 70B model with 4096 token context in FP16 uses roughly 40 GB just for the KV cache.
Optimization strategies:
- KV cache quantization: Quantize the KV cache to FP8 or INT8 during inference. This halves KV cache memory with negligible quality impact, allowing longer contexts or more concurrent requests.
- Grouped Query Attention (GQA): Models using GQA (like Llama 2 70B) share KV heads across query heads, reducing KV cache size by 4-8x compared to multi-head attention.
- Prefix caching: For applications with common system prompts, cache the KV state of the system prompt and reuse it across requests. This eliminates redundant computation for the shared prefix, reducing time-to-first-token by 50-80% for long system prompts.
- PagedAttention: vLLM's PagedAttention manages KV cache like virtual memory with paging, eliminating fragmentation and waste. This alone can increase the number of concurrent sequences per GPU by 2-4x.
Infrastructure-Level Optimizations
Beyond model-level techniques, the serving infrastructure itself offers optimization opportunities:
- Request routing: Route requests to the GPU with the most available KV cache capacity, not just the lowest load. This maximizes batch efficiency and reduces queueing.
- Model sharding: For models that require tensor parallelism, use NVLink-connected GPUs within a single node. Cross-node tensor parallelism over InfiniBand adds 100-500us per layer, which compounds to significant latency at 80+ layers.
- Streaming responses: Use server-sent events (SSE) to stream tokens to the client as they are generated. While this does not reduce total generation time, it dramatically improves perceived latency — users see the first token within 100-200ms instead of waiting 5+ seconds for the complete response.
Putting It All Together
Combining all these techniques, we achieved the following results for a Llama 2 70B deployment on a single H100 GPU:
# Before optimization (FP16, naive serving)
Time to first token: 480ms
Token generation: 28 tokens/sec
Concurrent users: 4
Cost per 1M tokens: $2.80
# After optimization (AWQ 4-bit, vLLM, speculative decoding, KV cache optimization)
Time to first token: 45ms
Token generation: 120 tokens/sec
Concurrent users: 32
Cost per 1M tokens: $0.35
That is a 10x reduction in latency, 4x increase in throughput, 8x more concurrent users, and 8x cost reduction — all on the same hardware. The cumulative effect of these optimizations is multiplicative, not additive. Each technique removes a different bottleneck, and together they unlock the full potential of modern GPU hardware for LLM inference.
Want help implementing these practices?