LLM Inference Optimization Techniques
Running LLMs in production is expensive. Here’s how to cut costs by 70%.
1. Quantization
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_4bit=True,
device_map="auto"
)
| Precision | Memory | Speed | Quality |
|---|---|---|---|
| FP16 | 14GB | 1x | 100% |
| INT8 | 7GB | 1.2x | 99.5% |
| INT4 | 4GB | 1.5x | 98% |
2. vLLM for Serving
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-hf")
params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(prompts, params)
vLLM uses PagedAttention for 24x higher throughput.
3. Batching Strategies
async def batch_inference(requests, max_wait=50):
batch = []
deadline = time.time() + max_wait/1000
while time.time() < deadline and len(batch) < 32:
if requests:
batch.append(requests.pop())
return model.generate(batch)
4. KV Cache Optimization
llm = LLM(
model="llama-7b",
enable_prefix_caching=True
)
Results
| Optimization | Latency | Cost |
|---|---|---|
| Baseline | 500ms | $1.00 |
| + Quantization | 350ms | $0.50 |
| + vLLM | 150ms | $0.30 |
| + Batching | 80ms | $0.15 |