Runtime choices
Evaluate local runtimes and pick the one that meets your latency and throughput targets. Popular options include vLLM, TGI, and TensorRT-LLM.
Consider factors like batching support, memory efficiency, and compatibility with your model format.
- vLLM: High-throughput serving with PagedAttention.
- TGI: Hugging Face optimized inference.
- TensorRT-LLM: NVIDIA optimized for maximum performance.
Parity tests
Compare self-hosted model outputs against your gateway baseline using eval suites. Run parity checks before switching traffic.
- Use the same test prompts across both deployments.
- Compare output quality, not just latency.
- Set regression thresholds before going live.
Operations monitoring
Track GPU memory, queue depth, and error rates to keep deployments healthy.
- Monitor throughput and p99 latency.
- Set alerts for GPU memory >90% utilization.
- Track request queue depth for capacity planning.