Markdown view
# Self-hosted model deployment Run open models locally with parity checks and cost controls. - Date: Aug 11, 2025 - Reading time: 16 min - Level: Advanced - Tags: Open Models, Ops, Deployment ## Takeaways - Benchmark latency and throughput before launch. - Run parity evals against gateway outputs. - Monitor GPU utilization and queue depth. ## Runtime choices Evaluate local runtimes and pick the one that meets your latency and throughput targets. Popular options include vLLM, TGI, and TensorRT-LLM. Consider factors like batching support, memory efficiency, and compatibility with your model format. - vLLM: High-throughput serving with PagedAttention. - TGI: Hugging Face optimized inference. - TensorRT-LLM: NVIDIA optimized for maximum performance. ## Parity tests Compare self-hosted model outputs against your gateway baseline using eval suites. Run parity checks before switching traffic. - Use the same test prompts across both deployments. - Compare output quality, not just latency. - Set regression thresholds before going live. ## Operations monitoring Track GPU memory, queue depth, and error rates to keep deployments healthy. - Monitor throughput and p99 latency. - Set alerts for GPU memory >90% utilization. - Track request queue depth for capacity planning.