Back to guide
Markdown view
# Self-hosted model deployment

Run open models locally with parity checks and cost controls.

- Date: Aug 11, 2025
- Reading time: 16 min
- Level: Advanced
- Tags: Open Models, Ops, Deployment

## Takeaways
- Benchmark latency and throughput before launch.
- Run parity evals against gateway outputs.
- Monitor GPU utilization and queue depth.

## Runtime choices

Evaluate local runtimes and pick the one that meets your latency and throughput targets. Popular options include vLLM, TGI, and TensorRT-LLM.

Consider factors like batching support, memory efficiency, and compatibility with your model format.

- vLLM: High-throughput serving with PagedAttention.
- TGI: Hugging Face optimized inference.
- TensorRT-LLM: NVIDIA optimized for maximum performance.

## Parity tests

Compare self-hosted model outputs against your gateway baseline using eval suites. Run parity checks before switching traffic.

- Use the same test prompts across both deployments.
- Compare output quality, not just latency.
- Set regression thresholds before going live.

## Operations monitoring

Track GPU memory, queue depth, and error rates to keep deployments healthy.

- Monitor throughput and p99 latency.
- Set alerts for GPU memory >90% utilization.
- Track request queue depth for capacity planning.