All guides

Self-hosted model deployment

Run open models locally with parity checks and cost controls.

Advanced16 min readAug 11, 2025
Open ModelsOpsDeployment
Key takeaways
  • Benchmark latency and throughput before launch.
  • Run parity evals against gateway outputs.
  • Monitor GPU utilization and queue depth.

Runtime choices

Evaluate local runtimes and pick the one that meets your latency and throughput targets. Popular options include vLLM, TGI, and TensorRT-LLM.

Consider factors like batching support, memory efficiency, and compatibility with your model format.

  • vLLM: High-throughput serving with PagedAttention.
  • TGI: Hugging Face optimized inference.
  • TensorRT-LLM: NVIDIA optimized for maximum performance.

Parity tests

Compare self-hosted model outputs against your gateway baseline using eval suites. Run parity checks before switching traffic.

  • Use the same test prompts across both deployments.
  • Compare output quality, not just latency.
  • Set regression thresholds before going live.

Operations monitoring

Track GPU memory, queue depth, and error rates to keep deployments healthy.

  • Monitor throughput and p99 latency.
  • Set alerts for GPU memory >90% utilization.
  • Track request queue depth for capacity planning.