Optimization
Optimization that protects latency and cost
Measure every step, budget tokens, and cache stable prompts to keep the gateway fast and affordable.
Tune token budgets, caching, and concurrency to deliver reliable performance at scale.
- Define a token budget for every workflow.
- Cache stable prompts and templates.
- Retry with backoff on transient failures.
- Monitor latency and rate-limit usage.
Token budgets
Set max tokens and summarize context to control cost.
Caching
Cache stable prompt prefixes and tool results.
Concurrency
Balance throughput with guardrails and retries.
Telemetry
Track latency, errors, and cache hit rates.
Optimization guides
Curated recipes, playbooks, and walkthroughs for this topic area.
Eval flywheel for prompt regressions
Generate test cases, score outputs, and track regressions.
Gateway API authentication guide
Secure your Gateway API integration with proper authentication and scopes.
Streaming formats and reconnects
Event schemas, heartbeats, and reconnect logic for SSE and WebSocket.
Self-hosted model deployment
Run open models locally with parity checks and cost controls.
Optimize prompts
Tune prompt structure, few-shot examples, and token budgets for consistency.
Prompt migration guide
Move legacy prompts into the Responses API with clearer roles and tool rules.
Prompt caching 101
Reduce latency and cost with cache-safe prompt blocks.