Optimization

Optimization that protects latency and cost

Measure every step, budget tokens, and cache stable prompts to keep the gateway fast and affordable.

Tune token budgets, caching, and concurrency to deliver reliable performance at scale.

7 guides4 focus areasCaching

Starter kit

Define a token budget for every workflow.
Cache stable prompts and templates.
Retry with backoff on transient failures.
Monitor latency and rate-limit usage.

Explore all guides

Focus areas

Token budgets

Set max tokens and summarize context to control cost.

Caching

Cache stable prompt prefixes and tool results.

Concurrency

Balance throughput with guardrails and retries.

Telemetry

Track latency, errors, and cache hit rates.

Guides in this topic

Optimization guides

Curated recipes, playbooks, and walkthroughs for this topic area.

EvalsQualityAutomation

Eval flywheel for prompt regressions

Generate test cases, score outputs, and track regressions.

Oct 6, 202514 min read

GatewayAuthSecurity

Gateway API authentication guide

Secure your Gateway API integration with proper authentication and scopes.

Sep 20, 202514 min read

StreamingResponsesRealtime

Streaming formats and reconnects

Event schemas, heartbeats, and reconnect logic for SSE and WebSocket.

Aug 28, 202512 min read

Open ModelsOpsDeployment

Self-hosted model deployment

Run open models locally with parity checks and cost controls.

Aug 11, 202516 min read

OptimizationPromptingResponses

Optimize prompts

Tune prompt structure, few-shot examples, and token budgets for consistency.

Jul 14, 202514 min read

CompletionsResponsesPrompting

Prompt migration guide

Move legacy prompts into the Responses API with clearer roles and tool rules.

Jun 26, 202512 min read

LatencyCachingOptimization

Prompt caching 101

Reduce latency and cost with cache-safe prompt blocks.

Oct 10, 202410 min read

Start here

Featured in Optimization

Eval flywheel for prompt regressions

Generate test cases, score outputs, and track regressions.

Self-hosted model deployment

Run open models locally with parity checks and cost controls.