Skip to content
GitHubX/TwitterRSS

Cost vs. Latency: The Deployment Trade-off

Cost vs. Latency: The Deployment Trade-off

Section titled “Cost vs. Latency: The Deployment Trade-off”

A Series A startup recently provisioned 20 GPU instances to guarantee sub-200ms latency for their chatbot. Their bill hit $48,000/month. Six months later, they discovered that adaptive batching with 4 GPUs could achieve the same latency SLA for $6,200/month—an 87% cost reduction. The difference? Understanding the economic trade-offs between infrastructure choices.

This guide breaks down the critical decisions that pit cost against latency: reserved versus on-demand capacity, batching overhead, and overprovisioning traps. You’ll learn how to calculate break-even points, optimize autoscaling policies, and avoid the hidden costs that inflate bills while degrading performance.

The economic impact of deployment choices is severe and often invisible. Based on verified pricing data from Azure OpenAI and Google Cloud Vertex AI, here’s what’s at stake:

Provisioned vs. On-Demand Break-Even: For a workload processing 1,000 requests/minute with 500 input and 200 output tokens per request:

  • On-demand cost: $13,440/day (GPT-5 Global)
  • Provisioned cost: $360/day (15 PTUs)
  • Savings: 97.3% ($13,080/day)

However, provisioned throughput units (PTUs) are billed hourly regardless of usage. If your traffic drops to 200 requests/minute, you’re still paying $360/day while on-demand would cost only $2,688/day. The break-even point for this configuration is approximately 180 requests/minute.

Batching Overhead Impact: Batch API provides a 50% discount on token costs across major providers, but introduces a 24-hour completion window. For real-time applications, this trade-off is often unacceptable. However, for offline processing, the savings are substantial:

  • Gemini 2.5 Pro input cost: $1.25/M tokens (standard) → $0.625/M tokens (batch)
  • Claude Sonnet 4.5 input cost: $3.00/M tokens (standard) → $1.50/M tokens (batch)

Overprovisioning Penalties: Setting max_tokens too high reserves unnecessary compute capacity. Azure OpenAI documentation confirms that even when generation is shorter, high max_tokens values increase latency for all requests by reserving GPU memory and compute slots that could serve other requests. A max_tokens value of 4096 instead of 1024 can increase p95 latency by 15-25% while providing no benefit for typical 200-500 token responses.

Choosing between reserved (provisioned) and on-demand infrastructure requires understanding three variables: traffic predictability, volume, and latency requirements.

Reserved capacity is optimal when you have predictable, high-volume traffic with strict latency SLAs.

When it makes sense:

  • Sustained traffic above 180 requests/minute (for GPT-5 Global PTU pricing)
  • Predictable diurnal patterns (e.g., business hours traffic)
  • Latency SLA below 200ms p95
  • Stable model version (no frequent switching)

Cost structure:

  • Azure OpenAI: $1/hour per PTU for GPT-5 Global
  • Monthly reservation: $260/month per unit (vs $720/month on hourly rate)
  • Minimum commitment: 15 PTUs for GPT-5 Global

Latency characteristics:

  • Dedicated capacity eliminates queue wait time
  • Consistent performance regardless of other customers
  • Can achieve 100-150ms p95 latency with proper tuning

Break-even calculation:

Use this flowchart to choose your deployment strategy:

  1. Measure your baseline: Capture 2 weeks of traffic data (requests/minute, token distribution, latency requirements).
  2. Calculate break-even: Use the formula below or the code example to find your threshold.
  3. Evaluate traffic pattern: Is it predictable (provisioned) or variable (on-demand/hybrid)?
  4. Test degradation: If considering hybrid, measure PayGo latency impact during peak.
  5. Implement monitoring: Set up alerts for cost and latency anomalies.

Break-even formula:

Daily Break-Even Requests = (Provisioned Daily Cost) / (On-Demand Cost per Request)
Where:
- Provisioned Daily Cost = PTU Units × $1/hour × 24 hours
- On-Demand Cost per Request = (Input Tokens × $1.25M + Output Tokens × $10M) / 1M

For GPT-5 Global with 500 input + 200 output tokens:

  • Provisioned: 15 PTUs × $1/hour × 24 = $360/day
  • On-Demand per request: (500 × $1.25 + 200 × $10) / 1M = $0.002625
  • Break-even: 137,143 requests/day ≈ 95 requests/minute

For offline workloads, Batch API cuts costs by 50% but adds 24-hour latency. Implement a dual-queue system:

# Pseudo-architecture
if latency_sla > 24_hours:
use_batch_api() # 50% cost savings
else:
if traffic_volume > break_even:
use_provisioned()
else:
use_on_demand()

For self-hosted models (GKE/AKS), configure HPA based on queue size (not GPU utilization):

  • Scale-up threshold: Queue > 5 requests per pod
  • Scale-down stabilization: 5 minutes (prevents thrashing)
  • Max batch size: 32 for throughput, 8-16 for latency-sensitive

This tool calculates optimal deployment strategy and simulates monthly costs.

cost-latency-simulator.ts
interface DeploymentConfig {
model: 'GPT-5' | 'GPT-4.1' | 'GPT-5-mini';
provider: 'azure' | 'gcp';
region: string;
}
interface TrafficProfile {
avgRequestsPerMinute: number;
peakMultiplier: number; // Peak = avg × multiplier
avgInputTokens: number;
avgOutputTokens: number;
slaLatencyMs: number;
}
class CostLatencySimulator {
private pricing = {
azure: {
'GPT-5': { input: 1.25, output: 10, ptuHourly: 1.0, minPTU: 15 },
'GPT-4.1': { input: 2.0, output: 8, ptuHourly: 1.0, minPTU: 15 },
'GPT-5-mini': { input: 0.25, output: 2, ptuHourly: 0.25, minPTU: 15 }
},
gcp: {
'GPT-5': { input: 1.25, output: 10, ptuHourly: 1.0, minPTU: 15 }, // Placeholder
'GPT-4.1': { input: 2.0, output: 8, ptuHourly: 1.0, minPTU: 15 },
'GPT-5-mini': { input: 0.25, output: 2, ptuHourly: 0.25, minPTU: 15 }
}
};
constructor(private config: DeploymentConfig) {}
/**
* Calculate monthly cost for provisioned throughput
*/
calculateProvisionedCost(traffic: TrafficProfile, utilization: number = 1.0): number {
const pricing = this.pricing[this.config.provider][this.config.model];
const peakRPM = traffic.avgRequestsPerMinute * traffic.peakMultiplier;
// Estimate PTUs needed for peak load
const tokensPerMinute = peakRPM * (traffic.avgInputTokens + traffic.avgOutputTokens);
const estimatedPTU = Math.max(pricing.minPTU, Math.ceil(tokensPerMinute / 3360)); // ~3360 tokens/sec per PTU
// Monthly cost (assuming 730 hours)
const monthlyCost = estimatedPTU * pricing.ptuHourly * 730;
// Adjust for utilization (effective cost if underutilized)
return monthlyCost / utilization;
}
/**
* Calculate monthly cost for on-demand
*/
calculateOnDemandCost(traffic: TrafficProfile): number {
const pricing = this.pricing[this.config.provider][this.config.model];
const dailyTokens = traffic.avgRequestsPerMinute * 60 * 24 *
(traffic.avgInputTokens + traffic.avgOutputTokens);
const dailyCost = (dailyTokens / 1_000_000) * (pricing.input + pricing.output);
return dailyCost * 30; // Monthly
}
/**
* Calculate monthly cost for hybrid (provisioned baseline + on-demand burst)
*/
calculateHybridCost(traffic: TrafficProfile, baselineUtilization: number = 0.8): number {
const pricing = this.pricing[this.config.provider][this.config.model];
// Provisioned handles baseline (e.g., 80% of average)
const baselineRPM = traffic.avgRequestsPerMinute * baselineUtilization;
const baselineTokens = baselineRPM * (traffic.avgInputTokens + traffic.avgOutputTokens);
const baselinePTU = Math.max(pricing.minPTU, Math.ceil(baselineTokens / 3360));
const provisionedMonthly = baselinePTU * pricing.ptuHourly * 730;
// On-demand handles bursts (20% of average + all peaks)
const burstRPM = traffic.avgRequestsPerMinute * (1 - baselineUtilization);
const peakRPM = traffic.avgRequestsPerMinute * traffic.peakMultiplier;
const totalBurstRPM = burstRPM + Math.max(0, peakRPM - baselineRPM);
const burstTokens = totalBurstRPM * 60 * 24 * 30 * (traffic.avgInputTokens + traffic.avgOutputTokens);
const onDemandMonthly = (burstTokens / 1_000_000) * (pricing.input + pricing.output);
return provisionedMonthly + onDemandMonthly;
}
/**
* Generate recommendation with break-even analysis
*/
recommend(traffic: TrafficProfile): string {
const provisioned = this.calculateProvisionedCost(traffic);
const onDemand = this.calculateOnDemandCost(traffic);
const hybrid = this.calculateHybridCost(traffic);
const savings = ((onDemand - provisioned) / onDemand * 100).toFixed(1);
const hybridSavings = ((onDemand - hybrid) / onDemand * 100).toFixed(1);
// Break-even analysis
const pricing = this.pricing[this.config.provider][this.config.model];
const breakEvenRPM = (pricing.ptuHourly * 24 * 30 * pricing.minPTU) /
((pricing.input + pricing.output) * (traffic.avgInputTokens + traffic.avgOutputTokens) / 1_000_000 * 60 * 24 * 30);
return `
## Recommendation for ${this.config.model} on ${this.config.provider.toUpperCase()}
**Monthly Costs:**
- Provisioned: ${provisioned.toFixed(0)} (saves ${savings}%)
- On-Demand: ${onDemand.toFixed(0)}
- Hybrid (80/20): ${hybrid.toFixed(0)} (saves ${hybridSavings}%)
**Break-even:** ${breakEvenRPM.toFixed(0)} requests/minute
**Current Average:** ${traffic.avgRequestsPerMinute} requests/minute
**Decision:** ${traffic.avgRequestsPerMinute > breakEvenRPM * 1.2 ? '✅ Provisioned' : traffic.avgRequestsPerMinute < breakEvenRPM * 0.8 ? '✅ On-Demand' : '⚠️ Hybrid'}
`;
}
}
// Example usage
const simulator = new CostLatencySimulator({
model: 'GPT-5',
provider: 'azure',
region: 'eastus'
});
const traffic: TrafficProfile = {
avgRequestsPerMinute: 150,
peakMultiplier: 3,
avgInputTokens: 500,
avgOutputTokens: 200,
slaLatencyMs: 200
};
console.log(simulator.recommend(traffic));

GPU utilization is a poor metric for LLM autoscaling because it doesn’t correlate well with inference performance. A GPU can be at 90% utilization but still have a small batch size and low queue, resulting in low latency. Conversely, it can be at 60% utilization but have a large batch and long queue, causing high latency. Relying on GPU utilization leads to poor scaling decisions and unpredictable performance GKE Best Practices.

Setting max_tokens far above the expected response length reserves unnecessary compute capacity. Azure OpenAI documentation confirms this increases latency for all requests because the system allocates resources for the maximum possible output, even if the actual generation is much shorter. This can increase p95 latency by 15-25% without any benefit for typical 200-500 token responses Azure OpenAI Latency.

Combining different workloads (e.g., short chat responses and long summarization tasks) on a single deployment harms performance. Short calls wait for longer completions during batching, and mixed traffic patterns reduce cache hit rates. This increases tail latency and wastes cost. Use separate deployments for distinct workloads Azure OpenAI Latency.

Azure OpenAI’s content filtering system adds measurable latency (typically 50-150ms per request). For low-risk use cases like internal tools or creative writing, evaluate whether disabling or modifying content filters is an acceptable safety trade-off to improve performance Azure OpenAI Latency.

5. Overprovisioning Provisioned Throughput

Section titled “5. Overprovisioning Provisioned Throughput”

PTUs are billed hourly regardless of usage. Provisioning for peak traffic and leaving capacity idle during off-hours is a common and expensive mistake. For variable traffic, consider hybrid strategies or on-demand to avoid paying for unused capacity Azure OpenAI Pricing.

The default Kubernetes HPA scale-down stabilization window is 5 minutes. This means if traffic drops suddenly, pods remain provisioned for 5 minutes, incurring unnecessary costs. For LLM workloads with bursty traffic, reduce the scale-down window or use queue-based metrics to scale down faster GKE Best Practices.

7. Relying Solely on Queue Size for Latency-Critical Workloads

Section titled “7. Relying Solely on Queue Size for Latency-Critical Workloads”

While queue size is excellent for cost optimization, it cannot guarantee latency below what the maximum batch size allows. For strict latency SLAs, you must also monitor and limit batch size, as larger batches increase prefill/decode time in continuous batching systems GKE Best Practices.

Batch API provides a 50% discount on token costs across major providers (Azure OpenAI, Google Vertex AI). However, it has a 24-hour completion window, making it unsuitable for real-time applications. For offline processing like data enrichment or report generation, always evaluate batch API Azure OpenAI Pricing.

Enabling prefix caching (available in vLLM, TGI, and some managed services) can reduce redundant computation by 30-50% for workloads with repetitive prompts (e.g., system messages, few-shot examples). This directly reduces token generation cost and latency without code changes GKE Best Practices.

Average latency hides tail latency spikes that directly impact user experience. Always monitor p95 and p99 latency. A service with 150ms average but 800ms p99 will feel unreliable to users. Set alerts on tail latency, not averages Azure OpenAI Latency.


Traffic PatternVolumeLatency SLARecommended StrategyExpected Savings
Steady, predictablegreater than 180 RPMless than 200msProvisioned50-97% vs on-demand
Variable, bursty50-180 RPMless than 200msHybrid (PTU baseline + on-demand burst)30-60% vs full provisioned
Variable, burstyless than 50 RPMless than 500msOn-demand0% (most flexible)
Offline/batchAnygreater than 24hBatch API50% vs on-demand
Break-Even RPM = (PTU Hourly Cost × 24 × Min PTU) / ((Input Cost + Output Cost) × Tokens per Request / 1M × 60)
Example (GPT-5 Global):
= ($1 × 24 × 15) / (($1.25 + $10) × 700 / 1M × 60)
= $360 / ($0.007875 × 60)
= 137,143 requests/day ≈ 95 RPM
ModelInput/1MOutput/1MBatch DiscountMin PTU
GPT-5 Global$1.25$10.0050%15
GPT-4.1 Global$2.00$8.0050%15
GPT-5-mini$0.25$2.00None15
Gemini 2.5 Pro$1.25$10.0050%N/A
Claude Sonnet 4.5$3.00$15.0050%N/A
MetricScale-Up ThresholdScale-Down ThresholdStabilization
Queue Size5 requests/pod2 requests/pod5 min down
Batch Size16 requests8 requests2 min down
GPU Utilization80%30%5 min down

Cost-latency tradeoff calculator (budget → latency SLA achievable)

Interactive widget derived from “Cost vs. Latency: The Deployment Trade-off” that lets readers explore cost-latency tradeoff calculator (budget → latency sla achievable).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.

The cost-latency trade-off is not a fixed choice but a dynamic optimization problem. The key insight is that provisioned throughput is not expensive—it’s often misused. The startup in our opening anecdote didn’t fail because they chose provisioned capacity; they failed because they provisioned for peak without considering traffic patterns or hybrid strategies.

Three actionable takeaways:

  1. Measure first, provision second: Capture 2 weeks of traffic data before committing to PTUs. Use the break-even