from openai import OpenAI
from typing import Dict, List
def __init__(self, model: str = "gpt-4o-mini"):
self.encoder = tiktoken.encoding_for_model(model)
def diagnose(self, prompt: str, max_tokens: int = 150) -> Dict:
"""Comprehensive latency diagnosis across all phases"""
prompt_tokens = len(self.encoder.encode(prompt))
tokenization_time = time.time() - start
results['tokenization'] = {
'time_ms': tokenization_time * 1000,
'rate': prompt_tokens / tokenization_time if tokenization_time > 0 else 0
response = self.client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
# Stream to measure time-to-first-token
if chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.time()
output_content.append(chunk.choices[0].delta.content)
api_time = time.time() - start
output_text = ''.join(output_content)
output_tokens = len(self.encoder.encode(output_text))
'total_time_ms': api_time * 1000,
'ttft_ms': (first_token_time - start) * 1000 if first_token_time else None,
'output_tokens': output_tokens,
'tokens_per_second': output_tokens / api_time if api_time > 0 else 0
# Phase 3: Bottleneck Detection
results['bottleneck'] = self._identify_bottleneck(results)
def _identify_bottleneck(self, results: Dict) -> str:
"""Identify the primary bottleneck based on thresholds"""
token_time = results['tokenization']['time_ms']
api_time = results['api']['total_time_ms']
<Aside type="danger" title="Avoid These Production Mistakes">
Based on verified production data, these are the most common causes of latency degradation:
<TabItem label="Workload Mixing">
**Problem:** Running multiple workload types on the same endpoint creates unpredictable latency.
**Why it happens:** Short calls wait for longer completions during batching, and competing workloads reduce cache hit rates [learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/latency).
**Fix:** Deploy separate endpoints for different workload patterns (e.g., chat vs. batch processing).
<TabItem label="Over-provisioning max_tokens">
**Problem:** Setting `max_tokens` excessively high increases latency even when generation is shorter.
**Why it happens:** The model reserves compute time for the full `max_tokens` value upfront, then releases unused quota after completion [learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/latency).
**Fix:** Set `max_tokens` as low as possible. Use stop sequences to prevent over-generation.
<TabItem label="Ignoring Prompt Size">
**Problem:** Treating prompt tokens as negligible.
**Why it happens:** While each prompt token adds less time than each output token, large prompts (1000+ tokens) still significantly impact total latency [learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/latency).
**Fix:** Implement prompt compression, use caching, or consider context-aware summarization.
<TabItem label="Missing Content Filter Overhead">
**Problem:** Latency budgets don't account for Azure OpenAI's content filtering.
**Why it happens:** Content filtering runs classification models on both prompt and completion, adding measurable latency [learn.microsoft.com](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/latency).
**Fix:** For low-risk use cases, request modified content filtering policies. Measure baseline with/without filtering.
### Latency Thresholds by Phase
| Phase | Acceptable | Warning | Critical | Action |
|-------|------------|---------|----------|--------|
| Tokenization | less than 10ms | 10-50ms | greater than 50ms | Optimize prompt, use faster tokenizer |
| API Request | less than 500ms | 500-2000ms | greater than 2000ms | Check deployment capacity, enable streaming |
| Response Parsing | less than 5ms | 5-20ms | greater than 20ms | Stream responses, optimize parser |
| **Total (Non-Streaming)** | less than 1s | 1-3s | greater than 3s | Review all phases |
| **Total (Streaming TTF)** | less than 300ms | 300-800ms | greater than 800ms | Model upgrade, provisioned throughput |
### Model Selection Cheat Sheet
<Aside type="note" title="Latency vs Cost Trade-off">
Lower latency models may cost more per token but reduce overall costs through faster processing and better user engagement.
| Model | Input Cost | Output Cost | Context | Best For |
|-------|------------|-------------|---------|----------|
| **gpt-4o-mini** | $0.15/M | $0.60/M | 128K | Fast responses, cost-sensitive apps |
| **gpt-4o** | $5.00/M | $15.00/M | 128K | Balanced performance |
| **haiku-3.5** | $1.25/M | $5.00/M | 200K | Fast reasoning, moderate cost |
| **claude-3-5-sonnet** | $3.00/M | $15.00/M | 200K | Complex tasks, high quality |
*Pricing verified from [OpenAI](https://openai.com/pricing) and [Anthropic](https://docs.anthropic.com/en/docs/about-claude/models) documentation.*
# Check Azure OpenAI deployment utilization
az monitor metrics list \
--resource <deployment-id> \
--metric "Provisioned-managed Utilization V2" \
# Measure streaming time-to-response
curl -w "@curl-format.txt" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"test"}],"stream":true}' \
https://api.openai.com/v1/chat/completions