Skip to content
GitHubX/TwitterRSS

Streaming Responses: Perceived Latency Reduction & UX

Streaming Responses: Perceived Latency Reduction & UX

Section titled “Streaming Responses: Perceived Latency Reduction & UX”

A user clicks “Generate Code” and stares at a loading spinner for 8 seconds. Your LLM generates 500 tokens, but the user has already switched tabs—assuming your app is broken. Now imagine they see the first word appear in 200ms, then watch code stream in real-time. That’s the difference between a 40% bounce rate and a 95% completion rate. Streaming isn’t a nice-to-have; it’s the difference between users trusting your AI product and abandoning it.

Perceived latency directly impacts user engagement and business metrics. Research shows that users perceive responses as “instant” when the first content appears within 200ms. Without streaming, even a fast model generating 100 tokens/second feels sluggish if you wait for the full 500-token response.

The cost implications are equally critical. Modern streaming implementations support cancellation, allowing users to stop generation mid-stream. This prevents wasted tokens on responses they’ve already read. For a high-volume application processing 100,000 requests/day with an average cancellation rate of 15%, you could save $2,250/month on a model like Claude 3.5 Sonnet ($15/1M output tokens).

However, implementing streaming incorrectly can increase costs and create bugs. Common mistakes like enabling proxy buffering or failing to handle errors properly can negate all benefits and even create security vulnerabilities.

For LLM streaming, Server-Sent Events (SSE) is the industry standard. SSE is a one-way HTTP stream from server to client, standardized by WHATWG as part of the HTML Living Standard. It’s simpler, proxy-friendly, and perfectly suited for read-only LLM responses where the client only receives tokens.

WebSockets provide bidirectional communication, which is overkill for most LLM streaming scenarios and can cause issues with corporate firewalls and proxy configurations.

Both OpenAI and Anthropic use SSE with specific event structures:

Anthropic Claude Event Flow:

  1. message_start - Initial message metadata
  2. content_block_start - Beginning of a content block
  3. content_block_delta - Token deltas (the actual content)
  4. content_block_stop - End of content block
  5. message_delta - Final message metadata (usage, stop reason)
  6. message_stop - Stream completion

OpenAI Event Flow:

  • data: {"choices": [{"delta": {"content": "token"}}]} - Content chunks
  • data: {"usage": {...}} - Final usage statistics

Actual Latency is the technical measurement from request to final token. Perceived Latency is the user’s subjective experience.

Streaming optimizes for perceived latency by:

  • Immediate feedback: First token within 200-500ms
  • Progress indicators: Users see generation happening
  • Reading while generating: Users can read early tokens while later ones are still being generated
  • Early engagement: Users can react or cancel before full generation

The following examples demonstrate production-ready streaming implementations that prioritize perceived latency reduction through immediate token display.

OpenAI-Style Streaming with SSE (Python)
import os
import sys
from openai import OpenAI
from openai.types.chat import ChatCompletionChunk
def stream_llm_response(
prompt: str,
model: str = "gpt-4o-mini",
max_tokens: int = 500,
timeout: float = 30.0
) -> None:
"""
Stream LLM response with proper error handling and cancellation support.
Key features:
- Real-time token display (perceived latency reduction)
- Graceful error handling
- Timeout protection
- Clean resource cleanup
"""
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY"),
timeout=timeout
)
try:
# Create streaming completion
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True,
max_tokens=max_tokens,
stream_options={"include_usage": True}
)
print(f"\n=== Streaming Response from {model} ===")
print("Response: ", end="", flush=True)
usage_printed = False
# Process each chunk as it arrives
for chunk in stream:
if not chunk.choices:
# Handle usage chunk
if chunk.usage and not usage_printed:
print(f"\n\n=== Usage Statistics ===")
print(f"Input tokens: {chunk.usage.prompt_tokens}")
print(f"Output tokens: {chunk.usage.completion_tokens}")
print(f"Total tokens: {chunk.usage.total_tokens}")
usage_printed = True
continue
delta: ChatCompletionChunk.Choice.Delta = chunk.choices[0].delta
if delta.content:
# Print immediately - creates perception of speed
print(delta.content, end="", flush=True)
print("\n")
except Exception as e:
print(f"\n\nError during streaming: {type(e).__name__}: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
stream_llm_response(
prompt="Explain quantum computing in simple terms.",
model="gpt-4o-mini",
max_tokens=200
)

Streaming implementation is deceptively simple, yet teams routinely defeat its purpose through subtle misconfigurations. These pitfalls don’t just degrade UX—they actively waste money by generating tokens users never see.

Proxy Buffering (The Silent Killer) Reverse proxies like nginx or Cloudflare buffer SSE responses by default, waiting for complete responses before sending anything. This completely negates perceived latency reduction.

Nginx Proxy Buffering: Wrong vs Right
# nginx configuration that defeats streaming
location /api/stream {
proxy_pass http://backend;
proxy_buffering on; # ❌ BAD: Buffers entire response
proxy_cache off;
}
# Correct configuration
location /api/stream {
proxy_pass http://backend;
proxy_buffering off; # ✅ GOOD: Streams immediately
proxy_cache off;
proxy_read_timeout 86400; # Keep connection alive
proxy_http_version 1.1;
proxy_set_header Connection '';
chunked_transfer_encoding off;
}

Incorrect SSE Headers Missing or wrong headers cause browsers to wait for complete responses or fail to parse the stream.

SSE Response Headers
// ❌ WRONG - Missing critical headers
HTTP/1.1 200 OK
Content-Type: application/json // Wrong type
Cache-Control: public, max-age=3600 // Will be cached
// ✅ CORRECT - Proper SSE headers
HTTP/1.1 200 OK
Content-Type: text/event-stream // Required for SSE
Cache-Control: no-store, no-cache // Prevents caching
Connection: keep-alive // Keeps connection open
X-Accel-Buffering: no # Nginx-specific: disable buffering
Access-Control-Allow-Origin: * # If cross-origin

Mid-Stream Errors Once tokens are sent, you cannot send a standard HTTP error. Errors must be sent as SSE events.

Error Handling in Streaming
// ❌ WRONG - Breaks the stream
try:
for chunk in stream:
yield chunk
# If error occurs here, connection just drops
except Exception as e:
# Too late! Tokens already sent
return {"error": str(e)} # Invalid JSON after tokens
// ✅ CORRECT - Send error as SSE event
def stream_with_error_handling():
try:
for chunk in stream:
yield f"data: {json.dumps(chunk)}\n\n"
yield "event: complete\ndata: {}\n\n"
except Exception as e:
# Send error event without breaking stream format
yield f"event: error\ndata: {json.dumps({'message': str(e)})}\n\n"
yield "event: complete\ndata: {}\n\n"

Buffering on the Client Displaying tokens only after accumulating several chunks defeats the purpose.

Client-Side Token Display
// ❌ WRONG - Delays perceived latency
let buffer = "";
for await (const token of stream) {
buffer += token;
if (buffer.length > 50) { // Wait for 50 chars
display(buffer); // User waits longer
}
}
// ✅ CORRECT - Immediate display
for await (const token of stream) {
display(token); // Perceived latency reduction
await scrollToBottom(); // Keep view updated
}

Ignoring Backpressure Sending tokens faster than the client can render causes memory issues and jank.

Backpressure Handling
// ✅ CORRECT - Handle backpressure
const reader = stream.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Wait for DOM to be ready before next token
await new Promise(resolve => requestAnimationFrame(resolve));
display(value);
}

No Cancellation Support Users navigating away or clicking “stop” should immediately halt generation.

Cancellation Detection
# ❌ WRONG - Wastes tokens on cancellation
def handle_request(prompt):
for token in generate(prompt):
yield token # Continues even if client disconnects
# ✅ CORRECT - Check connection regularly
def handle_request(prompt, request):
for token in generate(prompt):
if request.is_disconnected():
break # Stop generation, save tokens
yield token
# In FastAPI/Starlette
@app.get("/stream")
async def stream(prompt: str, request: Request):
async for token in generate(prompt):
if await request.is_disconnected():
break
yield token

Assuming All Providers Support Cancellation OpenRouter and most major providers do, but some smaller ones don’t. Always verify.

Provider Capability Detection
// Check provider capabilities
const providerCapabilities = {
'openai': { streaming: true, cancellation: true },
'anthropic': { streaming: true, cancellation: true },
'openrouter': { streaming: true, cancellation: true },
'some-provider': { streaming: true, cancellation: false } // ⚠️
};
// Always wrap in try-catch for unsupported cancellation
try {
abortController.abort();
} catch (e) {
console.warn("Cancellation not supported by provider");
// Still stop client-side rendering
streamActive = false;
}

Focusing Only on Total Time Measuring only total generation time misses the point of streaming.

MetricNon-StreamingStreamingUser Perception
Total Time3.2s3.2sSame
Time to First Token3.2s0.3s10x faster
Engagement40%95%Massive increase

Key Metrics to Track:

  • TTFT (Time to First Token): Should be less than 500ms
  • Token Latency: Time between tokens (affects smoothness)
  • Cancellation Rate: % of users stopping early (indicates they got enough)
  • Completion Rate: % who read full response
Terminal window
# Infrastructure
☐ Proxy buffering: OFF
☐ SSE headers: Content-Type: text/event-stream
☐ Cache headers: no-store, no-cache
☐ Connection: keep-alive
☐ Timeout: 86400s (or appropriate long duration)
# Server Implementation
☐ Stream enabled: stream=true
☐ Error events: Sent as SSE events
☐ Cancellation: Check disconnect regularly
☐ Usage tracking: Include final chunk
☐ Backpressure: Handle client capacity
# Client Implementation
☐ Immediate display: No buffering
☐ AbortController: Implemented
☐ Error handling: Graceful degradation
☐ Auto-scroll: For chat interfaces
☐ Metrics: TTFT, token latency tracked

OpenAI/Azure

  • Supports: Streaming, cancellation, usage stats
  • Event format: data: {"choices": [{"delta": {...}}]}
  • Special: stream_options: {"include_usage": true}

Anthropic

  • Supports: Streaming, cancellation, usage stats
  • Event flow: message_startcontent_block_deltamessage_stop
  • Helper: client.messages.stream() provides text_stream iterator

OpenRouter

  • Supports: Streaming, cancellation (most providers)
  • Benefit: Single API for multiple models
  • Note: Cancellation support varies by upstream provider
Monthly Savings = (Requests/Day) × (Cancellation Rate) ×
(Avg Output Tokens) × (Cost per 1M tokens) / 1,000,000
Example: 100,000 requests × 15% × 500 tokens × $15 = $11,250/month

This interactive widget demonstrates how streaming reduces perceived latency compared to batch responses.

<div id="streaming-demo" style="font-family: monospace; padding: 1rem; border: 1px solid #333;">
<div style="margin-bottom: 1rem;">
<button id="start-stream">Start Streaming</button>
<button id="start-batch" disabled>Batch Mode</button>
<button id="stop" disabled>Stop</button>
</div>
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem;">
<div>
<strong>Streaming (Perceived Fast)</strong>
<div id="stream-output" style="min-height: 100px; background: #1a1a1a; padding: 0.5rem; margin-top: 0.5rem;"></div>
<div id="stream-metrics" style="font-size: 0.8rem; color: #888; margin-top: 0.5rem;"></div>
</div>
<div>
<strong>Batch (Perceived Slow)</strong>
<div id="batch-output" style="min-height: 100px; background: #1a1a1a; padding: 0.5rem; margin-top: 0.5rem;"></div>
<div id="batch-metrics" style="font-size: 0.8rem; color: #888; margin-top: 0.5rem;"></div>
</div>
</div>
</div>
<script>
const streamOutput = document.getElementById('stream-output');
const batchOutput = document.getElementById('batch-output');
const streamMetrics = document.getElementById('stream-metrics');
const batchMetrics = document.getElementById('batch-metrics');
const startStreamBtn = document.getElementById('start-stream');

Streaming impact calculator (client-side latency perception)

Interactive widget derived from “Streaming Responses: Perceived Latency Reduction & UX” that lets readers explore streaming impact calculator (client-side latency perception).

Key models to cover:

  • Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15
  • OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10
  • Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15

Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.

Data sources: model-catalog.json, retrieved-pricing.