A user clicks “Generate Code” and stares at a loading spinner for 8 seconds. Your LLM generates 500 tokens, but the user has already switched tabs—assuming your app is broken. Now imagine they see the first word appear in 200ms, then watch code stream in real-time. That’s the difference between a 40% bounce rate and a 95% completion rate. Streaming isn’t a nice-to-have; it’s the difference between users trusting your AI product and abandoning it.
Perceived latency directly impacts user engagement and business metrics. Research shows that users perceive responses as “instant” when the first content appears within 200ms. Without streaming, even a fast model generating 100 tokens/second feels sluggish if you wait for the full 500-token response.
The cost implications are equally critical. Modern streaming implementations support cancellation , allowing users to stop generation mid-stream. This prevents wasted tokens on responses they’ve already read. For a high-volume application processing 100,000 requests/day with an average cancellation rate of 15%, you could save $2,250/month on a model like Claude 3.5 Sonnet ($15/1M output tokens).
However, implementing streaming incorrectly can increase costs and create bugs . Common mistakes like enabling proxy buffering or failing to handle errors properly can negate all benefits and even create security vulnerabilities.
For LLM streaming, Server-Sent Events (SSE) is the industry standard. SSE is a one-way HTTP stream from server to client, standardized by WHATWG as part of the HTML Living Standard. It’s simpler, proxy-friendly, and perfectly suited for read-only LLM responses where the client only receives tokens.
WebSockets provide bidirectional communication, which is overkill for most LLM streaming scenarios and can cause issues with corporate firewalls and proxy configurations.
Both OpenAI and Anthropic use SSE with specific event structures:
Anthropic Claude Event Flow:
message_start - Initial message metadata
content_block_start - Beginning of a content block
content_block_delta - Token deltas (the actual content)
content_block_stop - End of content block
message_delta - Final message metadata (usage, stop reason)
message_stop - Stream completion
OpenAI Event Flow:
data: {"choices": [{"delta": {"content": "token"}}]} - Content chunks
data: {"usage": {...}} - Final usage statistics
Actual Latency is the technical measurement from request to final token.
Perceived Latency is the user’s subjective experience.
Streaming optimizes for perceived latency by:
Immediate feedback : First token within 200-500ms
Progress indicators : Users see generation happening
Reading while generating : Users can read early tokens while later ones are still being generated
Early engagement : Users can react or cancel before full generation
The following examples demonstrate production-ready streaming implementations that prioritize perceived latency reduction through immediate token display.
from openai import OpenAI
from openai.types.chat import ChatCompletionChunk
model: str = "gpt-4o-mini",
Stream LLM response with proper error handling and cancellation support.
- Real-time token display (perceived latency reduction)
- Graceful error handling
api_key=os.getenv("OPENAI_API_KEY"),
# Create streaming completion
stream = client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
stream_options={"include_usage": True}
print(f"\n=== Streaming Response from {model} ===")
print("Response: ", end="", flush=True)
# Process each chunk as it arrives
if chunk.usage and not usage_printed:
print(f"\n\n=== Usage Statistics ===")
print(f"Input tokens: {chunk.usage.prompt_tokens}")
print(f"Output tokens: {chunk.usage.completion_tokens}")
print(f"Total tokens: {chunk.usage.total_tokens}")
delta: ChatCompletionChunk.Choice.Delta = chunk.choices[0].delta
# Print immediately - creates perception of speed
print(delta.content, end="", flush=True)
print(f"\n\nError during streaming: {type(e).__name__}: {e}", file=sys.stderr)
if __name__ == "__main__":
prompt="Explain quantum computing in simple terms.",
* Browser-native SSE streaming without external dependencies
* Demonstrates perceived latency reduction through immediate DOM updates
private eventSource: EventSource | null = null;
private buffer: string = "";
private element: HTMLElement;
constructor(targetElementId: string) {
const el = document.getElementById(targetElementId);
if (!el) throw new Error(`Element #${targetElementId} not found`);
* Connect to SSE endpoint and stream tokens to DOM
onProgress?: (text: string) => void;
onError?: (error: Event) => void;
// Close existing connection
// Construct query parameters
const params = new URLSearchParams({ prompt });
const url = `${endpoint}?${params.toString()}`;
// Create EventSource connection
this.eventSource = new EventSource(url);
this.eventSource.onmessage = (event) => {
// Parse token from SSE data
const token = event.data;
// Immediately update DOM (perceived latency reduction)
this.element.textContent = this.buffer;
// Call progress callback
options.onProgress?.(token);
console.error("Failed to parse SSE message:", error);
this.eventSource.onerror = (error) => {
console.error("SSE connection error:", error);
options.onError?.(error);
* Cancel stream and free resources
this.eventSource.close();
* Clear the display buffer
this.element.textContent = "";
function setupStreamingUI() {
const streamer = new BrowserLLMStream("output");
const input = document.getElementById("prompt") as HTMLInputElement;
const button = document.getElementById("stream-btn") as HTMLButtonElement;
const stopButton = document.getElementById("stop-btn") as HTMLButtonElement;
button.addEventListener("click", async () => {
const prompt = input.value.trim();
stopButton.disabled = false;
streamer.streamFromEndpoint("/api/stream", prompt, {
console.log(`Token received: ${token}`);
console.error("Stream failed:", error);
stopButton.disabled = true;
stopButton.disabled = true;
stopButton.addEventListener("click", () => {
export { BrowserLLMStream, setupStreamingUI };
from anthropic import Anthropic
from anthropic.types import RawContentBlockDeltaEvent, TextDelta
def stream_claude_response(
model: str = "claude-sonnet-4-5",
Stream Claude response with proper event handling.
Demonstrates Anthropic's SSE event structure and perceived latency benefits.
api_key=os.getenv("ANTHROPIC_API_KEY"),
print(f"\n=== Claude Streaming ({model}) ===")
print("Response: ", end="", flush=True)
with client.messages.stream(
messages=[{"role": "user", "content": prompt}],
# Process text deltas as they arrive
for text in stream.text_stream:
print(text, end="", flush=True)
# Get final message and usage
message = stream.get_final_message()
print(f"\n\n=== Usage ===")
print(f"Input tokens: {usage.input_tokens}")
print(f"Output tokens: {usage.output_tokens}")
print(f"\n\nError: {type(e).__name__}: {e}", file=sys.stderr)
if __name__ == "__main__":
prompt="Explain the benefits of streaming for user experience.",
model="claude-sonnet-4-5",
import OpenAI from "openai";
interface StreamOptions {
onToken?: (token: string) => void;
onError?: (error: Error) => void;
completionTokens: number;
export class LLMStreamer {
private abortController: AbortController | null = null;
constructor(apiKey: string) {
this.client = new OpenAI({
baseURL: "https://api.openai.com/v1"
* Stream LLM response with cancellation support
* Demonstrates perceived latency reduction through immediate token display
options: StreamOptions = {}
// Create abort controller for cancellation
this.abortController = new AbortController();
const timeoutId = setTimeout(() => {
this.abortController?.abort(new Error("Request timeout"));
const stream = await this.client.chat.completions.create(
messages: [{ role: "user", content: prompt }],
stream_options: { include_usage: true }
{ signal: this.abortController.signal }
let usage: { promptTokens: number; completionTokens: number; totalTokens: number } | null = null;
for await (const chunk of stream) {
if (chunk.choices && chunk.choices.length > 0) {
const delta = chunk.choices[0].delta;
// Immediate token display (perceived latency reduction)
onToken?.(delta.content);
} else if (chunk.usage) {
promptTokens: chunk.usage.prompt_tokens,
completionTokens: chunk.usage.completion_tokens,
totalTokens: chunk.usage.total_tokens
onComplete?.(usage || undefined);
onError?.(error as Error);
* Cancel the ongoing stream
this.abortController?.abort(new Error("Stream cancelled by user"));
Streaming implementation is deceptively simple, yet teams routinely defeat its purpose through subtle misconfigurations. These pitfalls don’t just degrade UX—they actively waste money by generating tokens users never see.
Proxy Buffering (The Silent Killer)
Reverse proxies like nginx or Cloudflare buffer SSE responses by default, waiting for complete responses before sending anything. This completely negates perceived latency reduction.
# nginx configuration that defeats streaming
proxy_pass http://backend;
proxy_buffering on; # ❌ BAD: Buffers entire response
proxy_pass http://backend;
proxy_buffering off; # ✅ GOOD: Streams immediately
proxy_read_timeout 86400; # Keep connection alive
proxy_set_header Connection '';
chunked_transfer_encoding off;
Incorrect SSE Headers
Missing or wrong headers cause browsers to wait for complete responses or fail to parse the stream.
// ❌ WRONG - Missing critical headers
Content-Type: application/json // Wrong type
Cache-Control: public, max-age=3600 // Will be cached
// ✅ CORRECT - Proper SSE headers
Content-Type: text/event-stream // Required for SSE
Cache-Control: no-store, no-cache // Prevents caching
Connection: keep-alive // Keeps connection open
X-Accel-Buffering: no # Nginx-specific: disable buffering
Access-Control-Allow-Origin: * # If cross-origin
Mid-Stream Errors
Once tokens are sent, you cannot send a standard HTTP error. Errors must be sent as SSE events.
// ❌ WRONG - Breaks the stream
# If error occurs here, connection just drops
# Too late! Tokens already sent
return {"error": str(e)} # Invalid JSON after tokens
// ✅ CORRECT - Send error as SSE event
def stream_with_error_handling():
yield f"data: {json.dumps(chunk)}\n\n"
yield "event: complete\ndata: {}\n\n"
# Send error event without breaking stream format
yield f"event: error\ndata: {json.dumps({'message': str(e)})}\n\n"
yield "event: complete\ndata: {}\n\n"
Buffering on the Client
Displaying tokens only after accumulating several chunks defeats the purpose.
// ❌ WRONG - Delays perceived latency
for await (const token of stream) {
if (buffer.length > 50) { // Wait for 50 chars
display(buffer); // User waits longer
// ✅ CORRECT - Immediate display
for await (const token of stream) {
display(token); // Perceived latency reduction
await scrollToBottom(); // Keep view updated
Ignoring Backpressure
Sending tokens faster than the client can render causes memory issues and jank.
// ✅ CORRECT - Handle backpressure
const reader = stream.getReader();
const { done, value } = await reader.read();
// Wait for DOM to be ready before next token
await new Promise(resolve => requestAnimationFrame(resolve));
No Cancellation Support
Users navigating away or clicking “stop” should immediately halt generation.
# ❌ WRONG - Wastes tokens on cancellation
def handle_request(prompt):
for token in generate(prompt):
yield token # Continues even if client disconnects
# ✅ CORRECT - Check connection regularly
def handle_request(prompt, request):
for token in generate(prompt):
if request.is_disconnected():
break # Stop generation, save tokens
async def stream(prompt: str, request: Request):
async for token in generate(prompt):
if await request.is_disconnected():
Assuming All Providers Support Cancellation
OpenRouter and most major providers do, but some smaller ones don’t. Always verify.
// Check provider capabilities
const providerCapabilities = {
'openai': { streaming: true, cancellation: true },
'anthropic': { streaming: true, cancellation: true },
'openrouter': { streaming: true, cancellation: true },
'some-provider': { streaming: true, cancellation: false } // ⚠️
// Always wrap in try-catch for unsupported cancellation
console.warn("Cancellation not supported by provider");
// Still stop client-side rendering
Focusing Only on Total Time
Measuring only total generation time misses the point of streaming.
Metric Non-Streaming Streaming User Perception Total Time 3.2s 3.2s Same Time to First Token 3.2s 0.3s 10x faster Engagement 40% 95% Massive increase
Key Metrics to Track:
TTFT (Time to First Token) : Should be less than 500ms
Token Latency : Time between tokens (affects smoothness)
Cancellation Rate : % of users stopping early (indicates they got enough)
Completion Rate : % who read full response
☐ SSE headers: Content-Type: text/event-stream
☐ Cache headers: no-store, no-cache
☐ Timeout: 86400s (or appropriate long duration)
☐ Stream enabled: stream=true
☐ Error events: Sent as SSE events
☐ Cancellation: Check disconnect regularly
☐ Usage tracking: Include final chunk
☐ Backpressure: Handle client capacity
☐ Immediate display: No buffering
☐ AbortController: Implemented
☐ Error handling: Graceful degradation
☐ Auto-scroll: For chat interfaces
☐ Metrics: TTFT, token latency tracked
OpenAI/Azure
Supports: Streaming, cancellation, usage stats
Event format: data: {"choices": [{"delta": {...}}]}
Special: stream_options: {"include_usage": true}
Anthropic
Supports: Streaming, cancellation, usage stats
Event flow: message_start → content_block_delta → message_stop
Helper: client.messages.stream() provides text_stream iterator
OpenRouter
Supports: Streaming, cancellation (most providers)
Benefit: Single API for multiple models
Note: Cancellation support varies by upstream provider
Monthly Savings = (Requests/Day) × (Cancellation Rate) ×
(Avg Output Tokens) × (Cost per 1M tokens) / 1,000,000
Example: 100,000 requests × 15% × 500 tokens × $15 = $11,250/month
This interactive widget demonstrates how streaming reduces perceived latency compared to batch responses.
<div id="streaming-demo" style="font-family: monospace; padding: 1rem; border: 1px solid #333;">
<div style="margin-bottom: 1rem;">
<button id="start-stream">Start Streaming</button>
<button id="start-batch" disabled>Batch Mode</button>
<button id="stop" disabled>Stop</button>
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem;">
<strong>Streaming (Perceived Fast)</strong>
<div id="stream-output" style="min-height: 100px; background: #1a1a1a; padding: 0.5rem; margin-top: 0.5rem;"></div>
<div id="stream-metrics" style="font-size: 0.8rem; color: #888; margin-top: 0.5rem;"></div>
<strong>Batch (Perceived Slow)</strong>
<div id="batch-output" style="min-height: 100px; background: #1a1a1a; padding: 0.5rem; margin-top: 0.5rem;"></div>
<div id="batch-metrics" style="font-size: 0.8rem; color: #888; margin-top: 0.5rem;"></div>
const streamOutput = document.getElementById('stream-output');
const batchOutput = document.getElementById('batch-output');
const streamMetrics = document.getElementById('stream-metrics');
const batchMetrics = document.getElementById('batch-metrics');
const startStreamBtn = document.getElementById('start-stream');
Streaming impact calculator (client-side latency perception)
Interactive widget derived from “Streaming Responses: Perceived Latency Reduction & UX” that lets readers explore streaming impact calculator (client-side latency perception).
Key models to cover:
Anthropic claude-3-5-sonnet (tier: general) — refreshed 2024-11-15OpenAI gpt-4o-mini (tier: balanced) — refreshed 2024-10-10Anthropic haiku-3.5 (tier: throughput) — refreshed 2024-11-15Widget metrics to capture: user_selections, calculated_monthly_cost, comparison_delta.
Data sources: model-catalog.json, retrieved-pricing.