How Real User LLM Performance Metrics Work

A visual guide to TTFW, streaming WPS, end-to-end WPS, and how they're measured.

By Ritesh Maheshwari

The request lifecycle

Every conversation is broken into three time segments

TTFW
Time to First Word
Latency from when you hit send to the very first word appearing on screen. Perceived "responsiveness".
Streaming Speed
wordsPerSecond (WPS)
How fast words appear once streaming starts. Measured only over the streaming window — not penalised by TTFW latency. wordCount / streamingTime
TTLW
Time to Last Word
Total wall-clock time from send to the full response being complete. What the user actually waited.
Smoothness
longestStallMs
The longest freeze between chunks during streaming. A high WPS with a large stall still feels broken. Lower is better.

All six metrics at a glance

Latency
TTFW
How quickly the model starts responding. Lower is better — a model that "thinks" for a long time before writing will have a high TTFW even if it streams fast afterward.
measured directly from page events
Total Duration
TTLW
Total round-trip time from prompt submission to full response. The number a user feels most directly — "how long did I wait?"
measured directly from page events
🌊
Generation Window
streamingTime
The pure generation phase — excludes the initial wait. Short streaming durations combined with high TTFW means the model thought for a while then dumped words quickly.
TTLW − TTFW
📝
Response Size
wordCount
Number of words in the model's response. Critical for interpreting speed metrics — a model with 10 WPS on a 5-word reply is very different from 10 WPS on a 500-word essay.
counted from response text
Streaming Speed
wordsPerSecond (WPS)
How fast the model generates once it starts. Measures raw throughput during the streaming window only — not affected by initial latency. High WPS + high TTFW = fast model, slow to start.
wordCount / (streamingTime / 1000)
🔄
Effective Throughput
endToEndWordsPerSecond
The user-perceived throughput — words delivered per second of total wait time. Penalises high TTFW. A model with 50 streaming WPS but 5s TTFW will have a much lower E2E WPS.
wordCount / (TTLW / 1000)
Streaming Smoothness
longestStallMs
The longest pause between consecutive chunks during streaming. Even a fast average WPS can feel janky if the stream freezes mid-response. Lower is better — ideally under 200 ms.
max inter-chunk gap during streaming

vs. standard API metrics

How these differ from TTFT, ITL, and TPS

The standard LLM performance vocabulary — Time to First Token (TTFT), Inter-Token Latency (ITL), and Tokens Per Second (TPS) — measures throughput at the infrastructure boundary: the API endpoint, the model server, the network edge. The metrics on this page measure at the rendering layer inside the browser, capturing what the person in front of the screen actually perceives. These are Real User Metrics (RUM), not infrastructure metrics.

API / infra metric Real-user equivalent What the gap captures
TTFT
Time to First Token
Measured at the HTTP stream
TTFW
Time to First Word
Measured when text appears on screen
Chunk buffering, markdown parsing, React/DOM render cycles, and browser paint frames. A token arriving at the network layer is not the same as a word appearing in the UI. TTFW is always ≥ TTFT.
ITL
Inter-Token Latency
Avg. gap between successive tokens
longestStallMs
Worst-case freeze during streaming
Measured between rendered word chunks
ITL is an average; it masks the single long freeze that breaks perceived smoothness. A 20 ms average ITL with one 800 ms spike feels broken — but ITL alone won't tell you that. longestStallMs surfaces exactly that worst case.
TPS
Tokens Per Second
Model throughput at the API
wordsPerSecond
Visible words per second (streaming only)
Measured in the rendering layer
Tokens and words are not the same unit. On average ~1.3 tokens make one word, but code, URLs, and non-English text skew this heavily. WPS counts what the user reads. TPS also doesn't account for frontend buffering — the UI may batch multiple tokens into a single DOM update, making the visible rate lower than the raw token rate.
TPS (E2E variant)
Sometimes: total tokens / total time
Blended rate including TTFT wait
endToEndWordsPerSecond
Words per second of total wait
Penalises TTFW latency explicitly
Same intent, but measured at the UI layer in words not tokens, making it directly interpretable by product teams without needing to know a model's tokenizer. The word unit also scales intuitively with response length as the user experiences it.
Measurement layer
TTFT and TPS are infrastructure signals — useful for model operators and SREs. TTFW and WPS are product signals — useful for anyone making decisions about user experience.
The rendering gap is real
A chatbot UI that re-renders on every token will feel faster than one that batches updates every 100 ms — even with identical TTFT and TPS. Only real-user measurement captures this difference.
Words, not tokens
Reporting in words makes metrics model-agnostic and human-readable. You don't need to know GPT-4's BPE tokenizer or Claude's vocabulary to compare "15 words/s vs 22 words/s" across providers.
Averages hide jank
ITL is an average over all inter-token gaps. longestStallMs deliberately surfaces the worst gap, because a single long freeze is what users notice — not the median smoothness they never consciously register.

The key distinction

wordsPerSecond vs endToEndWordsPerSecond — why both matter

Streaming WPS
Measures the model's engine
Calculated only over the streaming window. If the model waited 4 seconds then streamed 200 words in 2 seconds, this metric says 100 w/s — fast generation engine.
wordCount / streamingSeconds
vs
End-to-End WPS
Measures the user's experience
Calculated over the entire wait including TTFW latency. Same scenario — 200 words, 6 seconds total — gives just 33 w/s. This is what you actually feel.
wordCount / totalSeconds

Interactive example

Adjust the sliders to see how metrics relate

Move TTFW or word count and watch the derived metrics update instantly.

TTLW
Streaming WPS
E2E WPS
WPS / E2E ratio

What drives each metric