Metrics Explainer

The request lifecycle

Every conversation is broken into three time segments

TTFW

Time to First Word

Latency from when you hit send to the very first word appearing on screen. Perceived "responsiveness".

Streaming Speed

wordsPerSecond (WPS)

How fast words appear once streaming starts. Measured only over the streaming window — not penalised by TTFW latency. wordCount / streamingTime

TTLW

Time to Last Word

Total wall-clock time from send to the full response being complete. What the user actually waited.

Smoothness

longestStallMs

The longest freeze between chunks during streaming. A high WPS with a large stall still feels broken. Lower is better.

All six metrics at a glance

⏱

Latency

TTFW

How quickly the model starts responding. Lower is better — a model that "thinks" for a long time before writing will have a high TTFW even if it streams fast afterward.

measured directly from page events

⏳

Total Duration

TTLW

Total round-trip time from prompt submission to full response. The number a user feels most directly — "how long did I wait?"

measured directly from page events

🌊

Generation Window

streamingTime

The pure generation phase — excludes the initial wait. Short streaming durations combined with high TTFW means the model thought for a while then dumped words quickly.

TTLW − TTFW

📝

Response Size

wordCount

Number of words in the model's response. Critical for interpreting speed metrics — a model with 10 WPS on a 5-word reply is very different from 10 WPS on a 500-word essay.

counted from response text

⚡

Streaming Speed

wordsPerSecond (WPS)

How fast the model generates once it starts. Measures raw throughput during the streaming window only — not affected by initial latency. High WPS + high TTFW = fast model, slow to start.

wordCount / (streamingTime / 1000)

🔄

Effective Throughput

endToEndWordsPerSecond

The user-perceived throughput — words delivered per second of total wait time. Penalises high TTFW. A model with 50 streaming WPS but 5s TTFW will have a much lower E2E WPS.

wordCount / (TTLW / 1000)

⏸

Streaming Smoothness

longestStallMs

The longest pause between consecutive chunks during streaming. Even a fast average WPS can feel janky if the stream freezes mid-response. Lower is better — ideally under 200 ms.

max inter-chunk gap during streaming

vs. standard API metrics

How these differ from TTFT, ITL, and TPS

The standard LLM performance vocabulary — Time to First Token (TTFT), Inter-Token Latency (ITL), and Tokens Per Second (TPS) — measures throughput at the infrastructure boundary: the API endpoint, the model server, the network edge. The metrics on this page measure at the rendering layer inside the browser, capturing what the person in front of the screen actually perceives. These are Real User Metrics (RUM), not infrastructure metrics.

API / infra metric	Real-user equivalent	What the gap captures
TTFT Time to First Token Measured at the HTTP stream	TTFW Time to First Word Measured when text appears on screen	Chunk buffering, markdown parsing, React/DOM render cycles, and browser paint frames. A token arriving at the network layer is not the same as a word appearing in the UI. TTFW is always ≥ TTFT.
ITL Inter-Token Latency Avg. gap between successive tokens	longestStallMs Worst-case freeze during streaming Measured between rendered word chunks	ITL is an average; it masks the single long freeze that breaks perceived smoothness. A 20 ms average ITL with one 800 ms spike feels broken — but ITL alone won't tell you that. longestStallMs surfaces exactly that worst case.
TPS Tokens Per Second Model throughput at the API	wordsPerSecond Visible words per second (streaming only) Measured in the rendering layer	Tokens and words are not the same unit. On average ~1.3 tokens make one word, but code, URLs, and non-English text skew this heavily. WPS counts what the user reads. TPS also doesn't account for frontend buffering — the UI may batch multiple tokens into a single DOM update, making the visible rate lower than the raw token rate.
TPS (E2E variant) Sometimes: total tokens / total time Blended rate including TTFT wait	endToEndWordsPerSecond Words per second of total wait Penalises TTFW latency explicitly	Same intent, but measured at the UI layer in words not tokens, making it directly interpretable by product teams without needing to know a model's tokenizer. The word unit also scales intuitively with response length as the user experiences it.

Measurement layer

TTFT and TPS are infrastructure signals — useful for model operators and SREs. TTFW and WPS are product signals — useful for anyone making decisions about user experience.

The rendering gap is real

A chatbot UI that re-renders on every token will feel faster than one that batches updates every 100 ms — even with identical TTFT and TPS. Only real-user measurement captures this difference.

Words, not tokens

Reporting in words makes metrics model-agnostic and human-readable. You don't need to know GPT-4's BPE tokenizer or Claude's vocabulary to compare "15 words/s vs 22 words/s" across providers.

Averages hide jank

ITL is an average over all inter-token gaps. longestStallMs deliberately surfaces the worst gap, because a single long freeze is what users notice — not the median smoothness they never consciously register.

The key distinction

wordsPerSecond vs endToEndWordsPerSecond — why both matter

Streaming WPS

Measures the model's engine

Calculated only over the streaming window. If the model waited 4 seconds then streamed 200 words in 2 seconds, this metric says 100 w/s — fast generation engine.

wordCount / streamingSeconds

vs

End-to-End WPS

Measures the user's experience

Calculated over the entire wait including TTFW latency. Same scenario — 200 words, 6 seconds total — gives just 33 w/s. This is what you actually feel.

wordCount / totalSeconds

Interactive example

Adjust the sliders to see how metrics relate

Move TTFW or word count and watch the derived metrics update instantly.

TTFW 2000 ms

Streaming duration 3000 ms

Word count 80 words

TTLW

—

Streaming WPS

—

E2E WPS

—

WPS / E2E ratio

—

What drives each metric

How Real User LLM Performance Metrics Work

Every conversation is broken into three time segments

How these differ from TTFT, ITL, and TPS

wordsPerSecond vs endToEndWordsPerSecond — why both matter

Adjust the sliders to see how metrics relate