Whether LLM agents can reliably serve as synthetic users in usability
studies rests on an assumption that has yet to be systematically
tested: that agent behavior meaningfully mirrors how real humans
experience websites. We present FlowBench, the first
benchmark for measuring behavioral similarity between LLM agents and
human users on real-world web tasks.
FlowBench comprises 10 diverse websites spanning
e-commerce and SaaS domains, each paired with a website-specific task
completed by 20 human participants under structured
usability protocols. Similarity is evaluated across three dimensions:
System Usability Scale (SUS) scores, step-wise
Single Ease Question (SEQ) trajectories, and
Think Aloud transcripts assessed via semantic embedding
cosine similarity. These are aggregated into a unified
FlowBench score, where 1.0 indicates perfect
behavioral alignment.
We instantiate the benchmark using
OpenFlo
as the reference agent harness and report preliminary results,
establishing a reproducible foundation for evaluating the fidelity of
LLM agents as proxies for human usability testers.