FlowBench

UCL Nexus Labs · University College London

FlowBench

Benchmarking LLM agent behavior against human usability patterns on the web.

Wee Joe Tan · Zi Rui Lucas Lim · Shashank Durgad · Karim Obegi · Aiden Yiliu Li

Abstract

Research Overview

Whether LLM agents can reliably serve as synthetic users in usability studies rests on an assumption that has yet to be systematically tested: that agent behavior meaningfully mirrors how real humans experience websites. We present FlowBench, the first benchmark for measuring behavioral similarity between LLM agents and human users on real-world web tasks.

FlowBench comprises 10 diverse websites spanning e-commerce and SaaS domains, each paired with a website-specific task completed by 20 human participants under structured usability protocols. Similarity is evaluated across three dimensions: System Usability Scale (SUS) scores, step-wise Single Ease Question (SEQ) trajectories, and Think Aloud transcripts assessed via semantic embedding cosine similarity. These are aggregated into a unified FlowBench score, where 1.0 indicates perfect behavioral alignment.

We instantiate the benchmark using OpenFlo as the reference agent harness and report preliminary results, establishing a reproducible foundation for evaluating the fidelity of LLM agents as proxies for human usability testers.

Design

Benchmark

Existing benchmarks for web agents—WebArena, Mind2Web—evaluate functional task completion. They measure whether an agent achieves a goal, not whether it experiences the interface the way a human would. A system can be functionally correct yet deeply frustrating to human users, and an agent that completes tasks efficiently may still fail to surface the friction points that matter most.

FlowBench is distinct in measuring the process of interaction—how difficulty is distributed across steps, how reasoning is verbalized—rather than only whether a terminal goal is reached.

Feature
WebArena
UXAgent
FlowBench
Goal
Task completion accuracy
Automated UX reporting
Human-agent similarity
Human Baseline
No
No
Yes (20 participants)
Similarity Metric
None
None
SUS + SEQ + Think Aloud
Perception
DOM-based
DOM-based
Visual (agent-agnostic)
Persona Matching
No
No
Yes (1-to-1)

Method

Evaluation Protocol

01

Human Study

20 participants each complete all 10 tasks across 10 benchmark websites. Each session is screen-recorded. Participants follow a concurrent Think Aloud protocol, verbalizing intentions, confusions, and observations. After each step, participants provide a step-wise SEQ rating (1–7). Upon task completion, they complete the 10-item SUS questionnaire.

02

Persona Construction

Each participant completes a demographic and technology-use questionnaire, used to construct a structured persona profile capturing digital literacy, browsing habits, and domain familiarity. These profiles initialize matched agent instances for one-to-one comparison.

03

Agent Evaluation

For each human participant with persona profile πi, a corresponding agent instance is initialized via OpenFlo with a system prompt encoding πi. The agent completes the same task on the same website using visual grounding—not DOM parsing—producing a step-wise SEQ trace, a post-task SUS score, and a Think Aloud transcript.

04

Similarity Scoring

For each matched human-agent pair, similarity is computed across three dimensions and aggregated into a single FlowBench score. Aggregate benchmark performance is reported as the mean across all 20 pairs and all 10 websites.

Metrics

Three Dimensions of Similarity

FlowBench defines behavioral similarity across three complementary dimensions, aggregated with equal weighting into a single score where 1.0 = perfect alignment.

SSUS

SUS Similarity

Normalized absolute difference between human and agent SUS scores. Captures whether the agent's overall usability judgment matches the human's. A score of 1.0 indicates identical SUS assessments.

SSUS(pi, ai) = 1 − |SUSp − SUSa| / 100
SSEQ

SEQ Trajectory Similarity

Pearson correlation between human and agent step-wise SEQ sequences, normalized to [0, 1]. Captures whether the agent experiences friction at the same interaction steps as the human, independent of absolute score magnitude.

SSEQ(pi, ai) = (1 + r(SEQp, SEQa)) / 2
STA

Think Aloud Similarity

Cosine similarity between human and agent Think Aloud transcript embeddings, computed using a pre-trained sentence embedding model with mean-pooled step-wise representations.

STA(pi, ai) = cos(ep, ea)

Unified Score

FlowBench(pi, ai) = (SSUS + SSEQ + STA) / 3

Preliminary Results

Pilot Evaluation

Two websites were selected for a pilot evaluation prior to the full human study: Discogs (a music marketplace) and Recreation.gov (a government permit platform). These represent contrasting usability profiles—one with clear information hierarchy, the other with complex interactive widgets.

Discogs

87.5 SUS
A+

Avg SEQ: 6.0 / 7 · 4 steps

The agent completed the task (locating submission guidelines) smoothly. Think Aloud indicated a coherent strategy: the agent correctly reasoned that documentation links would appear in the footer rather than the primary navigation.

Recreation.gov

55.0 SUS
D

Avg SEQ: 4.1 / 7 · 14 steps

The task exposed a website where visual clarity masks functional defects. Initial navigation succeeded (SEQ = 7), but the system degraded sharply during date selection and group size configuration (SEQ → 1). The agent captured the precise failure mode: "While the DOM element is clearly visible and correctly identified, the lack of response creates a total block."

Quickstart

Run OpenFlo

Install

conda create -n openflo python=3.11
conda activate openflo
pip install -e src
playwright install chromium

API Key

export OPENROUTER_API_KEY="your-key"

Batch Run

cd src
uv run run_agent.py -c config/auto_mode.toml

With Persona

cd src
uv run run_agent.py -c config/auto_mode.toml -p config/persona.toml

Citation

BibTeX

@misc{tan2026flowbench,
  title={FlowBench: Benchmarking LLM Agent Behavior Against Human Usability Patterns on the Web},
  author={Wee Joe Tan and Zi Rui Lucas Lim and Shashank Durgad and Karim Obegi and Aiden Yiliu Li},
  year={2026},
  url={https://onflow-ai.github.io/OpenFlo/},
}