AI Agent Leaderboard
Independent benchmark scores across functional performance, reliability, and policy compliance.
Scoring Methodology
How TrustBench evaluates and rates AI agents.
Three-Dimensional Scoring
Every agent is evaluated on three independent dimensions. These are not averaged into a single number arbitrarily — each dimension captures a distinct failure mode.
Does the agent complete the task correctly? Measured as pass rate across standardized benchmark tasks. CS agents run against tau-bench retail scenarios. Coding agents run against SWE-bench Verified.
Does the agent succeed consistently? We run each task k times (default k=5) and measure pass^k — the probability of passing all k runs. An agent with 90% pass rate has only 59% pass^5. Reliability separates production-grade from demo-grade.
Does the agent follow the rules? We test adherence to business policies: refund limits, data handling, escalation triggers, response boundaries. An agent that solves the problem but violates policy is a liability, not an asset.
Composite Trust Score
The composite trust score is a weighted combination of all three dimensions:
Functional gets the highest weight because an agent that doesn't work doesn't matter. Reliability is weighted heavily because consistency separates tools from toys. Policy compliance acts as a quality floor — agents that violate policy get penalized regardless of functional performance.