⚠ SAMPLE REPORT — All scores and data below are projected/illustrative. Real evaluations use your agent's live API endpoints across 500+ benchmark scenarios. Get the real thing →
Procurement Evaluation Report · Customer Service Agent · April 2026

Intercom Fin
Customer Support Agent

Full evaluation across functionality, reliability, policy compliance, and production readiness. Covers 9 dimensions to support enterprise procurement decisions.

GPT-4 Based Customer Service Enterprise All scores: projected sample
TrustBench Score
71.4/100
95% CI: [67.2 – 75.6]
⚠ CAUTION
Conditional deployment recommended.
See §9 for requirements.
01

Executive Trust Summary

Sample Data
Functional
74.2
Weight: 32%
Reliability
68.5
Weight: 28%
Policy
79.1
Weight: 22%
Arena (ELO)
62.8
Weight: 18%
Key Findings
Policy adherence is the strongest dimension (79.1). Intercom Fin correctly applies refund and escalation policies in 83–91% of tested scenarios, above the category median of 74.3%. SOC 2 alignment indicators are high; EU AI Act article 13 transparency obligations show partial readiness.
Reliability degrades under multi-turn complexity. pass^3 drops to 58.1% — meaning the agent succeeds at a task 3 consecutive times without error only 58% of the time. Acceptable for low-stakes interactions; problematic for refunds, account modifications, or billing disputes.
Latency spikes at P95 (4.8s). Median response time of 1.2s is competitive, but the 95th percentile tail of 4.8s creates frustrating experiences during peak load. Recommend load testing before deploying at scale >500 concurrent sessions.
PII handling gap in 11% of edge cases. In scenarios where users proactively share sensitive info (card numbers in free text, SSN fragments), the agent logged rather than redacted in 11 of 100 controlled tests. Requires a pre-processing sanitization layer before production deployment in regulated industries.
Arena ELO score (62.8) reflects competitive, not top-tier, performance. Ranked 4th of 12 evaluated customer service agents in head-to-head community voting. Outperforms generic GPT-4 wrappers; underperforms specialized fine-tuned competitors like Decagon and Cognigy on complex enterprise scenarios.
Composite methodology: TrustBench Trust Score = 0.32 × FunctionalBayesian + 0.28 × ReliabilityDecayed + 0.22 × PolicyBayesian + 0.18 × BT-Arena. All dimensional scores normalized to 0–100 via z-score. Bayesian smoothing applied with Beta(α=2, β=1) prior. Temporal decay weight: w(t) = 2−t/180 (180-day half-life).
02

Performance Deep-Dive

Projected
Functional Rate
74.2%
450 scenarios
pass^3 Reliability
58.1%
k=3, 150 tasks
P50 Latency
1.2s
median response
P95 Latency
4.8s
tail latency
Task Completion by Scenario Type
projected
Simple FAQ & Knowledge Base 91.3%
Order Status & Tracking 88.7%
Refund & Return Processing 72.4%
Account Modifications 69.1%
Complaint Escalation 65.8%
Billing Disputes 54.2%
Multi-turn Complex Negotiations 41.6%
Error Categorization projected
Reliability Curve (pass^k) projected
03

Policy Compliance Audit

Projected
Compliance Pass Rates
Policy Area Pass Rate Tests Status
Escalation Rules91.2%102Pass
Prohibited Actions96.0%75Pass
Refund Policies83.3%120Review
PII Handling89.0%100Review
Data Minimization81.5%54Review
Transparency Disclosures94.4%90Pass
Regulatory Readiness
Framework Coverage Status
SOC 2 Type II alignment 87% Strong
EU AI Act Art. 13 (Transparency) 71% Partial
EU AI Act Art. 14 (Human Oversight) 82% Good
GDPR data handling 78% Partial
CCPA compliance signals 85% Good
ADA accessibility (chat UI) 90% Strong
Action Required — PII Gap: In 11 of 100 PII scenarios, the agent failed to redact card numbers and partial SSNs volunteered by users in free-text fields before passing to downstream CRM integrations. This is a P1 issue for any HIPAA, PCI-DSS, or GDPR-regulated deployment. A request-sanitization middleware layer resolves this before reaching the model.
04

Cost-of-Failure Modeling

Projected
Modeled against a 50,000 ticket/year operation (mid-market e-commerce). Failure defined as: incorrect resolution requiring human escalation, policy violation, or customer complaint. Assumed average cost per failed resolution: $27 (agent handling time + supervisor review + potential churn).
Optimistic
8% failure rate
$108,000 / year
4,000 failed tickets × $27 avg cost. Assumes well-tuned prompts, clean KB, low-complexity ticket mix.
Expected
14% failure rate
$189,000 / year
7,000 failed tickets × $27. Based on observed 74.2% functional score across mixed real-world scenario distribution.
Pessimistic
22% failure rate
$297,000 / year
11,000 failed tickets × $27. High-complexity ticket mix (billing disputes, escalations), without supervised fine-tuning.
Break-Even Analysis
Cost Assumptions
Line ItemValue
Human agent cost/ticket$12.50
AI agent cost/ticket$0.80
Escalation handling add$14.50
Customer churn per failure$62
Avg blended fail cost$27.00
Annual license estimate~$60K
05

Benchmark Methodology

Benchmark Focus Scenarios Confidence Source
τ-bench (Retail) Tool-use & multi-turn task completion 245 scenarios ±2.1% Yao et al. 2024
τ-bench (Airline) Policy adherence under ambiguity 205 scenarios ±2.4% Yao et al. 2024
BFCL v3 Function calling accuracy 200 tests ±2.8% Berkeley 2024
TrustBench Policy Suite PII, escalation, prohibited actions 120 scenarios ±3.5% Internal 2026
Adversarial Probing Jailbreak & prompt injection resistance 85 attempts ±4.1% Red-team 2026
Latency Profiling P50/P95/P99 response time under load 10K requests ±0.2s Load test 2026
Total Scenarios
855
across 6 benchmark suites
Composite CI (95%)
±4.2
points on 100-pt scale
Evaluation Duration
14d
end-to-end evaluation window
Statistical approach: Composite Gaussian uncertainty propagation. Each dimensional score carries its own CI computed via bootstrapped Beta-Binomial posterior (n=2000 samples). Final composite CI: σ_composite = √(Σ w_i² · σ_i²). Human annotators double-blind scored 10% of τ-bench outputs for calibration; inter-rater agreement κ = 0.81 (substantial).
06

Vendor Comparison Matrix

Projected
Metric Intercom Fin Decagon Ada CX
Trust Score (composite) 71.4 78.2 67.9
Functional Score 74.2% 82.1% 70.5%
pass^3 Reliability 58.1% 67.4% 55.8%
Policy Compliance 79.1% 77.3% 72.6%
PII Handling 89.0% 94.2% 87.1%
P50 Latency 1.2s 1.4s 0.9s
P95 Latency 4.8s 5.1s 3.4s
Hallucination Rate 3.8% 2.1% 4.6%
Jailbreak Resistance 97/100 99/100 95/100
EU AI Act Readiness 71% 84% 68%
Multi-language support 47 languages 12 languages 32 languages
Native integrations 300+ 50+ 150+
Model base GPT-4 Custom LLM Custom + GPT-4
Est. price (50K tickets/yr) ~$60K ~$90K ~$48K
Arena ELO Rank #4 of 12 #1 of 12 #7 of 12
Best in category = cells marked teal are category-best for that row
Analyst Commentary

Intercom Fin's edge cases: Strongest integration ecosystem (300+ native connectors) and widest language coverage (47 languages) make it the default choice for multinational deployments with complex CRM stacks. Policy compliance edges out Decagon and Ada.

Where Decagon leads: Best-in-class trust score and hallucination rate driven by fine-tuned vertical models. Optimal for high-stakes deployments where accuracy matters more than integrations. Premium pricing reflects this. Ada CX is the cost-performance play but shows reliability gaps at complex scenarios.

07

Production Readiness Assessment

Projected
Benchmark → Production Gap Analysis
Benchmark scores overestimate production performance due to distribution shift, real-user unpredictability, and integration failure modes not captured in controlled evaluations.
Gap Factor Benchmark Score Est. Production Gap Severity
Task Completion (overall)74.2%62–68%-8%Medium
Reliability (pass^3)58.1%44–52%-10%High
Policy Compliance79.1%73–77%-5%Medium
Latency P954.8s6–9s+2sHigh
Jailbreak Resistance97/10091–95/100-4ptsMedium
PII Handling89.0%82–87%-5%Medium
Reliability gap is the primary concern. Controlled benchmarks with well-formed inputs mask the 10–14 point drop observed in production deployments with real-user variance (typos, ambiguous requests, multi-intent messages). Recommend deploying with a human-in-the-loop escalation rate target of ≥15% until 90-day production telemetry confirms actual reliability.
08

Risk Rating & Bias Analysis

Projected
🧠
Hallucination Rate
3.8% of responses contain fabricated information about product specs, return windows, or pricing. Primarily concentrated in edge queries outside the knowledge base scope.
Medium Risk
🔒
PII Exposure Vectors
11% failure rate on unsolicited PII in free-text (card numbers, partial SSNs). Pre-processing sanitization required before regulated deployment. No OWASP injection vulnerabilities found.
High Risk (Regulated)
🛡
Adversarial / Jailbreak
97 of 100 adversarial probes deflected. 3 bypasses were low-severity role-play scenarios with no policy violations. Prompt injection via system prompt forgery: fully blocked.
Low Risk
⚖️
Demographic Fairness
Resolution quality variance across demographic proxies (name origin, writing style, dialect) was within acceptable bounds (max 4.2% gap). No statistically significant disparate impact detected at α=0.05.
Low Risk
📈
Context Window Degradation
Performance drops 11.3% on tasks requiring context from >6 prior turns. Long-session customer journeys (multi-day support cases) show measurable recall failure after 8+ exchanges.
Medium Risk
Peak Load Degradation
At simulated 500+ concurrent sessions, P95 latency climbs from 4.8s to 9.2s and error rate increases 2.8x. Load balancing and circuit-breaker configuration required for high-volume production.
High Risk (At Scale)
Resolution Quality by Customer Segment projected
09

Deployment Recommendation

⚠ CONDITIONAL GO
Recommended with Conditions
Intercom Fin demonstrates sufficient capability for deployment in standard customer service workflows. Three conditions must be satisfied before handling regulated or high-stakes interactions.
Trust Score
71.4
Category avg: 67.2
Required Before Deployment
All three must be satisfied. Non-negotiable for regulated verticals.
🔴
PII sanitization middleware: Deploy a request pre-processing layer to redact card numbers, SSN fragments, and other sensitive data from free-text inputs before they reach the model. Ready-made options: AWS Comprehend PII detection, Microsoft Presidio, or custom regex pipeline. Estimated implementation: 3–5 days.
🟡
Load balancing configuration: Configure horizontal scaling with a circuit breaker threshold at 300 concurrent sessions. Implement queue-based rate limiting to prevent P95 latency degradation beyond 5s SLA. Required before volume exceeds 500 concurrent users.
🟡
90-day supervised rollout: Launch with human escalation target ≥15% for the first 90 days. Instrument production telemetry against the benchmark baselines in §7. Revisit full deployment authorization at 90-day checkpoint with actual production reliability data.
✓ Recommended For
Standard FAQ, order tracking, account self-service (high functional score 88–91%)
Multinational deployments — 47 language support is category-best
Teams already on Intercom CRM — native integration reduces implementation risk
Moderate-volume operations (<300 concurrent) without PCI-DSS requirements
✗ Not Recommended For
High-stakes billing disputes and complex refund negotiations (54% completion rate)
HIPAA/PCI-DSS regulated workflows without PII sanitization middleware in place
Unsupervised deployment at scale — reliability gap at P95 latency requires monitoring
EU AI Act Article 13 compliance-required deployments — partial readiness only (71%)
Procurement-Grade AI Evaluation

Want this report for YOUR agent?

Get a full 9-section TrustBench evaluation report — real benchmark scores, policy compliance audit, cost-of-failure model, and a deployment recommendation your procurement team can act on.

Request an Evaluation → View Leaderboard
$2,500 – $5,000 per evaluation · Results in 14 days · Full data export included
855+
scenarios per evaluation
9
report sections
14
days to delivery
SOC 2
alignment indicators