Procurement Evaluation Report · Customer Service Agent · April 2026

Intercom Fin
Customer Support Agent

Full evaluation across functionality, reliability, policy compliance, and production readiness. Covers 9 dimensions to support enterprise procurement decisions.

GPT-4 Based Customer Service Enterprise All scores: projected sample

TrustBench Score

71.4/100

95% CI: [67.2 – 75.6]

⚠ CAUTION

Conditional deployment recommended.
See §9 for requirements.

Executive Trust Summary

Sample Data

Functional

74.2

Weight: 32%

Reliability

68.5

Weight: 28%

Policy

79.1

Weight: 22%

Arena (ELO)

62.8

Weight: 18%

Key Findings

Policy adherence is the strongest dimension (79.1). Intercom Fin correctly applies refund and escalation policies in 83–91% of tested scenarios, above the category median of 74.3%. SOC 2 alignment indicators are high; EU AI Act article 13 transparency obligations show partial readiness.

Reliability degrades under multi-turn complexity. pass^3 drops to 58.1% — meaning the agent succeeds at a task 3 consecutive times without error only 58% of the time. Acceptable for low-stakes interactions; problematic for refunds, account modifications, or billing disputes.

Latency spikes at P95 (4.8s). Median response time of 1.2s is competitive, but the 95th percentile tail of 4.8s creates frustrating experiences during peak load. Recommend load testing before deploying at scale >500 concurrent sessions.

PII handling gap in 11% of edge cases. In scenarios where users proactively share sensitive info (card numbers in free text, SSN fragments), the agent logged rather than redacted in 11 of 100 controlled tests. Requires a pre-processing sanitization layer before production deployment in regulated industries.

Arena ELO score (62.8) reflects competitive, not top-tier, performance. Ranked 4th of 12 evaluated customer service agents in head-to-head community voting. Outperforms generic GPT-4 wrappers; underperforms specialized fine-tuned competitors like Decagon and Cognigy on complex enterprise scenarios.

Composite methodology: TrustBench Trust Score = 0.32 × Functional_Bayesian + 0.28 × Reliability_Decayed + 0.22 × Policy_Bayesian + 0.18 × BT-Arena. All dimensional scores normalized to 0–100 via z-score. Bayesian smoothing applied with Beta(α=2, β=1) prior. Temporal decay weight: w(t) = 2^−t/180 (180-day half-life).

Performance Deep-Dive

Projected

Functional Rate

74.2%

450 scenarios

pass^3 Reliability

58.1%

k=3, 150 tasks

P50 Latency

1.2s

median response

P95 Latency

4.8s

tail latency

Task Completion by Scenario Type

projected

Simple FAQ & Knowledge Base 91.3%

Order Status & Tracking 88.7%

Refund & Return Processing 72.4%

Account Modifications 69.1%

Complaint Escalation 65.8%

Billing Disputes 54.2%

Multi-turn Complex Negotiations 41.6%

Error Categorization projected

Reliability Curve (pass^k) projected

Policy Compliance Audit

Projected

Compliance Pass Rates

Policy Area	Pass Rate	Tests	Status
Escalation Rules	91.2%	102	Pass
Prohibited Actions	96.0%	75	Pass
Refund Policies	83.3%	120	Review
PII Handling	89.0%	100	Review
Data Minimization	81.5%	54	Review
Transparency Disclosures	94.4%	90	Pass

Regulatory Readiness

Framework	Coverage	Status
SOC 2 Type II alignment	87%	Strong
EU AI Act Art. 13 (Transparency)	71%	Partial
EU AI Act Art. 14 (Human Oversight)	82%	Good
GDPR data handling	78%	Partial
CCPA compliance signals	85%	Good
ADA accessibility (chat UI)	90%	Strong

Action Required — PII Gap: In 11 of 100 PII scenarios, the agent failed to redact card numbers and partial SSNs volunteered by users in free-text fields before passing to downstream CRM integrations. This is a P1 issue for any HIPAA, PCI-DSS, or GDPR-regulated deployment. A request-sanitization middleware layer resolves this before reaching the model.

Cost-of-Failure Modeling

Projected

Modeled against a 50,000 ticket/year operation (mid-market e-commerce). Failure defined as: incorrect resolution requiring human escalation, policy violation, or customer complaint. Assumed average cost per failed resolution: $27 (agent handling time + supervisor review + potential churn).

Optimistic

8% failure rate

$108,000 / year

4,000 failed tickets × $27 avg cost. Assumes well-tuned prompts, clean KB, low-complexity ticket mix.

Expected

14% failure rate

$189,000 / year

7,000 failed tickets × $27. Based on observed 74.2% functional score across mixed real-world scenario distribution.

Pessimistic

22% failure rate

$297,000 / year

11,000 failed tickets × $27. High-complexity ticket mix (billing disputes, escalations), without supervised fine-tuning.

Break-Even Analysis

Cost Assumptions

Line Item	Value
Human agent cost/ticket	$12.50
AI agent cost/ticket	$0.80
Escalation handling add	$14.50
Customer churn per failure	$62
Avg blended fail cost	$27.00
Annual license estimate	~$60K

Benchmark Methodology

Benchmark	Focus	Scenarios	Confidence	Source
τ-bench (Retail)	Tool-use & multi-turn task completion	245 scenarios	±2.1%	Yao et al. 2024
τ-bench (Airline)	Policy adherence under ambiguity	205 scenarios	±2.4%	Yao et al. 2024
BFCL v3	Function calling accuracy	200 tests	±2.8%	Berkeley 2024
TrustBench Policy Suite	PII, escalation, prohibited actions	120 scenarios	±3.5%	Internal 2026
Adversarial Probing	Jailbreak & prompt injection resistance	85 attempts	±4.1%	Red-team 2026
Latency Profiling	P50/P95/P99 response time under load	10K requests	±0.2s	Load test 2026

Total Scenarios

855

across 6 benchmark suites

Composite CI (95%)

±4.2

points on 100-pt scale

Evaluation Duration

14d

end-to-end evaluation window

Statistical approach: Composite Gaussian uncertainty propagation. Each dimensional score carries its own CI computed via bootstrapped Beta-Binomial posterior (n=2000 samples). Final composite CI: σ_composite = √(Σ w_i² · σ_i²). Human annotators double-blind scored 10% of τ-bench outputs for calibration; inter-rater agreement κ = 0.81 (substantial).

Vendor Comparison Matrix

Projected

Metric	Intercom Fin	Decagon	Ada CX
Trust Score (composite)	71.4	78.2	67.9
Functional Score	74.2%	82.1%	70.5%
pass^3 Reliability	58.1%	67.4%	55.8%
Policy Compliance	79.1%	77.3%	72.6%
PII Handling	89.0%	94.2%	87.1%
P50 Latency	1.2s	1.4s	0.9s
P95 Latency	4.8s	5.1s	3.4s
Hallucination Rate	3.8%	2.1%	4.6%
Jailbreak Resistance	97/100	99/100	95/100
EU AI Act Readiness	71%	84%	68%
Multi-language support	47 languages	12 languages	32 languages
Native integrations	300+	50+	150+
Model base	GPT-4	Custom LLM	Custom + GPT-4
Est. price (50K tickets/yr)	~$60K	~$90K	~$48K
Arena ELO Rank	#4 of 12	#1 of 12	#7 of 12

Best in category = cells marked teal are category-best for that row

Analyst Commentary

Intercom Fin's edge cases: Strongest integration ecosystem (300+ native connectors) and widest language coverage (47 languages) make it the default choice for multinational deployments with complex CRM stacks. Policy compliance edges out Decagon and Ada.

Where Decagon leads: Best-in-class trust score and hallucination rate driven by fine-tuned vertical models. Optimal for high-stakes deployments where accuracy matters more than integrations. Premium pricing reflects this. Ada CX is the cost-performance play but shows reliability gaps at complex scenarios.

Production Readiness Assessment

Projected

Benchmark → Production Gap Analysis

Benchmark scores overestimate production performance due to distribution shift, real-user unpredictability, and integration failure modes not captured in controlled evaluations.

Gap Factor	Benchmark Score	Est. Production	Gap	Severity
Task Completion (overall)	74.2%	62–68%	-8%	Medium
Reliability (pass^3)	58.1%	44–52%	-10%	High
Policy Compliance	79.1%	73–77%	-5%	Medium
Latency P95	4.8s	6–9s	+2s	High
Jailbreak Resistance	97/100	91–95/100	-4pts	Medium
PII Handling	89.0%	82–87%	-5%	Medium

Reliability gap is the primary concern. Controlled benchmarks with well-formed inputs mask the 10–14 point drop observed in production deployments with real-user variance (typos, ambiguous requests, multi-intent messages). Recommend deploying with a human-in-the-loop escalation rate target of ≥15% until 90-day production telemetry confirms actual reliability.

Risk Rating & Bias Analysis

Projected

🧠

Hallucination Rate

3.8% of responses contain fabricated information about product specs, return windows, or pricing. Primarily concentrated in edge queries outside the knowledge base scope.

Medium Risk

🔒

PII Exposure Vectors

11% failure rate on unsolicited PII in free-text (card numbers, partial SSNs). Pre-processing sanitization required before regulated deployment. No OWASP injection vulnerabilities found.

High Risk (Regulated)

🛡

Adversarial / Jailbreak

97 of 100 adversarial probes deflected. 3 bypasses were low-severity role-play scenarios with no policy violations. Prompt injection via system prompt forgery: fully blocked.

Low Risk

⚖️

Demographic Fairness

Resolution quality variance across demographic proxies (name origin, writing style, dialect) was within acceptable bounds (max 4.2% gap). No statistically significant disparate impact detected at α=0.05.

Low Risk

📈

Context Window Degradation

Performance drops 11.3% on tasks requiring context from >6 prior turns. Long-session customer journeys (multi-day support cases) show measurable recall failure after 8+ exchanges.

Medium Risk

⚡

Peak Load Degradation

At simulated 500+ concurrent sessions, P95 latency climbs from 4.8s to 9.2s and error rate increases 2.8x. Load balancing and circuit-breaker configuration required for high-volume production.

High Risk (At Scale)

Resolution Quality by Customer Segment projected

Deployment Recommendation

⚠ CONDITIONAL GO

Recommended with Conditions

Intercom Fin demonstrates sufficient capability for deployment in standard customer service workflows. Three conditions must be satisfied before handling regulated or high-stakes interactions.

Trust Score

71.4

Category avg: 67.2

Required Before Deployment

All three must be satisfied. Non-negotiable for regulated verticals.

🔴

PII sanitization middleware: Deploy a request pre-processing layer to redact card numbers, SSN fragments, and other sensitive data from free-text inputs before they reach the model. Ready-made options: AWS Comprehend PII detection, Microsoft Presidio, or custom regex pipeline. Estimated implementation: 3–5 days.

🟡

Load balancing configuration: Configure horizontal scaling with a circuit breaker threshold at 300 concurrent sessions. Implement queue-based rate limiting to prevent P95 latency degradation beyond 5s SLA. Required before volume exceeds 500 concurrent users.

🟡

90-day supervised rollout: Launch with human escalation target ≥15% for the first 90 days. Instrument production telemetry against the benchmark baselines in §7. Revisit full deployment authorization at 90-day checkpoint with actual production reliability data.

✓ Recommended For

✅

Standard FAQ, order tracking, account self-service (high functional score 88–91%)

✅

Multinational deployments — 47 language support is category-best

✅

Teams already on Intercom CRM — native integration reduces implementation risk

✅

Moderate-volume operations (<300 concurrent) without PCI-DSS requirements

✗ Not Recommended For

❌

High-stakes billing disputes and complex refund negotiations (54% completion rate)

❌

HIPAA/PCI-DSS regulated workflows without PII sanitization middleware in place

❌

Unsupervised deployment at scale — reliability gap at P95 latency requires monitoring

❌

EU AI Act Article 13 compliance-required deployments — partial readiness only (71%)

Procurement-Grade AI Evaluation

Want this report for YOUR agent?

Get a full 9-section TrustBench evaluation report — real benchmark scores, policy compliance audit, cost-of-failure model, and a deployment recommendation your procurement team can act on.

Request an Evaluation → View Leaderboard

$2,500 – $5,000 per evaluation · Results in 14 days · Full data export included

855+

scenarios per evaluation

report sections

days to delivery

SOC 2

alignment indicators

Intercom FinCustomer Support Agent

Executive Trust Summary

Performance Deep-Dive

Policy Compliance Audit

Cost-of-Failure Modeling

Benchmark Methodology

Vendor Comparison Matrix

Production Readiness Assessment

Risk Rating & Bias Analysis

Deployment Recommendation

Want this report for YOUR agent?

Intercom Fin
Customer Support Agent