News & Insights
Field Guides 1 June 2026 7 min read

How to compare AI tools like-for-like: a practical framework

ChatGPT, Copilot, Claude and Gemini all publish their own impact numbers. None of them are comparable. Here's a five-minute framework to fix that, built around four dimensions any CFO can defend.

By Colin Cardwell

ChatGPT, Copilot, Claude and Gemini all publish their own impact numbers. None of them are comparable. Here's a five-minute framework to fix that, built around four dimensions any CFO can defend.

ChatGPT says it saves your team 4 hours a week. GitHub Copilot says it saves your team 6 hours. Microsoft Copilot says it saves your team 3 hours. Anthropic Claude says the team is "highly satisfied".

Your CFO walks in and asks the obvious question. "Which one is paying off?"

You can't answer that from these four numbers. They're not comparable. They were measured by four different vendors, on four different scales, using four different definitions of "time saved". You can't subtract a 4 from a 3 when one is a self-reported median and the other is a survey-weighted mean.

This piece is about how to fix that. A practical, week-of-work framework you can apply to your real AI stack to produce a single comparable read of what each tool is actually delivering. It pairs with the previous piece in this series, which explained why vendor-led impact surveys can't do this job. This is the "OK, so how do I do it instead?" answer.

The four dimensions that matter

Before you can compare anything, you have to agree on what you're comparing. Most AI tool comparisons fail because the buyer compares whatever the vendor happened to publish. That's the wrong starting point.

There are four dimensions of AI Impact Measurement that actually matter to a renewal decision. If a tool can't be scored on all four, you don't have a real comparison.

1. Adoption

What percentage of the seats you're paying for are actually in active use?

This sounds like a basic activity metric, but it's the gateway question. A tool with 30% adoption is not "better than" a tool with 90% adoption at the same headline ROI number, because the 30% is sampling a much smaller, much more self-selected population. Adoption is the denominator under every other dimension.

2. Hours saved

How much time is the tool returning to the team, per active user, per week?

This is the core productivity claim. Vendors measure it differently (more on that in a moment) but the underlying question is the same: if you take this tool away, how much extra work appears on the calendar?

3. ROI

What's the dollar value of the time saved, divided by the cost of the tool?

Multiplier. A 1× means the tool pays for itself exactly. A 5× means every dollar spent returns five dollars of recovered time. ROI sits on top of adoption × hours saved × hourly rate, but expressed cleanly so a CFO doesn't have to do the maths.

A note on ROI inflation. Be careful with extrapolation. We cap ours at 2.5× when the input data is thin; the reasoning behind that is in the 2.5× rule piece. Any AI tool quoting you a 50× ROI is selling you a model assumption, not a measurement.

4. Satisfaction

How does the team actually feel about the tool?

This is the qualitative read that stops you renewing a tool the team has quietly given up on. A high-adoption, high-hours-saved, low-satisfaction combination is a leading indicator that the team is white-knuckling through. Worth knowing before the contract renews.

Four dimensions. Adoption, hours, ROI, satisfaction. If your AI tool comparison has fewer than these, you're under-measuring.

Why each vendor measures these differently

You'll find these dimensions in most vendor impact dashboards. The catch is that each vendor measures them in a way that flatters their own product.

Some concrete methodology drift:

DimensionVendor variationWhy it bites
Adoption"Active users" = logged in within 28 days (one vendor) vs. used a core feature (another)The 28-day login threshold double-counts dabblers and inflates the number
Hours savedSelf-reported on 0-60 min scale vs. 0-180 min scale vs. estimated from prompt volumeSame underlying behaviour produces wildly different averages
ROILoaded team rate vs. national-average wage vs. fixed $50/hrCost-of-time assumptions can swing ROI by 3-4× without changing reality
Satisfaction5-point Likert vs. 10-point NPS vs. sentiment-from-commentsNone of these scales is linearly comparable

No vendor is doing anything wrong here. Each one is making a reasonable choice for their own dashboard. The problem is that four reasonable-but-different choices don't add up to a single decision.

No vendor is lying. They're each just speaking a different language. Your job is to translate.

A normalisation framework

The fix is structural. You need one methodology, applied identically across every tool. Three rules.

Rule 1: Same question, same scale. Define your team's "hours saved" question once. Phrase it once. Set its scale once (we use 0-15 hours per week, which captures real signal without tail-inflation). Then ask that same question, that same way, of every team using every AI tool. Don't let the vendor's own survey override yours.

Rule 2: Normalise to a common unit. The four dimensions should land in common units across tools:

  • Adoption: percentage of seated users active in the last 14 days
  • Hours saved: hours per active user per week
  • ROI: dollar value of saved hours ÷ tool cost, capped at 2.5× when inputs are sparse
  • Satisfaction: 0-100 health score (more on why below)

Rule 3: Normalise to a 0-100 health score. Different dimensions have wildly different natural scales. Adoption is a percentage; hours is a small integer; ROI is a multiplier. Putting them on the same dashboard raw is like comparing temperatures in Celsius, Fahrenheit, and Kelvin. We map every metric onto a 0-100 health score so the eye can compare them at a glance. Here's how that maths works.

A worked example

Say you're running ChatGPT Enterprise, GitHub Copilot, and Microsoft Copilot. You ask each team the same set of questions (yours, not the vendor's) once a week. You aggregate at month-end.

The BEFORE state, as each vendor reports it:

ToolVendor's headlineCost per seat
ChatGPT Enterprise"4 hours saved per user per week" (self-report)$60/mo
GitHub Copilot"62% productivity confidence" (1-5 scale)$19/mo
Microsoft Copilot"+38 NPS" (sentiment)$30/mo

No two numbers are comparable. Your CFO can't work with this.

The AFTER state, same teams, your single methodology:

ToolAdoptionHrs/user/wkROISatComposite
ChatGPT Enterprise78%2.82.1×71/10068/100
GitHub Copilot91%3.44.7×79/10081/100
Microsoft Copilot42%1.61.3×54/10047/100

Now you have a real comparison. GitHub Copilot is the clear winner on the engineering team's workload (high adoption, strong ROI, healthy satisfaction). Microsoft Copilot is underperforming and needs an intervention or a non-renewal conversation. ChatGPT Enterprise is solid but adoption could be lifted.

That's a CFO conversation you can defend, because the comparison is structurally sound.

Four traps that quietly invalidate AI tool comparisons

Trap 1: Vanity metrics

Daily active users sounds like adoption. It isn't. A 14-day active rate against your seated population is adoption. A vendor reporting "8,400 daily active users" without telling you the denominator is telling you nothing.

Trap 2: Self-reported time bias

Humans round up. When you ask "how much time did this save you?", the average response will skew higher than the truth, especially for tools the respondent likes. We counter this with a 2.5× extrapolation cap and aggregate-only reporting (see the 2.5× rule). If you're rolling your own framework, you'll want a similar guardrail.

Trap 3: Sample selection

A vendor survey only reaches active users of the vendor's tool. The leavers and never-adopters are invisible. As we covered in the previous piece, this isn't a small effect; it can flip the verdict on a tool. Your independent framework should sample your team, not the vendor's user base.

Trap 4: Unit mismatches

"Per active user" and "per seat" are different things. So are "per user per week" and "per team per month". When a vendor reports a number, check the unit. When you build your composite, make sure every input is in the same unit before you start adding.

A five-minute spreadsheet template

If you want to run this framework today, here's the minimum viable spreadsheet:

ColumnWhat goes in it
Tool namePlain-English name (ChatGPT, Copilot, Claude)
SeatsNumber of paid seats
Active users (14d)From your independent measurement, not the vendor's
Adoption %Active users ÷ seats
Hours saved / user / weekFrom your weekly pulse, capped sensibly
Team hourly rateYour team's loaded rate (use the same one for every tool)
Monthly valueActive users × hours × 4.33 × rate
Monthly costSeats × per-seat cost
ROI multiplierMonthly value ÷ monthly cost (cap at 2.5× when inputs are sparse)
Satisfaction (0-100)From your pulse, mapped onto a common health score
Composite (0-100)Weighted average across the four dimensions

Fill this in once a month. The first two months will be noisy. By month three you'll have a defensible read on which tools are paying off and which ones aren't.

If you'd rather not build the spreadsheet from scratch, we have a polished version baked into the product. It runs the same maths automatically against your real AI stack, with the pulse surveys handled in the background.

Where to go next

You don't need our product to do this. The framework is yours, and we'd rather you ran it badly than not at all.

If you want to see it applied to a real AI stack with the maths already done, explore a sample report (no signup). Or start a 14-day trial and we'll wire it up against your real tools, no card required, methodology fully published.

Apples to apples isn't a marketing claim. It's a discipline. Run the framework above and your next AI renewal meeting will be the shortest, clearest one you've had.

Want more like this?

The AI Impact Brief. AI Impact Measurement news when it's fresh. One click to unsubscribe.