News & Insights

On AI Impact Measurement.

Methodology, product updates, and candid thinking from the team building The GAiGE.

Methodology 22 July 2026 5 min read

Rising AI Spend, Elusive Returns: What KPMG's Data Says About Measurement

KPMG found AI spend and usage climbing while most companies still can't see their own AI costs clearly enough to prove return.

By Colin Cardwell

Methodology 15 July 2026 6 min read

The Trust Deficit Is the Missing Layer Beneath Every AI ROI Dashboard

Most AI returns fail because leaders expect probabilistic tools to behave like deterministic software, and no dashboard fixes an expectations problem.

By Colin Cardwell

Methodology 8 July 2026 6 min read

Microsoft Is Spending $190 Billion on AI and Cutting Jobs. Here Is What to Measure Before You Do the Same.

Microsoft's fresh layoffs against record AI spending are the clearest signal yet that capital expenditure and proof of return have come apart, and the gap is measurable long before it shows up as headcount.

By Colin Cardwell

Field Guides 6 July 2026 5 min read

The Vendors Just Told You Where AI ROI Is Won. Now Measure It.

Microsoft is putting $2.5 billion behind the idea that AI returns come from deployment, not model access. Here is how to measure whether yours are showing up.

By Colin Cardwell

Product Updates 29 June 2026 5 min read

What's new: redesigned reports, three sharper pulses, and Microsoft Edge

Our biggest reporting update yet, plus a sharper set of pulses and a few things you asked for. Every report now leads with at-a-glance gauges tailored to what each pulse measures, the three core pulses each have a clearer job, every pulse has its own colour, and the browser extension is now on Microsoft Edge.

We've shipped two releases since the last roundup. Rather than a stream of small notes, here's the catch-up on everything you'll actually see and use. Nothing below is internal plumbing; that all lives on /changelog.

Reports that lead with the answer

The headline change: every report now opens with gauges, and it shows only the gauges that pulse actually measures. No more empty dials for a number a pulse was never asking about.

Each report is now shaped around the question its pulse is built to answer, so the most important number is the first thing you see. The AI Pulse leads with ROI, hours saved and satisfaction. The AI Review leads with an NPS gauge, plus value, trust, and "would you miss it?". The AI Capability Check leads with a confidence gauge. Custom pulses adapt the same way, showing exactly what they measure and nothing they don't.

It's the same data, read faster.

See the spread, not just the average

Averages hide as much as they reveal. Every report now has a Breakdowns section that shows how people actually answered each question, as a full distribution. You can tell at a glance whether a "mostly positive" average is a wall of high scores or a split that happens to average out in the middle. That difference usually changes what you do next.

Three pulses, three clear jobs

We've sharpened what each of the three core pulses is for, so they stop overlapping and each earns its place:

The AI Pulse (the frequent one): is a tool being used, and is it saving time? It now includes a quick 1-to-10 performance score, so you can see how well a tool is actually working day to day, not just whether it was opened.
The AI Review (the deeper one): is this tool actually worth it? It asks about the value it's delivering, how much you trust its output, and how likely people are to recommend it.
The AI Capability Check (the people one): are your team confident and growing with the tool, or quietly stuck? It looks at confidence, what's holding people back, and where they'd like to do more.

Same three pulses, much clearer signal.

A colour for every pulse

Each pulse now has its own colour, used consistently across your dashboard, your reports, and the extension. A small thing, but once you're running three pulses it means you can tell which one you're looking at without reading the name. The kind of quiet wayfinding you stop noticing because it just works.

Also new: Microsoft Edge, schedule control, and guided setup

A few more things that landed recently and are worth pointing at:

The extension is now on Microsoft Edge. The same browser extension is installable from the Edge Add-ons store as well as Chrome, so you can roll it out to your Edge users too. Install it here.
You can set how often each pulse fires. Core-pulse schedules used to be fixed. Admins can now adjust the cadence of each one to match how your team works.
A guided setup. A clearer, step-by-step path through connecting your AI tools, adding your team, and turning on pulses, so a new organisation gets to its first results faster.

Coming next

The browser extension update brings the last pieces of this release, and your team picks them up automatically: the per-pulse colours inside the extension, a rebuilt pulse screen, and a smarter delivery time, so the daily AI Pulse arrives in the afternoon, when "did you use it today?" has a full day to draw on. Nothing to install; installed extensions update themselves.

Want to see the redesigned reports without signing in? Open the sample report (no signup). Or start a free trial and point the new reports at your own AI stack.

As always, the methodology behind every gauge is published. If something here raises a question, the contact form is the fastest path; we read everything.

By Colin Cardwell

Product Updates 1 June 2026 5 min read

What's new: editorial refresh, sample reports, PDF downloads, and more

A roundup of what we've shipped over the last few weeks: a full editorial design refresh across the site and the app, a public sample report you can show your team before signing up, downloadable PDF reports, a cleaner sign-up flow, and a sharpened positioning around independent, cross-vendor measurement.

We've been heads-down. Rather than a stream of small posts, here's one post that catches you up on everything customer-facing that's landed (or will land in the next day or two). Every item below is something you'll see or use directly.

A full editorial design refresh

The biggest change is one you'll feel before you can name. We've rebuilt the visual register across both the marketing site and the in-app dashboard. Cream and charcoal replace the navy-and-white SaaS look. Serif headlines anchor every page; crimson hairlines signal where each section starts. The dense data UI (gauges, reports, tables) stays sans-serif and tightly packed because that's where the work happens.

The redesign isn't decoration. It signals the kind of company we are. Considered, independent, editorial. The same kind of attention you'd want behind your AI Impact Measurement.

Everything from the home page through to the user menu has been touched. If something looks off on your screen, refresh once and let us know.

A sample report anyone can view

Pointing a CFO or an exec team at a SaaS landing page rarely lands. Pointing them at a real, navigable AI Impact Report does.

/sample-report is now a public page. No signup. No card. It renders the exact same Reports surface a paid customer sees, against a demo organisation's data. ROI, hours saved, adoption by tool, satisfaction, and wasted spend, all on the same scale and the same methodology, ready to share around your buying committee.

Send the link. Let the report do the talking.

See the demo from inside your account

New in this release: a "View the demo" item in the sidebar (just below Settings) lets you flip into our live demo organisation any time without leaving your account. It runs the full Pro-tier experience against a real-world simulated AI stack, refreshed daily.

Useful in three situations:

Your data hasn't landed yet. The first weeks of a new pulse program are quiet by design. The demo gives you a preview of what your own dashboard will look like once responses start coming in.
You want to show a colleague. Bring a CFO, an exec, or a curious team lead into the demo to walk them through a populated report. Read-only, so they can poke around without changing anything.
You're scoping a new pulse. Look at how the demo org has theirs set up and what the resulting reports surface. Inspiration without committing.

Clicking the sidebar item flips your active organisation context to the demo; the same item then reads "Return to [your org]" so coming back is one click. If you've never run the demo, you'll also see a gentle prompt on your Dashboard and Reports pages while they're still filling up. Both prompts go away once you've visited the demo or your real data starts landing.

Downloadable PDF reports

Reports in the app are interactive, but some conversations want a flat document you can attach to a board pack or email to a sceptical stakeholder. The "Download PDF" button on any report now generates a branded, paginated copy with all the gauges, by-tool tables, comments, and ROI breakdowns rendered cleanly for print.

The PDF is generated server-side on demand, so it always reflects the live data and your current filters. A progress bar shows the wait (typically 15-40 seconds depending on the date range), and the file lands in your downloads automatically when it's done.

It's the same report. It's just portable now.

A cleaner sign-up flow

We've rebuilt the sign-up and sign-in surfaces. Single-column layout with a serif headline, an inline tagline ("Independent. Cross-vendor. AI Impact Measurement."), and clearer reassurance on the trial terms (14 days, no card, cancel any time). Sign in with Google or Microsoft is a click; the email path is two.

Onboarding (the post-signup workspace setup) has the same treatment. Three questions, no fluff, and a clear closing line on what happens next.

Existing customers won't see this often, but if you've invited a teammate recently they'll get the new experience.

Sharpened positioning: independent, cross-vendor

OpenAI is shipping Impact Surveys inside ChatGPT. Anthropic, Microsoft, Google and GitHub will follow. That's a positive shift for the category. It's also why the case for independent measurement is stronger now than it was six months ago.

We've sharpened how we describe ourselves across the site to reflect that. Three pillars you'll see in the chrome and the messaging:

Independent: we don't sell ChatGPT, Copilot, Claude or Gemini, so our measurement isn't motivated to make any of them look good.
Cross-vendor: one methodology applied to every AI tool in your stack, so the numbers actually compare.
Outcome-led: we measure what your team is getting, not what each vendor's dashboard counts as activity.

We've published two new pieces on this in the last week. If you want the full argument: Why AI vendor impact surveys can't replace independent measurement and How to compare AI tools like-for-like. Both are written for the CFO conversation.

Pricing page that actually compares

The pricing page has had two specific fixes that make decisions faster.

First, the three plan cards now align vertically so you can read across them at a glance. Previously the Lite card sat slightly higher than the other two because it didn't have a "Most popular" or "Best value" badge above the title. The badge spot is always reserved now, so the headlines line up.

Second, the "Compare all features" table at the bottom used to use a faint crimson hairline for "Included" and a barely-visible em-dash for "Not included". It was technically accurate and practically useless. Now you get a clear crimson checkmark for included and a muted X for not, so scanning takes a glance, not a squint.

Blog filter chips

Self-referential, but: this page (and the blog index you came from, if you did) now has category chips at the top. Founder Essays, Field Guides, Methodology, Product Updates. Sticky-post mechanic is gone; everything sorts by date.

If you're new to the blog and want a starting point, Field Guides is where most of the thought-leadership lives.

Coming next

Three things on the immediate runway:

Persona-led pages for the Strategy, Technology, and People audiences, rewritten around the independent + cross-vendor positioning.
A downloadable "AI Tool Comparison" spreadsheet template that pairs with the like-for-like framework piece. Soft email-gated, useful even if you never trial the product.
A "What changed" companion for every prod deploy that lands on /changelog. Continues to be the source of truth for the customer-visible items.

Want to see the new design and the sample report in one place? Open the sample report (no signup). Or start a free trial and put it against your real AI stack.

As always, the methodology behind every number we show is published. If something here prompts a question, the contact form is the fastest path; we read everything.

By Colin Cardwell

Field Guides 1 June 2026 9 min read

Why AI vendor impact surveys can't replace independent measurement

OpenAI is shipping Impact Surveys in ChatGPT. Anthropic, Microsoft, Google and GitHub will follow within the year. Here's why that's good news for the category, and why vendor-led measurement still can't answer your CFO's question.

The AI vendor measurement era just started. OpenAI is shipping Impact Surveys inside ChatGPT. Anthropic, Microsoft, Google and GitHub will follow within twelve months. Every AI vendor with a hand in the productivity story is racing to add outcome measurement to their own dashboard.

That's good news. For a year, the measurement conversation has been stuck on activity data. "We have 412 daily active users." "You've sent 18,000 prompts." Activity is not value. Vendors finally moving from "are people using us?" to "are people getting value from us?" is a healthy shift for the category.

It's also where the problem starts.

When the company selling you the AI tool is also the company grading whether it works, you have a problem the financial-services industry recognised decades ago. Self-marked tests are not tests. The grade is biased, the methodology is opaque, and the comparison to the next tool is non-existent. The fix in finance was independent audit. The fix for AI Impact Measurement is the same.

This piece is about why that matters, what "independent" actually means in practice, and how to evaluate the vendor-side surveys you're about to be drowning in.

The vendor survey arms race

OpenAI's Impact Surveys appear inside ChatGPT after a user completes certain workflows. They ask things like "how much time did this save you today?" and "how confident are you in the answer you got?". Aggregated, they produce a per-organisation impact report.

The launch is genuinely useful. It pushes the conversation past "but how many seats do we have?" toward "but what is the team actually getting?". Microsoft Copilot has had something similar in beta for months. Anthropic's Claude for Work team has signalled the same direction. GitHub Copilot's enterprise dashboard now includes self-reported productivity questions.

Every vendor with a productivity story is moving from activity measurement to outcome measurement. The race is on.

Within twelve months, every AI tool in your stack will produce its own "impact dashboard". Each one will:

Ask different questions
Use different scales
Sample at different cadences
Define ROI differently
Aggregate at different unit levels (per-user, per-team, per-prompt)
Publish (or not publish) their methodology

If you use five AI tools, you'll soon have five impact dashboards. None of them comparable. None of them independent. None of them complete.

Why vendor-led measurement is biased

We are not anti-vendor. Vendor-side measurement has real value when it's understood for what it is. The issue is treating it as a neutral score when it isn't.

Three sources of bias matter more than the rest.

Methodology bias

The vendor designs the questions. The vendor decides the scale. The vendor decides what counts as a "positive outcome". A vendor whose hero metric is "time saved" will ask about time saved. A vendor whose hero metric is "user confidence" will ask about confidence. None of those are wrong individually. But they are not the same question, and the answers will not roll up to a comparable number.

A concrete example. If a vendor asks "how much time did this save you?" on a 0-60 minute scale, the average answer will skew higher than if the same question were asked on a 0-180 minute scale. Buyers don't see the scale. Buyers see the average and assume comparability.

Sample bias

A vendor only sees its own users. If 30% of your team has churned away from a tool, that 30% does not get to answer the impact survey. The signal you see is from people who stayed. Survivors over-index on satisfaction. The team members the tool failed to serve are silent.

This is not a small effect. For an AI tool with a typical 60-day stickiness curve, impact-survey respondents are roughly the top quartile of users by engagement. A 30% "time saved" number from the top quartile becomes a 12% number across the whole seated population. The vendor isn't lying. The vendor is just sampling the wrong people for the question you're actually asking.

Narrative bias

Vendors will not publish negative impact numbers. They cannot. Their commercial reality is that low scores threaten the renewal. The published impact figures will be the company-blessed read of the data. The disconfirming evidence will quietly fail to make the slide.

This is not unique to AI vendors. It is true of every market category where vendors measure their own product. But in AI it matters more because the prices are higher, the contracts are newer, and the buying committees have less experience triangulating.

The comparability problem

The deeper issue is structural. Even setting bias aside, vendor surveys cannot be compared across tools.

A common buyer scenario. You're a Head of AI Strategy at a 300-person company. You have ChatGPT Enterprise for the marketing team, GitHub Copilot for engineering, Microsoft Copilot for finance and ops, and Anthropic Claude for legal. Four AI tools, four vendors, four impact surveys.

Your CFO walks in and asks the obvious question. "Which of these is paying off?"

You cannot answer that question from the four vendor surveys. Here's why:

Tool	Sample scope	Question scale	ROI definition
ChatGPT	Active users, post-workflow	Time saved (0-60 min)	(Time × rate) / seat cost
GitHub Copilot	Per-developer, weekly	Confidence (1-5)	"Acceptance rate"
MS Copilot	Per-user, monthly NPS	Satisfaction (-100 to +100)	Self-reported task min
Claude (Work)	Per-team, quarterly	Multi-factor (not public)	Not published

Four vendors. Four methodologies. The numbers do not roll up.

Worse, the buying decision is rarely "which one of these is the worst?". It is "given a finite budget for next year, where do we double down, where do we hold, where do we cut?". That decision needs comparable measurement across the whole AI stack. Vendor surveys, by design, cannot provide it.

What "independent" actually means

Independence isn't a slogan. It has four operational meanings, and any AI Impact Measurement vendor (including us) should be willing to answer all four directly.

1. Who designs the questions

Independent measurement uses the same question, on the same scale, for every tool. The question is designed by an organisation whose commercial interest is in the measurement being trustworthy, not in any specific tool looking good. The methodology is published. The scale is fixed. The wording is open to review.

2. Who owns the data

If the AI vendor is the one collecting the answers, the AI vendor is the one deciding what to publish, what to retain, what to share with their sales team, and how the aggregate looks on a slide deck. Independent measurement holds the data outside the vendor's environment. The customer owns the raw responses. The vendor of the AI tool sees the same aggregate the customer does, with no privileged read.

3. Who publishes the methodology

You should be able to read, in plain English, exactly how every score in the report is calculated. What constitutes "active". What counts as "time saved". How wasted spend is computed. How satisfaction is normalised across tools. If you cannot read the methodology, the score is opinion, not measurement.

4. Who is incentivised by which outcome

This is the load-bearing one. A vendor selling you Tool X benefits if Tool X looks good. A vendor selling you measurement of every tool benefits if the measurement is trustworthy enough that you keep using it. Those are different incentives, and they pull in different directions.

The vendor selling you the AI tool benefits if it looks good. The vendor selling you measurement benefits if the measurement is trustworthy. Those are different incentives.

A CFO framework: four questions to ask any AI impact data

Independence is a feature, not a marketing claim. Here is a fast diagnostic any CFO can run on any AI impact data crossing their desk.

1. Who chose the questions?
If the same company that sells the tool also designed the survey, treat the result as one input, not the answer. There may still be useful signal. You just cannot rely on it as a like-for-like comparison.

2. Who chose the scale?
A 5-point scale and a 100-point NPS produce different averages from the same underlying sentiment. If the vendor cannot tell you why they chose their scale (or worse, will not tell you), the number is not portable.

3. Who is in the sample?
A vendor's "impact survey" almost always polls active users of the vendor's tool. The leavers, the people who never adopted, the silently disengaged: those are missing from the denominator. The question to ask is "what percentage of seated users are in your sample, and what do you know about the rest?".

4. Who got to see the result before you did?
If the vendor saw the result first and decided what to surface, you are reading a curated read. If you saw the raw responses first and the vendor sees the aggregate at the same time you do, you are reading measurement.

Run those four questions over the next vendor impact dashboard that lands on your desk. You will find that one or two of the answers reliably embarrass the source.

Vendor surveys are part of the answer, not the whole answer

Read this piece carefully and you'll notice we did not say vendor impact surveys are bad. They are not. OpenAI shipping Impact Surveys is a positive development. Microsoft, Anthropic and Google adding their own will be too.

The point is that vendor surveys are an input to AI Impact Measurement, not a replacement for it. (See also our piece on why subjective and objective signals beat either one alone.)

The job of an independent, cross-vendor measurement layer is to:

Use one methodology across every tool in the stack
Sample the whole seated population, not just the most engaged
Publish the question wording and the scale openly
Hold the raw data outside any single vendor's environment
Present the result apples-to-apples so the CFO's renewal decision is data-driven

That's what The GAiGE is built to do. We're an independent measurement company. We don't sell ChatGPT, Copilot, Claude or Gemini. We measure all of them the same way, so the question "is this paying off?" has a real answer.

Where to go next

If this article describes a problem you already had (the spreadsheet of impact numbers from five vendors that don't add up to one decision), there is a faster way to see the alternative.

Explore an AI Impact Report (no signup) to see what independent, cross-vendor measurement looks like across a sample organisation's stack. Or start a free trial and we'll measure your real AI tools side by side, on one methodology, no card required.

The vendor survey era has just started. The independence era starts the day you ask the four questions above and don't accept "trust us" as an answer.

By Colin Cardwell

Field Guides 1 June 2026 7 min read

How to compare AI tools like-for-like: a practical framework

ChatGPT, Copilot, Claude and Gemini all publish their own impact numbers. None of them are comparable. Here's a five-minute framework to fix that, built around four dimensions any CFO can defend.

ChatGPT says it saves your team 4 hours a week. GitHub Copilot says it saves your team 6 hours. Microsoft Copilot says it saves your team 3 hours. Anthropic Claude says the team is "highly satisfied".

Your CFO walks in and asks the obvious question. "Which one is paying off?"

You can't answer that from these four numbers. They're not comparable. They were measured by four different vendors, on four different scales, using four different definitions of "time saved". You can't subtract a 4 from a 3 when one is a self-reported median and the other is a survey-weighted mean.

This piece is about how to fix that. A practical, week-of-work framework you can apply to your real AI stack to produce a single comparable read of what each tool is actually delivering. It pairs with the previous piece in this series, which explained why vendor-led impact surveys can't do this job. This is the "OK, so how do I do it instead?" answer.

The four dimensions that matter

Before you can compare anything, you have to agree on what you're comparing. Most AI tool comparisons fail because the buyer compares whatever the vendor happened to publish. That's the wrong starting point.

There are four dimensions of AI Impact Measurement that actually matter to a renewal decision. If a tool can't be scored on all four, you don't have a real comparison.

1. Adoption

What percentage of the seats you're paying for are actually in active use?

This sounds like a basic activity metric, but it's the gateway question. A tool with 30% adoption is not "better than" a tool with 90% adoption at the same headline ROI number, because the 30% is sampling a much smaller, much more self-selected population. Adoption is the denominator under every other dimension.

2. Hours saved

How much time is the tool returning to the team, per active user, per week?

This is the core productivity claim. Vendors measure it differently (more on that in a moment) but the underlying question is the same: if you take this tool away, how much extra work appears on the calendar?

3. ROI

What's the dollar value of the time saved, divided by the cost of the tool?

Multiplier. A 1× means the tool pays for itself exactly. A 5× means every dollar spent returns five dollars of recovered time. ROI sits on top of adoption × hours saved × hourly rate, but expressed cleanly so a CFO doesn't have to do the maths.

A note on ROI inflation. Be careful with extrapolation. We cap ours at 2.5× when the input data is thin; the reasoning behind that is in the 2.5× rule piece. Any AI tool quoting you a 50× ROI is selling you a model assumption, not a measurement.

4. Satisfaction

How does the team actually feel about the tool?

This is the qualitative read that stops you renewing a tool the team has quietly given up on. A high-adoption, high-hours-saved, low-satisfaction combination is a leading indicator that the team is white-knuckling through. Worth knowing before the contract renews.

Four dimensions. Adoption, hours, ROI, satisfaction. If your AI tool comparison has fewer than these, you're under-measuring.

Why each vendor measures these differently

You'll find these dimensions in most vendor impact dashboards. The catch is that each vendor measures them in a way that flatters their own product.

Some concrete methodology drift:

Dimension	Vendor variation	Why it bites
Adoption	"Active users" = logged in within 28 days (one vendor) vs. used a core feature (another)	The 28-day login threshold double-counts dabblers and inflates the number
Hours saved	Self-reported on 0-60 min scale vs. 0-180 min scale vs. estimated from prompt volume	Same underlying behaviour produces wildly different averages
ROI	Loaded team rate vs. national-average wage vs. fixed $50/hr	Cost-of-time assumptions can swing ROI by 3-4× without changing reality
Satisfaction	5-point Likert vs. 10-point NPS vs. sentiment-from-comments	None of these scales is linearly comparable

No vendor is doing anything wrong here. Each one is making a reasonable choice for their own dashboard. The problem is that four reasonable-but-different choices don't add up to a single decision.

No vendor is lying. They're each just speaking a different language. Your job is to translate.

A normalisation framework

The fix is structural. You need one methodology, applied identically across every tool. Three rules.

Rule 1: Same question, same scale. Define your team's "hours saved" question once. Phrase it once. Set its scale once (we use 0-15 hours per week, which captures real signal without tail-inflation). Then ask that same question, that same way, of every team using every AI tool. Don't let the vendor's own survey override yours.

Rule 2: Normalise to a common unit. The four dimensions should land in common units across tools:

Adoption: percentage of seated users active in the last 14 days
Hours saved: hours per active user per week
ROI: dollar value of saved hours ÷ tool cost, capped at 2.5× when inputs are sparse
Satisfaction: 0-100 health score (more on why below)

Rule 3: Normalise to a 0-100 health score. Different dimensions have wildly different natural scales. Adoption is a percentage; hours is a small integer; ROI is a multiplier. Putting them on the same dashboard raw is like comparing temperatures in Celsius, Fahrenheit, and Kelvin. We map every metric onto a 0-100 health score so the eye can compare them at a glance. Here's how that maths works.

A worked example

Say you're running ChatGPT Enterprise, GitHub Copilot, and Microsoft Copilot. You ask each team the same set of questions (yours, not the vendor's) once a week. You aggregate at month-end.

The BEFORE state, as each vendor reports it:

Tool	Vendor's headline	Cost per seat
ChatGPT Enterprise	"4 hours saved per user per week" (self-report)	$60/mo
GitHub Copilot	"62% productivity confidence" (1-5 scale)	$19/mo
Microsoft Copilot	"+38 NPS" (sentiment)	$30/mo

No two numbers are comparable. Your CFO can't work with this.

The AFTER state, same teams, your single methodology:

Tool	Adoption	Hrs/user/wk	ROI	Sat	Composite
ChatGPT Enterprise	78%	2.8	2.1×	71/100	68/100
GitHub Copilot	91%	3.4	4.7×	79/100	81/100
Microsoft Copilot	42%	1.6	1.3×	54/100	47/100

Now you have a real comparison. GitHub Copilot is the clear winner on the engineering team's workload (high adoption, strong ROI, healthy satisfaction). Microsoft Copilot is underperforming and needs an intervention or a non-renewal conversation. ChatGPT Enterprise is solid but adoption could be lifted.

That's a CFO conversation you can defend, because the comparison is structurally sound.

Four traps that quietly invalidate AI tool comparisons

Trap 1: Vanity metrics

Daily active users sounds like adoption. It isn't. A 14-day active rate against your seated population is adoption. A vendor reporting "8,400 daily active users" without telling you the denominator is telling you nothing.

Trap 2: Self-reported time bias

Humans round up. When you ask "how much time did this save you?", the average response will skew higher than the truth, especially for tools the respondent likes. We counter this with a 2.5× extrapolation cap and aggregate-only reporting (see the 2.5× rule). If you're rolling your own framework, you'll want a similar guardrail.

Trap 3: Sample selection

A vendor survey only reaches active users of the vendor's tool. The leavers and never-adopters are invisible. As we covered in the previous piece, this isn't a small effect; it can flip the verdict on a tool. Your independent framework should sample your team, not the vendor's user base.

Trap 4: Unit mismatches

"Per active user" and "per seat" are different things. So are "per user per week" and "per team per month". When a vendor reports a number, check the unit. When you build your composite, make sure every input is in the same unit before you start adding.

A five-minute spreadsheet template

If you want to run this framework today, here's the minimum viable spreadsheet:

Column	What goes in it
Tool name	Plain-English name (ChatGPT, Copilot, Claude)
Seats	Number of paid seats
Active users (14d)	From your independent measurement, not the vendor's
Adoption %	Active users ÷ seats
Hours saved / user / week	From your weekly pulse, capped sensibly
Team hourly rate	Your team's loaded rate (use the same one for every tool)
Monthly value	Active users × hours × 4.33 × rate
Monthly cost	Seats × per-seat cost
ROI multiplier	Monthly value ÷ monthly cost (cap at 2.5× when inputs are sparse)
Satisfaction (0-100)	From your pulse, mapped onto a common health score
Composite (0-100)	Weighted average across the four dimensions

Fill this in once a month. The first two months will be noisy. By month three you'll have a defensible read on which tools are paying off and which ones aren't.

If you'd rather not build the spreadsheet from scratch, we have a polished version baked into the product. It runs the same maths automatically against your real AI stack, with the pulse surveys handled in the background.

Where to go next

You don't need our product to do this. The framework is yours, and we'd rather you ran it badly than not at all.

If you want to see it applied to a real AI stack with the maths already done, explore a sample report (no signup). Or start a free trial and we'll wire it up against your real tools, no card required, methodology fully published.

Apples to apples isn't a marketing claim. It's a discipline. Run the framework above and your next AI renewal meeting will be the shortest, clearest one you've had.

By Colin Cardwell

Field Guides 20 May 2026 8 min read

What's the best way to measure AI? Five honest options compared

AI spend has exploded. Measurement hasn't kept up. Here's an honest look at the five approaches businesses actually use today — what each is good for, where each falls short, and how to pick the right combination for your team.

Every leadership team we talk to is asking some version of the same question. We've bought seats in five different AI tools across the company. Some people seem to love them. The bills are real. Is any of this actually working?

It's a fair question. It's also surprisingly hard to answer, because AI tools sit in an awkward spot — they don't fit neatly into the measurement frames we already have. They're not a hire, where you measure outputs. They're not a piece of software with a clear process metric, like Salesforce or HubSpot. They're a general-purpose lever, used differently by every person who touches them.

So how do you actually measure it? Five approaches are in common use today. Each has a sweet spot. None is a universal answer. The trick is picking the right one — or, more often, the right combination — for the question your team is genuinely trying to answer.

1. Annual engagement and pulse surveys

Examples: Culture Amp, Lattice, 15Five, Officevibe, the Gallup Q12.

These are the broad employee-experience tools that most mid-sized businesses already run. Quarterly or annually, you ask your team a battery of questions about how they're feeling, and you slot in two or three AI-specific questions next to the engagement and management items.

Sweet spot: tracking sentiment at a high level over time. Useful if your leadership team needs a single "are people happy with the AI rollout?" number for a board report.

Where it falls short: the response rate. Even the best-run annual surveys get 40-60% participation, and the people who respond are skewed — the enthusiasts and the complainers, with the silent majority absent. You're also asking people to remember three months back to a tool they used twice. The answers reflect vibes, not behaviour. And because AI questions live alongside thirty others, they get a few seconds of attention, not real reflection.

Use this if you're already running an engagement programme and you want a coarse trend line. Don't use it to make tool-by-tool budget decisions.

2. Vendor usage dashboards

Examples: the admin panel that ships with ChatGPT Enterprise, Microsoft Copilot's usage report, Claude for Work's analytics, GitHub Copilot's seat dashboard.

Every major AI vendor ships an admin dashboard for the seats you've bought. They tell you who's logged in, who's active, how many messages or completions or suggestions per user per week, sometimes with a feature breakdown.

Sweet spot: answering the activation question. "Did Sarah, who I gave a Copilot seat to, actually use it?" The data is objective, free (it's bundled with your subscription), and granular.

Where it falls short: usage is not effectiveness. Someone can use a tool a hundred times a week and find it net-unhelpful. Someone else can use it twice and save themselves a day. The dashboard tells you the count, not the consequence. Vendor metrics also tend toward flattering — "your team submitted 1,200 prompts this quarter!" is a number designed to make you not cancel the seats, not a number designed to help you decide whether to keep them.

Use this to spot dormant seats and to flag unhealthy patterns (e.g. a tool with strong adoption in Engineering but zero in Marketing — is that right?). Don't use it as your ROI proof.

3. SaaS spend management platforms

Examples: Zylo, Productiv, Vendr, Spendflo. Some also touch the AI-cost-management space (Zylo's AI Discovery, Productiv's AI tooling visibility).

These platforms started life as SaaS spend-rationalisation tools — surface every subscription, find the duplicates, negotiate them down at renewal. Most have a story about AI now: which tools are sprawling across your org, what the cost trajectory looks like, whether seats are sitting idle.

Sweet spot: cost visibility. If you've genuinely lost track of how many AI subscriptions are scattered across your business cards, departmental budgets, and shadow procurement, these tools find them and consolidate them.

Where it falls short: they answer the wrong question. Spend management tells you what you're spending; it tells you nothing about whether the spend is earning its place. Even the dormant-seat warnings only get you to "kill the seats nobody uses" — they don't tell you which tools are quietly working but under-adopted, or which ones people use a lot but secretly resent.

Use this if your AI spend is genuinely out of control and you need a defragmentation exercise. Don't expect it to answer "is this working?" — it isn't designed to.

4. Consultant-led audit (one-off engagement)

Examples: a six-week engagement with a Big Four consultancy, an AI-specialist boutique, or your in-house transformation team running a structured audit with interviews, focus groups, and a final report.

Done well, this is the most credible single snapshot you can get. A skilled consultant will interview a cross-section of your team, observe actual usage, build a quantitative case study or two, and hand you back a board-ready report with concrete recommendations.

Sweet spot: depth. You get qualitative texture (the "why" behind the numbers) plus a strategic recommendation written for your specific context. Hard to beat for a board presentation or a budget decision that justifies the engagement cost.

Where it falls short: it's a snapshot, not a sensor. Six weeks after the report is delivered, your team has bought two new AI tools, churned a third, and the usage patterns have shifted. By Q2 the report is a historical document. The other problem: cost. A serious AI audit from a top-tier consultancy lands at $40k–$200k+. You'll do it once and then go without measurement until you can justify another round.

Use this for a one-off strategic reset — the moment you're presenting AI ROI to the board for the first time, or making a multi-year platform decision. Don't expect it to be your ongoing instrument.

5. Continuous in-context micro-surveys

Example: The GAiGE. (We're aware this is our blog. Stick with us — we'll be honest about the trade-offs.)

The approach we built. A small browser extension delivers a thirty-second pulse to a team member right after they've used an AI tool — one or two short questions, in the moment. Pulses fire 3× a week per user. Responses flow to a dashboard that turns them into per-tool ROI, hours saved, satisfaction, adoption, and training-gap signals.

Sweet spot: continuous, defensible per-tool numbers. Because we ask one question in the moment, response rates run at 70-90% (compared to 40-60% for annual surveys), and the responses aren't self-selected — the silent majority shows up too. The methodology is published — a 2.5× extrapolation cap, your own blended hourly rate, minimum-N response thresholds before any number renders, full aggregate-only privacy. It survives board scrutiny because every number is auditable end-to-end.

Where it falls short: two things, honestly. First, it requires your team to install a Chrome extension. We've kept the install warning as tame as possible (browser-page access is requested only when needed, not at install) and most teams roll it out via MDM in minutes — but it's still a step. Second, we measure browser-based AI tools. If your team uses a desktop-app AI tool we don't yet cover, those interactions flow through the extension's inbox rather than as in-page pulses. The signal still reaches you; the immediacy is reduced.

Use this when you want a defensible per-tool ROI number on an ongoing basis — for the CFO who keeps asking, for renewal decisions, for spotting which tools your team secretly resents before they show up in churn. Don't use this if your AI usage is entirely off-browser (rare, but possible).

What the right answer usually looks like

Most mid-sized businesses we work with end up with two or three of the above, layered:

Vendor dashboards for activation and seat-utilisation hygiene (free, already there).
Continuous in-context pulses for the ongoing per-tool ROI signal that drives renewal and rollout decisions.
An annual or biannual consultant audit for strategic resets and big platform decisions, every 18-24 months.

The annual employee survey can include two or three AI questions for the sentiment trend line, but it's not where you'll find the answers that matter. SaaS spend management is worth it only if your subscription sprawl is genuinely out of control.

Where to start

If you're just beginning, the order we'd suggest is the reverse of how most companies actually start. Most start with the annual survey and the vendor dashboards because they're already paid for. The problem is, neither tells you anything you can defend in a board pack.

Start with the question your CFO will ask in six months, work backwards from that, and pick the measurement that answers it. For most teams that's a continuous per-tool ROI signal. If you'd like to see what that looks like in practice, we've published the full methodology behind The GAiGE on our 2.5× rule post, and a deeper case for why surveys beat usage data for this specific question.

We've also opened a free trial (up to 30 days) — no credit card — if you'd rather see the dashboard with your own team's data than read about it. Start a trial here.

By Colin Cardwell

Founder Essays 22 Apr 2026 6 min read

Why we built The GAiGE — and what most companies get wrong about measuring AI ROI

Three years ago, a customer asked us how to measure whether their AI tools were actually working. We had a good answer for the nine other steps in our adoption guide. That tenth one — measurement — we didn't have a good answer for at all.

When we started AiGILE, we built a ten-step guide to AI adoption for mid-sized businesses. Step one was pick the right problems to attack. Step two was pick tools that fit those problems. And so on down to step ten, which was some variation of measure the effectiveness of the tools you're using. It was the obvious final step. No one finishes an adoption plan without it.

It was also the step that kept embarrassing us. Customers would nod politely through the first nine steps and then ask, at the end, "OK — so how do we actually measure this?" — and we didn't have a good answer. Vendor dashboards didn't exist yet in any meaningful form. Our fallback was "send a survey". Our customers dutifully sent surveys. The response rates were dismal, the results were thin, and nobody learned very much.

That wasn't enough to stop the AI hype. Adoption kept accelerating — wildly, in some places — in a way that was mostly unplanned, mostly uncoordinated, mostly invisible to the people paying for it. Especially at mid-sized firms without a dedicated AI officer to care about the question. Tools got bought, rolled out, used, ignored, renewed. Nobody really knew what any of it was doing.

We realised, somewhere in the middle of that, that the problem wasn't going to measure itself.

What most teams settle for — and why it's not enough

By year two, vendors had started publishing basic usage telemetry. Seats, sessions, prompt counts. That was genuine progress over having nothing, and we saw CTOs gratefully latch onto it.

Usage data is the most common measurement mistake we see now. Not because it's wrong — it isn't — but because it's so obviously insufficient that it's surprising how many leadership teams treat it as the whole answer. If you only know how often a tool gets opened, you don't know how your team is using it. You don't know what they're using it for, or how well it's going, or whether they'd recommend it to a colleague. You don't know why three people on the same team have quietly stopped using it. And you don't get anywhere near a defensible ROI number.

Usage tells you someone logged in. It doesn't tell you whether any of it mattered. A CTO who closes the laptop confident because *"usage is up 40% quarter on quarter"* is, in our experience, exactly the CTO who'll be surprised when the board asks about business impact.

The Chrome extension idea — and why it changed things

We knew asking people was the right direction. The catch was always response rate. An email survey is, quite literally, the definition of friction — it's in the way, it arrives at the wrong moment, it costs you ten minutes, and the reward for finishing is that you've filled in a form. Response rates of 20-40% are normal. We needed something radically different.

About a year ago we started playing with a different idea: a Chrome extension that delivered a short prompt in the moment someone used a tool. Right after the session, not weeks later. Two or three questions, not twenty. A few seconds, not minutes. The response rates would go up because the friction would go down.

There's a personal angle there too — before AiGILE, a lot of my background was in game design and gamification. Friction and reward are the two levers game designers think about every day. A badly-designed survey is all friction, no reward. A well-designed one is quick, specific, and slightly satisfying to complete. That's not cosmetic; it's the difference between a 35% response rate and an 85% one.

We started calling them Pulses instead of surveys. Not because "pulse" is cuter, but because the product needed to feel different — a brief check-in rather than a corporate form. The name set the design brief.

The methodology we didn't know we needed

Here's the admission. Our original plan was to build the extension, aggregate some numbers, and worry about the methodology behind "ROI" later. Good enough, ship it, we'd refine.

The tool we were building The GAiGE with — Claude — pushed back, repeatedly, every time we reached a shortcut. It kept asking awkward questions. "What happens when one user reports 20 hours saved and the rest report 1? Do you really want to extrapolate that? What about response-rate bias? What are you capping at?"

Every one of those was a question we'd planned to deal with after launch. Every one of those, it turned out, would have materially embarrassed us if a real customer had asked it first. So we leaned in, talked it through, argued with it, tested it, and ended up with a methodology we can publish — the 2.5× extrapolation cap, the response-rate-aware reporting, the aggregate-only privacy posture, the normalised health scores on every gauge. That work, more than any other single thing, is what separates The GAiGE from a slick survey tool. It's also the work I didn't plan to do.

I wanted to note that specifically because we're living in an era where "built with AI" is usually a boast about speed. For us it's also a genuine quality story. The product is more rigorous because we had a collaborator who wouldn't let us off the hook on the hard questions.

Why us, honestly

A fair question to ask: with AI-assisted development as cheap and fast as it now is, why does AiGILE specifically get to build this, rather than any of the dozens of teams who could spin up a similar product in a weekend?

Three years of sitting inside mid-sized businesses trying to get them to adopt AI gave us a set of intuitions we didn't know we had. We know which metrics executives actually trust in a board pack and which ones they'll ignore. We know how teams actually use these tools (not how the vendor demo says they'll use them). We know which questions cause defensive behaviour if asked the wrong way, and which ones unlock honest feedback. That's not consultant polish; it's accumulated pattern recognition.

My game-design background matters too. A weekend's worth of vibe-coded MVP can produce a working survey tool. It's unlikely to produce a survey tool with a 75%+ response rate. That gap is where we've spent most of our time.

Where we are, and what we'd like you to do

It's early days for The GAiGE. We've made a lot of decisions we believe in and we've published the methodology so you can push back on any of them. We also know there's plenty we haven't built yet, and plenty of signal we haven't yet figured out how to surface well.

We'd genuinely like you to join us on that journey. The trial is on us, no credit card — long enough to install the extension, roll it out to your team, and see whether the dashboard tells you something you didn't already know. If it does, you'll find the pricing pleasantly unexpected: we've fully internalised how AI is changing the economics of SaaS, and what you'll pay is a flat fee for your whole organisation, not per-user. We think that's where this category is going, and we'd rather lead than chase.

Three years after that first embarrassed moment with a customer asking "how do we measure this?" — we've finally got a good answer. Come try it.

The methodology read: The 2.5× rule. The privacy posture: Aggregate-only by default. If you've got questions or pushback, we'd love to hear them.

By Colin Cardwell

Methodology 21 Apr 2026 5 min read

Aggregate-only by default: why your team's answers aren't attributed

Honest answers need safety. Safety, in this case, is engineered.

Every survey has the same failure mode. You ask a question people have strong feelings about, you promise their answers are confidential, and somewhere between the survey form and the leadership deck the names quietly travel too. The respondents are not stupid. They notice. And the next time you ask, the honest answers are gone.

We've watched this play out in employee engagement data for decades. It's the single biggest reason long-form engagement surveys produce platitudes instead of signal. We did not want to rebuild that mistake for AI tool measurement.

So The GAiGE is aggregate-only by default. Admins, owners, and group leaders look at dashboards and reports that show averages, totals, trends and verbatim comment text — but never who said what. This isn't a policy, it's a technical fact: the API doesn't send names, the CSV export has no name column, the UI has no per-person view. The data isn't hidden; it's not there.

What admins actually see

When an admin loads the Reports page, the network request that fetches responses returns payloads like this:

{
  "id": "resp_01HW...",
  "user": { "id": "usr_01HW..." },
  "aiTool": { "id": "tool_01HW...", "name": "Copilot", ... },
  "answers": [ ... ratings, time saved, comment ... ],
  "createdAt": "2025-04-12T09:14:00Z"
}

Notice what's not there: user.name, user.email. The user.id is a React list-key internal to the frontend — it's never displayed anywhere in the UI. An admin clicking around Reports sees this for each comment:

"Copilot's autocomplete keeps suggesting deprecated Python 3 APIs."
— Copilot · 2/5

Not "— Sarah Chen". The comment stays, the tool stays, the rating stays. The human attribution doesn't exist on that page.

Why we didn't just hide the name in the UI

The tempting shortcut is to keep names in the API payload and just not display them in the UI. That's not a privacy guarantee; that's a suggestion. A motivated admin with DevTools can inspect the raw response and read every name that was ever sent to their browser.

So the filtering happens at the API layer. The Prisma query that populates the reports endpoint explicitly selects only user.id — not name, not email. A curl'd request with an admin's own token gets the same namespace-stripped payload the browser does. An admin exporting to CSV gets no name column because the data isn't on the export's source.

This is the difference between hiding information and not having information. Only the second one survives the clever intern with a Postman collection.

The honest limits of the guarantee

Aggregate-only is not pure anonymity. Three caveats worth naming:

1. The database still knows. Every response has a user_id foreign key — that's how your personal dashboard shows your responses, how the system avoids surveying you twice in a day, and how we produce the mini-scoreboard on pulse submission. The user-to-response link is a technical fact at the row level. What changed is that it never leaves the database: no API surfaces it to an admin, no UI renders it, no export includes it.

2. Free-text comments can identify you. If you write "as the only person on the finance team using Jasper…" in the free-text field, the words themselves give you away, even with no name attached. Our guidance to end users: keep the comment about the tool, not about your specific situation. Admins see the text regardless of attribution; distinctiveness is its own signal.

3. Admins can create single-person groups. Group-level reports are scoped by group membership. If an admin puts one person in a group by themselves, the "aggregate" of that group is effectively that individual's data. We allow this because small teams are legitimate (a one-person marketing function is a real thing), but we make it deliberate and visible: members can see which groups they're in on their own profile. It's not a hidden back-door; it's an audit trail.

Each of these is a known honest limit. We'd rather state them than pretend the guarantee is airtight.

What this bought us

Higher response rates, more honest comments, and an end-user communication that we can actually stand behind when a nervous team member asks about it. The member-facing tour now says "your boss sees patterns, never your name" — and the sentence is factually true of the product as shipped, not a marketing aspiration.

The other thing it bought us is cultural. Asking a team to answer a pulse in three seconds is a small request. Doing so in an environment where anyone can look up their contributions is a much larger one. Aggregate-only lets us keep the ask small.

What we'd change if we were starting over

Two honest regrets. First, we'd have built aggregate-only first and made the per-person view the exception, not the other way round. We spent a few weeks with a product where admins could see names, and the retrofit cost us a careful audit of every endpoint that returned response data.

Second, we'd have written this post earlier. Trust postures are easier to defend when they're documented before the first hard question arrives, not after.

In short

The GAiGE does not let your admins see individual answers. Not because of a policy, but because the product is built so the names don't reach the browser. The guarantee has honest limits (which we've named), and the posture is non-negotiable — it's the reason the answers are honest in the first place.

Member-facing explanation in plain English: Your 2 cents drives real change. The methodology behind the numbers we do show: The 2.5× rule.

By Colin Cardwell

Methodology 21 Apr 2026 6 min read

Why we normalise every metric to 0-100: the case for health scores over raw numbers

A 3.5× ROI and a 72% satisfaction score are both numbers. They are not, in any useful sense, the same kind of number.

Spend ten minutes in any AI ROI dashboard and you'll notice the same problem. The page shows you half a dozen metrics — ROI multiplier, hours saved per week, adoption rate, satisfaction, maybe a Net Promoter Score — and you have no idea which ones are good, which ones are in trouble, and which ones you should care about most. Every metric has its own scale, its own distribution, and its own sense of "what's normal".

The GAiGE solves this the same way an electrocardiogram does: we normalise every headline metric onto a common 0-100 scale before drawing it. A score of 80 on ROI means the same thing as 80 on satisfaction, which means the same thing as 80 on adoption. Strong. Worth keeping. Tell the team.

This post is about why we made that call, what the curves actually look like, and where the model breaks down.

The problem with raw numbers

Raw metrics are precise and incomparable. That's a worse combination than it sounds.

A 2.5× ROI sounds impressive until you realise the industry median for a tool like Copilot is closer to 4×. A 65% satisfaction sounds mediocre until you realise it maps to roughly 3.25 out of 5 on a survey where the pragmatic ceiling is about 4.3. A 72% adoption sounds strong until you notice that 72% is what you'd get if three out of every ten seats sit completely idle — which is probably not what "strong" should mean.

The reader now has to hold three separate mental rubrics at once, and toggle between them on every glance. That cognitive tax is small in isolation and ruinous at scale — because it means your leadership team will, in practice, ignore the metrics they don't intuit and over-rotate on the ones they do. Usually that's the dollar figure, which is the one most vulnerable to methodology disagreement.

What a health score gives you instead

A health score is a normalised, opinionated mapping of a raw metric onto a 0-100 scale where:

0-40 means "needs attention" — red.
40-70 means "okay, watch it" — amber.
70-90 means "strong" — green.
90-100 means "excellent, tell the CFO" — bright green.

The bands are deliberately coarse. You do not need a dashboard that can distinguish a 73.2 from a 74.9; you need one that can tell you, at a glance, whether to celebrate, plan a training push, or kill the tool.

Every gauge in the app shows two numbers: the raw value (2.8×, 142 hours, 4.2/5) and the health score (78/100). The big needle tracks the health score, so the colour-coded dial is always comparable across metrics. The raw number is there for the auditor.

Why piecewise-linear, not linear

The obvious next question: how do we pick where 0, 40, 70, and 100 fall? A naive answer is "draw a straight line between min and max". We don't do that, because a straight line is almost always wrong for ROI-style metrics.

Take ROI. On a pure linear scale from 0× to 10×, a 2× ROI would map to 20/100 — which would tell you "this tool is in trouble". Except: a 2× ROI is completely fine. You got double your money back. That's not failing. On the other end, the difference between a 15× ROI and a 20× ROI is, in practice, noise — both mean "this tool is a runaway winner"; neither needs a further 20 points of dashboard celebration.

So we use a piecewise-linear curve with explicit anchor points. For ROI:

1× → 20/100 (break-even, but under-delivering)
2× → 45/100 (fine, but not compelling)
5× → 75/100 (strong)
10× → 95/100 (excellent)
20×+ → 100/100 (celebrate; the gain flattens)

The anchors aren't arbitrary. They reflect how a reasonable leadership team would describe the same number in words. 2× ROI is "fine". 5× ROI is "we should expand this". 10× ROI is "this is one of the best AI bets we've made". The curve encodes that judgement, so the dashboard renders the judgement automatically.

The four curves we ship

Every GAiGE dashboard has four headline gauges. Each is driven by its own piecewise curve.

ROI. As above. Capped at 20× to prevent a single enthusiastic response from dragging the whole strip off the scale.

Hours saved per user per week. 0h → 0, 1h → 35, 2h → 60, 3h → 80, 5h+ → 100. An hour a week per person is genuinely meaningful. Five hours a week — a full working day reclaimed — is as good as it gets before we start suspecting the pulse is miscalibrated.

Satisfaction. Linear on a 1-5 scale, rounded to 0-100. Satisfaction is the one metric where the raw number is already intuitive (everyone knows what 4.2/5 means), so we don't fight it.

Adoption. 30% → 30, 60% → 65, 80% → 85, 95%+ → 98. Adoption is the metric most prone to surface flattery — 80% sounds great until you realise the other 20% are paying for seats they never touch. The curve is deliberately punishing in the 90-100 band so that only near-total activation counts as "excellent".

The label is the point

Behind the 0-100 number sits a four-label scale: Needs attention. Okay. Strong. Excellent. Every gauge shows one of those four words directly under the number. The word does most of the work.

This matters because nobody makes a decision off "our ROI is 62.4/100". People make decisions off "our ROI is okay, and our satisfaction is strong, and our adoption needs attention". Three words, one conclusion: we have a reach problem, not a quality problem.

The health score is the math; the label is the translation. We keep both on the page because the math is auditable and the label is decidable. You can push back on the curve. You can't push back on the word — which is exactly what makes it useful in a leadership meeting.

Where this model breaks down

Three places it's worth calling out.

1. The anchors are opinions. We picked them based on what we've seen across the AI-adoption consulting work that seeded this product. A different vertical might disagree — a law firm might reasonably argue that 2× ROI on a premium research tool is the ceiling, and 5× is fantasy. Pro plans let you reshape the curves; the defaults are defaults, not laws of physics.

2. Normalisation hides volatility. A metric bouncing between 65 and 75 reads as "stable strong" on the health-score scale, but the underlying raw number might have swung 40% week over week. We mitigate this by showing sparklines of the raw metric beneath the gauge, so the health score never disguises real turbulence.

3. Composite gauges are tempting and wrong. It is trivially easy to mush four health scores into one "overall AI health" number. We refuse to do it. Collapsing ROI, hours saved, satisfaction, and adoption into a single figure destroys the diagnostic value — which is precisely the pattern the individual scores were designed to reveal. Four numbers beat one number every time, so long as the four are comparable.

In short

Raw metrics are precise and incomparable. Health scores are coarser and comparable. For a dashboard read in thirty seconds by a leadership team that needs to make a decision, coarser-and-comparable wins every time. We show both numbers — the raw for defensibility, the normalised for decisions — and let the label do the heavy lifting.

Curious how the raw ROI number gets computed in the first place? The 2.5× rule post covers the upstream side of this — how we turn a handful of pulses into a defensible multiplier before it hits the curve.

By Colin Cardwell

Methodology 21 Apr 2026 6 min read

Why asking your team works better than usage data alone for AI ROI

Usage data tells you who logged in. It doesn't tell you whether anything actually helped.

There's a standard argument in AI tool measurement that goes: "don't trust self-reported surveys, they're biased — trust the usage logs, they're objective." It has an appealing ring of empiricism. It is also, as a principle for measuring AI ROI, half-right at best.

The GAiGE is built around asking your team short, contextual questions. We get the "but surveys are subjective" objection often enough that this post is worth writing. Here's our honest answer.

What usage data is good at

Usage data — the output of the vendor's admin panel, or your own network logs, or MDM inventory — is genuinely great for a narrow set of questions:

Did anyone log in? Binary signal of activation.
How often does an account get used? Session frequency per seat.
Which features get touched? Granular event data.
What's the usage curve over time? Adoption velocity, seasonality, decline patterns.

All of that is valuable. None of it answers the question your CFO is actually asking, which is: was any of this worth the money?

What usage data misses

Three big gaps. They're different from each other, but each is enough on its own to make usage data an incomplete picture.

1. Usage ≠ value. A user can "use" a tool every day and get no meaningful output from it. Drafting work in ChatGPT and then scrapping the draft still counts as a session. Running a search in an AI assistant and not trusting the answer still counts. The vendor's usage meter clicks up; your team's productivity doesn't.

2. Value ≠ usage. The inverse, and more common than you'd think. A tool can deliver enormous value through occasional high-leverage use. A senior engineer who fires up Copilot twice a week to solve the gnarly bit of code they'd otherwise spend hours on produces minimal usage data and massive real value. The usage meter reads "low engagement — consider cutting this seat"; reality reads "this is our most profitable licence".

3. Usage data can't tell you about quality, confidence, fit, or fear. Is the output reliable? Does the user trust it? Are they worried about sharing data with it? Is the tool actively replacing a workflow, or layered awkwardly on top of one? These are the questions that predict renewal, expansion, and whether your AI investment will still look healthy in 18 months. Usage logs have nothing to say on any of them.

What surveys are good at

Asking people — properly — is the only way to get the signals usage data can't reach:

Did it actually help?
Did it make the output better, not just faster?
Do you trust what it gave you?
Would you miss it if it was gone?
What's getting in the way of using it more?

These answers are subjective by definition — no log file will ever produce them. That's not a weakness of surveys, it's the job they do.

Where surveys fail (if you don't design them carefully)

The standard critiques of surveys are real and worth taking seriously:

Response bias. Happy users respond more, or unhappy users respond more — either way you get a skewed picture.
Recall error. Asking "how has AI been this month?" is asking someone to construct a narrative from memory, and memory is compressed and biased.
Survey fatigue. A 20-question engagement survey once a year gets ignored. Or worse, fake-filled.
Gaming. If people think the answers will influence what tools their team gets to keep, they'll respond strategically.

Each of these has a fix. The fixes are what make the difference between useful survey data and useless survey data.

The fixes we use in The GAiGE

Ask in the moment, not in retrospect. The Chrome extension delivers pulses within seconds of someone actually using a tool. "Did that just save you time?" is a vastly different question than "on average, over the last month, how much time would you say this tool has saved you?" The first is a recall problem of seconds. The second is a recall problem of weeks.

Keep it short. 30 seconds max, 1-3 questions, no interstitial. Response rates on short, in-context pulses run 70- 90% in orgs we've measured. Response rates on traditional annual engagement surveys run 40-60% on a good year. Short + timely beats long + ignored.

Aggregate hard. No individual response is ever shown to admins. Your boss sees "the team rated Copilot 4.2/5", never "Sarah rated it 2/5". Gaming goes away when the person answering knows their specific answer can't be used to affect their specific situation.

Correct for response bias in the math. Our 2.5× cap on extrapolation and the response-rate flag on every report mean we can't silently project an enthusiast's answers across the whole team. If response rate is 40%, we tell you — and you should discount the aggregate accordingly.

The combined signal

Neither source is enough on its own. Both together give you something much better than either alone. Three examples:

High usage, low satisfaction. Classic "sticky but not loved" pattern. The tool has wedged itself into a workflow people can't escape, but no one's enthusiastic about it. Renewal risk when something better shows up; training opportunity now.

Low usage, high satisfaction. The high-leverage expert user pattern. The tool isn't touched every day but when it is, it solves expensive problems. Don't cut the seat based on activity alone — you'd lose the value.

High usage, high satisfaction, high time-saved. Your winner. Buy more licences.

In short

The "usage data is objective, surveys are subjective" framing is wrong because it assumes the two are competing. They're not. Usage data tells you what happened. Surveys tell you whether it mattered. Every meaningful AI ROI conversation needs both.

Questions about our methodology? The 2.5× rule post is the longer read on how we turn pulses into defensible numbers.

By Colin Cardwell

Field Guides 21 Apr 2026 5 min read

Seven questions every AI tool vendor should answer before you renew

Your AI tool vendor has a smoother renewal conversation than you do. That's because they've had the conversation more times.

Every AI tool vendor's renewal playbook looks similar: show up a month out, walk through a pre-baked "success story" deck, ask if there's anything to iron out, and put the contract in front of you before the quarter closes. The deck is glossy. The numbers are flattering. The default action is to renew.

It's your job to make that default harder. Below are seven questions we've found separate the tools your team genuinely uses from the ones you're paying for out of inertia. Ask them at renewal — in writing if you can — and see which vendors answer crisply and which ones pivot to a demo.

1. What's our real adoption rate?

Not "licensed seats". Not "signups". The percentage of our licensed seats that have an active, sustained usage pattern — ideally measured weekly over the last quarter.

Any vendor's admin portal can show you this, but the numbers they'll put on their deck will be the softest definition they can get away with (logged in once, or activated account). Pin them down on "at least one meaningful session per week for the last 8 weeks" and watch what happens to the number.

2. What's our hours-saved per active user per week?

Normalised. Total hours saved is impressive but meaningless — 500 hours saved in a 10-person team is stellar, 500 hours in a 500- person team is a rounding error. The comparable number is hours saved per active user per week.

As a rough rule of thumb from everything we've seen at AiGILE: under 1 hour/user/week is disappointing, 2-3 is healthy, 5+ is genuinely transformative and you should be buying more licences.

3. Which features are being used, which are ignored?

Good vendors can tell you. Great vendors can show you a breakdown by seat. This matters for two reasons:

If 80% of your usage is one feature that other tools also do, you're paying for a premium suite to get commodity value. There may be a cheaper tool.
If a feature the vendor charges extra for is sitting unused, you have negotiating leverage at renewal. "You're charging us for feature X and the data shows nobody's touching it — either prove its value or drop the price."

4. How does our usage compare to similar organisations on your platform?

Benchmarks are the single piece of data the vendor has that you don't. They see across hundreds of customers. Make them tell you where you sit. Are your power-users above or below average? Is your adoption curve trending like healthy customers' or like churn-risk customers'?

Good vendors have this data ready. If they push back with "we don't share aggregate data", it's either genuinely unavailable (unlikely) or they don't want you to know you're underperforming (more likely).

5. What onboarding and enablement are you committing to this renewal?

Renewals aren't passive. The tools that thrive in your org thrive because someone — often the vendor's CS team, ideally — keeps reintroducing them to new hires, new teams, new use cases. Most vendors give you white-glove onboarding in year one and leave you to fend for yourself afterwards.

Ask what they're committing to in year two onwards. Quarterly training? Monthly lunch-and-learns? A dedicated CS contact? Feature-release walkthroughs? If they can't name a specific deliverable, you'll be paying the same price for significantly less support.

6. What's your roadmap for the issues our team has flagged?

This assumes you know what your team has flagged — which is where a measurement tool like The GAiGE or a well-run quarterly retro earns its keep. If you can walk into the renewal with a list of specific, dated complaints from your actual users, the vendor has to engage with them.

Watch carefully how they respond. "That's on the roadmap for Q3" is fine. "That's interesting feedback, we'll look into it" is a brush-off. "No one else has raised that" is a lie — or it's a real differentiator in how your team uses the tool, in which case it's absolutely on the vendor to solve.

7. If we cancelled today, what would we lose?

The single sharpest question. Makes the vendor articulate the specific value that would disappear from your team's week if the tool vanished overnight.

A good vendor will have an answer involving specific workflows and specific outcomes: "Your engineering team is saving ~6 hours/week on code review. Your marketing team generated 40 campaign variants last month they wouldn't have otherwise."

A struggling vendor will give you a feature list. Features are inputs. Outcomes are what you're paying for.

A note on attitude

None of this is adversarial. Good vendors want these questions — they're selling into senior buyers who already ask them, and if you don't, they have to waste time guessing what actually matters to you. Asking them clearly and early improves the renewal conversation for everyone.

What it does do is pull the conversation out of the vendor's pre-baked deck and into your reality. That's the whole game.

If you're evaluating how your team's answers compare to vendor claims, The GAiGE is built for exactly this job. Happy to show you what it looks like against your current AI stack.

By Colin Cardwell

Methodology 20 Apr 2026 6 min read

The 2.5× rule: why most AI ROI numbers are quietly lying to you

If you can't defend the number, don't report the number.

Every AI vendor has an ROI calculator. Most of them produce results that would embarrass a mid-career finance person. "Save 14 hours per user per week!" Not if the user only works 37 hours a week and spends a third of them in meetings, they won't. The gap between those numbers and reality is where trust in AI measurement goes to die.

The GAiGE exists because we'd rather be honestly useful than impressively wrong. This post is a walk-through of how we calculate the numbers you'll see in the product — and the decisions we've made to keep them defensible. If you're evaluating whether to trust our reports, this is the page to read. If you're a skeptic, we're hoping to earn about eighty percent of your trust by the end of it, and leave you with useful questions for the last twenty.

Where most AI ROI numbers go wrong

Four common failure modes, in roughly decreasing order of frequency:

Unbounded extrapolation. A small number of users report saving an outlier amount of time. The calculator multiplies that across the whole company. Twelve months of "savings" are invented in a spreadsheet.
Survivorship bias in the respondents. Only enthusiasts reply. Their answers get treated as representative. Everyone who quietly ignores the tool is invisible to the number.
Conflict of interest. The party calculating the ROI also wants the renewal to go through. Guess which way the ambiguous decisions break.
Wrong unit of measurement. "Hours saved" with no hourly rate, or with an hourly rate pulled from thin air. Impressive-looking numbers that nobody can turn back into dollars.

These aren't strawmen. We've seen all four in vendor decks, and we've caught ourselves drifting toward a couple of them during product design. Naming them helps.

The 2.5× rule

Here's the cap that gives this post its name.

When a user responds to a pulse and reports saving, say, "3 hours this week" on a specific tool, we have a number to extrapolate from. If they answer four pulses in a month, we have four numbers. The temptation is to take the average, multiply by 52, and call it an annual savings figure. Nobody working with real data thinks this is a good idea.

What we actually do:

We set a 2.5× ceiling on the ratio between reported hours saved and expected baseline hours of work. In plain English — if someone claims to have saved 20 hours a week on a task that realistically only took 8 hours before AI, we cap the saving at 8 × 2.5 = 20 of their original 8-hour context, then don't extrapolate further. The cap's the cap.
We never report extrapolated numbers without a clearly-labelled reported figure next to them. The reader always sees both. If the two diverge a lot, the reader knows to be careful.
We factor in the org's response rate. A 4.5/5 average from 80% of your team is a different fact from a 4.5/5 from 15%. We surface both.
We require minimum aggregation thresholds. Fewer than five respondents in a segment and we don't report on it. Too few data points is how you get noise masquerading as insight.

Is 2.5× the right cap? It's defensible. It's approximately what the literature on self-reported time savings in consulting engagements converges on for "outlier but plausible" results. It's also a number we can justify in a room — which is the point. If you want to argue for 2× or 3×, we'll happily have that conversation and adjust the model. What we won't do is claim a uniformly applied 10× because it makes the deck look better.

The blended hourly rate — yours, not ours

The other number that turns hours into dollars is the hourly rate. We could make one up. Most ROI calculators effectively do — they multiply by "the average knowledge worker salary" and hope you don't notice.

Our approach is boring: you set it. Every GAiGE organisation configures a blended hourly rate at setup, fully-loaded (salary plus on-costs, divided by productive hours per year). It's your number. If your auditor has a different view, you can argue that with them. We just do the multiplication.

Same principle applies to the tool cost side of the equation. You enter what you actually pay per seat per month, including any negotiated discounts and committed spend. We don't pull list prices from the vendor's website. Your numbers, your truth.

Aggregation, because honest answers require anonymity

One more principle — and it's a product decision as much as a methodology one. Pulse responses are always aggregated before any human inside your organisation sees them. Your CTO sees "the team rated Copilot 4.1/5". They never see "Sarah rated Copilot 2/5 on Tuesday, and wrote 'honestly I prefer Claude'."

Why does this belong in a methodology post? Because the quality of the data depends on it. If your team suspects their answers are attributable — even just a little — the honest ones stop responding, and you're left with the corporate-approved vibes. We've watched this happen to engagement surveys for twenty years. We don't intend to repeat the mistake with AI.

A worked example

A 120-person firm. They pay for 80 ChatGPT Enterprise seats at $60/month each. Blended hourly rate entered as $110.

Over 8 weeks, 74 of those 80 users answer at least one pulse (response rate: 92.5% — healthy). The average self-reported hours saved is 2.7 per user per week. That number passes the 2.5× check against the baseline of "around 7 hours of content-writing or summarisation work per week."

Do the math:

Annualised hours saved: 74 users × 2.7 hrs × 48 weeks ≈ 9,590 hrs
Dollar value at $110: ≈ $1.05M
Annual tool cost: 80 × $60 × 12 = $57,600
Ratio: roughly 18× return

That ratio is indicative — headline-safe for a board paper, with the methodology attached. It's also not a number we asked you to take on faith; every component is shown, and you can stress-test any of them.

Where you should push back

Three places where this methodology has genuine limits, and we'd rather be open about them than quiet:

Self-reported time is self-reported. People overestimate savings on tasks they enjoy and underestimate on tasks they dread. The 2.5× cap bounds the overestimate; nothing fully fixes the rest. Pair our numbers with your intuition.
Non-responders are a real problem. A 90% response rate is gold. A 50% response rate is concerning — the other 50% may be the people struggling the most. We show response rate prominently so you can't miss this.
ROI is a proxy, not the goal. Sometimes a tool costs more than it saves, and you keep it because it opens up something you couldn't do before. The GAiGE tells you the numbers; you decide what the numbers mean.

In short

Cap the extrapolation. Use your numbers, not our numbers. Always show response rate. Always aggregate. Be honest about the limits. Present the reader a number they can defend, not a number that will embarrass them in six months.

If that sounds boring — it is, slightly. The glamorous AI ROI claims are the ones that don't survive contact with the CFO. The durable ones tend to look a bit like this.

Questions about the methodology? Our team would rather hear them than not — drop us a line.

By Colin Cardwell

Get the next one straight to your inbox.

Getting the Measure of AI. AI Impact Measurement news when it's fresh. One click to unsubscribe.

Subscribe to Getting the Measure of AI