Why AI vendor impact surveys can't replace independent measurement

OpenAI is shipping Impact Surveys in ChatGPT. Anthropic, Microsoft, Google and GitHub will follow within the year. Here's why that's good news for the category, and why vendor-led measurement still can't answer your CFO's question.

The AI vendor measurement era just started. OpenAI is shipping Impact Surveys inside ChatGPT. Anthropic, Microsoft, Google and GitHub will follow within twelve months. Every AI vendor with a hand in the productivity story is racing to add outcome measurement to their own dashboard.

That's good news. For a year, the measurement conversation has been stuck on activity data. "We have 412 daily active users." "You've sent 18,000 prompts." Activity is not value. Vendors finally moving from "are people using us?" to "are people getting value from us?" is a healthy shift for the category.

It's also where the problem starts.

When the company selling you the AI tool is also the company grading whether it works, you have a problem the financial-services industry recognised decades ago. Self-marked tests are not tests. The grade is biased, the methodology is opaque, and the comparison to the next tool is non-existent. The fix in finance was independent audit. The fix for AI Impact Measurement is the same.

This piece is about why that matters, what "independent" actually means in practice, and how to evaluate the vendor-side surveys you're about to be drowning in.

The vendor survey arms race

OpenAI's Impact Surveys appear inside ChatGPT after a user completes certain workflows. They ask things like "how much time did this save you today?" and "how confident are you in the answer you got?". Aggregated, they produce a per-organisation impact report.

The launch is genuinely useful. It pushes the conversation past "but how many seats do we have?" toward "but what is the team actually getting?". Microsoft Copilot has had something similar in beta for months. Anthropic's Claude for Work team has signalled the same direction. GitHub Copilot's enterprise dashboard now includes self-reported productivity questions.

Every vendor with a productivity story is moving from activity measurement to outcome measurement. The race is on.

Within twelve months, every AI tool in your stack will produce its own "impact dashboard". Each one will:

Ask different questions
Use different scales
Sample at different cadences
Define ROI differently
Aggregate at different unit levels (per-user, per-team, per-prompt)
Publish (or not publish) their methodology

If you use five AI tools, you'll soon have five impact dashboards. None of them comparable. None of them independent. None of them complete.

Why vendor-led measurement is biased

We are not anti-vendor. Vendor-side measurement has real value when it's understood for what it is. The issue is treating it as a neutral score when it isn't.

Three sources of bias matter more than the rest.

Methodology bias

The vendor designs the questions. The vendor decides the scale. The vendor decides what counts as a "positive outcome". A vendor whose hero metric is "time saved" will ask about time saved. A vendor whose hero metric is "user confidence" will ask about confidence. None of those are wrong individually. But they are not the same question, and the answers will not roll up to a comparable number.

A concrete example. If a vendor asks "how much time did this save you?" on a 0-60 minute scale, the average answer will skew higher than if the same question were asked on a 0-180 minute scale. Buyers don't see the scale. Buyers see the average and assume comparability.

Sample bias

A vendor only sees its own users. If 30% of your team has churned away from a tool, that 30% does not get to answer the impact survey. The signal you see is from people who stayed. Survivors over-index on satisfaction. The team members the tool failed to serve are silent.

This is not a small effect. For an AI tool with a typical 60-day stickiness curve, impact-survey respondents are roughly the top quartile of users by engagement. A 30% "time saved" number from the top quartile becomes a 12% number across the whole seated population. The vendor isn't lying. The vendor is just sampling the wrong people for the question you're actually asking.

Narrative bias

Vendors will not publish negative impact numbers. They cannot. Their commercial reality is that low scores threaten the renewal. The published impact figures will be the company-blessed read of the data. The disconfirming evidence will quietly fail to make the slide.

This is not unique to AI vendors. It is true of every market category where vendors measure their own product. But in AI it matters more because the prices are higher, the contracts are newer, and the buying committees have less experience triangulating.

The comparability problem

The deeper issue is structural. Even setting bias aside, vendor surveys cannot be compared across tools.

A common buyer scenario. You're a Head of AI Strategy at a 300-person company. You have ChatGPT Enterprise for the marketing team, GitHub Copilot for engineering, Microsoft Copilot for finance and ops, and Anthropic Claude for legal. Four AI tools, four vendors, four impact surveys.

Your CFO walks in and asks the obvious question. "Which of these is paying off?"

You cannot answer that question from the four vendor surveys. Here's why:

Tool	Sample scope	Question scale	ROI definition
ChatGPT	Active users, post-workflow	Time saved (0-60 min)	(Time × rate) / seat cost
GitHub Copilot	Per-developer, weekly	Confidence (1-5)	"Acceptance rate"
MS Copilot	Per-user, monthly NPS	Satisfaction (-100 to +100)	Self-reported task min
Claude (Work)	Per-team, quarterly	Multi-factor (not public)	Not published

Four vendors. Four methodologies. The numbers do not roll up.

Worse, the buying decision is rarely "which one of these is the worst?". It is "given a finite budget for next year, where do we double down, where do we hold, where do we cut?". That decision needs comparable measurement across the whole AI stack. Vendor surveys, by design, cannot provide it.

What "independent" actually means

Independence isn't a slogan. It has four operational meanings, and any AI Impact Measurement vendor (including us) should be willing to answer all four directly.

1. Who designs the questions

Independent measurement uses the same question, on the same scale, for every tool. The question is designed by an organisation whose commercial interest is in the measurement being trustworthy, not in any specific tool looking good. The methodology is published. The scale is fixed. The wording is open to review.

2. Who owns the data

If the AI vendor is the one collecting the answers, the AI vendor is the one deciding what to publish, what to retain, what to share with their sales team, and how the aggregate looks on a slide deck. Independent measurement holds the data outside the vendor's environment. The customer owns the raw responses. The vendor of the AI tool sees the same aggregate the customer does, with no privileged read.

3. Who publishes the methodology

You should be able to read, in plain English, exactly how every score in the report is calculated. What constitutes "active". What counts as "time saved". How wasted spend is computed. How satisfaction is normalised across tools. If you cannot read the methodology, the score is opinion, not measurement.

4. Who is incentivised by which outcome

This is the load-bearing one. A vendor selling you Tool X benefits if Tool X looks good. A vendor selling you measurement of every tool benefits if the measurement is trustworthy enough that you keep using it. Those are different incentives, and they pull in different directions.

The vendor selling you the AI tool benefits if it looks good. The vendor selling you measurement benefits if the measurement is trustworthy. Those are different incentives.

A CFO framework: four questions to ask any AI impact data

Independence is a feature, not a marketing claim. Here is a fast diagnostic any CFO can run on any AI impact data crossing their desk.

1. Who chose the questions?
If the same company that sells the tool also designed the survey, treat the result as one input, not the answer. There may still be useful signal. You just cannot rely on it as a like-for-like comparison.

2. Who chose the scale?
A 5-point scale and a 100-point NPS produce different averages from the same underlying sentiment. If the vendor cannot tell you why they chose their scale (or worse, will not tell you), the number is not portable.

3. Who is in the sample?
A vendor's "impact survey" almost always polls active users of the vendor's tool. The leavers, the people who never adopted, the silently disengaged: those are missing from the denominator. The question to ask is "what percentage of seated users are in your sample, and what do you know about the rest?".

4. Who got to see the result before you did?
If the vendor saw the result first and decided what to surface, you are reading a curated read. If you saw the raw responses first and the vendor sees the aggregate at the same time you do, you are reading measurement.

Run those four questions over the next vendor impact dashboard that lands on your desk. You will find that one or two of the answers reliably embarrass the source.

Vendor surveys are part of the answer, not the whole answer

Read this piece carefully and you'll notice we did not say vendor impact surveys are bad. They are not. OpenAI shipping Impact Surveys is a positive development. Microsoft, Anthropic and Google adding their own will be too.

The point is that vendor surveys are an input to AI Impact Measurement, not a replacement for it. (See also our piece on why subjective and objective signals beat either one alone.)

The job of an independent, cross-vendor measurement layer is to:

Use one methodology across every tool in the stack
Sample the whole seated population, not just the most engaged
Publish the question wording and the scale openly
Hold the raw data outside any single vendor's environment
Present the result apples-to-apples so the CFO's renewal decision is data-driven

That's what The GAiGE is built to do. We're an independent measurement company. We don't sell ChatGPT, Copilot, Claude or Gemini. We measure all of them the same way, so the question "is this paying off?" has a real answer.

Where to go next

If this article describes a problem you already had (the spreadsheet of impact numbers from five vendors that don't add up to one decision), there is a faster way to see the alternative.

Explore an AI Impact Report (no signup) to see what independent, cross-vendor measurement looks like across a sample organisation's stack. Or start a free trial and we'll measure your real AI tools side by side, on one methodology, no card required.

The vendor survey era has just started. The independence era starts the day you ask the four questions above and don't accept "trust us" as an answer.