Why we normalise every metric to 0-100: the case for health scores over raw numbers

A 3.5× ROI and a 72% satisfaction score are both numbers. They are not, in any useful sense, the same kind of number.

Spend ten minutes in any AI ROI dashboard and you'll notice the same problem. The page shows you half a dozen metrics — ROI multiplier, hours saved per week, adoption rate, satisfaction, maybe a Net Promoter Score — and you have no idea which ones are good, which ones are in trouble, and which ones you should care about most. Every metric has its own scale, its own distribution, and its own sense of "what's normal".

The GAiGE solves this the same way an electrocardiogram does: we normalise every headline metric onto a common 0-100 scale before drawing it. A score of 80 on ROI means the same thing as 80 on satisfaction, which means the same thing as 80 on adoption. Strong. Worth keeping. Tell the team.

This post is about why we made that call, what the curves actually look like, and where the model breaks down.

The problem with raw numbers

Raw metrics are precise and incomparable. That's a worse combination than it sounds.

A 2.5× ROI sounds impressive until you realise the industry median for a tool like Copilot is closer to 4×. A 65% satisfaction sounds mediocre until you realise it maps to roughly 3.25 out of 5 on a survey where the pragmatic ceiling is about 4.3. A 72% adoption sounds strong until you notice that 72% is what you'd get if three out of every ten seats sit completely idle — which is probably not what "strong" should mean.

The reader now has to hold three separate mental rubrics at once, and toggle between them on every glance. That cognitive tax is small in isolation and ruinous at scale — because it means your leadership team will, in practice, ignore the metrics they don't intuit and over-rotate on the ones they do. Usually that's the dollar figure, which is the one most vulnerable to methodology disagreement.

What a health score gives you instead

A health score is a normalised, opinionated mapping of a raw metric onto a 0-100 scale where:

0-40 means "needs attention" — red.
40-70 means "okay, watch it" — amber.
70-90 means "strong" — green.
90-100 means "excellent, tell the CFO" — bright green.

The bands are deliberately coarse. You do not need a dashboard that can distinguish a 73.2 from a 74.9; you need one that can tell you, at a glance, whether to celebrate, plan a training push, or kill the tool.

Every gauge in the app shows two numbers: the raw value (2.8×, 142 hours, 4.2/5) and the health score (78/100). The big needle tracks the health score, so the colour-coded dial is always comparable across metrics. The raw number is there for the auditor.

Why piecewise-linear, not linear

The obvious next question: how do we pick where 0, 40, 70, and 100 fall? A naive answer is "draw a straight line between min and max". We don't do that, because a straight line is almost always wrong for ROI-style metrics.

Take ROI. On a pure linear scale from 0× to 10×, a 2× ROI would map to 20/100 — which would tell you "this tool is in trouble". Except: a 2× ROI is completely fine. You got double your money back. That's not failing. On the other end, the difference between a 15× ROI and a 20× ROI is, in practice, noise — both mean "this tool is a runaway winner"; neither needs a further 20 points of dashboard celebration.

So we use a piecewise-linear curve with explicit anchor points. For ROI:

1× → 20/100 (break-even, but under-delivering)
2× → 45/100 (fine, but not compelling)
5× → 75/100 (strong)
10× → 95/100 (excellent)
20×+ → 100/100 (celebrate; the gain flattens)

The anchors aren't arbitrary. They reflect how a reasonable leadership team would describe the same number in words. 2× ROI is "fine". 5× ROI is "we should expand this". 10× ROI is "this is one of the best AI bets we've made". The curve encodes that judgement, so the dashboard renders the judgement automatically.

The four curves we ship

Every GAiGE dashboard has four headline gauges. Each is driven by its own piecewise curve.

ROI. As above. Capped at 20× to prevent a single enthusiastic response from dragging the whole strip off the scale.

Hours saved per user per week. 0h → 0, 1h → 35, 2h → 60, 3h → 80, 5h+ → 100. An hour a week per person is genuinely meaningful. Five hours a week — a full working day reclaimed — is as good as it gets before we start suspecting the pulse is miscalibrated.

Satisfaction. Linear on a 1-5 scale, rounded to 0-100. Satisfaction is the one metric where the raw number is already intuitive (everyone knows what 4.2/5 means), so we don't fight it.

Adoption. 30% → 30, 60% → 65, 80% → 85, 95%+ → 98. Adoption is the metric most prone to surface flattery — 80% sounds great until you realise the other 20% are paying for seats they never touch. The curve is deliberately punishing in the 90-100 band so that only near-total activation counts as "excellent".

The label is the point

Behind the 0-100 number sits a four-label scale: Needs attention. Okay. Strong. Excellent. Every gauge shows one of those four words directly under the number. The word does most of the work.

This matters because nobody makes a decision off "our ROI is 62.4/100". People make decisions off "our ROI is okay, and our satisfaction is strong, and our adoption needs attention". Three words, one conclusion: we have a reach problem, not a quality problem.

The health score is the math; the label is the translation. We keep both on the page because the math is auditable and the label is decidable. You can push back on the curve. You can't push back on the word — which is exactly what makes it useful in a leadership meeting.

Where this model breaks down

Three places it's worth calling out.

1. The anchors are opinions. We picked them based on what we've seen across the AI-adoption consulting work that seeded this product. A different vertical might disagree — a law firm might reasonably argue that 2× ROI on a premium research tool is the ceiling, and 5× is fantasy. Pro plans let you reshape the curves; the defaults are defaults, not laws of physics.

2. Normalisation hides volatility. A metric bouncing between 65 and 75 reads as "stable strong" on the health-score scale, but the underlying raw number might have swung 40% week over week. We mitigate this by showing sparklines of the raw metric beneath the gauge, so the health score never disguises real turbulence.

3. Composite gauges are tempting and wrong. It is trivially easy to mush four health scores into one "overall AI health" number. We refuse to do it. Collapsing ROI, hours saved, satisfaction, and adoption into a single figure destroys the diagnostic value — which is precisely the pattern the individual scores were designed to reveal. Four numbers beat one number every time, so long as the four are comparable.

In short

Raw metrics are precise and incomparable. Health scores are coarser and comparable. For a dashboard read in thirty seconds by a leadership team that needs to make a decision, coarser-and-comparable wins every time. We show both numbers — the raw for defensibility, the normalised for decisions — and let the label do the heavy lifting.

Curious how the raw ROI number gets computed in the first place? The 2.5× rule post covers the upstream side of this — how we turn a handful of pulses into a defensible multiplier before it hits the curve.