If the AI answers differently every time, can visibility be measured? How serious measurement works
The objection always comes from the most technical person in the room, and it's the best objection there is against AI visibility services: "I've asked ChatGPT the same thing twice and it recommended different companies. If the answer changes every time, what on earth are you measuring?"
Whoever asks that has grasped something half the industry would rather ignore: language models aren't deterministic. The same question, on the same day, can produce different answers. There's no fixed "ranking" to consult, no position 3 to capture. Anyone selling you "you're in 2nd place in ChatGPT" as if it were a league table is selling you a snapshot of something that moves.
And yet, the conclusion "so it can't be measured" is wrong, and the best proof is that half of science works by measuring things that change every time you look at them. Let's take it step by step.
Why the answers vary (the part the sceptic already knows)
Three sources of variation, to have them named:
- The model rolls dice. Models generate text by choosing among probable options, with deliberate randomness involved. Two identical runs can take different paths, especially in lists ("name five accounting firms in Seville"): names with a strong presence in the sources come up almost always; marginal ones drift in and out.
- The context contaminates. The same question with different histories, accounts or locations produces different answers. What you see in your ChatGPT isn't what the client's customer sees in theirs.
- The ground moves. The models get updated, the search engines they use change their index, and March's answer can be unrecognisable in June without anyone having touched anything.
The sceptic's partial conclusion: a ChatGPT screenshot proves almost nothing. Correct. Granted. Now here's what doesn't follow from it.
The election poll: this is how you measure what varies
Nobody knows how a specific voter will vote, and yet polls estimate election results within reasonable margins. How? Not by asking one person once, but by asking many people many times, and looking at frequencies instead of cases. Serious AI visibility measurement works exactly the same way:
Sampling instead of a snapshot. You don't ask once: you send a battery of prompts — the questions the client's real audience would ask — repeatedly and periodically, to several AIs. If across 40 relevant questions sent this week your client appears in 12 answers, that's an appearance frequency: a data point. Whether an individual answer varies stops mattering, just as the poll doesn't care that one specific respondent changes their mind. In fact, the variability is the reason to measure this way, not the obstacle. Which questions make up a good battery is a craft in itself — we develop it in how to choose the prompts you monitor for a client.
Trends instead of moments. An isolated measurement of "you appear in 30%" says little: it could be noise. Twelve weeks of measurements drawing a curve that goes from 10% to 30% while you work the sources — that's a signal. The serious question is never "what did ChatGPT say on Tuesday?" but "is the appearance frequency going up, down or holding since we started?". The same goes for the competitor: if they appear in 70% of the category's answers and your client in 15%, that gap is too big to be chance.
Intervals instead of certainties. Even so, the measurement is coarse-grained, and it's worth presenting it as such: "you appear in around 25-35% of your category's answers" is defensible; "you appear in exactly 28.4%" is false precision. Small movements between weeks are noise; sustained movements and large gaps are information. The honest report distinguishes both in front of the client, not in the small print.
This has an immediate practical consequence: measuring well by hand isn't viable. A decent battery is 40-75 questions, across 3-4 AIs, every week, with a record of who appears and what's said — per client. Do it with screenshots and you've got a part-time job that produces worse data. It's the kind of work that gets delegated to tools: we do it with Surfeo, which runs the full battery every week against the 4 AIs and turns the result into frequencies and trends ready for the report. But the methodology matters more than the brand: any measurement that isn't repeated, periodic sampling is a screenshot with pretensions.
What such a measurement can promise (and what it can't)
You win the sceptic over by finishing the sentence they started:
It can: tell you how often the client appears in their category's answers, in which AIs yes and in which no, what's said about them when they appear, who occupies the space when they don't, and — most valuable of all — whether all that improves or worsens over the months, which in the end is the only thing that justifies a retainer.
It can't: guarantee what a specific user will see in a specific conversation, nor promise a stable "position", nor attribute each improvement to each action with surgical precision. Whoever promises that hasn't understood the instrument — or is counting on the client not understanding it.
This boundary between the measurable and the promisable is exactly the one your sales proposal should draw: committing to measure well and to work the sources, not to results nobody controls. How to translate that into concrete objectives without burning your fingers is in realistic AI visibility goals, and how it turns into a monthly document the client understands is in the anatomy of an AI visibility report.
And a final note for the sceptic who's made it this far: the variability that motivates their objection is also the best reason to measure. If the answers were fixed, looking at them once a year would do. Because they change — with every model update, with every shift in the sources — whoever doesn't monitor finds out about the changes when they've already cost them clients. The silent decline of branded searches is the textbook example: how to tell if the AI is answering on your client's behalf.
Frequently asked questions
How many questions and how many repetitions are needed for the data to be reliable?
More relevant questions and more frequency give more resolution, with diminishing returns. As a practical reference, a battery of 40-75 prompts per client run weekly against 3-4 AIs reliably detects the trends that matter to a business. With 5 questions once a month, by contrast, the noise eats the signal.
Why measure several AIs if my client only talks about ChatGPT?
Because the answers don't match: in our study of 9,865 Spanish SMEs, 91% only appeared in 1 of the 4 main AIs. Being well placed in ChatGPT says nothing about Gemini or Perplexity, and the client's audience is spread across all of them. Measuring only one is like running a poll in just one neighbourhood.
Isn't this the same as the rank tracking we've always had?
It shares the spirit — measuring presence systematically — but not the mechanics: a ranking is a public, stable list that you consult; here there's no list, you have to generate it by asking many times and counting. That's why the reports talk about frequencies and trends, not positions. Whoever carries over the "position 3" mindset to AI without adapting it will end up promising things the medium doesn't allow.
How do I explain all this to a non-technical client without losing them?
With the election poll: "we can't know what the AI will tell one specific person, just as a poll doesn't know how your neighbour will vote; we can know how often it recommends you and whether that frequency improves with the work". Two sentences, and the conversation moves from magic to statistics.
The theory is easier to grasp with a data point in front of you: take the free AI visibility test with any website and see the first snapshot — the trend starts there.