Back to Surfeo for agencies
7 min read

If the AI answers differently every time, can visibility be measured? How serious measurement works

agenciesmeasurementmethodologyscepticism

The objection always comes from the most technical person in the room, and it's the best objection there is against AI visibility services: "I've asked ChatGPT the same thing twice and it recommended different companies. If the answer changes every time, what on earth are you measuring?"

Whoever asks that has grasped something half the industry would rather ignore: language models aren't deterministic. The same question, on the same day, can produce different answers. There's no fixed "ranking" to consult, no position 3 to capture. Anyone selling you "you're in 2nd place in ChatGPT" as if it were a league table is selling you a snapshot of something that moves.

And yet, the conclusion "so it can't be measured" is wrong, and the best proof is that half of science works by measuring things that change every time you look at them. Let's take it step by step.

Why the answers vary (the part the sceptic already knows)

Three sources of variation, to have them named:

  1. The model rolls dice. Models generate text by choosing among probable options, with deliberate randomness involved. Two identical runs can take different paths, especially in lists ("name five accounting firms in Seville"): names with a strong presence in the sources come up almost always; marginal ones drift in and out.
  2. The context contaminates. The same question with different histories, accounts or locations produces different answers. What you see in your ChatGPT isn't what the client's customer sees in theirs.
  3. The ground moves. The models get updated, the search engines they use change their index, and March's answer can be unrecognisable in June without anyone having touched anything.

The sceptic's partial conclusion: a ChatGPT screenshot proves almost nothing. Correct. Granted. Now here's what doesn't follow from it.

The election poll: this is how you measure what varies

Nobody knows how a specific voter will vote, and yet polls estimate election results within reasonable margins. How? Not by asking one person once, but by asking many people many times, and looking at frequencies instead of cases. Serious AI visibility measurement works exactly the same way:

Sampling instead of a snapshot. You don't ask once: you send a battery of prompts — the questions the client's real audience would ask — repeatedly and periodically, to several AIs. If across 40 relevant questions sent this week your client appears in 12 answers, that's an appearance frequency: a data point. Whether an individual answer varies stops mattering, just as the poll doesn't care that one specific respondent changes their mind. In fact, the variability is the reason to measure this way, not the obstacle. Which questions make up a good battery is a craft in itself — we develop it in how to choose the prompts you monitor for a client.

Trends instead of moments. An isolated measurement of "you appear in 30%" says little: it could be noise. Twelve weeks of measurements drawing a curve that goes from 10% to 30% while you work the sources — that's a signal. The serious question is never "what did ChatGPT say on Tuesday?" but "is the appearance frequency going up, down or holding since we started?". The same goes for the competitor: if they appear in 70% of the category's answers and your client in 15%, that gap is too big to be chance.

Intervals instead of certainties. Even so, the measurement is coarse-grained, and it's worth presenting it as such: "you appear in around 25-35% of your category's answers" is defensible; "you appear in exactly 28.4%" is false precision. Small movements between weeks are noise; sustained movements and large gaps are information. The honest report distinguishes both in front of the client, not in the small print.

This has an immediate practical consequence: measuring well by hand isn't viable. A decent battery is 40-75 questions, across 3-4 AIs, every week, with a record of who appears and what's said — per client. Do it with screenshots and you've got a part-time job that produces worse data. It's the kind of work that gets delegated to tools: we do it with Surfeo, which runs the full battery every week against the 4 AIs and turns the result into frequencies and trends ready for the report. But the methodology matters more than the brand: any measurement that isn't repeated, periodic sampling is a screenshot with pretensions.

What such a measurement can promise (and what it can't)

You win the sceptic over by finishing the sentence they started:

It can: tell you how often the client appears in their category's answers, in which AIs yes and in which no, what's said about them when they appear, who occupies the space when they don't, and — most valuable of all — whether all that improves or worsens over the months, which in the end is the only thing that justifies a retainer.

It can't: guarantee what a specific user will see in a specific conversation, nor promise a stable "position", nor attribute each improvement to each action with surgical precision. Whoever promises that hasn't understood the instrument — or is counting on the client not understanding it.

This boundary between the measurable and the promisable is exactly the one your sales proposal should draw: committing to measure well and to work the sources, not to results nobody controls. How to translate that into concrete objectives without burning your fingers is in realistic AI visibility goals, and how it turns into a monthly document the client understands is in the anatomy of an AI visibility report.

And a final note for the sceptic who's made it this far: the variability that motivates their objection is also the best reason to measure. If the answers were fixed, looking at them once a year would do. Because they change — with every model update, with every shift in the sources — whoever doesn't monitor finds out about the changes when they've already cost them clients. The silent decline of branded searches is the textbook example: how to tell if the AI is answering on your client's behalf.

Frequently asked questions

How many questions and how many repetitions are needed for the data to be reliable?

More relevant questions and more frequency give more resolution, with diminishing returns. As a practical reference, a battery of 40-75 prompts per client run weekly against 3-4 AIs reliably detects the trends that matter to a business. With 5 questions once a month, by contrast, the noise eats the signal.

Why measure several AIs if my client only talks about ChatGPT?

Because the answers don't match: in our study of 9,865 Spanish SMEs, 91% only appeared in 1 of the 4 main AIs. Being well placed in ChatGPT says nothing about Gemini or Perplexity, and the client's audience is spread across all of them. Measuring only one is like running a poll in just one neighbourhood.

Isn't this the same as the rank tracking we've always had?

It shares the spirit — measuring presence systematically — but not the mechanics: a ranking is a public, stable list that you consult; here there's no list, you have to generate it by asking many times and counting. That's why the reports talk about frequencies and trends, not positions. Whoever carries over the "position 3" mindset to AI without adapting it will end up promising things the medium doesn't allow.

How do I explain all this to a non-technical client without losing them?

With the election poll: "we can't know what the AI will tell one specific person, just as a poll doesn't know how your neighbour will vote; we can know how often it recommends you and whether that frequency improves with the work". Two sentences, and the conversation moves from magic to statistics.


The theory is easier to grasp with a data point in front of you: take the free AI visibility test with any website and see the first snapshot — the trend starts there.

Pablo Marín

Pablo Marín

Fundador de Surfeo y Made AI. Audita la visibilidad de PYMEs en ChatGPT, Gemini, Perplexity y Claude con datos reales: más de 9.000 negocios analizados en 30 sectores y 10 ciudades españolas. Escribe sobre GEO, AEO y SEO para IA desde la práctica, no desde la teoría.

More articles for agencies

How to use an AI visibility audit as a hook to close new clientsAnatomy of a good AI visibility report for clients: sections, charts and what to leave outHow to create an AI visibility report for your clients in minutesHow much to charge for AI visibility services in Spain: the real rangesThe AI visibility slide you should add to every report from this month onTools to measure your clients' AI visibility: a comparison for agenciesWhat to say to a client asking about AI when your agency doesn't offer it yetYour branded search is falling: how to tell if the AI is answering for youWhy ChatGPT and Gemini say different things about your brand (and how to explain it to the client)What to do when ChatGPT says something false about your client: a protocol for agenciesHow to check in your own Analytics whether AI is affecting your clients' trafficHow to choose the prompts you'll monitor for a client (and how many is enough)What to say when a client asks if they appear in ChatGPT (a script for the next meeting)What to do when a client isn't improving their AI visibility: diagnosis and the hard conversationWhat to say to a client who wants to cancel SEO because "everyone just uses ChatGPT now"What to say when your client wants to cut Google Ads because 'everyone asks ChatGPT now''My competitor shows up in AI and I don't': how to respond without losing the clientHow to defend a renewal when the other agency sells 'AI positioning' and you don't yetHow to differentiate your marketing agency in 2026: AI services almost nobody offers yetHow to escape the price war between agencies: sell what the client can't compareHow to explain AI visibility to a client without jargon: 5 analogies that workHow to explain the AI-driven organic traffic drop to your client without causing panicA client's first AI visibility audit, step by step (with template)How much to charge for monitoring AI visibility as a freelance (and when it pays to give it away)Does the SEO consultant have a future? What AI changes and how to adapt without starting overGEO vs SEO: what's genuinely new and what's just rebrandedHigh-margin services for agencies: how to bill more without growing your headcountHow many hours a month an AI visibility client really takes (a task-by-task breakdown)How often to monitor a client's AI visibility without driving yourself madIs GEO just another fad like NFTs? The differences, with dataHow to justify your SEO retainer when traffic falls for reasons you don't controlWhat KPIs to put in a GEO proposal: the ones you can commit to and the ones you only reportHow to monitor all your clients across 4 AIs without doing it by hand or taking screenshotsThe monthly content plan for a GEO client: what to publish so the AI cites youWhat monthly service to sell clients after delivering the website (that isn't maintenance)New services for your marketing agency in 2026 (with margin numbers)How to present an AI visibility service in your sales proposals (structure and examples)How much to charge for managing a local business's AI presence (example packages)How to reactivate your web agency's old clients with a new service (email template)Realistic AI visibility goals by sector: what to expect at 3, 6 and 12 monthsHow to grow your agency's recurring revenue by productising AI visibilityHow to sell AI visibility to your current SEO clients: the email and the meetingHow many Spaniards use ChatGPT to decide purchases: the data available in 2026Standalone service or SEO extension? How to package AI visibility in your agencyHow to test an AI visibility service with a single client and minimal budget before decidingHow to see traffic from ChatGPT and Perplexity in Google Analytics 4Why traffic drops even though rankings hold: AI Overviews explained to the clientHow to train a junior to run GEO clients: a 30-day plan and what not to delegate yetWhat Share of Model is and how to put it in your reportingWhat to automate and what not in a GEO service: the operations of an agency with 10+ clientsWhat to deliver in the first month of an AI visibility serviceWhat you can promise (and what you can't) about AI visibility without burning your fingersWhat to show in an AI visibility reporting meeting (and why you should never do the demo live)Which agency services will survive AI (and which you need to reinvent now)

Related resources

Show your client where they (don't) appear. Free.

Run the free test