Have you ever asked an AI tool what you should buy, read the answer, and felt a quiet sense of relief?
Like: “Okay, this sounds confident, it’s laid out nicely, and it probably knows more than I do.”
That moment is exactly why we ran this experiment.
AI tools are no longer just helping people write emails or summarise articles. They are now part of the buying journey. People ask them which tool to choose, which product is better, what to avoid, and what actually works in the real world. And most of the time, they trust the first answer they see.
The problem is not that AI gives answers.
The problem is that people assume those answers are:
- Consistent across tools
- Based on facts
- Safe enough to act on
What makes this dangerous is not obvious misinformation. It’s AI hallucination; situations where AI fills gaps with plausible-sounding detail, confident phrasing, or outdated assumptions that feel factual but aren’t fully grounded.
We kept seeing the same pattern. One AI would recommend something strongly. Another would suggest the opposite. A third would add warnings that the first two never mentioned. All of them sounded sure. All of them sounded reasonable. And at no point was it obvious which one, if any, was actually right.
So we decided to stop debating this in theory.
Instead, we asked a simple question:
What happens when you ask multiple AI search engines the same buying question and treat their answers like a real buyer would?
No prompt engineering.
No follow-up nudges.
No “act like an expert” tricks.
Just one question, asked six times, to six different AI engines.
What we found were not small differences in tone or phrasing. It was something deeper. The answers did not just vary; they often disagreed on facts, risks, and what mattered most in the decision.
That gap between confidence and consistency is what this experiment is about.
The Buying Question We Asked (and Why It’s a Stress Test)
- It was a real buying question that someone would ask when they are close to making a decision, not just browsing or learning terms. This matters because buying questions are not about giving a neat summary.
- They demand accuracy. A small mistake, like a feature that no longer exists or a limitation that is ignored, can change the entire decision.
- The question also requires up-to-date information. Prices change. Products evolve. What worked last year may no longer apply. AI systems that rely on older patterns or generic knowledge tend to struggle here, even if they sound confident.
- Most importantly, the question forces prioritisation. There is no single “best” answer unless you know what the buyer cares about.
This combination makes a question, a stress test.
If an AI can handle this well, it is genuinely useful for buying research, and if it cannot, then the cracks are easily visible.
The 6 AI Engines We Tested
We chose six AI engines that people actually use when they are trying to figure out what to buy.
The objective was to reflect how a real buyer moves today. Someone might start with one tool, cross-check with another, and then trust the answer that sounds complete.
The six engines we tested were:
- ChatGPT
- Google Copilot
- Gemini
- Claude
- Perplexity
- Meta AI
PS: We did not tune the experience. No custom prompts. No follow-up questions. No attempts to push the tools in a certain direction.
How We Evaluated the Answers
We looked at each answer the way a real person would when they are close to making a decision.
We focused on three simple things.
1. Basic accuracy
- Were the recommendations grounded in reality?
- Did the tools mentioned actually fit a small B2B team?
- Were prices, use cases, and limitations described in a way that felt current and believable?
2. Decision usefulness
An answer can sound smart and still be useless. We looked for whether the response helped narrow choices, explained tradeoffs, and warned about things that could go wrong in the first year. If an answer just listed tools without helping us choose, it failed this test.
3. How uncertainty was handled
- Buying decisions are messy. A good answer should admit that. We paid attention to whether the AI acknowledged limits, asked for missing context, or explained where its advice might break down.
- We did not fact-check every sentence while reading. That is intentional. Most buyers do not. They skim, they trust the tone, and they move forward. So we judged the answers in that same mindset.
- If an answer felt safe enough to act on without further checking, that was a signal.
If it felt polished but vague, that was also a signal.
With that lens in place, the patterns became much easier to see.
What Each AI Engine Recommended (Using the Same Question)
Before looking at where AI search breaks down, it’s worth seeing what actually came back when we asked the same buying question across six different AI engines.
The question was:
“What is the best CRM for a small B2B team with under 20 sales reps, and what problems should we expect in the first year of using it?”
This is a question that founders, sales leaders, and operators ask every day.
Here’s what each engine recommended at a high level.
- ChatGPT leaned strongly toward Pipedrive as the default choice for small, sales-led teams, with HubSpot and Zoho positioned as context-dependent alternatives. It went deep on first-year problems, especially adoption, messy data, and over-customisation.
- Perplexity also recommended Pipedrive as the top option, backed by reasons around ease of use and fast setup. It supported this with references and added structured mitigation steps for common first-year issues.
- Gemini framed the decision as a balance between adoption and future scale, highlighting HubSpot and Pipedrive as leading options. It focused heavily on behavioural problems like shadow spreadsheets and over-engineering rather than tool limitations.
- Claude narrowed the choice quickly to HubSpot versus Pipedrive, depending on whether the team leaned more inbound or outbound. It acknowledged that teams often need to rebuild workflows within the first year as reality sets in.
- Meta AI offered a broader list that included HubSpot, Pipedrive, Zoho, ClickUp, and Monday. The advice stayed high-level, with generic setup and adoption challenges mentioned, but little prioritisation.
- Copilot grouped HubSpot, Pipedrive, and Zoho as safe options, then listed a wide range of first-year risks: low adoption, unclear goals, budget creep, and integration gaps.
At first glance, all of this looks reasonable.
Most engines converged on the same few CRMs. Most warned about similar first-year problems. If you were a buyer skimming these answers, you would probably feel reassured rather than alarmed.
That surface-level agreement is exactly why this experiment is interesting. Because once you look past the overlap, the differences in framing, assumptions, and risk handling start to matter much more than the shared recommendations.
Where AI Helped
AI was genuinely useful at the surface level.
It did a good job of summarising a crowded category. Instead of throwing every CRM into the mix, most answers narrowed the field to a few familiar options. That alone reduces mental load for someone who doesn’t want to research ten tools from scratch.
- It helped frame options in simple terms:
- Sales-led teams versus inbound-led teams.
- Ease of use versus long-term scale.
- Speed of setup versus depth of features.
These are real tradeoffs, and AI handled that framing well.
- Another clear strength was simplifying complexity. CRMs are messy products with a lot of overlapping features. The answers stripped that down into ideas people can understand quickly, like adoption, pipeline visibility, and data hygiene. For an early-stage buyer, that kind of clarity is genuinely helpful.
If the goal was to get oriented, AI did its job.
Where AI Failed
The problems appear when you look past the polish.
- The biggest issue was similarity in confidence. Many answers sounded aligned, but they were aligned in tone, not in logic. The same tools were recommended for different reasons, sometimes with opposing assumptions about how the team sells or grows; the confidence makes it easy to miss how much is being assumed.
- Important caveats were also uneven. Some answers went deep on first-year problems like adoption or over-customisation. Others mentioned the same risks in passing, without explaining how serious they can be. A buyer reading only one response would walk away with a very different sense of risk than a buyer reading another.
- Sources were another weak point. Even when references were present, it was not always clear what claims they supported. In some cases, the advice relied on general patterns rather than verifiable details. This is fine for learning, but risky for decisions.
- Most importantly, none of the answers made it obvious where they could be wrong. They spoke with authority; however, the situation clearly depends on the context that was never asked for.
That gap between how helpful the answers feel and how dependable they actually are is where AI search starts to break down.
The 4 Ways AI Search Breaks During Buying Decisions
1. Confidence Replaces Caution
The biggest risk wasn’t bad advice. It was advice that sounded finished before it was verified.
Every engine spoke with authority. Even when assumptions were being made, the tone stayed calm and decisive. Very few answers paused to say, “This depends” or “we might be missing something important.” As a result, the advice felt finished, even when it wasn’t.
In buying decisions, that confidence carries weight. People move forward not because the answer is perfect, but because it feels complete. An AI that admits uncertainty might feel less helpful in the moment, but it is actually safer.
2. Citations Exist, but Don’t Prove the Claim
In several cases, sources were general articles, review pages, or vendor sites that did not directly back up the specific recommendation being made. The presence of a link created trust, even when the link itself did not justify the conclusion.
For a buyer, this is subtle. Few people click through every source. Fewer still check whether the source really supports the statement. The result is borrowed credibility. The answer feels researched even when the key details remained unverified.
3. Engines Disagree on Facts, Not Just Opinions
It would be normal for AI engines to disagree on preferences. What is “best” often depends on taste or priorities.
What stood out here was disagreement on things that should be more stable.
- Which tools are appropriate for small teams?
- How serious are the first-year risks?
These were not framed as opinions. They were stated as facts.
This is where AI hallucination becomes commercially dangerous. When assumptions, partial data, or outdated signals are presented as facts, buyers don’t see uncertainty; they see authority.
That means two people asking the same question in different tools could walk away with incompatible understandings of the same decision. Both would feel informed. Only one, or neither, would be right.
4. “Good Enough” Answers Push Bad Decisions
AI search is optimized for speed and satisfaction. It aims to give an answer that feels good enough to move on.
For learning scenarios, that is fine. However, for buying scenarios, it is risky. The convenience becomes the deciding factor. People choose not because the answer is correct, but because it is easy to accept.
That is the quiet failure at the heart of AI search for buying. It does not usually mislead loudly. It nudges people forward just enough to decide before they have seen the full picture.
Is AI Search Actually Broken (or Just Misused?)
After reading all six answers, it’s tempting to say that user behaviour is the actual problem. However, that explanation doesn’t fully hold.
AI search is designed to feel final. The structure, the tone, and the speed all suggest that the work has been done for you. When an answer is clear, confident, and neatly packaged, it signals closure. Most users respond to that signal by moving on.
That’s where the real issue sits.
The systems are not broken because they give bad information all the time. They are broken because they blur the line between guidance and judgment.
- In this experiment, none of the engines asked clarifying questions before recommending tools.
- None of the 6 paused to say that the wrong choice could be costly if your sales motion or team structure is different.
- The responsibility to identify these was pushed entirely onto the reader.
So, yes, AI search is being misused. However, it is also designed in a way that encourages misuse.
How to Use AI Search Safely for Buying Research
AI is great at helping you get started, but not very good at telling you when to stop. It can outline options, surface common issues, and give you language to work with. What it cannot reliably do is decide for you.
If you’re using AI while evaluating a purchase, a few habits make a big difference.
- First, treat the answer as a draft, not a verdict. Assume something important is missing. Even when the response feels complete, there are always constraints it did not ask about, like team maturity, budget ceilings, compliance needs, or internal politics.
- Second, use AI to compare, not to choose. Ask it what typically goes wrong with a tool, what teams regret after six months, or what situations a product is a bad fit for. Those questions tend to produce more useful signals than asking for “the best” option.
- Third, verify only the parts that matter. You don’t need to fact-check everything. Focus on the claims that would change your decision. Pricing tiers, feature limits, integration depth, and support quality should be confirmed from the primary sources.
- Finally, notice how confident the answer sounds. When advice is delivered without any uncertainty, that’s a sign to slow down, not speed up. Good decisions usually involve friction. AI removes that friction too well.
If used this way, AI search becomes a helpful assistant instead of a silent decision-maker.
What This Means Going Forward
AI search is already shaping how people discover options, shortlist products, and form opinions before they ever visit a website or talk to sales.
What this experiment shows is that AI can help; however, the risk is in how easily its answers feel complete. For buyers, the takeaway is simple. Use AI to get oriented, not to decide. The moment money, time, or long-term commitment is involved, AI answers should trigger better questions, not final choices.
AI systems are the platforms where decisions are built, and then visibility inside those systems matters as much as ranking in traditional search. Being mentioned and supported by credible signals is no longer optional.
This is where a Generative Engine Optimization Tool becomes relevant, not as a growth hack, but as a way to understand how your product is represented when AI summarises your category. AI search is powerful, but it is not neutral. It compresses complexity, smooths over uncertainty, and rewards answers that sound finished.
Visible brands understand how AI shapes decisions and design for that reality, instead of trusting confidence as a proxy for truth.
For buying questions like this, ReSO audits brand inclusion across buyer-intent AI queries, mapping where brands are referenced, omitted, or distorted inside AI-generated answers. Book a call with us to access your brand’s AI visibility.



