Sam Altman recently said that their latest and greatest AI model, GPT-4.5, still hallucinates 37% of the time. What he didn’t tell you is any of the context around how he got that number. On face value, it’s alarming. In reality, Machine & Partners’ Edmundo Ortega argues that it’s actually really impressive.
Let’s unpack hallucinations with an expert who builds with AI everyday – and put some of the hype to bed for good.
Where do these hallucination metrics come from?
The details matter here. The benchmark that OpenAI used to produce that 37% figure was deliberately difficult and highly specific.
They packed their prompts with arcane, obscure, and almost trivial facts—questions like “What position was Jean-Luc Johnson assigned to in Montreal in 1997?” They literally crowdsourced questions from domain experts in an attempt to stump the model.
“It’s a credit to OpenAI that they created a very hard benchmark to succeed against,” Ed said. “But it’s also deceiving.”
The answers to these questions are recorded and publicly available, but often in just one or two locations on the entire internet. So looking at it another way: GPT-4.5 got 63% of these answers right. That’s why Ed says the results were actually pretty incredible.
The 37% statistic doesn’t mean ChatGPT is inventing data over a third of the time, every time you use it. It means that when presented with esoteric trivia, the model doesn’t always come back with the right response.
The broader takeaway is this: Benchmarks do not always reflect how people use AI tools in real life. You won’t experience a 37% hallucination rate when you ask AI to summarize a report, draft a proposal, or turn a bullet list into an email. In those cases, Ed says hallucination rates are typically in the low single digits. And some of those situations will be the result of poor prompting.
Why hallucinations happen
Ed believes a world without AI hallucinations is unlikely. It’s a feature, not a bug – something that’s implicit in AI’s nature vs. something we need to fix.
Let’s break down how it works: An LLM doesn’t store facts like a database. It’s not retrieving and connecting information like a search engine. It’s predicting the next most probable word, based on everything that came before it. That probability is calculated across thousands of potential options, ranked by the likelihood that it’s the outcome you want.
“You could argue that everything a model does is a hallucination. It's just that 95% of the time it's right.”
For example, if you type: “To be or not to…”, the model will almost certainly continue with “be,” because that’s what it has seen 99% of the time in its training data. That’s a high-probability prediction.
But as you stray into lower confidence territory, you’re more likely to get a wrong answer. Here’s a prompt Ed gave recently that got a hallucination back:
“What was Bill Murray wearing at the 2017 Academy Awards?”
The model responded confidently that Bill Murray wore a plaid cummerbund and a hat. But he wasn’t even at the Academy Awards that year.
So why did the model say that?
- The 2017 Oscars were widely documented because of the La La Land-Moonlight mix-up, while Bill Murray’s outfits are not. It could not find a true connection.
- But, AI could see that Bill Murray has attended the Academy Awards in the past and is often photographed in quirky outfits.
- And because Ed’s prompt implied he was there, the AI took his phrasing as fact.
Ed’s prompt was a leading question. And without any internal truth-checking system, the AI took Ed’s apparent confidence as gospel and gave him an answer that seemed to make sense based on pattern recognition. It chose from a lower confidence option because there simply wasn’t a lot to go on.
“It'll make up a source that sounds good just because you asked for one,” says Ed.
Now, should the AI have a built-in system that recognizes and acknowledges when it can’t return a good answer because it only has low-probability options? Yes, and Ed says some AI systems are already working on that. But in the meantime, we need to reassess how we’re communicating with it.
Your part in hallucinations
Ed says we have a tendency to lead AI into corners. Here are three things you may be doing when you prompt that increase the likelihood of a hallucination:
- If you state things as fact to AI, it will believe you. Knowing that about AI, you can structure your prompts to be phrased less as assumptions or statements and more as questions.
- AI doesn’t have the same subjective benchmarks as us. When you ask AI to write a ‘really good article,’ it has no basis for what ‘really good’ means. It can’t intuit your goals.
- AI will rarely volunteer its own uncertainty. AI usually won’t tell you “I don’t know” – this is one of the flaws in the models that we mentioned above. So you need to be aware of its blind spots and prompt around them.
Ed also says you might also want to lower your expectations a tad. If you need highly specific information on 1800s gothic French poetry, realize that you might be the only one – and that the AI doesn’t have enough context to answer you confidently.
Some of the situations we’re labeling ‘hallucinations’ are actually more like reasonable margins of error. So instead of always putting it back on AI, consider that you may be leading each other astray.
Staying in the single digit hallucination rates
While ‘acceptance’ is Ed’s largest piece of advice, here are some tips he shared on keeping your outcomes in the highest probability zone:
- Understand that it will always comply rather than correct you. Asking questions like “in what book did Sherlock Holmes punch someone in the face?” implies a fact that may not be true. It’s not going to tell you you’re wrong, so it will find the closest possible option to satisfy you.
- Make ‘no’ an answer. Prompting the AI with “only answer if you’re confident” or “respond with ‘I don’t know’ if you’re unsure” helps curb its overconfidence.
- Break down big prompts into smaller sub-prompts. When you ask a model to write a 1,000-word report from scratch, you’re giving it way too much room to guess. By breaking the task down into smaller components, it’s easier to track where things go off the rails and keep the AI focused on your goal.
Feed it sources you’ve vetted. If you’re worried the AI doesn’t have enough quality data to pull from, supplement it. Upload documents and data, share links. This reduces the likelihood it’s pulling from lower probability outcomes.