Tech giant OpenAI has hit an unexpected roadblock with its latest artificial intelligence models. The company’s new reasoning models, o3 and o4-mini, are showing a concerning spike in hallucination rates—essentially making up information that isn’t true—compared to their predecessors.
Hallucination Rates Surge in o3 and o4-Mini
According to OpenAI’s internal testing, the o3 model hallucinated in 33% of cases on the company’s PersonQA benchmark, nearly double the rate of previous models like o1 (16%) and o3-mini (14.8%). Even more troubling, the o4-mini model performed worse, hallucinating in 48% of all test cases—a staggering increase.
Why Is This Happening?
OpenAI has admitted that it does not fully understand why hallucination rates have increased. The company suggests that the reinforcement learning methods used in developing these models might be amplifying problems that older techniques managed to avoid. Researchers have found that o3 sometimes fabricates actions, such as pretending to run code on hardware that doesn’t exist.
Industry Concerns
AI experts warn that higher hallucination rates could undermine trust in AI systems, especially in fields like medicine, finance, and legal research, where accuracy is critical. Sarah Chen, a leading tech analyst, stated, “These hallucination rates could potentially undermine years of work building public trust in AI systems”.
Despite Issues, Models Excel in Other Areas
Despite these setbacks, the o3 model achieved an impressive 69.1% score on the SWE-bench coding benchmark, with o4-mini close behind at 68.1%. The models also demonstrate strong performance in coding and mathematical tasks, making them valuable for developers and researchers.
What’s Next for OpenAI?
OpenAI is now focusing on additional research to understand and mitigate hallucination rates in its models. The company remains committed to improving AI reliability while maintaining advancements in reasoning and problem-solving capabilities.