According to a recent study, even the best AI models regularly experience hallucinations—errors where the models generate false or misleading information. All generative AI models—from Google's Gemini to Anthropic's Claude and OpenAI's most recent GPT-4o—have this problem, however the kind and frequency of errors differ according to training data.
In an effort to assess these hallucinations, researchers from Cornell, the universities of Washington and Waterloo, and the organization AI2 compared model outputs against reliable sources on a variety of subjects, including geography, history, health, and law. No model performed consistently well across all individuals, according to the data, and the ones that experienced less hallucinations did so in part because they declined to respond to questions that they might have answered erroneously.
The research underscores the persistent difficulty of relying on AI-generated material, as even the most advanced models are only able to generate precise, devoid of hallucinations text in approximately 35% of cases. This investigation includes more difficult subjects that aren't covered by Wikipedia, such culture, finance, and medicine, whereas previous studies frequently focused on questions with easily accessible answers on Wikipedia. We evaluated more than a dozen well-known models, such as Google's Gemini 1.5 Pro, Meta's Llama 3, and GPT-4o.
The study discovered that although AI models have advanced, their rates of hallucinations have not decreased noticeably. OpenAI's models were among the least likely to produce inaccurate results; yet, the models had more trouble answering questions about finance and celebrities than they did about geography and computer science.
Not only did models lacking web search capabilities struggle to answer problems not covered by Wikipedia, but smaller models outperformed larger ones in terms of hallucination rate.These results cast doubt on the advancements that AI suppliers have claimed to have made.
The research indicates that hallucinations are going to be a problem for some time to come and that the criteria being used to assess these models may not be sufficient. The Claude 3 Haiku model, which attained more accuracy by not responding to roughly 28% of the questions, is one example of a model that the researchers propose as an intermediate solution. Programming models to do the same. On the other hand, it is unclear if people will tolerate a model that consistently refuses to answer.
The researchers support legislation that guarantee human experts are involved in verifying AI-generated content, as well as more concentrated efforts to reduce hallucinations, maybe through human-in-the-loop fact-checking and improved citation procedures. They believe there are a lot of potential to improve fact-checking tools and offer content fixes for content that has been altered by hallucinations.