theregister.com

AI models hallucinate, and doctors are OK with that

The tendency of AI models to hallucinate – aka confidently making stuff up – isn't sufficient to disqualify them from use in healthcare settings. So, researchers have set out to enumerate the risks and formulate a plan to do no harm while still allowing medical professionals to consult with unreliable software assistants.

No less than 25 technology and medical experts from respected academic and healthcare organizations, not to mention a web search ad giant – MIT, Harvard Medical School, University of Washington, Carnegie Mellon University, Seoul National University Hospital, Google, Columbia University, and Johns Hopkins University – have taken it upon themselves to catalog and analyze medical hallucinations in mainstream, foundation models, with an eye toward formulating better rules for working with AI in healthcare settings.

Their work, published in a preprint paper titled "Medical Hallucinations in Foundation Models and Their Impact on Healthcare" and in a supporting GitHub repository, argues that harm mitigation strategies need to be developed.

These hallucinations use domain-specific terms and appear to present coherent logic, which can make them difficult to recognize

The authors start from the premise that foundation models – huge neural networks trained on a ton of people's work and other data – from the likes of Anthropic, Google, Meta, and OpenAI present "significant opportunities, from enhancing clinical decision support to transforming medical research and improving healthcare quality and safety."

And given that starting point – and the affiliation of at least one researcher with a major AI vendor – it's perhaps unsurprising that the burn-it-with-fire scenario is not considered.

Rather, the authors set out to create a taxonomy of medical hallucination, which differs they claim from erroneous AI answers in less consequential contexts.

"Medical hallucinations exhibit two distinct features compared to their general purpose counterparts," the authors explain. "First, they arise within specialized tasks such as diagnostic reasoning, therapeutic planning, or interpretation of laboratory findings, where inaccuracies have immediate implications for patient care. Second, these hallucinations frequently use domain-specific terms and appear to present coherent logic, which can make them difficult to recognize without expert scrutiny."

The taxonomy, rendered visually in the paper as a pie chart, includes: Factual Errors; Outdated References; Spurious Correlations: Fabricated Sources or Guidelines; and Incomplete Chains of Reasoning.

The authors also looked at the frequency with which such hallucinations appear. Among various different tests, the boffins evaluated the clinical reasoning abilities of five general-purpose LLMs – o1, gemini-2.0-flash-exp, gpt-4o,gemini-1.5-flash, and claude-3.5 sonnet – on three targeted tasks: Ordering events chronologically; lab data interpretation; and differential diagnosis generation, the process of assessing symptoms and exploring possible diagnoses. Models were rated on a scale of No Risk (0) to Catastrophic (5).

The results were not great, though some models fared better than others: "Diagnosis Prediction consistently exhibited the lowest overall hallucination rates across all models, ranging from 0 percent to 22 percent," the paper says. "Conversely, tasks demanding precise factual recall and temporal integration – Chronological Ordering (0.25 - 24.6 percent) and Lab Data Understanding (0.25 - 18.7 percent) – presented significantly higher hallucination frequencies."

The findings, the authors say, challenge the assumption that diagnostic tasks require complex inference that LLMs are less able to handle.

"Instead, our results suggest that current LLM architectures may possess a relative strength in pattern recognition and diagnostic inference within medical case reports, but struggle with the more fundamental tasks of accurately extracting and synthesizing detailed factual and temporal information directly from clinical text," they explain.

Among the general purpose models, Anthropic's Claude-3.5 and OpenAI's o1 had the lowest hallucination rates in the three tested tasks. These findings, the researchers argue, suggest high performing models show promise for diagnostic inference. But the continued occurrence of risk errors rated Significant (2) or Considerable (3) means even the best performing models have to be carefully monitored for clinical tasks and have a human in the loop.

The researchers also conducted a survey of 75 medical practitioners about their use of AI tools. And there's no going back, it seems: "40 used these tools daily, 9 used them several times per week, 13 used them few times a month, and 13 reported rare or no usage," the paper says, adding that 30 respondents expressed high levels of trust in AI model output.

That lack of skepticism from 40 percent of the survey participants is all the more surprising considering that "91.8 percent have encountered medical hallucination in their clinical practice" and that "84.7 percent have considered that hallucination they have experienced could potentially affect patient health."

We're left to wonder whether newly hired medical personnel would be afforded an error rate to match that of the hallucinating AI models.

The researchers conclude by emphasizing that regulations are urgently needed and that legal liability for errors needs to be clarified.

"If an AI model outputs misleading diagnostic information, questions arise as to whether liability should fall on the AI developer for potential shortcomings in training data, the healthcare provider for over-reliance on opaque outputs, or the institution for inadequate oversight," the authors say.

Given the Trump administration's rollback of Biden-era AI safety rules, the researchers' call "for ethical guidelines and robust frameworks to ensure patient safety and accountability" may not be answered on a federal level. ®

Read full news in source page