Tech News
← Back to articles

Something Extremely Scary Happens When Advanced AI Tries to Give Medical Advice to Real World Patients

read original related products more articles

Image by Getty / Futurism Developments

Last week, Google AI pioneer Jad Tarifi sparked controversy when he told Business Insider that it no longer makes sense to get a medical degree — since, in his telling, artificial intelligence will render such an education obsolete by the time you're a practicing doctor.

Companies have long touted the tech as a way to free up the time of overworked doctors and even aid them in specialized skills, including scanning medical imagery for tumors. Hospitals have already been rolling out AI tech to help with administrative work.

But given the current state of AI — from widespread hallucinations to "deskilling" experienced by doctors over-relying on it — there's reason to believe that med students should stick it out.

If anything, in fact, the latest research suggests we need human healthcare professionals now more than ever.

As PsyPost reports, researchers have found that frontier AI models fail spectacularly when the familiar formats of medical exams are even slightly altered, greatly undermining their ability to help patients in the real world — and raising the possibility that, instead, they could cause great harm by providing garbled medical advice in high-stakes health scenarios.

As detailed in a paper published in the journal JAMA Network Open, things quickly fell apart for models including OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet when the wording of questions in a benchmark test was only slightly adjusted.

The idea was to probe the nature of how large language models arrive at their answers: by predicting the probability of each subsequent word — and not through a human-level understanding of complex medical terms.

"We have AI models achieving near perfect accuracy on benchmarks like multiple-choice based medical licensing exam questions," Stanford University PhD student and coauthor Suhana Bedi told PsyPost. "But this doesn’t reflect the reality of clinical practice. We found that less than five percent of papers evaluate LLMs on real patient data, which can be messy and fragmented."

The results left a lot to be desired. According to Bedi, "most models (including reasoning models) struggled" when it came to "Administrative and Clinical Decision Support tasks."

... continue reading