TEK IMAGE/SCIENCE PHOTO LIBRARY / Science Photo Library via Getty Images ZDNET's key takeaways People can't tell AI-generated from doctor responses. However, people trust AI responses more than those from doctors. Integrating AI into clinical practice must be a nuanced approach. Get more in-depth ZDNET tech coverage: Add us as a preferred Google source on Chrome and Chromium browsers. There's a crisis due to a lack of doctors in the US. In the October issue of the prestigious New England Journal of Medicine, Harvard Medical School professor Isaac Kohane described how many large hospitals in Massachusetts, the state with the most doctors per capita, are refusing to admit new patients. The situation is only going to get worse, statistics suggest, wrote Kohane. As a result: "Whether out of desperation, frustration, or curiosity, large numbers of patients are already using AI to obtain medical advice, including second opinions -- sometimes with dramatic therapeutic consequences." Also: Can AI outdiagnose doctors? Microsoft's tool is 4 times better for complex cases The medical community is both interested in and somewhat concerned about the growing tendency for people to seek medical advice from ChatGPT and other generative AI systems. And they ought to be concerned, as it appears people are likely to trust a bot for medical advice more than they trust doctors, including when the medical advice from a bot is of "low quality." Testing how people view AI-generated medical advice In a study published in June in The New England Journal of Medicine, titled, "People Overtrust AI-Generated Medical Advice despite Low Accuracy," Shruthi Shekar and collaborators at MIT's Media Lab, Stanford University, Cornell University, Beth Israel Deaconess Medical Center in Boston, and IBM tested people's responses to medical advice from OpenAI's older GPT-3 model. Shekar and team extracted 150 medical questions from an internet health site, HealthTap, and generated answers to them using GPT-3. A group of doctors was recruited to rate the AI answers for accuracy, assigning each "yes," "no," or "maybe" in terms of correctness. Shekar and team then curated three data sets consisting of 30 question/answer pairs with actual physicians' responses, 30 with "high-accuracy" AI responses, meaning those mostly rated correct by doctors, and 30 with "low-accuracy AI responses, those mostly assigned "no" or "maybe" by doctors. They conducted three experiments. In the first experiment, a group of 100 subjects recruited online from the website Prolific were presented with 10 question/answer pairs randomly selected from the 90, without knowing whether they were from doctors or AI. The researchers asked each person to rate on a scale of 1 to 5 how much they understood the question/response pair, and to rate the certainty that the source of a given pair was a person or AI. Also: This one feature could make GPT-5 a true game changer (if OpenAI gets it right) In a second experiment, a different group of 100 had to rate whether they thought the answers were "valid" and answer multiple-choice questions about whether they would be inclined to follow the medical advice given. However, this group wasn't told any information about doctors versus AI. In the third and final experiment, another group of 100 was given a random sample of 10 questions and asked the same questions. However, this time they were informed at the beginning of the experiment that what they were about to examine was from AI, a doctor, or "a doctor assisted by AI." Also: Stop using AI for these 9 work tasks - here's why The labels were chosen at random, meaning that some questions written by AI might have been thought by subjects to be written by doctors, or by doctors using AI. MIT Media Lab People can't tell it's AI The authors then analyzed the subjects' performance in each experiment. In experiment one, participants performed poorly when guessing if a question/answer pair was human or AI-sourced, little better than chance, in fact: When participants were asked to determine the source of the medical response provided to them (doctor-written or AI-generated), there was an average source determination accuracy of 50% for doctors' responses, 53% for high-accuracy AI responses, and 50% for low-accuracy AI responses. People are also very confident even when they're wrong. Although they did poorly, Shekar and team reported a high degree of confidence from subjects that their determination of AI or human was accurate. "The level of confidence when participants guessed correctly and incorrectly was not significantly different," they noted. In the second experiment, the subjects judged the AI-generated responses "to be significantly more valid than the doctors' responses," and even the "low-accuracy AI-generated responses performed very comparably with the doctors' responses." Remember, the low-accuracy AI responses were responses that doctors deemed wrong, or at least possibly inaccurate. Also: You can use Google's Math Olympiad-winning Deep Think AI model now - for a price The same thing happened with trustworthiness: subjects said the AI responses were "significantly more trustworthy" than doctors' responses, and they also showed "a relatively equal tendency to follow the advice provided across all three response types," meaning high-quality AI, doctors, and low-quality AI. People can be led to believe AI is a doctor In the third test, with random labels that suggested a response was from AI, a doctor, or a doctor assisted with AI, the label that suggested the doctor was a source heavily influenced the subjects. "In the presence of the label 'This response to each medical question was given by a %(doctor),' participants tended to rate high-accuracy AI-generated responses as significantly more trustworthy" than when responses were labeled as coming from AI. Even doctors can be fooled, it turns out. In a follow-up test, Shekar and team asked doctors to evaluate the question/answer pairs, both with and without being told which was AI and which wasn't. With labels indicating which was which, the doctors "evaluated the AI-generated responses as significantly lower in accuracy." When they didn't know the source, "there was no significant difference in their evaluation in terms of accuracy," which, the authors write, shows that doctors have their own biases. Also: Even OpenAI CEO Sam Altman thinks you shouldn't trust AI for therapy In sum, people, even doctors, can't tell AI from a human when it comes to medical advice, and, on average, lay people are inclined to trust AI responses more than doctors, even when the AI responses are of low quality, meaning, even when the advice is wrong, and even more so if they are led to believe the response is actually from a doctor. The danger of believing AI advice Shekar and team see a big concern in all this: Participants' inability to differentiate between the quality of AI-generated responses and doctors' responses, regardless of accuracy, combined with their high evaluation of low-accuracy AI responses, which were deemed comparable with, if not superior to, doctors' responses, presents a concerning threat […] a dangerous scenario where inaccurate AI medical advice might be deemed as trustworthy as a doctor's response. When unaware of the response's source, participants are willing to trust, be satisfied, and even act upon advice provided in AI-generated responses, similarly to how they would respond to advice given by a doctor, even when the AI-generated response includes inaccurate information. Shekar and team conclude that "expert oversight is crucial to maximize AI's unique capabilities while minimizing risks," including transparency about where advice is coming from. The results also mean that "integrating AI into medical information delivery requires a more nuanced approach than previously considered." However, the conclusions are made more complicated as, ironically, the people in the third experiment were less favorable if they thought a response was coming from a doctor "assisted by AI," a fact that complicates "the ideal solution of combining AI's comprehensive responses with physician trust," they write. Let's examine how AI can help To be sure, there is evidence that bots can be helpful in tasks such as diagnosis when used by doctors. A study in the scholarly journal Nature Medicine in December, conducted by researchers at the Stanford Center for Biomedical Informatics Research at Stanford University, and collaborating institutions, tested how physicians fared in diagnosing conditions in a simulated setting, meaning, not with real patients, using either the help of GPT-4 or traditional physicians' resources. The study was very positive for AI. "Physicians using the LLM scored significantly higher compared to those using conventional resources," wrote lead author Ethan Goh and team. Also: Google upgrades AI Mode with Canvas and 3 other new features - how to try them Putting the research together, if people tend to trust AI, and if AI has been shown to help doctors in some cases, the next stage might be for the entire field of medicine to grapple with how AI can help or hurt in practice. As Harvard professor Kohane argues in his opinion piece, what is ultimately at stake is the quality of care and whether AI can or cannot help. "In the case of AI, shouldn't we be comparing health outcomes achieved with patients' use of these programs with outcomes in our current primary-care-doctor–depleted system?"