Did you know Can You Trust AI for Medical Advice? New Study Uncovers the Risky Truth
According to a new study published in NPJ Digital Medicine, some Spanish
researchers tried to investigate if the large language models are
reliable when it comes to giving health advice. The researchers tested
seven LLMs, including OpenAI's ChatGPT, ChatGPT-4 and Meta's Llama 3,
with 150 medical questions, and the researchers found that all the
models tested had varied results. Most of the AI-based search engines
give incomplete or incorrect results when users ask them some
health-related questions. Even though AI-powered chatbots are
increasingly in demand there haven't been proper studies which could
show that LLMs give reliable medical-related results. This study found
that the results of LLMs accuracy depend on the phrasing, retrieval
bias, and reasoning, but they can still produce misinformation.
For
the study, the researchers assessed four search engines: Google,
Yahoo!, DuckDuckGo and Bing, and seven LLMs including ChatGPT, GPT-4,
Flan-T5, Llama3 and MedLlama3. The results showed that ChatGPT, GPT-4,
Llama3 and MedLlama3 had the upper hand in most evaluations, while
Flan-T5 lagged behind the pack. For search engines, the researchers
analyzed the top 20 ranked results. A passage extraction model was used
to identify relevant snippets and a reading comprehension model was used
to determine if the snippets had a definitive yes/no answer. Two types
of users' behaviors were also seen: Lazy users stopped searching as soon
as they found the first clear answer, while the diligent users
cross-referenced three sources before deciding on an answer. The lazy
users were the ones who got the most accurate answers, which shows that
top-ranked answers are accurate most of the time.
For large language models, the researchers used different prompting
strategies like asking a question without any context, using friendly
wording, and using expert wording. The study also provided LLMs some
sample Q&As which helped some models but didn't have any effect on
others. Retrieval-augmented generation method was also used where LLMs
were provided search engine results before they generated their own
responses. The performance of the AI models was measured through
accuracy, common errors in their responses, and improvements through
retrieval augmentation.
The results of the study showed
that search engines answered 50-70% queries accurately while LLMs had
an 80% accuracy rate. The responses from LLMs varied on the basis of how
questions were framed, and the expert prompt (using expert tone) was
the most effective but sometimes resulted in less definitive answers.
Bing had the most reliable answers, but it wasn't any better than
Yahoo!, Google, and DuckDuckGo. Many search results from search engines
were irrelevant or off-topic while the precision improved 80-90% by
filtering for relevant answers. Smaller LLMs showed improvements in
their performance after search engine snippets were added. But poor
quality retrieval worsened the accuracy of LLMs, especially for Covid-19
related queries.
The error analysis of LLMs showed that there were three major failures of LLMs when it comes to health-related queries: Incorrect medical consensus understanding, misinterpreting questions, and ambiguous answers. The study showed that the performance of LLMs varied based on the dataset they were being questioned from, with a dataset from 2020 generating more accurate responses than a dataset from 2021.
