Helicobacter pylori infects about half of the global population and is a major cause of peptic ulcer disease and gastric cancer. Improving patient education can increase screening participation, enhance treatment adherence, and help reduce gastric cancer incidence. Recently, large language models (LLMs) such as ChatGPT, Gemini, and DeepSeek-R1 have been explored as tools for producing patient-facing educational materials; however, their performance compared to expert gastroenterologists remains under evaluation. This narrative review analyzed seven peer-reviewed studies (2024–2025) assessing LLMs’ ability to answer H. pylori-related questions or generate educational content, evaluated against physician- and patient-rated benchmarks across six domains: accuracy, completeness, readability, comprehension, safety, and user satisfaction. LLMs demonstrated high accuracy, with mean accuracies typically ranging from approximately 77% to 95% across different models and studies, and with most models achieving values above 90%, comparable to or exceeding that of general gastroenterologists and approaching senior specialist levels. However, their responses were often judged as incomplete, described as “correct but insufficient.” Readability exceeded the recommended sixth-grade level, though comprehension remained acceptable. Occasional inaccuracies in treatment advice raised safety concerns. Experts and medical trainees rated LLM outputs positively, while patients found them less clear and helpful. Overall, LLMs demonstrate strong potential to provide accurate and scalable H. pylori education for patients; however, heterogeneity between LLM versions (e.g., GPT-3.5, GPT-4, GPT-4o, and various proprietary or open-source architectures) and prompting strategies results in variable performance across studies. Enhancing completeness, simplifying language, and ensuring clinical safety are key to their effective integration into gastroenterology patient education.
Educational Materials for Helicobacter pylori Infection: A Comparative Evaluation of Large Language Models Versus Human Experts / Ortu, Giulia; Merola, Elettra; Pes, Giovanni Mario; Dore, Maria Pina. - In: AI. - ISSN 2673-2688. - 6:12(2025). [10.3390/ai6120311]
Educational Materials for Helicobacter pylori Infection: A Comparative Evaluation of Large Language Models Versus Human Experts
Ortu, Giulia;Merola, Elettra;Pes, Giovanni Mario;Dore, Maria Pina
2025-01-01
Abstract
Helicobacter pylori infects about half of the global population and is a major cause of peptic ulcer disease and gastric cancer. Improving patient education can increase screening participation, enhance treatment adherence, and help reduce gastric cancer incidence. Recently, large language models (LLMs) such as ChatGPT, Gemini, and DeepSeek-R1 have been explored as tools for producing patient-facing educational materials; however, their performance compared to expert gastroenterologists remains under evaluation. This narrative review analyzed seven peer-reviewed studies (2024–2025) assessing LLMs’ ability to answer H. pylori-related questions or generate educational content, evaluated against physician- and patient-rated benchmarks across six domains: accuracy, completeness, readability, comprehension, safety, and user satisfaction. LLMs demonstrated high accuracy, with mean accuracies typically ranging from approximately 77% to 95% across different models and studies, and with most models achieving values above 90%, comparable to or exceeding that of general gastroenterologists and approaching senior specialist levels. However, their responses were often judged as incomplete, described as “correct but insufficient.” Readability exceeded the recommended sixth-grade level, though comprehension remained acceptable. Occasional inaccuracies in treatment advice raised safety concerns. Experts and medical trainees rated LLM outputs positively, while patients found them less clear and helpful. Overall, LLMs demonstrate strong potential to provide accurate and scalable H. pylori education for patients; however, heterogeneity between LLM versions (e.g., GPT-3.5, GPT-4, GPT-4o, and various proprietary or open-source architectures) and prompting strategies results in variable performance across studies. Enhancing completeness, simplifying language, and ensuring clinical safety are key to their effective integration into gastroenterology patient education.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


