LLMs are not the answer to medical diagnosis without help from other technologies, or doctors

Inflect Health
5 min readNov 13, 2024

--

Joshua Tamayo-Sarver, MD, PhD, FACEP, FAMIA

In recent years, Large Language Models (LLMs) have captured the imagination of the healthcare community with their potential to revolutionize clinical diagnosis. However, a closer examination of the fundamental differences between LLMs and human clinical reasoning reveals significant limitations that may prevent these models from achieving the level of reliability required for real-world medical decision-making.

The Nature of Clinical Diagnosis

Clinical diagnosis is a complex process that relies on expert decision-making, often following the rapid clinical decision in context model (Tamayo-Sarver et al., 2005). This model involves two primary components:

  1. Pattern Recognition: Clinicians quickly identify familiar patterns based on their extensive training and experience.
  2. Pattern Verification: Experts then verify if the recognized pattern is correct or sufficiently close.

If the pattern is deemed sufficiently correct, the clinician proceeds with their diagnosis. However, if the pattern is not sufficiently correct, the process becomes more nuanced:

  • The clinician may attempt to modify the pattern and reassess its accuracy through additional testing and case consultation.
  • If modification is not possible or successful, the clinician switches to hypothetico-deductive reasoning, a more systematic approach to problem-solving.

This flexibility in reasoning is crucial in complex or atypical cases, where pattern recognition alone may be insufficient (Hruska et al., 2016).

The Functioning of Large Language Models

LLMs, by their very nature, operate exclusively through pattern recognition (Strasser, 2023). This limitation is inherent to their mathematical foundation:

  1. Pattern Recognition: The model identifies patterns in the input data based on its training.
  2. Pattern Verification: The model checks if the recognized pattern meets certain criteria or thresholds.

If the pattern is deemed sufficiently correct, the model proceeds with its output. If not, the model simply repeats the pattern recognition process, potentially with slight variations, but without the ability to switch to a fundamentally different reasoning approach (Reese et al., 2024; Tamayo-Sarver et al., 2005).

The Gap Between LLMs and Clinical Diagnosis

This fundamental difference in approach creates a significant gap between LLMs and human clinical reasoning:

  1. Lack of Contextual Understanding: While LLMs can process vast amounts of text, they struggle to truly understand the complex, multifaceted context of a patient’s condition (Lorè & Heydari, 2023; Thirunavukarasu et al., 2023)(Thirunavukarasu et al., 2023). For example, an LLM can recognize that a patient with appropriate symptoms and testing results matches the pattern of poorly controlled diabetes. However, it is challenging for the LLM to know how to sensitively explore the underlying problem that the patient is rationing their medication because of increased costs to pay for their kid’s tutoring.
  2. Inability to Perform Hypothetico-Deductive Reasoning: When pattern recognition fails, LLMs cannot switch to the more systematic reasoning approach that human clinicians employ (Reese et al., 2024). An example of this is when a large language model (LLM) is tasked with diagnosing a patient who presents with uncommon symptoms that don’t fit any familiar disease pattern. Unlike a human clinician who would systematically generate and test various hypotheses (e.g., considering less obvious conditions and ordering targeted tests), the LLM struggles to adapt beyond its initial pattern recognition and cannot switch to this step-by-step reasoning approach. Consequently, it may produce irrelevant or incomplete diagnoses.
  3. Inconsistency in Performance: Studies have shown that LLMs’ performance in clinical tasks can vary significantly, with accuracy rates ranging from 64% to 98% depending on the specific task and model used (Sorin et al., 2023).
  4. Risk of Data Leakage and Contamination: Many existing medical evaluation benchmarks for LLMs face the risk of data leakage or contamination, potentially leading to overly optimistic performance estimates (Yan et al., 2024).

The Future of AI in Clinical Diagnosis

While LLMs show promise in certain aspects of healthcare, such as information extraction and guideline-based question-answering, their limitations in clinical diagnosis are significant (Sorin et al., 2023). The path forward may involve:

  1. Hybrid Systems: Combining LLMs with other AI systems capable of hypothetico-deductive reasoning to more closely emulate the expert decision-making process (Jia et al., 2024; Mumtaz et al., 2024).
  2. Clinical Intelligence Data Layer: Developing a comprehensive data infrastructure that can provide context and support for AI-assisted diagnosis.
  3. Continued Human Oversight: Recognizing that AI systems, including LLMs, will likely remain tools to augment human clinicians rather than replace them entirely.

In conclusion, while LLMs have demonstrated impressive capabilities in certain healthcare applications, their fundamental limitations in replicating the full spectrum of clinical reasoning processes suggest that they are unlikely to achieve the level of reliability required for autonomous clinical diagnosis in the foreseeable future. The healthcare community must remain cautious and critical in evaluating the true potential of these technologies, focusing on how they can best support and augment human clinical expertise rather than attempting to replace it.

Citations

Hruska, P., Hecker, K. G., Coderre, S., McLaughlin, K., Cortese, F., Doig, C., Beran, T., Wright, B., & Krigolson, O. (2016). Hemispheric activation differences in novice and expert clinicians during clinical decision making. Advances in Health Sciences Education: Theory and Practice, 21(5), 921–933. https://doi.org/10.1007/s10459-015-9648-3

Jia, M., Duan, J., Song, Y., & Wang, J. (2024). medIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRs (arXiv:2406.14326). arXiv. http://arxiv.org/abs/2406.14326

Lorè, N., & Heydari, B. (2023). Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing (arXiv:2309.05898). arXiv. https://doi.org/10.48550/arXiv.2309.05898

Mumtaz, U., Ahmed, A., & Mumtaz, S. (2024). LLMs-Healthcare: Current Applications and Challenges of Large Language Models in various Medical Specialties (arXiv:2311.12882). arXiv. https://doi.org/10.48550/arXiv.2311.12882

Reese, J. T., Danis, D., Caufield, J. H., Groza, T., Casiraghi, E., Valentini, G., Mungall, C. J., & Robinson, P. N. (2024). On the limitations of large language models in clinical diagnosis. medRxiv, 2023.07.13.23292613. https://doi.org/10.1101/2023.07.13.23292613

Sorin, V., Glicksberg, B. S., Barash, Y., Konen, E., Nadkarni, G., & Klang, E. (2023, November 4). Applications of Large Language Models (LLMs) in Breast Cancer Care. https://doi.org/10.1101/2023.11.04.23298081

Strasser, A. (2023). On pitfalls (and advantages) of sophisticated large language models (arXiv:2303.17511). arXiv. https://doi.org/10.48550/arXiv.2303.17511

Tamayo-Sarver, J. H., Dawson, N. V., Hinze, S. W., Cydulka, R. K., Wigton, R. S., & Baker, D. W. (2005). Rapid Clinical Decisions in Context: A Theoretical Model to Understand Physicians’ Decision-Making With an Application to Racial/Ethnic Treatment Disparities. In J. Jacobs Kronenfeld (Ed.), Health Care Services, Racial and Ethnic Minorities and Underserved Populations: Patient and Provider Perspectives (Vol. 23, pp. 183–213). Emerald Group Publishing Limited. https://doi.org/10.1016/S0275-4959(05)23009-0

Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29(8), 1930–1940. https://doi.org/10.1038/s41591-023-02448-8

Yan, W., Liu, H., Wu, T., Chen, Q., Wang, W., Chai, H., Wang, J., Zhao, W., Zhang, Y., Zhang, R., & Zhu, L. (2024). ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World (arXiv:2406.13890). arXiv. https://doi.org/10.48550/arXiv.2406.13890

Joshua Tamayo-Sarver, MD, PhD, FACEP, FAMIA

--

--

Inflect Health
Inflect Health

Written by Inflect Health

Healthcare. Optimized and accessible for all.

Responses (1)