The sheer volume of data accompanying a modern cancer diagnosis has become a significant hurdle for clinicians. As biomarker testing expands and patients survive longer, pathology reports have evolved into dense, longitudinal documents that often span multiple institutions and dozens of pages. This complexity creates a high-pressure environment where physicians must synthesize vast amounts of information quickly to determine the best course of treatment.
A new study from Northwestern Medicine suggests that AI outperforms doctors at summarizing complex cancer pathology reports, offering a potential solution to this clinical burden. Researchers found that several large language models (LLMs) were more consistent and comprehensive than human physicians in capturing critical genomic and molecular details—the very data points that often dictate which targeted therapies a patient receives.
The research, titled “Toward Automating the Summarization of Cancer Pathology Reports Using Large Language Models to Improve Clinical Usability,” was published in JCO Clinical Cancer Informatics. By testing six different open-source models, the team demonstrated that AI can act as a high-fidelity safety net, ensuring that actionable genetic markers are not overlooked during the synthesis of a patient’s medical history.
The challenge of ‘information overload’ in oncology
In the current era of precision medicine, a pathology report is no longer a simple description of a tumor. It is a complex map of histopathological findings (microscopic characteristics), immunohistochemical results (protein expression), and molecular data. For a patient with lung cancer, these reports may be updated repeatedly as the disease evolves or as new sequencing technologies become available.

Dr. Mohamed Abazeed, chair and professor of radiation oncology at Northwestern University Feinberg School of Medicine and senior author of the study, noted that the burden of synthesizing these reports is growing rapidly as care becomes more complex. According to Abazeed, the goal is not to replace the physician, but to provide a tool that augments clinical decision-making by ensuring critical details are consistently captured.
When a clinician is under significant time pressure, the risk of missing a single actionable genetic marker increases. In oncology, such an omission can lead to the selection of a less effective treatment or the failure to utilize a life-saving targeted drug. This is why the study focused on “completeness”—the ability of a summary to reflect all relevant data points from the original, lengthy report.
Comparing human precision against open-source AI
To test the efficacy of these tools, Northwestern investigators analyzed 94 de-identified pathology reports from lung cancer patients. The researchers utilized open-source models—systems that can be downloaded and run locally to maintain data privacy—rather than consumer-facing chatbots. The study evaluated six specific models:
- Meta: Llama 3.0, 3.1, and 3.2
- Google: Gemma 9B
- Mistral: Mistral 7.2B
- DeepSeek: DeepSeek-R1
A panel of oncologists then reviewed the AI-generated summaries alongside summaries previously written by physicians, grading them on accuracy, conciseness, potential clinical risk, and completeness. The results indicated that the AI models were consistently rated as more complete, particularly regarding molecular and genomic findings.
| Top Performing Models | Key Strength | Clinical Application |
|---|---|---|
| DeepSeek-R1 | High Completeness | Genomic Data Synthesis |
| Llama 3.1 | High Accuracy | Structured Summarization |
| Gemma 9B / Mistral 7.2B | Competitive Performance | General Report Parsing |
Among the group, DeepSeek and Llama 3.1 emerged as the strongest performers. Study co-author Troy Teo, an instructor of radiation oncology at Feinberg, emphasized that reliable synthesis allows clinicians to review findings more efficiently and standardizes documentation, which ultimately allows physicians to dedicate more time to direct patient care.
A ‘second opinion’ for high-stakes decisions
The research team views AI not as a primary diagnostician, but as a support layer. By acting as a “second opinion” tool, the AI can highlight missing information or flag a specific protein expression that a human reviewer might have glossed over in a 30-page document.

Dr. Yirong Liu, a radiation oncology resident at McGaw Medical Center of Northwestern and the study’s first author, pointed out that patients with the most complex cancers stand to benefit the most. Because these patients often undergo repeated biopsies and sequencing over several years, their medical records become fragmented across different institutions. Ensuring that a key pathological finding is captured across this longitudinal timeline is critical for treatment accuracy.
The potential for risk reduction is significant. If an AI can reliably flag an actionable mutation that was buried in a previous report from another hospital, it directly impacts the patient’s therapeutic trajectory.
Next steps toward clinical implementation
Even as the results are promising, the Northwestern team is proceeding with caution. They are currently developing an application powered by Llama 3.1 that would allow physicians to upload pathology reports and receive a structured summary for their review. However, the authors have explicitly stated that the app requires further testing and extensive validation studies before it can be deployed in a live clinical setting.
The transition from a research study to a bedside tool involves solving challenges related to “hallucinations”—where AI might invent data—and ensuring absolute adherence to patient privacy laws. The focus remains on a “human-in-the-loop” system where the AI proposes a summary and the physician verifies and signs off on it.
Disclaimer: This article is for informational purposes only and does not constitute medical advice. Always seek the advice of your physician or other qualified health provider with any questions you may have regarding a medical condition.
The next phase for the research team involves conducting further validation studies to refine the accuracy of the Llama 3.1-based application and establishing a rigorous framework for clinical safety testing.
Do you believe AI should play a larger role in synthesizing medical records? Share your thoughts in the comments or share this story with your network.
