ChatGPT-5 & Pediatric Pneumothorax: Chest X-Ray Diagnosis Accuracy

by priyanka.patel tech editor

ChatGPT-5 Shows Promise, But Falls Short in Pediatric Pneumothorax Detection

A new study reveals that while ChatGPT-5 demonstrates high specificity in identifying pneumothorax in pediatric chest X-rays, its limited sensitivity – particularly for small cases – prevents its use as a standalone diagnostic tool.

The rise of artificial intelligence in healthcare has sparked both excitement and caution. Recent research has focused on the potential of large language models (LLMs) like ChatGPT to assist in medical image analysis. However, a new study published this month reveals significant limitations in the current capabilities of ChatGPT-5 when it comes to detecting pneumothorax – a collapsed lung – in children. Researchers found the model exhibited consistently high specificity (over 96%) and accurate lateral localization (also over 96%), but a modest overall accuracy (77-79%) and, critically, a low sensitivity (57-61%).

This pattern, observed across various prompting strategies, suggests that ChatGPT-5 adopts a conservative diagnostic approach, tending to avoid false positives at the expense of missing actual cases. “The model’s performance was substantially worse for small pneumothoraces,” the study authors noted, “underscoring its current suboptimal capability for identifying subtle radiographic signs in pediatric patients.”

LLMs Lag Behind Specialized AI in Pneumothorax Detection

The study’s findings place ChatGPT-5’s performance in context with previous research on multimodal LLMs. Earlier work with ChatGPT-4o, for example, showed an overall accuracy of 69% for chest and abdominal X-rays, but a significantly lower pneumothorax detection rate of just 41%. Other LLMs, such as Gemini 2.0 and Claude 3.5, also demonstrated varying performance, with accuracy declining sharply when assessing pediatric or small pneumothorax cases. One study found ChatGPT-4o achieved 82% accuracy for large pneumothoraces in adults, but only 20-42% accuracy in pediatric cases.

In contrast, specialized AI models, such as convolutional neural networks (CNNs), have demonstrated much higher accuracy in chest X-ray interpretation. Models like MobileLungNetV2 have achieved over 96% accuracy, precision, recall, and specificity. Commercial CNN tools, like Lunit INSIGHT CXR, have shown AUROC values between 0.88 and 0.99 for detecting major thoracic findings. However, even these specialized models struggle with smaller lesions, echoing the limitations observed in the ChatGPT-5 evaluation. This recurring issue highlights the inherent difficulty in interpreting subtle visual cues.

A Hybrid Approach May Hold the Key

Researchers suggest that the future of AI in medical imaging lies in a hybrid approach, combining the strengths of CNNs and LLMs. CNNs excel at pixel-level analysis and localization, while LLMs are adept at semantic comprehension and integrating information within a clinical context. “LLMs are probably best positioned as cognitive integrators rather than standalone detectors,” the study concludes.

Recent work supports this idea. One study demonstrated that ChatGPT could effectively stratify pneumothorax cases by difficulty, enabling a deep learning model to achieve near-FDA-grade performance. This illustrates how LLM-assisted preprocessing can enhance the performance of visual analysis models.

Study Limitations and Future Directions

The study acknowledges several limitations. Its retrospective, single-center design may limit generalizability. The analysis focused solely on pneumothorax and PA view chest X-rays, excluding cases with complicating factors like multiple pathologies or suboptimal imaging. The study population also had a relatively older pediatric age distribution, potentially affecting the findings’ applicability to younger children. Furthermore, the use of expert consensus as the reference standard, rather than direct comparison with clinicians, is a noted limitation.

Despite these limitations, the research provides valuable insights into the current capabilities – and limitations – of LLMs in pediatric pneumothorax detection. The authors emphasize the need for further research, including head-to-head comparisons between AI and human readers, and investigations into the impact of image quality and patient age.

Currently, ChatGPT and similar LLMs lack regulatory certification as medical devices, precluding their use as primary diagnostic tools. However, as AI technology continues to evolve, a hybrid CNN-LLM paradigm promises to improve the interpretability, robustness, and workflow adaptability of clinical AI, ultimately enhancing patient care.

You may also like

Leave a Comment