CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays

Overview

Chest X-ray interpretation is a multi-step diagnostic reasoning process that involves identifying anatomical regions, deriving measurements or spatial observations from the image, and applying diagnostic criteria. For diagnostic assistants to be reliable in clinical practice, their reasoning must therefore be grounded in verifiable diagnostic evidence derived from the image.

However, recent studies show that large vision-language models (LVLMs) often generate plausible but ungrounded responses that are not faithfully supported by diagnostic evidence in the image. In addition, LVLMs typically present reasoning only through textual explanations, making it difficult to verify how conclusions are derived from the image. Moreover, extending LVLMs to support diverse diagnostic tasks often requires costly retraining.

To address these limitations, we introduce CXReasonAgent, a diagnostic agent that integrates a large language model (LLM) with clinically grounded diagnostic tools. Instead of directly generating answers, the agent calls diagnostic tools that extract image-derived diagnostic evidence, including quantitative measurements and spatial observations, along with visual evidence presented on the image. The agent then produces responses grounded in this explicit diagnostic evidence.

To evaluate evidence-grounded diagnostic reasoning, we introduce CXReasonDial, a multi-turn dialogue benchmark containing 1,946 dialogues across 12 diagnostic tasks. The benchmark evaluates whether model responses are correctly grounded in diagnostic evidence across dialogue turns, reflecting the iterative nature of real clinical reasoning.

Key Contributions

CXReasonAgent: A diagnostic agent that integrates LLM reasoning with clinically grounded diagnostic tools.
Evidence-grounded responses: The agent produces responses supported by image-derived diagnostic and visual evidence.
CXReasonDial: A multi-turn dialogue benchmark with 1,946 dialogues across 12 diagnostic tasks.
Improved reliability: Experiments show that CXReasonAgent produces more faithfully grounded responses than LVLMs.

How It Works

CXReasonAgent performs evidence-grounded diagnostic reasoning by combining an LLM with clinically grounded diagnostic tools. Given a user query and a chest X-ray, the agent identifies the requested diagnostic task, calls the appropriate tool to obtain image-derived evidence, and generates a response grounded in the returned evidence. This design supports reliable, verifiable, and coherent multi-turn diagnostic interactions.

Step 1. Interpret the Query and Plan Tool Use

The agent first interprets the user query to identify the requested diagnostic task and the type of evidence needed. Queries may ask for diagnostic evidence such as measurements or spatial observations, or for visual evidence that presents this information directly on the image. Based on this interpretation, the agent selects the appropriate diagnostic tool.

Step 2. Execute Clinically Grounded Diagnostic Tools

The selected tool analyzes the chest X-ray and returns image-derived evidence. Depending on the query, the tool may provide quantitative measurements, spatial observations, diagnostic criteria, conclusions, or annotated visual evidence shown directly on the image. These tools are implemented with CheXStruct, a deterministic pipeline built from clinically grounded criteria defined with radiologists.

Step 3. Generate Evidence-Grounded Responses

The agent then generates its response using the evidence returned by the tools, without directly relying on the image itself. This makes the reasoning process more transparent and verifiable, and helps maintain coherent evidence-grounded reasoning across multi-turn interactions.

Dialogue Examples

The examples illustrate two diagnostic scenarios: assessing inspiration adequacy and evaluating cardiomegaly using the cardiothoracic ratio (CTR). CXReasonAgent grounds its responses in image-derived diagnostic evidence and presents visual overlays for verification. In contrast, conventional LVLMs either generate unsupported estimates or cannot provide visual evidence for verification.

Demo

We provide an interactive demo of CXReasonAgent that allows users to explore evidence-grounded diagnostic reasoning through multi-turn interactions with chest X-rays.

Open Demo

How to Access the Demo

When you click the demo link, a page will appear with a “Visit Site” button. Click “Visit Site” to open the demo interface.

How to Use the Demo

Once you enter the demo interface:

Upload an image using the Image Upload button at the top-left, or
Select one of the example images shown at the bottom.

After selecting an image:

Type your question in the chat box
Press Enter to start the conversation

Note: The first response for a newly uploaded image may take a few seconds while the image is being processed.

Example Questions

You may try questions such as:

General diagnostic questions

Are there any abnormalities?
Is the image quality adequate?
Is there cardiomegaly?
Is the trachea centered?

Diagnostic evidence questions

What is the cardiothoracic ratio?
What is the carina angle?
How is cardiomegaly assessed?

Visual evidence requests

Show the measurement used to compute the CTR.
Highlight the trachea.
Show the anatomical regions used for the CTR measurement.

Supported Image Formats

The demo currently supports the following image formats:

.jpg / .jpeg / .png

Usage Limits

Maximum 10 turns per conversation
Maximum 20 turns per IP address per day

Citation

@article{lee2026cxreasonagent,
  title={CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays},
  author={Lee, Hyungyung and Yoon, Hangyul and Choi, Edward},
  journal={arXiv preprint arXiv:2602.23276},
  year={2026}
}