Skip to content

Latest commit

 

History

History
205 lines (203 loc) · 342 KB

File metadata and controls

205 lines (203 loc) · 342 KB

Medical Image Analysis

Title Date Abstract Comment CodeRepository
A Self supervised learning framework for imbalanced medical imaging datasets 2026-04-02
Show

Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.

None
Ranking-Guided Semi-Supervised Domain Adaptation for Severity Classification 2026-04-02
Show

Semi-supervised domain adaptation leverages a few labeled and many unlabeled target samples, making it promising for addressing domain shifts in medical image analysis. However, existing methods struggle with severity classification due to unclear class boundaries. Severity classification involves naturally ordered class labels, complicating adaptation. We propose a novel method that aligns source and target domains using rank scores learned via ranking with class order. Specifically, Cross-Domain Ranking ranks sample pairs across domains, while Continuous Distribution Alignment aligns rank score distributions. Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of our approach, demonstrating successful alignment of class-specific rank score distributions.

None
OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images 2026-04-01
Show

Medical imaging techniques, especially Magnetic Resonance Imaging (MRI), are accepted as the gold standard in the diagnosis and treatment planning of neurological diseases. However, the manual analysis of MRI images is a time-consuming process for radiologists and is prone to human error due to fatigue. In this study, two different Deep Learning approaches were developed and analyzed comparatively for the automatic detection and classification of brain tumors (Glioma, Meningioma, Pituitary, and No Tumor). In the first approach, a custom Convolutional Neural Network (CNN) architecture named "OkanNet", which has a low computational cost and fast training time, was designed from scratch. In the second approach, the Transfer Learning method was applied using the 50-layer ResNet-50 [1] architecture, pre-trained on the ImageNet dataset. In experiments conducted on an extended dataset compiled by Masoud Nickparvar containing a total of $7,023$ MRI images, the Transfer Learning-based ResNet-50 model exhibited superior classification performance, achieving $96.49%$ Accuracy and $0.963$ Precision. In contrast, the custom OkanNet architecture reached an accuracy rate of $88.10%$; however, it proved to be a strong alternative for mobile and embedded systems with limited computational power by yielding results approximately $3.2$ times faster ($311$ seconds) than ResNet-50 in terms of training time. This study demonstrates the trade-off between model depth and computational efficiency in medical image analysis through experimental data.

7 pag...

7 pages, 3 figures, 1 table

None
Optimized Weighted Voting System for Brain Tumor Classification Using MRI Images 2026-03-30
Show

The accurate classification of brain tumors from MRI scans is essential for effective diagnosis and treatment planning. This paper presents a weighted ensemble learning approach that combines deep learning and traditional machine learning models to improve classification performance. The proposed system integrates multiple classifiers, including ResNet101, DenseNet121, Xception, CNN-MRI, and ResNet50 with edge-enhanced images, SVM, and KNN with HOG features. A weighted voting mechanism assigns higher influence to models with better individual accuracy, ensuring robust decision-making. Image processing techniques such as Balance Contrast Enhancement, K-means clustering, and Canny edge detection are applied to enhance feature extraction. Experimental evaluations on the Figshare and Kaggle MRI datasets demonstrate that the proposed method achieves state-of-the-art accuracy, outperforming existing models. These findings highlight the potential of ensemble-based learning for improving brain tumor classification, offering a reliable and scalable framework for medical image analysis.

None
MEDIC-AD: Towards Medical Vision-Language Model's Clinical Intelligence 2026-03-28
Show

Lesion detection, symptom tracking, and visual explainability are central to real-world medical image analysis, yet current medical Vision-Language Models (VLMs) still lack mechanisms that translate their broad knowledge into clinically actionable outputs. To bridge this gap, we present MEDIC-AD, a clinically oriented VLM that strengthens these three capabilities through a stage-wise framework. First, learnable anomaly-aware tokens () encourage the model to focus on abnormal regions and build more discriminative lesion centered representations. Second, inter image difference tokens () explicitly encode temporal changes between studies, allowing the model to distinguish worsening, improvement, and stability in disease burden. Finally, a dedicated explainability stage trains the model to generate heatmaps that highlight lesion-related regions, offering clear visual evidence that is consistent with the model's reasoning. Through our staged design, MEDIC-AD steadily boosts performance across anomaly detection, symptom tracking, and anomaly segmentation, achieving state-of-the-art results compared with both closed source and medical-specialized baselines. Evaluations on real longitudinal clinical data collected from real hospital workflows further show that MEDIC-AD delivers stable predictions and clinically faithful explanations in practical patient-monitoring and decision-support workflows

None
TTE-CAM: Built-in Class Activation Maps for Test-Time Explainability in Pretrained Black-Box CNNs 2026-03-27
Show

Convolutional neural networks (CNNs) achieve state-of-the-art performance in medical image analysis yet remain opaque, limiting adoption in high-stakes clinical settings. Existing approaches face a fundamental trade-off: post-hoc methods provide unfaithful approximate explanations, while inherently interpretable architectures are faithful but often sacrifice predictive performance. We introduce TTE-CAM, a test-time framework that bridges this gap by converting pretrained black-box CNNs into self-explainable models via a convolution-based replacement of their classification head, initialized from the original weights. The resulting model preserves black-box predictive performance while delivering built-in faithful explanations competitive with post-hoc methods, both qualitatively and quantitatively. The code is available at https://github.com/kdjoumessi/Test-Time-Explainability

Unloc...

Unlocking Test-Time Explainability from Pretrained Black-Box CNNs

Code Link
Adapting Frozen Mono-modal Backbones for Multi-modal Registration via Contrast-Agnostic Instance Optimization 2026-03-27
Show

Deformable image registration remains a central challenge in medical image analysis, particularly under multi-modal scenarios where intensity distributions vary significantly across scans. While deep learning methods provide efficient feed-forward predictions, they often fail to generalize robustly under distribution shifts at test time. A straightforward remedy is full network fine-tuning, yet for modern architectures such as Transformers or deep U-Nets, this adaptation is prohibitively expensive in both memory and runtime when operating in 3D. Meanwhile, the naive fine-tuning struggles more with potential degradation in performance in the existence of drastic domain shifts. In this work, we propose a registration framework that integrates a frozen pretrained \textbf{mono-modal} registration model with a lightweight adaptation pipeline for \textbf{multi-modal} image registration. Specifically, we employ style transfer based on contrast-agnostic representation generation and refinement modules to bridge modality and domain gaps with instance optimization at test time. This design is orthogonal to the choice of backbone mono-modal model, thus avoids the computational burden of full fine-tuning while retaining the flexibility to adapt to unseen domains. We evaluate our approach on the Learn2Reg 2025 LUMIR validation set and observe consistent improvements over the pretrained state-of-the-art mono-modal backbone. In particular, the method ranks second on the multi-modal subset, third on the out-of-domain subset, and achieves fourth place overall in Dice score. These results demonstrate that combining frozen mono-modal models with modality adaptation and lightweight instance optimization offers an effective and practical pathway toward robust multi-modal registration.

MICCA...

MICCAI Learn2Reg Challenge

None
Adapting Segment Anything Model 3 for Concept-Driven Lesion Segmentation in Medical Images: An Experimental Study 2026-03-26
Show

Accurate lesion segmentation is essential in medical image analysis, yet most existing methods are designed for specific anatomical sites or imaging modalities, limiting their generalizability. Recent vision-language foundation models enable concept-driven segmentation in natural images, offering a promising direction for more flexible medical image analysis. However, concept-prompt-based lesion segmentation, particularly with the latest Segment Anything Model 3 (SAM3), remains underexplored. In this work, we present a systematic evaluation of SAM3 for lesion segmentation. We assess its performance using geometric bounding boxes and concept-based text and image prompts across multiple modalities, including multiparametric MRI, CT, ultrasound, dermoscopy, and endoscopy. To improve robustness, we incorporate additional prior knowledge, such as adjacent-slice predictions, multiparametric information, and prior annotations. We further compare different fine-tuning strategies, including partial module tuning, adapter-based methods, and full-model optimization. Experiments on 13 datasets covering 11 lesion types demonstrate that SAM3 achieves strong cross-modality generalization, reliable concept-driven segmentation, and accurate lesion delineation. These results highlight the potential of concept-based foundation models for scalable and practical medical image segmentation. Code and trained models will be released at: https://github.com/apple1986/lesion-sam3

31 pages, 8 figures Code Link
PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI 2026-03-25
Show

MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.

18 pa...

18 pages, 9 figures, 6 tables

None
Causal Transfer in Medical Image Analysis 2026-03-25
Show

Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.

None
MedAugment: Universal Automatic Data Augmentation Plug-in for Medical Image Analysis 2026-03-25
Show

Data augmentation (DA) has been widely leveraged in computer vision to alleviate data shortage, while its application in medical imaging faces multiple challenges. The prevalent DA approaches in medical image analysis encompass conventional DA, synthetic DA, and automatic DA. However, these approaches may result in experience-driven design and intensive computation costs. Here, we propose a suitable yet general automatic DA method for medical images termed MedAugment. We propose pixel and spatial augmentation spaces and exclude the operations that can break medical details and features. Besides, we propose a sampling strategy by sampling a limited number of operations from the two spaces. Moreover, we present a hyperparameter mapping relationship to produce a rational augmentation level and make the MedAugment fully controllable using a single hyperparameter. These configurations settle the differences between natural and medical images. Extensive experimental results on four classification and four segmentation datasets demonstrate the superiority of MedAugment. Compared with existing approaches, the proposed MedAugment prevents producing color distortions or structural alterations while involving negligible computational overhead. Our method can serve as a plugin without an extra training stage, offering significant benefits to the community and medical experts lacking a deep learning foundation. The code is available at https://github.com/NUS-Tim/MedAugment.

Knowl...

Knowledge-Based Systems Accepted

Code Link
Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification 2026-03-25
Show

Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.

CVPR 2026(Findings) None
Bi-CRCL: Bidirectional Conservative-Radical Complementary Learning with Pre-trained Foundation Models for Class-incremental Medical Image Analysis 2026-03-24
Show

Class-incremental learning (CIL) in medical image-guided diagnosis requires retaining prior diagnostic knowledge while adapting to newly emerging disease categories, which is critical for scalable clinical deployment. This problem is particularly challenging due to heterogeneous data and privacy constraints that prevent memory replay. Although pretrained foundation models (PFMs) have advanced general-domain CIL, their potential in medical imaging remains underexplored, where domain-specific adaptation is essential yet difficult due to anatomical complexity and inter-institutional heterogeneity. To address this gap, we conduct a systematic benchmark of recent PFM-based CIL methods and propose Bidirectional Conservative-Radical Complementary Learning (Bi-CRCL), a dual-learner framework inspired by complementary learning systems. Bi-CRCL integrates a conservative learner that preserves prior knowledge through stability-oriented updates and a radical learner that rapidly adapts to new categories via plasticity-oriented learning. A bidirectional interaction mechanism enables forward transfer and backward consolidation, allowing continual integration of new knowledge while mitigating catastrophic forgetting. During inference, outputs from both learners are adaptively fused for robust predictions. Experiments on five medical imaging datasets demonstrate consistent improvements over state-of-the-art methods under diverse settings, including cross-dataset shifts and varying task configurations.

prepr...

preprint; under review

None
CoRe: Joint Optimization with Contrastive Learning for Medical Image Registration 2026-03-24
Show

Medical image registration is a fundamental task in medical image analysis, enabling the alignment of images from different modalities or time points. However, intensity inconsistencies and nonlinear tissue deformations pose significant challenges to the robustness of registration methods. Recent approaches leveraging self-supervised representation learning show promise by pre-training feature extractors to generate robust anatomical embeddings, that farther used for the registration. In this work, we propose a novel framework that integrates equivariant contrastive learning directly into the registration model. Our approach leverages the power of contrastive learning to learn robust feature representations that are invariant to tissue deformations. By jointly optimizing the contrastive and registration objectives, we ensure that the learned representations are not only informative but also suitable for the registration task. We evaluate our method on abdominal and thoracic image registration tasks, including both intra-patient and inter-patient scenarios. Experimental results demonstrate that the integration of contrastive learning directly into the registration framework significantly improves performance, surpassing strong baseline methods.

Preprint None
FixationFormer: Direct Utilization of Expert Gaze Trajectories for Chest X-Ray Classification 2026-03-24
Show

Expert eye movements provide a rich, passive source of domain knowledge in radiology, offering a powerful cue for integrating diagnostic reasoning into computer-aided analysis. However, direct integration into CNN-based systems, which historically have dominated the medical image analysis domain, is challenging: gaze recordings are sequential, temporally dense yet spatially sparse, noisy, and variable across experts. As a consequence, most existing image-based models utilize reduced representations such as heatmaps. In contrast, gaze naturally aligns with transformer architectures, as both are sequential in nature and rely on attention to highlight relevant input regions. In this work, we introduce FixationFormer, a transformer-based architecture that represents expert gaze trajectories as sequences of tokens, thereby preserving their temporal and spatial structure. By modeling gaze sequences jointly with image features, our approach addresses sparsity and variability in gaze data while enabling a more direct and fine-grained integration of expert diagnostic cues through explicit cross-attention between the image and gaze token sequences. We evaluate our method on three publicly available benchmark chest X-ray datasets and demonstrate that it achieves state-of-the-art classification performance, highlighting the value of representing gaze as a sequence in transformer-based medical image analysis.

None
URA-Net: Uncertainty-Integrated Anomaly Perception and Restoration Attention Network for Unsupervised Anomaly Detection 2026-03-24
Show

Unsupervised anomaly detection plays a pivotal role in industrial defect inspection and medical image analysis, with most methods relying on the reconstruction framework. However, these methods may suffer from over-generalization, enabling them to reconstruct anomalies well, which leads to poor detection performance. To address this issue, instead of focusing solely on normality reconstruction, we propose an innovative Uncertainty-Integrated Anomaly Perception and Restoration Attention Network (URA-Net), which explicitly restores abnormal patterns to their corresponding normality. First, unlike traditional image reconstruction methods, we utilize a pre-trained convolutional neural network to extract multi-level semantic features as the reconstruction target. To assist the URA-Net learning to restore anomalies, we introduce a novel feature-level artificial anomaly synthesis module to generate anomalous samples for training. Subsequently, a novel uncertainty-integrated anomaly perception module based on Bayesian neural networks is introduced to learn the distributions of anomalous and normal features. This facilitates the estimation of anomalous regions and ambiguous boundaries, laying the foundation for subsequent anomaly restoration. Then, we propose a novel restoration attention mechanism that leverages global normal semantic information to restore detected anomalous regions, thereby obtaining defect-free restored features. Finally, we employ residual maps between input features and restored features for anomaly detection and localization. The comprehensive experimental results on two industrial datasets, MVTec AD and BTAD, along with a medical image dataset, OCT-2017, unequivocally demonstrate the effectiveness and superiority of the proposed method.

Accep...

Accepted by IEEE TCSVT

None
Cerebra: A Multidisciplinary AI Board for Multimodal Dementia Characterization and Risk Assessment 2026-03-24
Show

Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra's potential for interpretable, robust decision support in clinical care.

None
Cycle Inverse-Consistent TransMorph: A Balanced Deep Learning Framework for Brain MRI Registration 2026-03-23
Show

Deformable image registration plays a fundamental role in medical image analysis by enabling spatial alignment of anatomical structures across subjects. While recent deep learning-based approaches have significantly improved computational efficiency, many existing methods remain limited in capturing long-range anatomical correspondence and maintaining deformation consistency. In this work, we present a cycle inverse-consistent transformer-based framework for deformable brain MRI registration. The model integrates a Swin-UNet architecture with bidirectional consistency constraints, enabling the joint estimation of forward and backward deformation fields. This design allows the framework to capture both local anatomical details and global spatial relationships while improving deformation stability. We conduct a comprehensive evaluation of the proposed framework on a large multi-center dataset consisting of 2851 T1-weighted brain MRI scans aggregated from 13 public datasets. Experimental results demonstrate that the proposed framework achieves strong and balanced performance across multiple quantitative evaluation metrics while maintaining stable and physically plausible deformation fields. Detailed quantitative comparisons with baseline methods, including ANTs, ICNet, and VoxelMorph, are provided in the appendix. Experimental results demonstrate that CICTM achieves consistently strong performance across multiple evaluation criteria while maintaining stable and physically plausible deformation fields. These properties make the proposed framework suitable for large-scale neuroimaging datasets where both accuracy and deformation stability are critical.

None
Interpretable Deep Learning Framework for Improved Disease Classification in Medical Imaging 2026-03-23
Show

Deep learning models have gained increasing adoption in medical image analysis. However, these models often produce overconfident predictions, which can compromise clinical accuracy and reliability. Bridging the gap between high-performance and awareness of uncertainty remains a crucial challenge in biomedical imaging applications. This study focuses on developing a unified deep learning framework for enhancing feature integration, interpretability, and reliability in prediction. We introduced a cross-guided channel spatial attention architecture that fuses feature representations extracted from EfficientNetB4 and ResNet34. Bidirectional attention approach enables the exchange of information across networks with differing receptive fields, enhancing discriminative and contextual feature learning. For quantitative predictive uncertainty assessment, Monte Carlo (MC)-Dropout is integrated with conformal prediction. This provides statistically valid prediction sets with entropy-based uncertainty visualization. The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images. The proposed framework achieved strong classification performance with an AUC of 99.75% for COVID-19, 100% for Tuberculosis, 99.3% for Pneumonia chest X-rays, and 98.69% for retinal OCT images. Uncertainty-aware inference yields calibrated prediction sets with interpretable examples of uncertainty, showing transparency. The results demonstrate that bidirectional cross-attention with uncertainty quantification can improve performance and transparency in medical image classification.

18 pa...

18 pages, 8 figures, 5 tables

None
OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging 2026-03-23
Show

Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.

Accep...

Accepted by CVPR 2026 (Main)

None
HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis 2026-03-22
Show

We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator's phase-space decomposition yields three functionally distinct representations: position~$q$ (feature content), momentum~$p$ (spatial gradients that encode boundary and texture information), and energy $H = \tfrac{1}{2}|z|^2$ (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC,2018 (89.38%), ISIC,2017 (88.40%), TN3K (87.05%), and ACDC (92.40%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85%) and PathMNIST (96.65%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator's momentum consistently encodes an interior$,{>},$boundary$,{>},$exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at https://github.com/Minds-R-Lab/hamvision.

Code Link
When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound 2026-03-22
Show

Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via "humanized" rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.

None
Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs 2026-03-21
Show

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.

None
DCG-Net: Dual Cross-Attention with Concept-Value Graph Reasoning for Interpretable Medical Diagnosis 2026-03-20
Show

Deep learning models have achieved strong performance in medical image analysis, but their internal decision processes remain difficult to interpret. Concept Bottleneck Models (CBMs) partially address this limitation by structuring predictions through human-interpretable clinical concepts. However, existing CBMs typically overlook the contextual dependencies among concepts. To address these issues, we propose an end-to-end interpretable framework \emph{DCG-Net} that integrates multimodal alignment with structured concept reasoning. DCG-Net introduces a Dual Cross-Attention module that replaces cosine similarity matching with bidirectional attention between visual tokens and canonicalized textual concept-value prototypes, enabling spatially localized evidence attribution. To capture the relational structure inherent to clinical concepts, we develop a Parametric Concept Graph initialized with Positive Pointwise Mutual Information priors and refined through sparsity-controlled message passing. This formulation models inter-concept dependencies in a manner consistent with clinical domain knowledge. Experiments on white blood cell morphology and skin lesion diagnosis demonstrate that DCG-Net achieves state-of-the-art classification performance while producing clinically interpretable diagnostic explanations.

None
Soft Dice Confidence: A Near-Optimal Confidence Estimator for Selective Prediction in Semantic Segmentation 2026-03-17
Show

In semantic segmentation, even state-of-the-art deep learning models fall short of the performance required in certain high-stakes applications such as medical image analysis. In these cases, performance can be improved by allowing a model to abstain from making predictions when confidence is low, an approach known as selective prediction. While well-known in the classification literature, selective prediction has been underexplored in the context of semantic segmentation. This paper tackles the problem by focusing on image-level abstention, which involves producing a single confidence estimate for the entire image, in contrast to previous approaches that focus on pixel-level uncertainty. Assuming the Dice coefficient as the evaluation metric for segmentation, two main contributions are provided in this paper: (i) In the case of known marginal posterior probabilities, we derive the optimal confidence estimator, which is observed to be intractable for typical image sizes. Then, an approximation computable in linear time, named Soft Dice Confidence (SDC), is proposed and proven to be tightly bounded to the optimal estimator. (ii) When only an estimate of the marginal posterior probabilities are known, we propose a plug-in version of the SDC and show it outperforms all previous methods, including those requiring additional tuning data. These findings are supported by experimental results on both synthetic data and real-world data from six medical imaging tasks, including out-of-distribution scenarios, positioning the SDC as a reliable and efficient tool for selective prediction in semantic segmentation.

48 pages, 11 figures None
Standardizing Medical Images at Scale for AI 2026-03-16
Show

Deep learning has achieved remarkable success in medical image analysis, yet its performance remains highly sensitive to the heterogeneity of clinical data. Differences in imaging hardware, staining protocols, and acquisition conditions produce substantial domain shifts that degrade model generalization across institutions. Here we present a physics-based data preprocessing framework based on the PhyCV (Physics-Inspired Computer Vision) family of algorithms, which standardizes medical images through deterministic transformations derived from optical physics. The framework models images as spatially varying optical fields that undergo a virtual diffractive propagation followed by coherent phase detection. This process suppresses non-semantic variability such as color and illumination differences while preserving diagnostically relevant texture and structural features. When applied to histopathological images from the Camelyon17-WILDS benchmark, PhyCV preprocessing improves out-of-distribution breast-cancer classification accuracy from 70.8% (Empirical Risk Minimization baseline) to 90.9%, matching or exceeding data-augmentation and domain-generalization approaches at negligible computational cost. Because the transform is physically interpretable, parameterizable, and differentiable, it can be deployed as a fixed preprocessing stage or integrated into end-to-end learning. These results establish PhyCV as a generalizable data refinery for medical imaging-one that harmonizes heterogeneous datasets through first-principles physics, improving robustness, interpretability, and reproducibility in clinical AI systems.

None
Halfway to 3D: Ensembling 2.5D and 3D Models for Robust COVID-19 CT Diagnosis 2026-03-16
Show

We propose a deep learning framework for COVID-19 detection and disease classification from chest CT scans that integrates both 2.5D and 3D representations to capture complementary slice-level and volumetric information. The 2.5D branch processes multi-view CT slices (axial, coronal, sagittal) using a DINOv3 vision transformer to extract robust visual features, while the 3D branch employs a ResNet-18 architecture to model volumetric context and is pretrained with Variance Risk Extrapolation (VREx) followed by supervised contrastive learning to improve cross-source robustness. Predictions from both branches are combined through logit-level ensemble inference. Experiments on the PHAROS-AIF-MIH benchmark demonstrate the effectiveness of the proposed approach: for binary COVID-19 detection, the ensemble achieves 94.48% accuracy and a 0.9426 Macro F1-score, outperforming both individual models, while for multi-class disease classification the 2.5D DINOv3 model achieves the best performance with 79.35% accuracy and a 0.7497 Macro F1-score. These results highlight the benefit of combining pretrained slice-based representations with volumetric modeling for robust multi-source medical imaging analysis. Code is available at https://github.com/HySonLab/PHAROS-AIF-MIH

Code Link
TopoCL: Topological Contrastive Learning for Medical Imaging 2026-03-15
Show

Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topology-aware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of +3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.

None
How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images 2026-03-15
Show

Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks, particularly in zero-shot settings where generalization is critical, remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. In this work, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle visual grounding from semantic grounding, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. We introduce new quantitative metrics and conduct detailed qualitative analyses. Our study across eight state-of-the-art (SOTA) medical MLLMs validates that they often fail to ground their predictions in clinically relevant image regions. We note that this finding is specific to medical image analysis; in contrast, prior work has shown that MLLMs are capable of grounding their predictions in the correct image regions when applied to natural scene images. Motivated by these findings, we propose VGRefine, a simple yet effective inference-time method that refines attention distribution to improve visual grounding in medical settings. Our approach achieves SOTA performance across 6 diverse Med-VQA benchmarks (over 110K VQA samples from 8 imaging modalities) without requiring additional training or external expert models. Overall, our work, for the first time, systematically validates inadequate visual grounding as one of the key contributing factors for medical MLLMs' under-performance. Additional experiments are included in the Supp.

Publi...

Published as a conference paper at ICLR 2026

None
Toward Clinically Ready Foundation Models in Medical Image Analysis: Adaptation Mechanisms and Deployment Trade-offs 2026-03-15
Show

Foundation models (FMs) have demonstrated strong transferability across medical imaging tasks, yet their clinical utility depends critically on how pretrained representations are adapted to domain-specific data, supervision regimes, and deployment constraints. Prior surveys primarily emphasize architectural advances and application coverage, while the mechanisms of adaptation and their implications for robustness, calibration, and regulatory feasibility remain insufficiently structured. This review introduces a strategy-centric framework for FM adaptation in medical image analysis (MIA). We conceptualize adaptation as a post-pretraining intervention and organize existing approaches into five mechanisms: parameter-, representation-, objective-, data-centric, and architectural/sequence-level adaptation. For each mechanism, we analyze trade-offs in adaptation depth, label efficiency, domain robustness, computational cost, auditability, and regulatory burden. We synthesize evidence across classification, segmentation, and detection tasks, highlighting how adaptation strategies influence clinically relevant failure modes rather than only aggregate benchmark performance. Finally, we examine how adaptation choices interact with validation protocols, calibration stability, multi-institutional deployment, and regulatory oversight. By reframing adaptation as a process of controlled representational change under clinical constraints, this review provides practical guidance for designing FM-based systems that are robust, auditable, and compatible with clinical deployment.

None
ITKIT: Feasible CT Image Analysis based on SimpleITK and MMEngine 2026-03-15
Show

CT images are widely used in clinical diagnosis and treatment, and their data have formed a de facto standard - DICOM. It is clear and easy to use, and can be efficiently utilized by data-driven analysis methods such as deep learning. In the past decade, many program frameworks for medical image analysis have emerged in the open-source community. ITKIT analyzed the characteristics of these frameworks and hopes to provide a better choice in terms of ease of use and configurability. ITKIT offers a complete pipeline from DICOM to 3D segmentation inference. Its basic practice only includes some essential steps, enabling users with relatively weak computing capabilities to quickly get started using the CLI according to the documentation. For advanced users, the OneDL-MMEngine framework provides a flexible model configuration and deployment entry. This paper conducted 12 typical experiments to verify that ITKIT can meet the needs of most basic scenarios.

None
Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision 2026-03-14
Show

Foundation models have transformed vision and language by learning general-purpose representations from large-scale unlabeled data, yet 3D medical imaging lacks analogous approaches. Existing self-supervised methods rely on low-level reconstruction or contrastive objectives that fail to capture the anatomical semantics critical for medical image analysis, limiting transfer to downstream tasks. We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. MASS's key insight is that automatically generated class-agnostic masks provide sufficient structural supervision for learning semantically rich representations. By training on thousands of diverse mask proposals spanning anatomical structures and pathological findings, MASS learns what semantically defines medical structures: the holistic combination of appearance, shape, spatial context, and anatomical relationships. We demonstrate effectiveness across data regimes: from small-scale pretraining on individual datasets (20-200 scans) to large-scale multi-modal pretraining on 5K CT, MRI, and PET volumes, all without annotations. MASS demonstrates: (i) few-shot segmentation on novel structures, (ii) matching full supervision with only 20-40% labeled data while outperforming self-supervised baselines by over 20 in Dice score in low-data regimes, and (iii) frozen-encoder classification on unseen pathologies that matches full supervised training with thousands of samples. Mask-guided self-supervised pretraining captures broadly generalizable knowledge, opening a path toward 3D medical imaging foundation models without expert annotations. Code is available: https://github.com/Stanford-AIMI/MASS.

CVPR 2026 Code Link
Addressing Data Scarcity in 3D Trauma Detection through Self-Supervised and Semi-Supervised Learning with Vertex Relative Position Encoding 2026-03-12
Show

Accurate detection and localization of traumatic injuries in abdominal CT scans remains a critical challenge in emergency radiology, primarily due to severe scarcity of annotated medical data. This paper presents a label-efficient approach combining self-supervised pre-training with semi-supervised detection for 3D medical image analysis. We employ patch-based Masked Image Modeling (MIM) to pre-train a 3D U-Net encoder on 1,206 CT volumes without annotations, learning robust anatomical representations. The pretrained encoder enables two downstream clinical tasks: 3D injury detection using VDETR with Vertex Relative Position Encoding, and multi-label injury classification. For detection, semi-supervised learning with 2,000 unlabeled volumes and consistency regularization achieves 56.57% validation mAP@0.50 and 45.30% test mAP@0.50 with only 144 labeled training samples, representing a 115% improvement over supervised-only training. For classification, expanding to 2,244 labeled samples yields 94.07% test accuracy across seven injury categories using only a frozen encoder, demonstrating immediately transferable self-supervised features. Our results validate that self-supervised pre-training combined with semi-supervised learning effectively addresses label scarcity in medical imaging, enabling robust 3D object detection with limited annotations.

9 pag...

9 pages, 6 figures, 6 tables. The code is available at https://github.com/shivasmic/3d-trauma-detection-ssl

Code Link
FL-MedSegBench: A Comprehensive Benchmark for Federated Learning on Medical Image Segmentation 2026-03-12
Show

Federated learning (FL) offers a privacy-preserving paradigm for collaborative medical image analysis without sharing raw data. However, the absence of standardized benchmarks for medical image segmentation hinders fair and comprehensive evaluation of FL methods. To address this gap, we introduce FL-MedSegBench, the first comprehensive benchmark for federated learning on medical image segmentation. Our benchmark encompasses nine segmentation tasks across ten imaging modalities, covering both 2D and 3D formats with realistic clinical heterogeneity. We systematically evaluate eight generic FL (gFL) and five personalized FL (pFL) methods across multiple dimensions: segmentation accuracy, fairness, communication efficiency, convergence behavior, and generalization to unseen domains. Extensive experiments reveal several key insights: (i) pFL methods, particularly those with client-specific batch normalization (\textit{e.g.}, FedBN), consistently outperform generic approaches; (ii) No single method universally dominates, with performance being dataset-dependent; (iii) Communication frequency analysis shows normalization-based personalization methods exhibit remarkable robustness to reduced communication frequency; (iv) Fairness evaluation identifies methods like Ditto and FedRDN that protect underperforming clients; (v) A method's generalization to unseen domains is strongly tied to its ability to perform well across participating clients. We will release an open-source toolkit to foster reproducible research and accelerate clinically applicable FL solutions, providing empirically grounded guidelines for real-world clinical deployment. The source code is available at https://github.com/meiluzhu/FL-MedSegBench.

19 pages,4 figures Code Link
ActiveFreq: Integrating Active Learning and Frequency Domain Analysis for Interactive Segmentation 2026-03-12
Show

Interactive segmentation is commonly used in medical image analysis to obtain precise, pixel-level labeling, typically involving iterative user input to correct mislabeled regions. However, existing approaches often fail to fully utilize user knowledge from interactive inputs and achieve comprehensive feature extraction. Specifically, these methods tend to treat all mislabeled regions equally, selecting them randomly for refinement without evaluating each region's potential impact on segmentation quality. Additionally, most models rely solely on spatial domain features, overlooking frequency domain information that could enhance feature extraction and improve performance. To address these limitations, we propose ActiveFreq, a novel interactive segmentation framework that integrates active learning and frequency domain analysis to minimize human intervention while achieving high-quality labeling. ActiveFreq introduces AcSelect, an autonomous module that prioritizes the most informative mislabeled regions, ensuring maximum performance gain from each click. Moreover, we develop FreqFormer, a segmentation backbone incorporating a Fourier transform module to map features from the spatial to the frequency domain, enabling richer feature extraction. Evaluations on the ISIC-2017 and OAI-ZIB datasets demonstrate that ActiveFreq achieves high performance with reduced user interaction, achieving 3.74 NoC@90 on ISIC-2017 and 9.27 NoC@90 on OAI-ZIB, with 23.5% and 12.8% improvements over previous best results, respectively. Under minimal input conditions, such as two clicks, ActiveFreq reaches mIoU scores of 85.29% and 75.76% on ISIC-2017 and OAI-ZIB, highlighting its efficiency and accuracy in interactive medical segmentation.

16 pa...

16 pages, 8 figures, published in Knowledge-Based Systems

None
VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs 2026-03-11
Show

Vision-language pretraining has driven significant progress in medical image analysis. However, current methods typically supervise visual encoders using one-hot labels or free-form text, neither of which effectively captures the complex semantic relationships among clinical findings. In this study, we introduce VIVID-Med, a novel framework that leverages a frozen large language model (LLM) as a structured semantic teacher to pretrain medical vision transformers (ViTs). VIVID-Med translates clinical findings into verifiable JSON field-state pairs via a Unified Medical Schema (UMS), utilizing answerability-aware masking to focus optimization. It then employs Structured Prediction Decomposition (SPD) to partition cross-attention into orthogonality-regularized query groups, extracting complementary visual aspects. Crucially, the LLM is discarded post-training, yielding a lightweight, deployable ViT-only backbone. We evaluated VIVID-Med across multiple settings: on CheXpert linear probing, it achieves a macro-AUC of 0.8588, outperforming BiomedCLIP by +6.65 points while using 500x less data. It also demonstrates robust zero-shot cross-domain transfer to NIH ChestX-ray14 (0.7225 macro-AUC) and strong cross-modality generalization to CT, achieving 0.8413 AUC on LIDC-IDRI lung nodule classification and 0.9969 macro-AUC on OrganAMNIST 11-organ classification. VIVID-Med offers a highly efficient, scalable alternative to deploying resource-heavy vision-language models in clinical settings.

10 pages, 4 figures None
Vision-Language Models Encode Clinical Guidelines for Concept-Based Medical Reasoning 2026-03-09
Show

Concept Bottleneck Models (CBMs) are a prominent framework for interpretable AI that map learned visual features to a set of meaningful concepts for task-specific downstream predictions. Their sequential structure enhances transparency by connecting model predictions to the underlying concepts that support them. In medical imaging, where transparency is essential, CBMs offer an appealing foundation for explainable model design. However, discrete concept representations often overlook broader clinical context such as diagnostic guidelines and expert heuristics, reducing reliability in complex cases. We propose MedCBR, a concept-based reasoning framework that integrates clinical guidelines with vision-language and reasoning models. Labeled clinical descriptors are transformed into guideline-conformant text, and a concept-based model is trained with a multitask objective combining multimodal contrastive alignment, concept supervision, and diagnostic classification to jointly ground image features, concepts, and pathology. A reasoning model then converts these predictions into structured clinical narratives that explain the diagnosis, emulating expert reasoning based on established guidelines. MedCBR achieves superior diagnostic and concept-level performance, with AUROCs of 94.2% on ultrasound and 84.0% on mammography. Further experiments on non-medical datasets achieve 86.1% accuracy. Our framework enhances interpretability and forms an end-to-end bridge from medical image analysis to decision-making.

CVPR 2026 Findings None
Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation 2026-03-05
Show

Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants' data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs -- using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.

Medic...

Medical Image Analysis 2025. arXiv admin note: substantial text overlap with arXiv:2403.11803

None
LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation 2026-03-05
Show

Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.

None
A Unified Framework for Joint Detection of Lacunes and Enlarged Perivascular Spaces 2026-03-04
Show

Cerebral small vessel disease (CSVD) markers, specifically enlarged perivascular spaces (EPVS) and lacunae, present a unique challenge in medical image analysis due to their radiological mimicry. Standard segmentation networks struggle with feature interference and extreme class imbalance when handling these divergent targets simultaneously. To address these issues, we propose a morphology-decoupled framework where Zero-Initialized Gated Cross-Task Attention exploits dense EPVS context to guide sparse lacune detection. Furthermore, biological and topological consistency are enforced via a mixed-supervision strategy integrating Mutual Exclusion and Centerline Dice losses. Finally, we introduce an Anatomically-Informed Inference Calibration mechanism to dynamically suppress false positives based on tissue semantics. Extensive 5-folds cross-validation on the VALDO 2021 dataset (N=40) demonstrates state-of-the-art performance, notably surpassing task winners in lacunae detection precision (71.1%, p=0.01) and F1-score (62.6%, p=0.03). Furthermore, evaluation on the external EPAD cohort (N=1762) confirms the model's robustness for large-scale population studies. Code will be released upon acceptance.

None
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset 2026-03-04
Show

The large volume of abdominal computed tomography (CT) scans coupled with the shortage of radiologists have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision-language models (VLMs) that jointly model images and radiology reports. However, current medical VLMs are generally limited to 2D images and short reports. Here to overcome these shortcomings for abdominal CT interpretation, we introduce Merlin, a 3D VLM that learns from volumetric CT scans, electronic health record data and radiology reports. This approach is enabled by a multistage pretraining framework that does not require additional manual annotations. We trained Merlin using a high-quality clinical dataset of paired CT scans (>6 million images from 15,331 CT scans), diagnosis codes (>1.8 million codes) and radiology reports (>6 million tokens). We comprehensively evaluated Merlin on 6 task types and 752 individual tasks that covered diagnostic, prognostic and quality-related tasks. The non-adapted (off-the-shelf) tasks included zero-shot classification of findings (30 findings), phenotype classification (692 phenotypes) and zero-shot cross-modal retrieval (image-to-findings and image-to-impression). The model-adapted tasks included 5-year chronic disease prediction (6 diseases), radiology report generation and 3D semantic segmentation (20 organs). We validated Merlin at scale, with internal testing on 5,137 CT scans and external testing on 44,098 CT scans from 3 independent sites and 2 public datasets. The results demonstrated high generalization across institutions and anatomies. Merlin outperformed 2D VLMs, CT foundation models and off-the-shelf radiology models. We also release our trained models, code, and dataset, available at: https://github.com/StanfordMIMI/Merlin.

Nature (2026) Code Link
Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing 2026-03-02
Show

Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.

Submi...

Submitted to IEEE TPAMI, under review

None
GuiDINO: Rethinking Vision Foundation Model in Medical Image Segmentation 2026-03-01
Show

Foundation vision models are increasingly adopted in medical image analysis. Due to domain shift, these pretrained models misalign with medical image segmentation needs without being fully fine-tuned or lightly adapted. We introduce GuiDINO, a framework that repositions native foundation model to acting as a visual guidance generator for downstream segmentation. GuiDINO extracts visual feature representation from DINOv3 and converts them into a spatial guide mask via a lightweight TokenBook mechanism, which aggregates token-prototype similarities. This guide mask gates feature activations in multiple segmentation backbones, thereby injecting foundation-model priors while preserving the inductive biases and efficiency of medical dedicated architectures. Training relies on a guide supervision objective loss that aligns the guide mask to ground-truth regions, optionally augmented by a boundary-focused hinge loss to sharpen fine structures. GuiDINO also supports parameter-efficient adaptation through LoRA on the DINOv3 guide backbone. Across diverse medical datasets and nnUNet-style inference, GuiDINO consistently improves segmentation quality and boundary robustness, suggesting a practical alternative to fine-tuning and offering a new perspective on how foundation models can best serve medical vision. Code is available at https://github.com/Hi-FishU/GuiDINO

12 pa...

12 pages, 2 figures, 3 tables

Code Link
Differential privacy representation geometry for medical image analysis 2026-03-01
Show

Differential privacy (DP)'s effect in medical imaging is typically evaluated only through end-to-end performance, leaving the mechanism of privacy-induced utility loss unclear. We introduce Differential Privacy Representation Geometry for Medical Imaging (DP-RGMI), a framework that interprets DP as a structured transformation of representation space and decomposes performance degradation into encoder geometry and task-head utilization. Geometry is quantified by representation displacement from initialization and spectral effective dimension, while utilization is measured as the gap between linear-probe and end-to-end utility. Across over 594,000 images from four chest X-ray datasets and multiple pretrained initializations, we show that DP is consistently associated with a utilization gap even when linear separability is largely preserved. At the same time, displacement and spectral dimension exhibit non-monotonic, initialization- and dataset-dependent reshaping, indicating that DP alters representation anisotropy rather than uniformly collapsing features. Correlation analysis reveals that the association between end-to-end performance and utilization is robust across datasets but can vary by initialization, while geometric quantities capture additional prior- and dataset-conditioned variation. These findings position DP-RGMI as a reproducible framework for diagnosing privacy-induced failure modes and informing privacy model selection.

None
The Texture-Shape Dilemma: Boundary-Safe Synthetic Generation for 3D Medical Transformers 2026-03-01
Show

Vision Transformers (ViTs) have revolutionized medical image analysis, yet their data-hungry nature clashes with the scarcity and privacy constraints of clinical archives. Formula-Driven Supervised Learning (FDSL) has emerged as a promising solution to this bottleneck, synthesizing infinite annotated samples from mathematical formulas without utilizing real patient data. However, existing FDSL paradigms rely on simple geometric shapes with homogeneous intensities, creating a substantial gap by neglecting tissue textures and noise patterns inherent in modalities like CT and MRI. In this paper, we identify a critical optimization conflict termed boundary aliasing: when high-frequency synthetic textures are naively added, they corrupt the image gradient signals necessary for learning structural boundaries, causing the model to fail in delineating real anatomical margins. To bridge this gap, we propose a novel Physics-inspired Spatially-Decoupled Synthesis framework. Our approach orthogonalizes the synthesis process: it first constructs a gradient-shielded buffer zone based on boundary distance to ensure stable shape learning, and subsequently injects physics-driven spectral textures into the object core. This design effectively reconciles robust shape representation learning with invariance to acquisition noise. Extensive experiments on the BTCV and MSD datasets demonstrate that our method significantly outperforms previous FDSL, as well as SSL methods trained on real-world medical datasets, by 1.43% on BTCV and up to 1.51% on MSD task, offering a scalable, annotation-free foundation for medical ViTs. The code will be made publicly available upon acceptance.

None
TAP-SLF: Parameter-Efficient Adaptation of Vision Foundation Models for Multi-Task Ultrasound Image Analysis 2026-02-28
Show

Executing multiple tasks simultaneously in medical image analysis, including segmentation, classification, detection, and regression, often introduces significant challenges regarding model generalizability and the optimization of shared feature representations. While Vision Foundation Models (VFMs) provide powerful general representations, full fine-tuning on limited medical data is prone to overfitting and incurs high computational costs. Moreover, existing parameter-efficient fine-tuning approaches typically adopt task-agnostic adaptation protocols, overlooking both task-specific mechanisms and the varying sensitivity of model layers during fine-tuning. In this work, we propose Task-Aware Prompting and Selective Layer Fine-Tuning (TAP-SLF), a unified framework for multi-task ultrasound image analysis. TAP-SLF incorporates task-aware soft prompts to encode task-specific priors into the input token sequence and applies LoRA to selected specific top layers of the encoder. This strategy updates only a small fraction of the VFM parameters while keeping the pre-trained backbone frozen. By combining task-aware prompts with selective high-layer fine-tuning, TAP-SLF enables efficient VFM adaptation to diverse medical tasks within a shared backbone. Results on the FMC_UIA 2026 Challenge test set, where TAP-SLF wins fifth place, combined with evaluations on the officially released training dataset using an 8:2 train-test split, demonstrate that task-aware prompting and selective layer tuning are effective strategies for efficient VFM adaptation.

4 pag...

4 pages, 2 figures, 4 tables; Submitted to ISBI FMC UIA 2026; Our code is publicly available at https://github.com/huiwanHW/Florence-2-adaptation

Code Link
General vs Domain-Specific CNNs: Understanding Pretraining Effects on Brain MRI Tumor Classification 2026-02-27
Show

The accurate identification of brain tumors from magnetic resonance imaging (MRI) is essential for timely diagnosis and effective therapeutic intervention. While deep convolutional neural networks (CNNs), particularly those pre-trained on extensive datasets, have shown considerable promise in medical image analysis, a key question arises when working with limited data: do models pre-trained on specialized medical image repositories outperform those pre-trained on diverse, general-domain datasets? This research presents a comparative analysis of three distinct pre-trained CNN architectures for brain tumor classification: RadImageNet DenseNet121, which leverages pre-training on medical-domain data, alongside two modern general-purpose networks, EfficientNetV2S and ConvNeXt-Tiny. All models were trained and fine-tuned under uniform experimental conditions using a modestly sized brain MRI dataset to maintain consistency in evaluation. The experimental outcomes indicate that ConvNeXt-Tiny delivered the best performance, achieving 93% test accuracy, followed by EfficientNetV2S at 85%. In contrast, RadImageNet DenseNet121 attained only 68% accuracy and exhibited higher loss, indicating limited generalization capability despite its domain-specific pre-training. These observations imply that pre-training on medical-domain data does not necessarily guarantee superior performance in data-scarce scenarios. Conversely, contemporary general-purpose CNNs with deeper architectures, pre-trained on large-scale diverse datasets, may offer more effective transfer learning for specialized diagnostic tasks in medical imaging.

None
Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning 2026-02-27
Show

Automated identification of DICOM image series is essential for large-scale medical image analysis, quality control, protocol harmonization, and reliable downstream processing. However, DICOM series classification remains challenging due to heterogeneous slice content, variable series length, and entirely missing, incomplete or inconsistent DICOM metadata. We propose an end-to-end multimodal framework for DICOM series classification that jointly models image content and acquisition metadata while explicitly accounting for all these challenges. (i) Images and metadata are encoded with modality-aware modules and fused using a bi-directional cross-modal attention mechanism. (ii) Metadata is processed by a sparse, missingness-aware encoder based on learnable feature dictionaries and value-conditioned modulation. By design, the approach does not require any form of imputation. (iii) Variability in series length and image data dimensions is handled via a 2.5D visual encoder and attention operating on equidistantly sampled slices. We evaluate the proposed approach on the publicly available Duke Liver MRI dataset and a large multi-institutional in-house cohort, assessing both in-domain performance and out-of-domain generalization. Across all evaluation settings, the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines. The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification.

None
SGDC: Structurally-Guided Dynamic Convolution for Medical Image Segmentation 2026-02-26
Show

Spatially variant dynamic convolution provides a principled approach of integrating spatial adaptivity into deep neural networks. However, mainstream designs in medical segmentation commonly generate dynamic kernels through average pooling, which implicitly collapses high-frequency spatial details into a coarse, spatially-compressed representation, leading to over-smoothed predictions that degrade the fidelity of fine-grained clinical structures. To address this limitation, we propose a novel Structure-Guided Dynamic Convolution (SGDC) mechanism, which leverages an explicitly supervised structure-extraction branch to guide the generation of dynamic kernels and gating signals for structure-aware feature modulation. Specifically, the high-fidelity boundary information from this auxiliary branch is fused with semantic features to enable spatially-precise feature modulation. By replacing context aggregation with pixel-wise structural guidance, the proposed design effectively prevents the information loss introduced by average pooling. Experimental results show that SGDC achieves state-of-the-art performance on ISIC 2016, PH2, ISIC 2018, and CoNIC datasets, delivering superior boundary fidelity by reducing the Hausdorff Distance (HD95) by 2.05, and providing consistent IoU gains of 0.99%-1.49% over pooling-based baselines. Moreover, the mechanism exhibits strong potential for extension to other fine-grained, structure-sensitive vision tasks, such as small-object detection, offering a principled solution for preserving structural integrity in medical image analysis. To facilitate reproducibility and encourage further research, the implementation code for both our SGE and SGDC modules has been is publicly released at https://github.com/solstice0621/SGDC.

Code Link
Axial-Centric Cross-Plane Attention for 3D Medical Image Classification 2026-02-25
Show

Clinicians commonly interpret three-dimensional (3D) medical images, such as computed tomography (CT) scans, using multiple anatomical planes rather than as a single volumetric representation. In this multi-planar approach, the axial plane typically serves as the primary acquisition and diagnostic reference, while the coronal and sagittal planes provide complementary spatial information to increase diagnostic confidence. However, many existing 3D deep learning methods either process volumetric data holistically or assign equal importance to all planes, failing to reflect the axial-centric clinical interpretation workflow. To address this gap, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that captures the inherent asymmetric dependencies between different anatomical planes. Our architecture incorporates MedDINOv3, a medical vision foundation model pretrained via self-supervised learning on large-scale axial CT images, as a frozen feature extractor for the axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information within each anatomical plane, while axial-centric cross-plane transformer encoders condition axial features on complementary information from auxiliary planes. Experimental results on six datasets from the MedMNIST3D benchmark demonstrate that the proposed architecture consistently outperforms existing 3D and multi-plane models in terms of accuracy and AUC. Ablation studies further confirm the importance of axial-centric query-key-value allocation and directional cross-plane fusion. These results highlight the importance of aligning architectural design with clinical interpretation workflows for robust and data-efficient 3D medical image analysis.

Submi...

Submitted to MICCAI 2026

None
Federated Learning for Cross-Modality Medical Image Segmentation via Augmentation-Driven Generalization 2026-02-24
Show

Artificial intelligence has emerged as a transformative tool in medical image analysis, yet developing robust and generalizable segmentation models remains difficult due to fragmented, privacy-constrained imaging data siloed across institutions. While federated learning (FL) enables collaborative model training without centralizing data, cross-modality domain shifts pose a critical challenge, particularly when models trained on one modality fail to generalize to another. Many existing solutions require paired multimodal data per patient or rely on complex architectures, both of which are impractical in real clinical settings. In this work, we consider a realistic FL scenario where each client holds single-modality data (CT or MRI), and systematically investigate augmentation strategies for cross-modality generalization. Using abdominal organ segmentation and whole-heart segmentation as representative multi-class and binary segmentation benchmarks, we evaluate convolution-based spatial augmentation, frequency-domain manipulation, domain-specific normalization, and global intensity nonlinear (GIN) augmentation. Our results show that GIN consistently outperforms alternatives in both centralized and federated settings by simulating cross-modality appearance variations while preserving anatomical structure. For the pancreas, Dice score improved from 0.073 to 0.437, a 498% gain. Our federated approach achieves 93-98% of centralized training accuracy, demonstrating strong cross-modality generalization without compromising data privacy, pointing toward feasible federated AI deployment across diverse healthcare systems.

Submi...

Submitted to IEEE JBHI

None
EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease 2026-02-22
Show

Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.

Accepted by CVPR2026 None
Dual-Kernel Adapter: Expanding Spatial Horizons for Data-Constrained Medical Image Analysis 2026-02-21
Show

Adapters have become a widely adopted strategy for efficient fine-tuning of large pretrained models, particularly in resource-constrained settings. However, their performance under extreme data scarcity, common in medical imaging due to high annotation costs, privacy regulations, and fragmented datasets, remains underexplored. In this work, we present the first comprehensive study of adapter-based fine-tuning for large pretrained models in low-data medical imaging scenarios. We find that, contrary to their promise, conventional adapters can degrade performance under severe data constraints, performing even worse than simple linear probing when trained on less than 1% of the corresponding training data. Through systematic analysis, we identify a sharp reduction in Effective Receptive Field (ERF) as a key factor behind this degradation. Motivated by these findings, we propose the Dual-Kernel Adapter (DKA), a lightweight module that expands spatial context via large-kernel convolutions while preserving local detail with small-kernel counterparts. Extensive experiments across diverse classification and segmentation benchmarks show that DKA significantly outperforms existing adapter methods, establishing new leading results in both data-constrained and data-rich regimes.

None
Soft-CAM: Making black box models self-explainable for medical image analysis 2026-02-20
Show

Convolutional neural networks (CNNs) are widely used for high-stakes applications like medicine, often surpassing human performance. However, most explanation methods rely on post-hoc attribution, approximating the decision-making process of already trained black-box models. These methods are often sensitive, unreliable, and fail to reflect true model reasoning, limiting their trustworthiness in critical applications. In this work, we introduce SoftCAM, a straightforward yet effective approach that makes standard CNN architectures inherently interpretable. By removing the global average pooling layer and replacing the fully connected classification layer with a convolution-based class evidence layer, SoftCAM preserves spatial information and produces explicit class activation maps that form the basis of the model's predictions. Evaluated on three medical datasets, SoftCAM maintains classification performance while significantly improving both the qualitative and quantitative explanation compared to existing post-hoc methods. Our results demonstrate that CNNs can be inherently interpretable without compromising performance, advancing the development of self-explainable deep learning for high-stakes decision-making. The code is available at https://github.com/kdjoumessi/SoftCAM

Accep...

Accepted at the Medical Imaging with Deep Learning Conference (MIDL 2026)

Code Link
Investigating Demographic Bias in Brain MRI Segmentation: A Comparative Study of Deep-Learning and Non-Deep-Learning Methods 2026-02-19
Show

Deep-learning-based segmentation algorithms have substantially advanced the field of medical image analysis, particularly in structural delineations in MRIs. However, an important consideration is the intrinsic bias in the data. Concerns about unfairness, such as performance disparities based on sensitive attributes like race and sex, are increasingly urgent. In this work, we evaluate the results of three different segmentation models (UNesT, nnU-Net, and CoTr) and a traditional atlas-based method (ANTs), applied to segment the left and right nucleus accumbens (NAc) in MRI images. We utilize a dataset including four demographic subgroups: black female, black male, white female, and white male. We employ manually labeled gold-standard segmentations to train and test segmentation models. This study consists of two parts: the first assesses the segmentation performance of models, while the second measures the volumes they produce to evaluate the effects of race, sex, and their interaction. Fairness is quantitatively measured using a metric designed to quantify fairness in segmentation performance. Additionally, linear mixed models analyze the impact of demographic variables on segmentation accuracy and derived volumes. Training on the same race as the test subjects leads to significantly better segmentation accuracy for some models. ANTs and UNesT show notable improvements in segmentation accuracy when trained and tested on race-matched data, unlike nnU-Net, which demonstrates robust performance independent of demographic matching. Finally, we examine sex and race effects on the volume of the NAc using segmentations from the manual rater and from our biased models. Results reveal that the sex effects observed with manual segmentation can also be observed with biased models, whereas the race effects disappear in all but one model.

17 pa...

17 pages, 2 figures, Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:035

None
HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis 2026-02-18
Show

Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: https://github.com/misti1203/HyPCA-Net.

Accep...

Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision 2026

Code Link
Effective and Robust Multimodal Medical Image Analysis 2026-02-17
Show

Multimodal Fusion Learning (MFL), leveraging disparate data from various imaging modalities (e.g., MRI, CT, SPECT), has shown great potential for addressing medical problems such as skin cancer and brain tumor prediction. However, existing MFL methods face three key limitations: a) they often specialize in specific modalities, and overlook effective shared complementary information across diverse modalities, hence limiting their generalizability for multi-disease analysis; b) they rely on computationally expensive models, restricting their applicability in resource-limited settings; and c) they lack robustness against adversarial attacks, compromising reliability in medical AI applications. To address these limitations, we propose a novel Multi-Attention Integration Learning (MAIL) network, incorporating two key components: a) an efficient residual learning attention block for capturing refined modality-specific multi-scale patterns and b) an efficient multimodal cross-attention module for learning enriched complementary shared representations across diverse modalities. Furthermore, to ensure adversarial robustness, we extend MAIL network to design Robust-MAIL by incorporating random projection filters and modulated attention noise. Extensive evaluations on 20 public datasets show that both MAIL and Robust-MAIL outperform existing methods, achieving performance gains of up to 9.34% while reducing computational costs by up to 78.3%. These results highlight the superiority of our approaches, ensuring more reliable predictions than top competitors. Code: https://github.com/misti1203/MAIL-Robust-MAIL.

Accep...

Accepted at Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

Code Link
Learning to Select Like Humans: Explainable Active Learning for Medical Imaging 2026-02-10
Show

Medical image analysis requires substantial labeled data for model training, yet expert annotation is expensive and time-consuming. Active learning (AL) addresses this challenge by strategically selecting the most informative samples for the annotation purpose, but traditional methods solely rely on predictive uncertainty while ignoring whether models learn from clinically meaningful features a critical requirement for clinical deployment. We propose an explainability-guided active learning framework that integrates spatial attention alignment into a sample acquisition process. Our approach advocates for a dual-criterion selection strategy combining: (i) classification uncertainty to identify informative examples, and (ii) attention misalignment with radiologist-defined regions-of-interest (ROIs) to target samples where the model focuses on incorrect features. By measuring misalignment between Grad-CAM attention maps and expert annotations using \emph{Dice similarity}, our acquisition function judiciously identifies samples that enhance both predictive performance and spatial interpretability. We evaluate the framework using three expert-annotated medical imaging datasets, namely, BraTS (MRI brain tumors), VinDr-CXR (chest X-rays), and SIIM-COVID-19 (chest X-rays). Using only 570 strategically selected samples, our explainability-guided approach consistently outperforms random sampling across all the datasets, achieving 77.22% accuracy on BraTS, 52.37% on VinDr-CXR, and 52.66% on SIIM-COVID. Grad-CAM visualizations confirm that the models trained by our dual-criterion selection focus on diagnostically relevant regions, demonstrating that incorporating explanation guidance into sample acquisition yields superior data efficiency while maintaining clinical interpretability.

Accep...

Accepted for publication IEEE Conference on Artificial Intelligence 2026, Granada, Spain

None
Anatomy-Preserving Latent Diffusion for Generation of Brain Segmentation Masks with Ischemic Infarct 2026-02-10
Show

The scarcity of high-quality segmentation masks remains a major bottleneck for medical image analysis, particularly in non-contrast CT (NCCT) neuroimaging, where manual annotation is costly and variable. To address this limitation, we propose an anatomy-preserving generative framework for the unconditional synthesis of multi-class brain segmentation masks, including ischemic infarcts. The proposed approach combines a variational autoencoder trained exclusively on segmentation masks to learn an anatomical latent representation, with a diffusion model operating in this latent space to generate new samples from pure noise. At inference, synthetic masks are obtained by decoding denoised latent vectors through the frozen VAE decoder, with optional coarse control over lesion presence via a binary prompt. Qualitative results show that the generated masks preserve global brain anatomy, discrete tissue semantics, and realistic variability, while avoiding the structural artifacts commonly observed in pixel-space generative models. Overall, the proposed framework offers a simple and scalable solution for anatomy-aware mask generation in data-scarce medical imaging scenarios.

None
Fully Differentiable Bidirectional Dual-Task Synergistic Learning for Semi-Supervised 3D Medical Image Segmentation 2026-02-10
Show

Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data. The scarcity of high-quality labeled data remains a major challenge in medical image analysis due to the high annotation costs and the need for specialized clinical expertise. Semi-supervised learning has demonstrated significant potential in addressing this bottleneck, with pseudo-labeling and consistency regularization emerging as two predominant paradigms. Dual-task collaborative learning, an emerging consistency-aware paradigm, seeks to derive supplementary supervision by establishing prediction consistency between related tasks. However, current methodologies are limited to unidirectional interaction mechanisms (typically regression-to-segmentation), as segmentation results can only be transformed into regression outputs in an offline manner, thereby failing to fully exploit the potential benefits of online bidirectional cross-task collaboration. Thus, we propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Experiments on two benchmark datasets demonstrate our method's state-of-the-art performance. Beyond technical contributions, this work provides new insights into unified SSL framework design and establishes a new architectural foundation for dual-task-driven SSL, while offering a generic multitask learning framework applicable to broader computer vision applications. The code will be released on github upon acceptance.

Accep...

Accepted by ESWA 2026

None
GAFR-Net: A Graph Attention and Fuzzy-Rule Network for Interpretable Breast Cancer Image Classification 2026-02-10
Show

Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic intervention.However, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a "blackbox" nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue structures.Concurrently, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent "IF-THEN" mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.

None
3D Transport-based Morphometry (3D-TBM) for medical image analysis 2026-02-06
Show

Transport-Based Morphometry (TBM) has emerged as a new framework for 3D medical image analysis. By embedding images into a transport domain via invertible transformations, TBM facilitates effective classification, regression, and other tasks using transport-domain features. Crucially, the inverse mapping enables the projection of analytic results back into the original image space, allowing researchers to directly interpret clinical features associated with model outputs in a spatially meaningful way. To facilitate broader adoption of TBM in clinical imaging research, we present 3D-TBM, a tool designed for morphological analysis of 3D medical images. The framework includes data preprocessing, computation of optimal transport embeddings, and analytical methods such as visualization of main transport directions, together with techniques for discerning discriminating directions and related analysis methods. We also provide comprehensive documentation and practical tutorials to support researchers interested in applying 3D-TBM in their own medical imaging studies. The source code is publicly available through PyTransKit.

None
Attention-Driven Framework for Non-Rigid Medical Image Registration 2026-02-06
Show

Deformable medical image registration is a fundamental task in medical image analysis with applications in disease diagnosis, treatment planning, and image-guided interventions. Despite significant advances in deep learning based registration methods, accurately aligning images with large deformations while preserving anatomical plausibility remains a challenging task. In this paper, we propose a novel Attention-Driven Framework for Non-Rigid Medical Image Registration (AD-RegNet) that employs attention mechanisms to guide the registration process. Our approach combines a 3D UNet backbone with bidirectional cross-attention, which establishes correspondences between moving and fixed images at multiple scales. We introduce a regional adaptive attention mechanism that focuses on anatomically relevant structures, along with a multi-resolution deformation field synthesis approach for accurate alignment. The method is evaluated on two distinct datasets: DIRLab for thoracic 4D CT scans and IXI for brain MRI scans, demonstrating its versatility across different anatomical structures and imaging modalities. Experimental results demonstrate that our approach achieves performance competitive with state-of-the-art methods on the IXI and DIRLab datasets. The proposed method maintains a favorable balance between registration accuracy and computational efficiency, making it suitable for clinical applications. A comprehensive evaluation using normalized cross-correlation (NCC), mean squared error (MSE), structural similarity (SSIM), Jacobian determinant, and target registration error (TRE) indicates that attention-guided registration improves alignment accuracy while ensuring anatomically plausible deformations.

None
CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis 2026-02-06
Show

High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.

None
PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining 2026-02-05
Show

Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.

None
Towards Segmenting the Invisible: An End-to-End Registration and Segmentation Framework for Weakly Supervised Tumour Analysis 2026-02-05
Show

Liver tumour ablation presents a significant clinical challenge: whilst tumours are clearly visible on pre-operative MRI, they are often effectively invisible on intra-operative CT due to minimal contrast between pathological and healthy tissue. This work investigates the feasibility of cross-modality weak supervision for scenarios where pathology is visible in one modality (MRI) but absent in another (CT). We present a hybrid registration-segmentation framework that combines MSCGUNet for inter-modal image registration with a UNet-based segmentation module, enabling registration-assisted pseudo-label generation for CT images. Our evaluation on the CHAOS dataset demonstrates that the pipeline can successfully register and segment healthy liver anatomy, achieving a Dice score of 0.72. However, when applied to clinical data containing tumours, performance degrades substantially (Dice score of 0.16), revealing the fundamental limitations of current registration methods when the target pathology lacks corresponding visual features in the target modality. We analyse the "domain gap" and "feature absence" problems, demonstrating that whilst spatial propagation of labels via registration is feasible for visible structures, segmenting truly invisible pathology remains an open challenge. Our findings highlight that registration-based label transfer cannot compensate for the absence of discriminative features in the target modality, providing important insights for future research in cross-modality medical image analysis. Code an weights are available at: https://github.com/BudhaTronix/Weakly-Supervised-Tumour-Detection

Accep...

Accepted for AIBio at ECAI 2025

Code Link
Explainable AI: A Combined XAI Framework for Explaining Brain Tumour Detection Models 2026-02-05
Show

This study explores the integration of multiple Explainable AI (XAI) techniques to enhance the interpretability of deep learning models for brain tumour detection. A custom Convolutional Neural Network (CNN) was developed and trained on the BraTS 2021 dataset, achieving 91.24% accuracy in distinguishing between tumour and non-tumour regions. This research combines Gradient-weighted Class Activation Mapping (GRAD-CAM), Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP) to provide comprehensive insights into the model's decision-making process. This multi-technique approach successfully identified both full and partial tumours, offering layered explanations ranging from broad regions of interest to pixel-level details. GRAD-CAM highlighted important spatial regions, LRP provided detailed pixel-level relevance and SHAP quantified feature contributions. The integrated approach effectively explained model predictions, including cases with partial tumour visibility thus showing superior explanatory power compared to individual XAI methods. This research enhances transparency and trust in AI-driven medical imaging analysis by offering a more comprehensive perspective on the model's reasoning. The study demonstrates the potential of integrated XAI techniques in improving the reliability and interpretability of AI systems in healthcare, particularly for critical tasks like brain tumour detection.

None
OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis 2026-02-04
Show

Radiological analysis increasingly benefits from pretrained visual representations that can support heterogeneous downstream tasks across imaging modalities. In this work, we introduce OmniRad, a self-supervised radiological foundation model pretrained on 1.2 million medical images, designed with radiology-inspired principles emphasizing representation reuse and cross-task transferability. We evaluate the pretrained encoder under multiple downstream adaptation regimes, including lightweight task-specific adapters with a frozen backbone as well as full end-to-end fine-tuning for classification, allowing us to assess both representation quality and task-specific performance. OmniRad is evaluated on a broad suite of public benchmarks spanning classification and segmentation across multiple modalities. On the MedMNISTv2 collection, OmniRad improves classification F1 by up to 2.05% over competing foundation models. For dense prediction, OmniRad attains mean Dice score improvements across six MedSegBench datasets when using frozen representations. Qualitative analyses and latent-space visualizations suggest improved feature clustering and modality-related separation.

19 pa...

19 pages, 4 figures, 12 tables

None
HypCBC: Domain-Invariant Hyperbolic Cross-Branch Consistency for Generalizable Medical Image Analysis 2026-02-03
Show

Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the advantages of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in-distribution datasets and three ViT models. We further propose an unsupervised, domain-invariant hyperbolic cross-branch consistency constraint. Extensive experiments confirm that our proposed method promotes domain-invariant features and outperforms state-of-the-art Euclidean methods by an average of $+2.1%$ AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-WILDS, and a cross-dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across substantially different conditions. The code is available at https://github.com/francescodisalvo05/hyperbolic-cross-branch-consistency .

Accep...

Accepted to Transactions on Machine Learning Research (TMLR)

Code Link
Efficient Deep Learning for Medical Imaging: Bridging the Gap Between High-Performance AI and Clinical Deployment 2026-01-31
Show

Deep learning has revolutionized medical image analysis, playing a vital role in modern clinical applications. However, the deployment of large-scale models in real-world clinical settings remains challenging due to high computational costs, latency constraints, and patient data privacy concerns associated with cloud-based processing. To address these bottlenecks, this review provides a comprehensive synthesis of efficient and lightweight deep learning architectures specifically tailored for the medical domain. We categorize the landscape of modern efficient models into three primary streams: Convolutional Neural Networks (CNNs), Lightweight Transformers, and emerging Linear Complexity Models. Furthermore, we examine key model compression strategies (including pruning, quantization, knowledge distillation, and low-rank factorization) and evaluate their efficacy in maintaining diagnostic performance while reducing hardware requirements. By identifying current limitations and discussing the transition toward on-device intelligence, this review serves as a roadmap for researchers and practitioners aiming to bridge the gap between high-performance AI and resource-constrained clinical environments.

None
Bi-MCQ: Reformulating Vision-Language Alignment for Negation Understanding 2026-01-30
Show

Recent vision-language models (VLMs) achieve strong zero-shot performance via large-scale image-text pretraining and have been widely adopted in medical image analysis. However, existing VLMs remain notably weak at understanding negated clinical statements, largely due to contrastive alignment objectives that treat negation as a minor linguistic variation rather than a meaning-inverting operator. In multi-label settings, prompt-based InfoNCE fine-tuning further reinforces easy-positive image-prompt alignments, limiting effective learning of disease absence. To overcome these limitations, we reformulate vision-language alignment as a conditional semantic comparison problem, which is instantiated through a bi-directional multiple-choice learning framework(Bi-MCQ). By jointly training Image-to-Text and Text-to-Image MCQ tasks with affirmative, negative, and mixed prompts, our method implements fine-tuning as conditional semantic comparison instead of global similarity maximization. We further introduce direction-specific Cross-Attention fusion modules to address asymmetric cues required by bi-directional reasoning and reduce alignment interference. Experiments on ChestXray14, Open-I, CheXpert, and PadChest show that Bi-MCQ improves negation understanding by up to 0.47 AUC over the zero-shot performance of the state-of-the-art CARZero model, while achieving up to a 0.08 absolute gain on positive-negative combined (PNC) evaluation. Additionally, Bi-MCQ reduces the affirmative-negative AUC gap by an average of 0.12 compared to InfoNCE-based fine-tuning, demonstrating that objective reformulation can substantially enhance negation understanding in medical VLMs.

15 pa...

15 pages, 4 figures, Submitted to ICPR 2026 (under review)

None
Testing of Deep Learning Model in Real World Clinical Setting: A Case Study in Obstetric Ultrasound 2026-01-29
Show

Despite the rapid development of AI models in medical image analysis, their validation in real-world clinical settings remains limited. To address this, we introduce a generic framework designed for deploying image-based AI models in such settings. Using this framework, we deployed a trained model for fetal ultrasound standard plane detection, and evaluated it in real-time sessions with both novice and expert users. Feedback from these sessions revealed that while the model offers potential benefits to medical practitioners, the need for navigational guidance was identified as a key area for improvement. These findings underscore the importance of early deployment of AI models in real-world settings, leading to insights that can guide the refinement of the model and system based on actual user feedback.

5 pages; ISBI 2026 None
Interpretable and backpropagation-free Green Learning for efficient multi-task echocardiographic segmentation and classification 2026-01-27
Show

Echocardiography is a cornerstone for managing heart failure (HF), with Left Ventricular Ejection Fraction (LVEF) being a critical metric for guiding therapy. However, manual LVEF assessment suffers from high inter-observer variability, while existing Deep Learning (DL) models are often computationally intensive and data-hungry "black boxes" that impede clinical trust and adoption. Here, we propose a backpropagation-free multi-task Green Learning (MTGL) framework that performs simultaneous Left Ventricle (LV) segmentation and LVEF classification. Our framework integrates an unsupervised VoxelHop encoder for hierarchical spatio-temporal feature extraction with a multi-level regression decoder and an XG-Boost classifier. On the EchoNet-Dynamic dataset, our MTGL model achieves state-of-the-art classification and segmentation performance, attaining a classification accuracy of 94.3% and a Dice Similarity Coefficient (DSC) of 0.912, significantly outperforming several advanced 3D DL models. Crucially, our model achieves this with over an order of magnitude fewer parameters, demonstrating exceptional computational efficiency. This work demonstrates that the GL paradigm can deliver highly accurate, efficient, and interpretable solutions for complex medical image analysis, paving the way for more sustainable and trustworthy artificial intelligence in clinical practice.

Jyun-...

Jyun-Ping Kao and Jiaxing Yang contributed equally to this work. C.-C. Jay Kuo and Jonghye Woo are the senior authors

None
The role of self-supervised pretraining in differentially private medical image analysis 2026-01-27
Show

Differential privacy (DP) provides formal protection for sensitive data but typically incurs substantial losses in diagnostic performance. Model initialization has emerged as a critical factor in mitigating this degradation, yet the role of modern self-supervised learning under full-model DP remains poorly understood. Here, we present a large-scale evaluation of initialization strategies for differentially private medical image analysis, using chest radiograph classification as a representative benchmark with more than 800,000 images. Using state-of-the-art ConvNeXt models trained with DP-SGD across realistic privacy regimes, we compare non-domain-specific supervised ImageNet initialization, non-domain-specific self-supervised DINOv3 initialization, and domain-specific supervised pretraining on MIMIC-CXR, the largest publicly available chest radiograph dataset. Evaluations are conducted across five external datasets spanning diverse institutions and acquisition settings. We show that DINOv3 initialization consistently improves diagnostic utility relative to ImageNet initialization under DP, but remains inferior to domain-specific supervised pretraining, which achieves performance closest to non-private baselines. We further demonstrate that initialization choice strongly influences demographic fairness, cross-dataset generalization, and robustness to data scale and model capacity under privacy constraints. The results establish initialization strategy as a central determinant of utility, fairness, and generalization in differentially private medical imaging.

None
Radiomics in Medical Imaging: Methods, Applications, and Challenges 2026-01-24
Show

Radiomics enables quantitative medical image analysis by converting imaging data into structured, high-dimensional feature representations for predictive modeling. Despite methodological developments and encouraging retrospective results, radiomics continue to face persistent challenges related to feature instability, limited reproducibility, validation bias, and restricted clinical translation. Existing reviews largely focus on application-specific outcomes or isolated pipeline components, with limited analysis of how interdependent design choices across acquisition, preprocessing, feature engineering, modeling, and evaluation collectively affect robustness and generalizability. This survey provides an end-to-end analysis of radiomics pipelines, examining how methodological decisions at each stage influence feature stability, model reliability, and translational validity. This paper reviews radiomic feature extraction, selection, and dimensionality reduction strategies; classical machine and deep learning-based modeling approaches; and ensemble and hybrid frameworks, with emphasis on validation protocols, data leakage prevention, and statistical reliability. Clinical applications are discussed with a focus on evaluation rigor rather than reported performance metrics. The survey identifies open challenges in standardization, domain shift, and clinical deployment, and outlines future directions such as hybrid radiomics-artificial intelligence models, multimodal fusion, federated learning, and standardized benchmarking.

None
Stylizing ViT: Anatomy-Preserving Instance Style Transfer for Domain Generalization 2026-01-24
Show

Deep learning models in medical image analysis often struggle with generalizability across domains and demographic groups due to data heterogeneity and scarcity. Traditional augmentation improves robustness, but fails under substantial domain shifts. Recent advances in stylistic augmentation enhance domain generalization by varying image styles but fall short in terms of style diversity or by introducing artifacts into the generated images. To address these limitations, we propose Stylizing ViT, a novel Vision Transformer encoder that utilizes weight-shared attention blocks for both self- and cross-attention. This design allows the same attention block to maintain anatomical consistency through self-attention while performing style transfer via cross-attention. We assess the effectiveness of our method for domain generalization by employing it for data augmentation on three distinct image classification tasks in the context of histopathology and dermatology. Results demonstrate an improved robustness (up to +13% accuracy) over the state of the art while generating perceptually convincing images without artifacts. Additionally, we show that Stylizing ViT is effective beyond training, achieving a 17% performance improvement during inference when used for test-time augmentation. The source code is available at https://github.com/sdoerrich97/stylizing-vit .

Accep...

Accepted at 23rd IEEE International Symposium on Biomedical Imaging (IEEE ISBI 2026)

Code Link
StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors 2026-01-23
Show

Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models-crucial to medical image analysis-remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under black-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model's performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model's segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved ASR above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores, significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at: https://github.com/Qinkaiyu/StealthMark.

15 pa...

15 pages,7 figures. Accepted to IEEE Transactions on Image Processing (TIP) 2026

Code Link
Performance uncertainty in medical image analysis: a large-scale investigation of confidence intervals 2026-01-23
Show

Performance uncertainty quantification is essential for reliable validation and eventual clinical translation of medical imaging artificial intelligence (AI). Confidence intervals (CIs) play a central role in this process by indicating how precise a reported performance estimate is. Yet, due to the limited amount of work examining CI behavior in medical imaging, the community remains largely unaware of how many diverse CI methods exist and how they behave in specific settings. The purpose of this study is to close this gap. To this end, we conducted a large-scale empirical analysis across a total of 24 segmentation and classification tasks, using 19 trained models per task group, a broad spectrum of commonly used performance metrics, multiple aggregation strategies, and several widely adopted CI methods. Reliability (coverage) and precision (width) of each CI method were estimated across all settings to characterize their dependence on study characteristics. Our analysis revealed five principal findings: 1) the sample size required for reliable CIs varies from a few dozens to several thousands of cases depending on study parameters; 2) CI behavior is strongly affected by the choice of performance metric; 3) aggregation strategy substantially influences the reliability of CIs, e.g. they require more observations for macro than for micro; 4) the machine learning problem (segmentation versus classification) modulates these effects; 5) different CI methods are not equally reliable and precise depending on the use case. These results form key components for the development of future guidelines on reporting performance uncertainty in medical imaging AI.

None
A Lightweight Medical Image Classification Framework via Self-Supervised Contrastive Learning and Quantum-Enhanced Feature Modeling 2026-01-23
Show

Intelligent medical image analysis is essential for clinical decision support but is often limited by scarce annotations, constrained computational resources, and suboptimal model generalization. To address these challenges, we propose a lightweight medical image classification framework that integrates self-supervised contrastive learning with quantum-enhanced feature modeling. MobileNetV2 is employed as a compact backbone and pretrained using a SimCLR-style self-supervised paradigm on unlabeled images. A lightweight parameterized quantum circuit (PQC) is embedded as a quantum feature enhancement module, forming a hybrid classical-quantum architecture, which is subsequently fine-tuned on limited labeled data. Experimental results demonstrate that, with only approximately 2-3 million parameters and low computational cost, the proposed method consistently outperforms classical baselines without self-supervised learning or quantum enhancement in terms of Accuracy, AUC, and F1-score. Feature visualization further indicates improved discriminability and representation stability. Overall, this work provides a practical and forward-looking solution for high-performance medical artificial intelligence under resource-constrained settings.

None
Efficient Multi-scale Masked Autoencoders with Hybrid-Attention Mechanism for Breast Lesion Classification 2026-01-23
Show

Self-supervised learning (SSL) with Vision Transformers (ViT) has shown immense potential in medical image analysis. However, the quadratic complexity ($\mathcal{O}(N^2)$) of standard self-attention poses a severe barrier for high-resolution biomedical tasks, effectively excluding resource-constrained research labs from utilizing state-of-the-art models. To address this computational bottleneck without sacrificing diagnostic accuracy, we propose \textbf{MIRAM}, a Multi-scale Masked Autoencoder that leverages a \textbf{hybrid-attention mechanism}. Our architecture uniquely decouples semantic learning from detail reconstruction using a dual-decoder design: a standard transformer decoder captures global semantics at low resolution, while a linear-complexity decoder (comparing Linformer, Performer, and Nyströmformer) handles the computationally expensive high-resolution reconstruction. This reduces the complexity of the upscaling stage from quadratic to linear ($\mathcal{O}(N)$), enabling high-fidelity training on consumer-grade GPUs. We validate our approach on the CBIS-DDSM mammography dataset. Remarkably, our \textbf{Nyströmformer-based variant} achieves a classification accuracy of \textbf{61.0%}, outperforming both standard MAE (58.9%) and MoCo-v3 (60.2%) while requiring significantly less memory. These results demonstrate that hybrid-attention architectures can democratize high-resolution medical AI, making powerful SSL accessible to researchers with limited hardware resources.

None
A Lightweight Brain-Inspired Machine Learning Framework for Coronary Angiography: Hybrid Neural Representation and Robust Learning Strategies 2026-01-22
Show

Background: Coronary angiography (CAG) is a cornerstone imaging modality for assessing coronary artery disease and guiding interventional treatment decisions. However, in real-world clinical settings, angiographic images are often characterized by complex lesion morphology, severe class imbalance, label uncertainty, and limited computational resources, posing substantial challenges to conventional deep learning approaches in terms of robustness and generalization.Methods: The proposed framework is built upon a pretrained convolutional neural network to construct a lightweight hybrid neural representation. A selective neural plasticity training strategy is introduced to enable efficient parameter adaptation. Furthermore, a brain-inspired attention-modulated loss function, combining Focal Loss with label smoothing, is employed to enhance sensitivity to hard samples and uncertain annotations. Class-imbalance-aware sampling and cosine annealing with warm restarts are adopted to mimic rhythmic regulation and attention allocation mechanisms observed in biological neural systems.Results: Experimental results demonstrate that the proposed lightweight brain-inspired model achieves strong and stable performance in binary coronary angiography classification, yielding competitive accuracy, recall, F1-score, and AUC metrics while maintaining high computational efficiency.Conclusion: This study validates the effectiveness of brain-inspired learning mechanisms in lightweight medical image analysis and provides a biologically plausible and deployable solution for intelligent clinical decision support under limited computational resources.

None
Vision Models for Medical Imaging: A Hybrid Approach for PCOS Detection from Ultrasound Scans 2026-01-21
Show

Polycystic Ovary Syndrome (PCOS) is the most familiar endocrine illness in women of reproductive age. Many Bangladeshi women suffer from PCOS disease in their older age. The aim of our research is to identify effective vision-based medical image analysis techniques and evaluate hybrid models for the accurate detection of PCOS. We introduced two novel hybrid models combining convolutional and transformer-based approaches. The training and testing data were organized into two categories: "infected" (PCOS-positive) and "noninfected" (healthy ovaries). In the initial stage, our first hybrid model, 'DenConST' (integrating DenseNet121, Swin Transformer, and ConvNeXt), achieved 85.69% accuracy. The final optimized model, 'DenConREST' (incorporating Swin Transformer, ConvNeXt, DenseNet121, ResNet18, and EfficientNetV2), demonstrated superior performance with 98.23% accuracy. Among all evaluated models, DenConREST showed the best performance. This research highlights an efficient solution for PCOS detection from ultrasound images, significantly improving diagnostic accuracy while reducing detection errors.

None
Does medical specialization of VLMs enhance discriminative power?: A comprehensive investigation through feature distribution analysis 2026-01-21
Show

This study investigates the feature representations produced by publicly available open source medical vision-language models (VLMs). While medical VLMs are expected to capture diagnostically relevant features, their learned representations remain underexplored, and standard evaluations like classification accuracy do not fully reveal if they acquire truly discriminative, lesion-specific features. Understanding these representations is crucial for revealing medical image structures and improving downstream tasks in medical image analysis. This study aims to investigate the feature distributions learned by medical VLMs and evaluate the impact of medical specialization. We analyze the feature distribution of multiple image modalities extracted by some representative medical VLMs across lesion classification datasets on multiple modalities. These distributions were compared them with non-medical VLMs to assess the domain-specific medical training. Our experiments showed that medical VLMs can extract discriminative features that are effective for medical classification tasks. Moreover, it was found that non-medical VLMs with recent improvement with contextual enrichment such as LLM2CLIP produce more refined feature representations. Our results imply that enhancing text encoder is more crucial than training intensively on medical images when developing medical VLMs. Notably, non-medical models are particularly vulnerable to biases introduced by overlaied text strings on images. These findings underscore the need for careful consideration on model selection according to downstream tasks besides potential risks in inference due to background biases such as textual information in images.

A sho...

A short version paper of this research has been accepted for The IEEE International Symposium on Biomedical Imaging (ISBI) 2026

None
From Volumes to Slices: Computationally Efficient Contrastive Learning for Sequential Abdominal CT Analysis 2026-01-21
Show

The requirement for expert annotations limits the effectiveness of deep learning for medical image analysis. Although 3D self-supervised methods like volume contrast learning (VoCo) are powerful and partially address the labeling scarcity issue, their high computational cost and memory consumption are barriers. We propose 2D-VoCo, an efficient adaptation of the VoCo framework for slice-level self-supervised pre-training that learns spatial-semantic features from unlabeled 2D CT slices via contrastive learning. The pre-trained CNN backbone is then integrated into a CNN-LSTM architecture to classify multi-organ injuries. In the RSNA 2023 Abdominal Trauma dataset, 2D-VoCo pre-training significantly improves mAP, precision, recall, and RSNA score over training from scratch. Our framework provides a practical method to reduce the dependency on labeled data and enhance model performance in clinical CT analysis. We release the code for reproducibility. https://github.com/tkz05/2D-VoCo-CT-Classifier

Code Link
Partial Decoder Attention Network with Contour-weighted Loss Function for Data-Imbalance Medical Image Segmentation 2026-01-20
Show

Image segmentation is pivotal in medical image analysis, facilitating clinical diagnosis, treatment planning, and disease evaluation. Deep learning has significantly advanced automatic segmentation methodologies by providing superior modeling capability for complex structures and fine-grained anatomical regions. However, medical images often suffer from data imbalance issues, such as large volume disparities among organs or tissues, and uneven sample distributions across different anatomical structures. This imbalance tends to bias the model toward larger organs or more frequently represented structures, while overlooking smaller or less represented structures, thereby affecting the segmentation accuracy and robustness. To address these challenges, we proposed a novel contour-weighted segmentation approach, which improves the model's capability to represent small and underrepresented structures. We developed PDANet, a lightweight and efficient segmentation network based on a partial decoder mechanism. We evaluated our method using three prominent public datasets. The experimental results show that our methodology excelled in three distinct tasks: segmenting multiple abdominal organs, brain tumors, and pelvic bone fragments with injuries. It consistently outperformed nine state-of-the-art methods. Moreover, the proposed contour-weighted strategy improved segmentation for other comparison methods across the three datasets, yielding average enhancements in Dice scores of 2.32%, 1.67%, and 3.60%, respectively. These results demonstrate that our contour-weighted segmentation method surpassed current leading approaches in both accuracy and robustness. As a model-independent strategy, it can seamlessly fit various segmentation frameworks, enhancing their performance. This flexibility highlighted its practical importance and potential for broad use in medical image analysis.

None
Unsupervised Deformable Image Registration with Local-Global Attention and Image Decomposition 2026-01-20
Show

Deformable image registration is a critical technology in medical image analysis, with broad applications in clinical practice such as disease diagnosis, multi-modal fusion, and surgical navigation. Traditional methods often rely on iterative optimization, which is computationally intensive and lacks generalizability. Recent advances in deep learning have introduced attention-based mechanisms that improve feature alignment, yet accurately registering regions with high anatomical variability remains challenging. In this study, we proposed a novel unsupervised deformable image registration framework, LGANet++, which employs a novel local-global attention mechanism integrated with a unique technique for feature interaction and fusion to enhance registration accuracy, robustness, and generalizability. We evaluated our approach using five publicly available datasets, representing three distinct registration scenarios: cross-patient, cross-time, and cross-modal CT-MR registration. The results demonstrated that our approach consistently outperforms several state-of-the-art registration methods, improving registration accuracy by 1.39% in cross-patient registration, 0.71% in cross-time registration, and 6.12% in cross-modal CT-MR registration tasks. These results underscore the potential of LGANet++ to support clinical workflows requiring reliable and efficient image registration. The source code is available at https://github.com/huangzyong/LGANet-Registration.

Code Link
Data-Efficient Meningioma Segmentation via Implicit Spatiotemporal Mixing and Sim2Real Semantic Injection 2026-01-19
Show

The performance of medical image segmentation is increasingly defined by the efficiency of data utilization rather than merely the volume of raw data. Accurate segmentation, particularly for complex pathologies like meningiomas, demands that models fully exploit the latent information within limited high-quality annotations. To maximize the value of existing datasets, we propose a novel dual-augmentation framework that synergistically integrates spatial manifold expansion and semantic object injection. Specifically, we leverage Implicit Neural Representations (INR) to model continuous velocity fields. Unlike previous methods, we perform linear mixing on the integrated deformation fields, enabling the efficient generation of anatomically plausible variations by interpolating within the deformation space. This approach allows for the extensive exploration of structural diversity from a small set of anchors. Furthermore, we introduce a Sim2Real lesion injection module. This module constructs a high-fidelity simulation domain by transplanting lesion textures into healthy anatomical backgrounds, effectively bridging the gap between synthetic augmentation and real-world pathology. Comprehensive experiments on a hybrid dataset demonstrate that our framework significantly enhances the data efficiency and robustness of state-of-the-art models, including nnU-Net and U-Mamba, offering a potent strategy for high-performance medical image analysis with limited annotation budgets.

None
Beyond Knowledge Silos: Task Fingerprinting for Democratization of Medical Imaging AI 2026-01-19
Show

The field of medical imaging AI is currently undergoing rapid transformations, with methodical research increasingly translated into clinical practice. Despite these successes, research suffers from knowledge silos, hindering collaboration and progress: Existing knowledge is scattered across publications and many details remain unpublished, while privacy regulations restrict data sharing. In the spirit of democratizing of AI, we propose a framework for secure knowledge transfer in the field of medical image analysis. The key to our approach is dataset "fingerprints", structured representations of feature distributions, that enable quantification of task similarity. We tested our approach across 71 distinct tasks and 12 medical imaging modalities by transferring neural architectures, pretraining, augmentation policies, and multi-task learning. According to comprehensive analyses, our method outperforms traditional methods for identifying relevant knowledge and facilitates collaborative model training. Our framework fosters the democratization of AI in medical imaging and could become a valuable tool for promoting faster scientific advancement.

None
From 100,000+ images to winning the first brain MRI foundation model challenges: Sharing lessons and models 2026-01-19
Show

Developing Foundation Models for medical image analysis is essential to overcome the unique challenges of radiological tasks. The first challenges of this kind for 3D brain MRI, SSL3D and FOMO25, were held at MICCAI 2025. Our solution ranked first in tracks of both contests. It relies on a U-Net CNN architecture combined with strategies leveraging anatomical priors and neuroimaging domain knowledge. Notably, our models trained 1-2 orders of magnitude faster and were 10 times smaller than competing transformer-based approaches. Models are available here: https://github.com/jbanusco/BrainFM4Challenges.

Work ...

Work presented at the SSL3D Challenge (1st place, ResEnc-L track) and FOMO Challenge (1st place, Methods track) on Brain MRI Foundation Models at MICCAI 2025

Code Link
Seeing Isn't Always Believing: Analysis of Grad-CAM Faithfulness and Localization Reliability in Lung Cancer CT Classification 2026-01-19
Show

Explainable Artificial Intelligence (XAI) techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM), have become indispensable for visualizing the reasoning process of deep neural networks in medical image analysis. Despite their popularity, the faithfulness and reliability of these heatmap-based explanations remain under scrutiny. This study critically investigates whether Grad-CAM truly represents the internal decision-making of deep models trained for lung cancer image classification. Using the publicly available IQ-OTH/NCCD dataset, we evaluate five representative architectures: ResNet-50, ResNet-101, DenseNet-161, EfficientNet-B0, and ViT-Base-Patch16-224, to explore model-dependent variations in Grad-CAM interpretability. We introduce a quantitative evaluation framework that combines localization accuracy, perturbation-based faithfulness, and explanation consistency to assess Grad-CAM reliability across architectures. Experimental findings reveal that while Grad-CAM effectively highlights salient tumor regions in most convolutional networks, its interpretive fidelity significantly degrades for Vision Transformer models due to non-local attention behavior. Furthermore, cross-model comparisons indicate substantial variability in saliency localization, implying that Grad-CAM explanations may not always correspond to the true diagnostic evidence used by the networks. This work exposes critical limitations of current saliency-based XAI approaches in medical imaging and emphasizes the need for model-aware interpretability methods that are both computationally sound and clinically meaningful. Our findings aim to inspire a more cautious and rigorous adoption of visual explanation tools in medical AI, urging the community to rethink what it truly means to "trust" a model's explanation.

7 pages None
Fine-Tuning Cycle-GAN for Domain Adaptation of MRI Images 2026-01-18
Show

Magnetic Resonance Imaging (MRI) scans acquired from different scanners or institutions often suffer from domain shifts owing to variations in hardware, protocols, and acquisition parameters. This discrepancy degrades the performance of deep learning models trained on source domain data when applied to target domain images. In this study, we propose a Cycle-GAN-based model for unsupervised medical-image domain adaptation. Leveraging CycleGANs, our model learns bidirectional mappings between the source and target domains without paired training data, preserving the anatomical content of the images. By leveraging Cycle-GAN capabilities with content and disparity loss for adaptation tasks, we ensured image-domain adaptation while maintaining image integrity. Several experiments on MRI datasets demonstrated the efficacy of our model in bidirectional domain adaptation without labelled data. Furthermore, research offers promising avenues for improving the diagnostic accuracy of healthcare. The statistical results confirm that our approach improves model performance and reduces domain-related variability, thus contributing to more precise and consistent medical image analysis.

14 pa...

14 pages, 9 figures, 2 tables

None
A Hierarchical Benchmark of Foundation Models for Dermatology 2026-01-18
Show

Foundation models have transformed medical image analysis by providing robust feature representations that reduce the need for large-scale task-specific training. However, current benchmarks in dermatology often reduce the complex diagnostic taxonomy to flat, binary classification tasks, such as distinguishing melanoma from benign nevi. This oversimplification obscures a model's ability to perform fine-grained differential diagnoses, which is critical for clinical workflow integration. This study evaluates the utility of embeddings derived from ten foundation models, spanning general computer vision, general medical imaging, and dermatology-specific domains, for hierarchical skin lesion classification. Using the DERM12345 dataset, which comprises 40 lesion subclasses, we calculated frozen embeddings and trained lightweight adapter models using a five-fold cross-validation. We introduce a hierarchical evaluation framework that assesses performance across four levels of clinical granularity: 40 Subclasses, 15 Main Classes, 2 and 4 Superclasses, and Binary Malignancy. Our results reveal a "granularity gap" in model capabilities: MedImageInsights achieved the strongest overall performance (97.52% weighted F1-Score on Binary Malignancy detection) but declined to 65.50% on fine-grained 40-class subtype classification. Conversely, MedSigLip (69.79%) and dermatology-specific models (Derm Foundation and MONET) excelled at fine-grained 40-class subtype discrimination while achieving lower overall performance than MedImageInsights on broader classification tasks. Our findings suggest that while general medical foundation models are highly effective for high-level screening, specialized modeling strategies are necessary for the granular distinctions required in diagnostic support systems.

None
Explainable Deep Learning for Pediatric Pneumonia Detection in Chest X-Ray Images 2026-01-14
Show

Background: Pneumonia remains a leading cause of morbidity and mortality among children worldwide, emphasizing the need for accurate and efficient diagnostic support tools. Deep learning has shown strong potential in medical image analysis, particularly for chest X-ray interpretation. This study compares two state-of-the-art convolutional neural network (CNN) architectures for automated pediatric pneumonia detection. Methods: A publicly available dataset of 5,863 pediatric chest X-ray images was used. Images were preprocessed through normalization, resizing, and data augmentation to enhance generalization. DenseNet121 and EfficientNet-B0 were fine-tuned using pretrained ImageNet weights under identical training settings. Performance was evaluated using accuracy, F1-score, Matthews Correlation Coefficient (MCC), and recall. Model explainability was incorporated using Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-agnostic Explanations (LIME) to visualize image regions influencing predictions. Results: EfficientNet-B0 outperformed DenseNet121, achieving an accuracy of 84.6%, F1-score of 0.8899, and MCC of 0.6849. DenseNet121 achieved 79.7% accuracy, an F1-score of 0.8597, and MCC of 0.5852. Both models demonstrated high recall values above 0.99, indicating strong sensitivity to pneumonia detection. Grad-CAM and LIME visualizations showed consistent focus on clinically relevant lung regions, supporting the reliability of model decisions. Conclusions: EfficientNet-B0 provided a more balanced and computationally efficient performance compared to DenseNet121, making it a strong candidate for clinical deployment. The integration of explainability techniques enhances transparency and trustworthiness in AI-assisted pediatric pneumonia diagnosis.

None
Fair Foundation Models for Medical Image Analysis: Challenges and Perspectives 2026-01-14
Show

Ensuring equitable Artificial Intelligence (AI) in healthcare demands systems that make unbiased decisions across all demographic groups, bridging technical innovation with ethical principles. Foundation Models (FMs), trained on vast datasets through self-supervised learning, enable efficient adaptation across medical imaging tasks while reducing dependency on labeled data. These models demonstrate potential for enhancing fairness, though significant challenges remain in achieving consistent performance across demographic groups. Our review indicates that effective bias mitigation in FMs requires systematic interventions throughout all stages of development. While previous approaches focused primarily on model-level bias mitigation, our analysis reveals that fairness in FMs requires integrated interventions throughout the development pipeline, from data documentation to deployment protocols. This comprehensive framework advances current knowledge by demonstrating how systematic bias mitigation, combined with policy engagement, can effectively address both technical and institutional barriers to equitable AI in healthcare. The development of equitable FMs represents a critical step toward democratizing advanced healthcare technologies, particularly for underserved populations and regions with limited medical infrastructure and computational resources.

None
Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging 2026-01-13
Show

Medical image analysis increasingly relies on large vision-language models (VLMs), yet most systems remain single-pass black boxes that offer limited control over reasoning, safety, and spatial grounding. We propose R^4, an agentic framework that decomposes medical imaging workflows into four coordinated agents: a Router that configures task- and specialization-aware prompts from the image, patient history, and metadata; a Retriever that uses exemplar memory and pass@k sampling to jointly generate free-text reports and bounding boxes; a Reflector that critiques each draft-box pair for key clinical error modes (negation, laterality, unsupported claims, contradictions, missing findings, and localization errors); and a Repairer that iteratively revises both narrative and spatial outputs under targeted constraints while curating high-quality exemplars for future cases. Instantiated on chest X-ray analysis with multiple modern VLM backbones and evaluated on report generation and weakly supervised detection, R^4 consistently boosts LLM-as-a-Judge scores by roughly +1.7-+2.5 points and mAP50 by +2.5-+3.5 absolute points over strong single-VLM baselines, without any gradient-based fine-tuning. These results show that agentic routing, reflection, and repair can turn strong but brittle VLMs into more reliable and better grounded tools for clinical image interpretation. Our code can be found at: https://github.com/faiyazabdullah/MultimodalMedAgent

Code Link
MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning 2026-01-12
Show

Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple sub-image evidence into interleaved multimodal reasoning within a unified, executable interface for agentic training. Using MedVistaGym, we train MedVistaGym-R1 to interleave tool use with agentic reasoning through trajectory sampling and end-to-end reinforcement learning. Across six medical VQA benchmarks, MedVistaGym-R1-8B exceeds comparably sized tool-augmented baselines by 19.10% to 24.21%, demonstrating that structured agentic training--not tool access alone--unlocks effective tool-integrated reasoning for medical image analysis.

None
Enhanced Leukemic Cell Classification Using Attention-Based CNN and Data Augmentation 2026-01-03
Show

We present a reproducible deep learning pipeline for leukemic cell classification, focusing on system architecture, experimental robustness, and software design choices for medical image analysis. Acute lymphoblastic leukemia (ALL) is the most common childhood cancer, requiring expert microscopic diagnosis that suffers from inter-observer variability and time constraints. The proposed system integrates an attention-based convolutional neural network combining EfficientNetV2-B3 with Squeeze-and-Excitation mechanisms for automated ALL cell classification. Our approach employs comprehensive data augmentation, focal loss for class imbalance, and patient-wise data splitting to ensure robust and reproducible evaluation. On the C-NMC 2019 dataset (12,528 original images from 62 patients), the system achieves a 97.89% F1-score and 97.89% accuracy on the test set, with statistical validation through 100-iteration Monte Carlo experiments confirming significant improvements (p < 0.001) over baseline methods. The proposed pipeline outperforms existing approaches by up to 4.67% while using 89% fewer parameters than VGG16 (15.2M vs. 138M). The attention mechanism provides interpretable visualizations of diagnostically relevant cellular features, demonstrating that modern attention-based architectures can improve leukemic cell classification while maintaining computational efficiency suitable for clinical deployment.

9 pag...

9 pages, 5 figures, 4 tables. Submitted to VISAPP 2025

None
Expert-Guided Explainable Few-Shot Learning with Active Sample Selection for Medical Image Analysis 2026-01-02
Show

Medical image analysis faces two critical challenges: scarcity of labeled data and lack of model interpretability, both hindering clinical AI deployment. Few-shot learning (FSL) addresses data limitations but lacks transparency in predictions. Active learning (AL) methods optimize data acquisition but overlook interpretability of acquired samples. We propose a dual-framework solution: Expert-Guided Explainable Few-Shot Learning (EGxFSL) and Explainability-Guided AL (xGAL). EGxFSL integrates radiologist-defined regions-of-interest as spatial supervision via Grad-CAM-based Dice loss, jointly optimized with prototypical classification for interpretable few-shot learning. xGAL introduces iterative sample acquisition prioritizing both predictive uncertainty and attention misalignment, creating a closed-loop framework where explainability guides training and sample selection synergistically. On the BraTS (MRI), VinDr-CXR (chest X-ray), and SIIM-COVID-19 (chest X-ray) datasets, we achieve accuracies of 92%, 76%, and 62%, respectively, consistently outperforming non-guided baselines across all datasets. Under severe data constraints, xGAL achieves 76% accuracy with only 680 samples versus 57% for random sampling. Grad-CAM visualizations demonstrate guided models focus on diagnostically relevant regions, with generalization validated on breast ultrasound confirming cross-modality applicability.

Accep...

Accepted for publication in IEEE Journal of Biomedical and Health Informatics, 2025

None
Med-2D SegNet: A Light Weight Deep Neural Network for Medical 2D Image Segmentation 2026-01-02
Show

Accurate and efficient medical image segmentation is crucial for advancing clinical diagnostics and surgical planning, yet remains a complex challenge due to the variability in anatomical structures and the demand for low-complexity models. In this paper, we introduced Med-2D SegNet, a novel and highly efficient segmentation architecture that delivers outstanding accuracy while maintaining a minimal computational footprint. Med-2D SegNet achieves state-of-the-art performance across multiple benchmark datasets, including KVASIR-SEG, PH2, EndoVis, and GLAS, with an average Dice similarity coefficient (DSC) of 89.77% across 20 diverse datasets. Central to its success is the compact Med Block, a specialized encoder design that incorporates dimension expansion and parameter reduction, enabling precise feature extraction while keeping model parameters to a low count of just 2.07 million. Med-2D SegNet excels in cross-dataset generalization, particularly in polyp segmentation, where it was trained on KVASIR-SEG and showed strong performance on unseen datasets, demonstrating its robustness in zero-shot learning scenarios, even though we acknowledge that further improvements are possible. With top-tier performance in both binary and multi-class segmentation, Med-2D SegNet redefines the balance between accuracy and efficiency, setting a new benchmark for medical image analysis. This work paves the way for developing accessible, high-performance diagnostic tools suitable for clinical environments and resource-constrained settings, making it a step forward in the democratization of advanced medical technology.

None
Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model 2026-01-02
Show

Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.

8 pages, 6 figures None
MetaFormer-driven Encoding Network for Robust Medical Semantic Segmentation 2026-01-01
Show

Semantic segmentation is crucial for medical image analysis, enabling precise disease diagnosis and treatment planning. However, many advanced models employ complex architectures, limiting their use in resource-constrained clinical settings. This paper proposes MFEnNet, an efficient medical image segmentation framework that incorporates MetaFormer in the encoding phase of the U-Net backbone. MetaFormer, an architectural abstraction of vision transformers, provides a versatile alternative to convolutional neural networks by transforming tokenized image patches into sequences for global context modeling. To mitigate the substantial computational cost associated with self-attention, the proposed framework replaces conventional transformer modules with pooling transformer blocks, thereby achieving effective global feature aggregation at reduced complexity. In addition, Swish activation is used to achieve smoother gradients and faster convergence, while spatial pyramid pooling is incorporated at the bottleneck to improve multi-scale feature extraction. Comprehensive experiments on different medical segmentation benchmarks demonstrate that the proposed MFEnNet approach attains competitive accuracy while significantly lowering computational cost compared to state-of-the-art models. The source code for this work is available at https://github.com/tranleanh/mfennet.

10 pa...

10 pages, 5 figures, MCT4SD 2025

Code Link
Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis 2025-12-30
Show

Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.

None
Graph Neural Networks with Transformer Fusion of Brain Connectivity Dynamics and Tabular Data for Forecasting Future Tobacco Use 2025-12-29
Show

Integrating non-Euclidean brain imaging data with Euclidean tabular data, such as clinical and demographic information, poses a substantial challenge for medical imaging analysis, particularly in forecasting future outcomes. While machine learning and deep learning techniques have been applied successfully to cross-sectional classification and prediction tasks, effectively forecasting outcomes in longitudinal imaging studies remains challenging. To address this challenge, we introduce a time-aware graph neural network model with transformer fusion (GNN-TF). This model flexibly integrates both tabular data and dynamic brain connectivity data, leveraging the temporal order of these variables within a coherent framework. By incorporating non-Euclidean and Euclidean sources of information from a longitudinal resting-state fMRI dataset from the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA), the GNN-TF enables a comprehensive analysis that captures critical aspects of longitudinal imaging data. Comparative analyses against a variety of established machine learning and deep learning models demonstrate that GNN-TF outperforms these state-of-the-art methods, delivering superior predictive accuracy for predicting future tobacco usage. The end-to-end, time-aware transformer fusion structure of the proposed GNN-TF model successfully integrates multiple data modalities and leverages temporal dynamics, making it a valuable analytic tool for functional brain imaging studies focused on clinical outcome prediction.

22 pages, 4 figures None
Medical Scene Reconstruction and Segmentation based on 3D Gaussian Representation 2025-12-28
Show

3D reconstruction of medical images is a key technology in medical image analysis and clinical diagnosis, providing structural visualization support for disease assessment and surgical planning. Traditional methods are computationally expensive and prone to structural discontinuities and loss of detail in sparse slices, making it difficult to meet clinical accuracy requirements.To address these challenges, we propose an efficient 3D reconstruction method based on 3D Gaussian and tri-plane representations. This method not only maintains the advantages of Gaussian representation in efficient rendering and geometric representation but also significantly enhances structural continuity and semantic consistency under sparse slicing conditions. Experimental results on multimodal medical datasets such as US and MRI show that our proposed method can generate high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions, while significantly improving reconstruction efficiency. This provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images.

14 pages, 3 figures None
Plasticine: A Traceable Diffusion Model for Medical Image Translation 2025-12-20
Show

Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.

Accep...

Accepted by IEEE Transactions on Artificial Intelligence

None
OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models 2025-12-20
Show

The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.

None
Establishing Stochastic Object Models from Noisy Data via Ambient Measurement-Integrated Diffusion 2025-12-16
Show

Task-based measures of image quality (IQ) are critical for evaluating medical imaging systems, which must account for randomness including anatomical variability. Stochastic object models (SOMs) provide a statistical description of such variability, but conventional mathematical SOMs fail to capture realistic anatomy, while data-driven approaches typically require clean data rarely available in clinical tasks. To address this challenge, we propose AMID, an unsupervised Ambient Measurement-Integrated Diffusion with noise decoupling, which establishes clean SOMs directly from noisy measurements. AMID introduces a measurement-integrated strategy aligning measurement noise with the diffusion trajectory, and explicitly models coupling between measurement and diffusion noise across steps, an ambient loss is thus designed base on it to learn clean SOMs. Experiments on real CT and mammography datasets show that AMID outperforms existing methods in generation fidelity and yields more reliable task-based IQ evaluation, demonstrating its potential for unsupervised medical imaging analysis.

None
Incentivizing Tool-augmented Thinking with Images for Medical Image Analysis 2025-12-16
Show

Recent reasoning based medical MLLMs have made progress in generating step by step textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on fine-grained visual regions to achieve precise grounding and diagnosis. We introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when additional visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought. In contrast to prior approaches limited by the performance ceiling of specialized tools, Ophiuchus integrates the model's inherent grounding and perception capabilities with external tools, thereby fostering higher-level reasoning. The core of our method is a three-stage training strategy: cold-start training with tool-integrated reasoning data to achieve basic tool selection and adaptation for inspecting key regions; self-reflection fine-tuning to strengthen reflective reasoning and encourage revisiting tool outputs; and Agentic Tool Reinforcement Learning to directly optimize task-specific rewards and emulate expert-like diagnostic behavior. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our approach illuminates a path toward medical AI agents that can genuinely "think with images" through tool-integrated reasoning. Datasets, codes, and trained models will be released publicly.

None
XAI-Driven Diagnosis of Generalization Failure in State-Space Cerebrovascular Segmentation Models: A Case Study on Domain Shift Between RSNA and TopCoW Datasets 2025-12-16
Show

The clinical deployment of deep learning models in medical imaging is severely hindered by domain shift. This challenge, where a high-performing model fails catastrophically on external datasets, is a critical barrier to trustworthy AI. Addressing this requires moving beyond simple performance metrics toward deeper understanding, making Explainable AI (XAI) an essential diagnostic tool in medical image analysis. We present a rigorous, two-phase approach to diagnose the generalization failure of state-of-the-art State-Space Models (SSMs), specifically UMamaba, applied to cerebrovascular segmentation. We first established a quantifiable domain gap between our Source (RSNA CTA Aneurysm) and Target (TopCoW Circle of Willis CT) datasets, noting significant differences in Z-resolution and background noise. The model's Dice score subsequently plummeted from 0.8604 (Source) to 0.2902 (Target). In the second phase, which is our core contribution, we utilized Seg-XRes-CAM to diagnose the cause of this failure. We quantified the model's focus by measuring the overlap between its attention maps and the Ground Truth segmentations, and between its attention maps and its own Prediction Mask. Our analysis proves the model failed to generalize because its attention mechanism abandoned true anatomical features in the Target domain. Quantitative metrics confirm the model's focus shifted away from the Ground Truth vessels (IoU0.101 at 0.3 threshold) while still aligning with its own wrong predictions (IoU0.282 at 0.3 threshold). This demonstrates the model learned spurious correlations, confirming XAI is a powerful diagnostic tool for identifying dataset bias in emerging architectures.

None
Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis 2025-12-15
Show

Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM's ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.

Code Link
TWLR: Text-Guided Weakly-Supervised Lesion Localization and Severity Regression for Explainable Diabetic Retinopathy Grading 2025-12-15
Show

Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.

None
MAISI: Medical AI for Synthetic Imaging 2025-12-12
Show

Medical imaging analysis faces challenges such as data scarcity, high annotation costs, and privacy concerns. This paper introduces the Medical AI for Synthetic Imaging (MAISI), an innovative approach using the diffusion model to generate synthetic 3D computed tomography (CT) images to address those challenges. MAISI leverages the foundation volume compression network and the latent diffusion model to produce high-resolution CT images (up to a landmark volume dimension of 512 x 512 x 768 ) with flexible volume dimensions and voxel spacing. By incorporating ControlNet, MAISI can process organ segmentation, including 127 anatomical structures, as additional conditions and enables the generation of accurately annotated synthetic images that can be used for various downstream tasks. Our experiment results show that MAISI's capabilities in generating realistic, anatomically accurate images for diverse regions and conditions reveal its promising potential to mitigate challenges using synthetic data.

WACV2...

WACV25 accepted. https://github.com/NVIDIA-Medtech/NV-Generate-CTMR

Code Link
MedXAI: A Retrieval-Augmented and Self-Verifying Framework for Knowledge-Guided Medical Image Analysis 2025-12-10
Show

Accurate and interpretable image-based diagnosis remains a fundamental challenge in medical AI, particularly un- der domain shifts and rare-class conditions. Deep learning mod- els often struggle with real-world distribution changes, exhibit bias against infrequent pathologies, and lack the transparency required for deployment in safety-critical clinical environments. We introduce MedXAI (An Explainable Framework for Med- ical Imaging Classification), a unified expert knowledge based framework that integrates deep vision models with clinician- derived expert knowledge to improve generalization, reduce rare- class bias, and provide human-understandable explanations by localizing the relevant diagnostic features rather than relying on technical post-hoc methods (e.g., Saliency Maps, LIME). We evaluate MedXAI across heterogeneous modalities on two challenging tasks: (i) Seizure Onset Zone localization from resting-state fMRI, and (ii) Diabetic Retinopathy grading. Ex periments on ten multicenter datasets show consistent gains, including a 3% improvement in cross-domain generalization and a 10% improvmnet in F1 score of rare class, substantially outperforming strong deep learning baselines. Ablations confirm that the symbolic components act as effective clinical priors and regularizers, improving robustness under distribution shift. MedXAI delivers clinically aligned explanations while achieving superior in-domain and cross-domain performance, particularly for rare diseases in multimodal medical AI.

https...

https://cmsworkshops.com/Asilomar2025/Papers/Uploads/FinalPapers/Original/1527/20251130102314_899554_1527.pdf

None
Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and Outlook 2025-12-10
Show

Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.

Code Link
Co-Seg++: Mutual Prompt-Guided Collaborative Learning for Versatile Medical Segmentation 2025-12-09
Show

Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co-Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio-sequential prompt encoder (SSP-Encoder) to capture long-range spatial and sequential relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi-task collaborative decoder (MTC-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co-Seg++ outperforms state-of-the-arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg-Plus.

Accepted by TMI Code Link
Mitigating Individual Skin Tone Bias in Skin Lesion Classification through Distribution-Aware Reweighting 2025-12-09
Show

Skin color has historically been a focal point of discrimination, yet fairness research in machine learning for medical imaging often relies on coarse subgroup categories, overlooking individual-level variations. Such group-based approaches risk obscuring biases faced by outliers within subgroups. This study introduces a distribution-based framework for evaluating and mitigating individual fairness in skin lesion classification. We treat skin tone as a continuous attribute rather than a categorical label, and employ kernel density estimation (KDE) to model its distribution. We further compare twelve statistical distance metrics to quantify disparities between skin tone distributions and propose a distance-based reweighting (DRW) loss function to correct underrepresentation in minority tones. Experiments across CNN and Transformer models demonstrate: (i) the limitations of categorical reweighting in capturing individual-level disparities, and (ii) the superior performance of distribution-based reweighting, particularly with Fidelity Similarity (FS), Wasserstein Distance (WD), Hellinger Metric (HM), and Harmonic Mean Similarity (HS). These findings establish a robust methodology for advancing fairness at individual level in dermatological AI systems, and highlight broader implications for sensitive continuous attributes in medical image analysis.

None
DIST-CLIP: Arbitrary Metadata and Image Guided MRI Harmonization via Disentangled Anatomy-Contrast Representations 2025-12-08
Show

Deep learning holds immense promise for transforming medical image analysis, yet its clinical generalization remains profoundly limited. A major barrier is data heterogeneity. This is particularly true in Magnetic Resonance Imaging, where scanner hardware differences, diverse acquisition protocols, and varying sequence parameters introduce substantial domain shifts that obscure underlying biological signals. Data harmonization methods aim to reduce these instrumental and acquisition variability, but existing approaches remain insufficient. When applied to imaging data, image-based harmonization approaches are often restricted by the need for target images, while existing text-guided methods rely on simplistic labels that fail to capture complex acquisition details or are typically restricted to datasets with limited variability, failing to capture the heterogeneity of real-world clinical environments. To address these limitations, we propose DIST-CLIP (Disentangled Style Transfer with CLIP Guidance), a unified framework for MRI harmonization that flexibly uses either target images or DICOM metadata for guidance. Our framework explicitly disentangles anatomical content from image contrast, with the contrast representations being extracted using pre-trained CLIP encoders. These contrast embeddings are then integrated into the anatomical content via a novel Adaptive Style Transfer module. We trained and evaluated DIST-CLIP on diverse real-world clinical datasets, and showed significant improvements in performance when compared against state-of-the-art methods in both style translation fidelity and anatomical preservation, offering a flexible solution for style transfer and standardizing MRI data. Our code and weights will be made publicly available upon publication.

None
MetaChest: Generalized few-shot learning of pathologies from chest X-rays 2025-12-07
Show

The limited availability of annotated data presents a major challenge for applying deep learning methods to medical image analysis. Few-shot learning methods aim to recognize new classes from only a small number of labeled examples. These methods are typically studied under the standard few-shot learning setting, where all classes in a task are new. However, medical applications such as pathology classification from chest X-rays often require learning new classes while simultaneously leveraging knowledge of previously known ones, a scenario more closely aligned with generalized few-shot classification. Despite its practical relevance, few-shot learning has been scarcely studied in this context. In this work, we present MetaChest, a large-scale dataset of 479,215 chest X-rays collected from four public databases. MetaChest includes a meta-set partition specifically designed for standard few-shot classification, as well as an algorithm for generating multi-label episodes. We conduct extensive experiments evaluating both a standard transfer learning approach and an extension of ProtoNet across a wide range of few-shot multi-label classification tasks. Our results demonstrate that increasing the number of classes per episode and the number of training examples per class improves classification performance. Notably, the transfer learning approach consistently outperforms the ProtoNet extension, despite not being tailored for few-shot learning. We also show that higher-resolution images improve accuracy at the cost of additional computation, while efficient model architectures achieve comparable performance to larger models with significantly reduced resource requirements.

None
MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging 2025-12-05
Show

Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.

None
LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation 2025-12-04
Show

Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.

None
Performance Evaluation of Transfer Learning Based Medical Image Classification Techniques for Disease Detection 2025-12-04
Show

Medical image classification plays an increasingly vital role in identifying various diseases by classifying medical images, such as X-rays, MRIs and CT scans, into different categories based on their features. In recent years, deep learning techniques have attracted significant attention in medical image classification. However, it is usually infeasible to train an entire large deep learning model from scratch. To address this issue, one of the solutions is the transfer learning (TL) technique, where a pre-trained model is reused for a new task. In this paper, we present a comprehensive analysis of TL techniques for medical image classification using deep convolutional neural networks. We evaluate six pre-trained models (AlexNet, VGG16, ResNet18, ResNet34, ResNet50, and InceptionV3) on a custom chest X-ray dataset for disease detection. The experimental results demonstrate that InceptionV3 consistently outperforms other models across all the standard metrics. The ResNet family shows progressively better performance with increasing depth, whereas VGG16 and AlexNet perform reasonably well but with lower accuracy. In addition, we also conduct uncertainty analysis and runtime comparison to assess the robustness and computational efficiency of these models. Our findings reveal that TL is beneficial in most cases, especially with limited data, but the extent of improvement depends on several factors such as model architecture, dataset size, and domain similarity between source and target tasks. Moreover, we demonstrate that with a well-trained feature extractor, only a lightweight feedforward model is enough to provide efficient prediction. As such, this study contributes to the understanding of TL in medical image classification, and provides insights for selecting appropriate models based on specific requirements.

None
Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation 2025-12-03
Show

Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration, over the Dice plus cross-entropy loss baseline, but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach not only enhances the trustworthiness of segmentation predictions but also shows potential for safer integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses

15 pa...

15 pages, 6 figures, IEEE TMI submission. This version originally appeared in error as arXiv:2403.06759(v2)

Code Link
Domain Feature Collapse: Implications for Out-of-Distribution Detection and Solutions 2025-12-03
Show

Why do state-of-the-art OOD detection methods exhibit catastrophic failure when models are trained on single-domain datasets? We provide the first theoretical explanation for this phenomenon through the lens of information theory. We prove that supervised learning on single-domain data inevitably produces domain feature collapse -- representations where I(x_d; z) = 0, meaning domain-specific information is completely discarded. This is a fundamental consequence of information bottleneck optimization: models trained on single domains (e.g., medical images) learn to rely solely on class-specific features while discarding domain features, leading to catastrophic failure when detecting out-of-domain samples (e.g., achieving only 53% FPR@95 on MNIST). We extend our analysis using Fano's inequality to quantify partial collapse in practical scenarios. To validate our theory, we introduce Domain Bench, a benchmark of single-domain datasets, and demonstrate that preserving I(x_d; z) > 0 through domain filtering (using pretrained representations) resolves the failure mode. While domain filtering itself is conceptually straightforward, its effectiveness provides strong empirical evidence for our information-theoretic framework. Our work explains a puzzling empirical phenomenon, reveals fundamental limitations of supervised learning in narrow domains, and has broader implications for transfer learning and when to fine-tune versus freeze pretrained models.

None
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation 2025-12-03
Show

Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.

10 pa...

10 pages. Under Review

Code Link
Hierarchical Attention for Sparse Volumetric Anomaly Detection in Subclinical Keratoconus 2025-12-03
Show

The detection of weak, spatially distributed anomalies in volumetric medical imaging remains a major challenge. The subtle, non-adjacent nature of early disease signals is often lost due to suboptimal architectural inductive biases: 2D/3D CNNs impose strong locality, while ViTs diffuse unconstrained global attention. This conflict leaves the optimal inductive structure for robust, sparse volumetric pattern recognition unresolved. This study presents a controlled comparison of sixteen modern deep learning architectures spanning 2D/3D convolutional, hybrid, and volumetric transformer families for subclinical keratoconus (SKC) detection from 3D anterior segment OCT volumes. We demonstrate that hierarchical attention models offer a superior and more parameter-efficient inductive bias, surpassing the performance of both 2D and 3D CNNs and ViTs. Our results show 21-23% higher sensitivity and specificity in the sparse anomaly (subclinical) regime. Mechanistic analyses reveal that this advantage stems from precise spatial scale alignment: hierarchical windowing produces effective receptive fields matched to the intermediate, multi-slice extent of subclinical abnormalities. This avoids excessive CNN locality and diffuse global attention. Attention-distance measurements confirm a key insight into architectural adaptation: the required spatial integration length shifts significantly based on the signal strength, with subclinical cases necessitating longer integration compared to both healthy and manifest disease states. Representational similarity and auxiliary age/sex prediction tasks further support the generalizability of these inductive principles. The findings provide design guidance for future volumetric anomaly detection systems, establishing hierarchical attention as a principled and effective approach for early pathological change analysis in 3D medical imaging.

16 pa...

16 pages, 7 figures, 6 tables

None
Radiologist Copilot: An Agentic Assistant with Orchestrated Tools for Radiology Reporting with Quality Control 2025-12-02
Show

Radiology reporting is an essential yet time-consuming and error-prone task for radiologists in clinical examinations, especially for volumetric medical images. Rigorous quality control is also critical but tedious, ensuring that the final report meets clinical standards. Existing automated approaches, including radiology report generation methods and medical vision-language models, focus mainly on the report generation phase and neglect the crucial quality control procedure, limiting their capability to provide comprehensive support to radiologists. We propose Radiologist Copilot, an agentic AI assistant equipped with orchestrated tools designed for automated radiology reporting with quality control. Leveraging large language models as the reasoning backbone, the agentic system autonomously selects tools, plans, and executes actions, emulating the behavior of radiologists throughout the holistic radiology reporting process. The orchestrated tools include region localization, think with image paradigm directed region analysis planning, strategic template selection for report generation, quality assessment and feedback-driven adaptive refinement for quality control. Therefore, Radiologist Copilot facilitates accurate, complete, and efficient radiology reporting, assisting radiologists and improving clinical efficiency. Experimental results demonstrate that Radiologist Copilot significantly surpasses other state-of-the-art methods in radiology reporting. The source code will be released upon acceptance.

None
Tissue-mask supported inter-subject whole-body image registration in the UK Biobank -- A method benchmarking study 2025-12-02
Show

The UK Biobank is a large-scale study collecting whole-body MR imaging and non-imaging health data. Robust and accurate inter-subject image registration of these whole-body MR images would enable their body-wide spatial standardization, and region-/voxel-wise correlation analysis of non-imaging data with image-derived parameters (e.g., tissue volume or fat content). We propose a sex-stratified inter-subject whole-body MR image registration approach that uses subcutaneous adipose tissue- and muscle-masks from the state-of-the-art VIBESegmentator method to augment intensity-based graph-cut registration. The proposed method was evaluated on a subset of 4000 subjects by comparing it to an intensity-only method as well as two previously published registration methods, uniGradICON and MIRTK. The evaluation comprised overlap measures applied to the 71 VIBESegmentator masks: 1) Dice scores, and 2) voxel-wise label error frequency. Additionally, voxel-wise correlation between age and each of fat content and tissue volume was studied to exemplify the usefulness for medical research. The proposed method exhibited a mean dice score of 0.77 / 0.75 across the cohort and the 71 masks for males/females, respectively. When compared to the intensity-only registration, the mean values were 6 percentage points (pp) higher for both sexes, and the label error frequency was decreased in most tissue regions. These differences were 9pp / 8pp against uniGradICON and 12pp / 13pp against MIRTK. Using the proposed method, the age-correlation maps were less noisy and showed higher anatomical alignment. In conclusion, the image registration method using two tissue masks improves whole-body registration of UK Biobank images.

None
Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation 2025-12-01
Show

Multifractal analysis has revealed regularities in many self-seeding phenomena, yet its use in modern deep learning remains limited. Existing end-to-end multifractal methods rely on heavy pooling or strong feature-space decimation, which constrain tasks such as semantic segmentation. Motivated by these limitations, we introduce two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between the probability mass of the exponents and the multifractal spectrum to form statistical descriptions of encoder embeddings, implemented as channel-attention functions in convolutional networks. Using a U-Net-based framework, we show that multifractal recalibration yields substantial gains over a baseline equipped with other channel-attention mechanisms that also use higher-order statistics. Given the proven ability of multifractal analysis to capture pathological regularities, we validate our approach on three public medical-imaging datasets: ISIC18 (dermoscopy), Kvasir-SEG (endoscopy), and BUSI (ultrasound). Our empirical analysis also provides insights into the behavior of these attention layers. We find that excitation responses do not become increasingly specialized with encoder depth in U-Net architectures due to skip connections, and that their effectiveness may relate to global statistics of instance variability.

30 pa...

30 pages, 9 figures, journal paper

None
Disentangling Progress in Medical Image Registration: Beyond Trend-Driven Architectures towards Domain-Specific Strategies 2025-12-01
Show

Medical image registration drives quantitative analysis across organs, modalities, and patient populations. Recent deep learning methods often combine low-level "trend-driven" computational blocks from computer vision, such as large-kernel CNNs, Transformers, and state-space models, with high-level registration-specific designs like motion pyramids, correlation layers, and iterative refinement. Yet, their relative contributions remain unclear and entangled. This raises a central question: should future advances in registration focus on importing generic architectural trends or on refining domain-specific design principles? Through a modular framework spanning brain, lung, cardiac, and abdominal registration, we systematically disentangle the influence of these two paradigms. Our evaluation reveals that low-level "trend-driven" computational blocks offer only marginal or inconsistent gains, while high-level registration-specific designs consistently deliver more accurate, smoother, and more robust deformations. These domain priors significantly elevate the performance of a standard U-Net baseline, far more than variants incorporating "trend-driven" blocks, achieving an average relative improvement of $\sim3%$. All models and experiments are released within a transparent, modular benchmark that enables plug-and-play comparison for new architectures and registration tasks (https://github.com/BailiangJ/rethink-reg). This dynamic and extensible platform establishes a common ground for reproducible and fair evaluation, inviting the community to isolate genuine methodological contributions from domain priors. Our findings advocate a shift in research emphasis: from following architectural trends to embracing domain-specific design principles as the true drivers of progress in learning-based medical image registration.

Submi...

Submitted to Medical Image Analysis. Journal Extension of arXiv:2407.19274

Code Link
Robust Rigid and Non-Rigid Medical Image Registration Using Learnable Edge Kernels 2025-12-01
Show

Medical image registration is crucial for various clinical and research applications including disease diagnosis or treatment planning which require alignment of images from different modalities, time points, or subjects. Traditional registration techniques often struggle with challenges such as contrast differences, spatial distortions, and modality-specific variations. To address these limitations, we propose a method that integrates learnable edge kernels with learning-based rigid and non-rigid registration techniques. Unlike conventional layers that learn all features without specific bias, our approach begins with a predefined edge detection kernel, which is then perturbed with random noise. These kernels are learned during training to extract optimal edge features tailored to the task. This adaptive edge detection enhances the registration process by capturing diverse structural features critical in medical imaging. To provide clearer insight into the contribution of each component in our design, we introduce four variant models for rigid registration and four variant models for non-rigid registration. We evaluated our approach using a dataset provided by the Medical University across three setups: rigid registration without skull removal, with skull removal, and non-rigid registration. Additionally, we assessed performance on two publicly available datasets. Across all experiments, our method consistently outperformed state-of-the-art techniques, demonstrating its potential to improve multi-modal image alignment and anatomical structure analysis.

None
FedHK-MVFC: Federated Heat Kernel Multi-View Clustering 2025-12-01
Show

In the realm of distributed artificial intelligence (AI) and privacy-focused medical applications, this paper proposes a multi-view clustering framework that links quantum field theory with federated healthcare analytics. The method uses heat kernel coefficients from spectral analysis to convert Euclidean distances into geometry-aware similarity measures that capture the structure of diverse medical data. The framework is presented through the heat kernel distance (HKD) transformation, which has convergence guarantees. Two algorithms have been developed: The first, Heat Kernel-Enhanced Multi-View Fuzzy Clustering (HK-MVFC), is used for central analysis. The second, Federated Heat Kernel Multi-View Fuzzy Clustering (FedHK-MVFC), is used for secure, privacy-preserving learning across hospitals. FedHK-MVFC uses differential privacy and secure aggregation to enable HIPAA-compliant collaboration. Tests on synthetic cardiovascular patient datasets demonstrate increased clustering accuracy, reduced communication, and retained efficiency compared to centralized methods. After being validated on 10,000 synthetic patient records across two hospitals, the methods proved useful for collaborative phenotyping involving electrocardiogram (ECG) data, cardiac imaging data, and behavioral data. The proposed methods' theoretical contributions include update rules with proven convergence, adaptive view weighting, and privacy-preserving protocols. These contributions establish a new standard for geometry-aware federated learning in healthcare, translating advanced mathematics into practical solutions for analyzing sensitive medical data while ensuring rigor and clinical relevance.

53 pa...

53 pages, 11 figures, and 9 tables

None
Deep Learning-Based Computer Vision Models for Early Cancer Detection Using Multimodal Medical Imaging and Radiogenomic Integration Frameworks 2025-11-30
Show

Early cancer detection remains one of the most critical challenges in modern healthcare, where delayed diagnosis significantly reduces survival outcomes. Recent advancements in artificial intelligence, particularly deep learning, have enabled transformative progress in medical imaging analysis. Deep learning-based computer vision models, such as convolutional neural networks (CNNs), transformers, and hybrid attention architectures, can automatically extract complex spatial, morphological, and temporal patterns from multimodal imaging data including MRI, CT, PET, mammography, histopathology, and ultrasound. These models surpass traditional radiological assessment by identifying subtle tissue abnormalities and tumor microenvironment variations invisible to the human eye. At a broader scale, the integration of multimodal imaging with radiogenomics linking quantitative imaging features with genomics, transcriptomics, and epigenetic biomarkers has introduced a new paradigm for personalized oncology. This radiogenomic fusion allows the prediction of tumor genotype, immune response, molecular subtypes, and treatment resistance without invasive biopsies.

None
MIMM-X: Disentangling Spurious Correlations for Medical Image Analysis 2025-11-28
Show

Deep learning models can excel on medical tasks, yet often experience spurious correlations, known as shortcut learning, leading to poor generalization in new environments. Particularly in medical imaging, where multiple spurious correlations can coexist, misclassifications can have severe consequences. We propose MIMM-X, a framework that disentangles causal features from multiple spurious correlations by minimizing their mutual information. It enables predictions based on true underlying causal relationships rather than dataset-specific shortcuts. We evaluate MIMM-X on three datasets (UK Biobank, NAKO, CheXpert) across two imaging modalities (MRI and X-ray). Results demonstrate that MIMM-X effectively mitigates shortcut learning of multiple spurious correlations.

None
From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation 2025-11-27
Show

Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.

None
GACELLE: GPU-accelerated tools for model parameter estimation and image reconstruction 2025-11-27
Show

Quantitative MRI (qMRI) offers tissue-specific biomarkers that can be tracked over time or compared across populations; however, its adoption in clinical research is hindered by significant computational demands of parameter estimation. Images acquired at high spatial resolution or requiring fitting multiple parameters often require lengthy processing time, constraining their use in routine pipelines and slowing methodological innovation and clinical translation. We present GACELLE, an open source, GPU-accelerated framework for high-throughput qMRI analysis. GACELLE provides a stochastic gradient descent optimiser and a stochastic sampler in MATLAB, enabling fast parameter mapping, improved estimation robustness via spatial regularisation, and uncertainty quantification. GACELLE prioritises accessibility: users only need to provide a forward signal model, while GACELLE's backend manages computational parallelisation, automatic parameter updates, and memory-batching. The stochastic solver performs fully vectorised Markov chain Monte Carlo with identical likelihood on CPU and GPU, ensuring reproducibility across hardware. Benchmarking demonstrates up to 451-fold acceleration for the stochastic gradient descent solver and 14,380-fold acceleration for stochastic sampling compared to CPU-based estimation, without compromising accuracy. We demonstrated GACELLE's versatility on three representative qMRI models and on an image reconstruction task. Across these applications, GACELLE improves parameter precision, enhances test-retest reproducibility, and reduces noise in quantitative maps. By combining speed, usability and flexibility, GACELLE provides a generalisable optimisation framework for medical image analysis. It lowers the computational barrier for qMRI, paving the way for reproducible biomarker development, large-scale imaging studies, and clinical translation.

None
OralGPT-Omni: A Versatile Dental Multimodal Large Language Model 2025-11-27
Show

Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists' decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model's capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.

47 pa...

47 pages, 42 figures, 13 tables

None
Histomorphology-Guided Prototypical Multi-Instance Learning for Breast Cancer WSI Classification 2025-11-27
Show

Histomorphology is crucial in cancer diagnosis. However, existing whole slide image (WSI) classification methods struggle to effectively incorporate histomorphology information, limiting their ability to capture key pathological features. Particularly when the number of instances within a bag is large and their features are complex, it becomes challenging to accurately identify instances decisive for the bag label, making these methods prone to interference from ambiguous instances. To address this limitation, we propose a novel Histomorphology-Guided Prototypical Multi-Instance Learning (HGPMIL) framework that explicitly learns histomorphology-guided prototypical representations by incorporating tumor cellularity, cellular morphology, and tissue architecture. Specifically, our approach consists of three key components: (1) estimating the importance of tumor-related histomorphology information at patch-level based on medical prior knowledge; (2) generating representative prototypes through histomorphology-prototypical clustering; and (3) enabling WSI classification through histomorphology-guided prototypical aggregation. HGPMIL adjusts the decision boundary by incorporating histomorphological importance to reduce instance label uncertainty, thereby reversely optimizing the bag-level boundary. Experimental results demonstrate its effectiveness, achieving high diagnostic accuracy for molecular subtyping, cancer subtyping and survival analysis. The code will be made available at https://github.com/Badgewho/HMDMIL.

11 pages,8 figures Code Link
Advancing Embodied Intelligence in Robotic-Assisted Endovascular Procedures: A Systematic Review of AI Solutions 2025-11-27
Show

Endovascular procedures have revolutionized vascular disease treatment, yet their manual execution is challenged by the demands for high precision, operator fatigue, and radiation exposure. Robotic systems have emerged as transformative solutions to mitigate these inherent limitations. A pivotal moment has arrived, where a confluence of pressing clinical needs and breakthroughs in AI creates an opportunity for a paradigm shift toward Embodied Intelligence (EI), enabling robots to navigate complex vascular networks and adapt to dynamic physiological conditions. Data-driven approaches, leveraging advanced computer vision, medical image analysis, and machine learning, drive this evolution by enabling real-time vessel segmentation, device tracking, and anatomical landmark detection. Reinforcement learning and imitation learning further enhance navigation strategies and replicate expert techniques. This review systematically analyzes the integration of EI into endovascular robotics, identifying profound systemic challenges such as the heterogeneity in validation standards and the gap between human mimicry and machine-native capabilities. Based on this analysis, a conceptual roadmap is proposed that reframes the ultimate objective away from systems that supplant clinical decision-making. This vision of augmented intelligence, where the clinician's role evolves into that of a high-level supervisor, provides a principled foundation for the future of the field.

20 pages, 6 figures None
MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis 2025-11-27
Show

Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5% across multiple medical VQA benchmarks, validating MedEyes's potential in building interpretable medical AI systems.

This ...

This paper has been accepted by AAAI 2026

None
A Fast and Efficient Modern BERT based Text-Conditioned Diffusion Model for Medical Image Segmentation 2025-11-26
Show

In recent times, denoising diffusion probabilistic models (DPMs) have proven effective for medical image generation and denoising, and as representation learners for downstream segmentation. However, segmentation performance is limited by the need for dense pixel-wise labels, which are expensive, time-consuming, and require expert knowledge. We propose FastTextDiff, a label-efficient diffusion-based segmentation model that integrates medical text annotations to enhance semantic representations. Our approach uses ModernBERT, a transformer capable of processing long clinical notes, to tightly link textual annotations with semantic content in medical images. Trained on MIMIC-III and MIMIC-IV, ModernBERT encodes clinical knowledge that guides cross-modal attention between visual and textual features. This study validates ModernBERT as a fast, scalable alternative to Clinical BioBERT in diffusion-based segmentation pipelines and highlights the promise of multi-modal techniques for medical image analysis. By replacing Clinical BioBERT with ModernBERT, FastTextDiff benefits from FlashAttention 2, an alternating attention mechanism, and a 2-trillion-token corpus, improving both segmentation accuracy and training efficiency over traditional diffusion-based models.

15 pa...

15 pages, 3 figures, Accepted in Slide 3 10th International Conference on Computer Vision & Image Processing (CVIP 2026)

None
DeepGI: Explainable Deep Learning for Gastrointestinal Image Classification 2025-11-26
Show

This paper presents a comprehensive comparative model analysis on a novel gastrointestinal medical imaging dataset, comprised of 4,000 endoscopic images spanning four critical disease classes: Diverticulosis, Neoplasm, Peritonitis, and Ureters. Leveraging state-of-the-art deep learning techniques, the study confronts common endoscopic challenges such as variable lighting, fluctuating camera angles, and frequent imaging artifacts. The best performing models, VGG16 and MobileNetV2, each achieved a test accuracy of 96.5%, while Xception reached 94.24%, establishing robust benchmarks and baselines for automated disease classification. In addition to strong classification performance, the approach includes explainable AI via Grad-CAM visualization, enabling identification of image regions most influential to model predictions and enhancing clinical interpretability. Experimental results demonstrate the potential for robust, accurate, and interpretable medical image analysis even in complex real-world conditions. This work contributes original benchmarks, comparative insights, and visual explanations, advancing the landscape of gastrointestinal computer-aided diagnosis and underscoring the importance of diverse, clinically relevant datasets and model explainability in medical AI research.

7 pag...

7 pages, 4 figures, 2 tables. Accepted at DASET 2026

None
Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data 2025-11-26
Show

Foundation models for promptable segmentation, including SAM, SAM 2, and the recently released SAM 3, have renewed interest in zero-shot segmentation of medical imaging. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 is widely used for annotation in 3D medical workflows, SAM 3 introduces a new perception backbone, detector-tracker pipeline, and concept-level prompting that may alter its behavior under spatial prompts. We present the first controlled comparison of SAM 2 and SAM 3 for zero-shot segmentation of 3D medical volumes and videos under purely visual prompting, with concept mechanisms disabled. We assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 without customization. We benchmark both models on 16 public datasets (CT, MRI, 3D and cine ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. Prompts are restricted to the first frame and use four modes: single-click, multi-click, bounding box, and dense mask. This design standardizes preprocessing, prompt placement, propagation rules, and metric computation to disentangle prompt interpretation from propagation. Prompt-frame analysis shows that SAM 3 provides substantially stronger initialization than SAM 2 for click prompting across most structures. In full-volume analysis, SAM 3 retains this advantage for complex, vascular, and soft-tissue anatomies, emerging as the more versatile general-purpose segmenter. While SAM 2 remains competitive for compact, rigid organs under strong spatial guidance, it frequently fails on challenging targets where SAM 3 succeeds. Overall, our results suggest that SAM 3 is the superior default choice for most medical segmentation tasks, particularly those involving sparse user interaction or complex anatomical topology.

None
RefTr: Recurrent Refinement of Confluent Trajectories for 3D Vascular Tree Centerline Graphs 2025-11-25
Show

Tubular trees, such as blood vessels and lung airways, are essential for material transport within the human body. Accurately detecting their centerlines with correct tree topology is critical for clinical tasks such as diagnosis, treatment planning, and surgical navigation. In these applications, maintaining high recall is crucial, as missing small branches can result in fatal mistakes caused by incomplete assessments or undetected abnormalities. We present RefTr, a 3D image-to-graph model for centerline generation of vascular trees via recurrent refinement of confluent trajectories. RefTr uses a Producer-Refiner architecture based on a Transformer decoder, where the Producer proposes a set of initial confluent trajectories that are recurrently refined by the Refiner to produce final trajectories, which forms the centerline graph. The confluent trajectory representation enables refinement of complete trajectories while explicitly enforcing a valid tree topology. The recurrent refinement scheme improves precision and reuses the same Refiner block across multiple steps, yielding a 2.4x reduction in decoder parameters compared to previous SOTA. We also introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and boost precision. Across multiple public centerline datasets, RefTr achieves superior recall and comparable precision to previous SOTA, while offering faster inference and substantially fewer parameters, demonstrating its potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging.

None
Prompt-Aware Adaptive Elastic Weight Consolidation for Continual Learning in Medical Vision-Language Models 2025-11-25
Show

Medical AI systems face catastrophic forgetting when deployed in clinical settings, where models must learn new imaging protocols while retaining prior diagnostic capabilities. This challenge is particularly acute for medical vision-language models that must preserve complex cross-modal alignments between medical images and clinical terminology across diverse imaging modalities. We introduce Prompt- Aware Adaptive Elastic Weight Consolidation (PA-EWC), a novel continual learning approach that addresses catastrophic forgetting through prompt-guided parameter specialization. Our method systematically categorizes model parameters based on their functional roles in processing visual-descriptive, spatial-guided, and medical-semantic information, enabling targeted protection of critical knowledge while allowing adaptation to new clinical requirements. PA-EWC incorporates adaptive Fisher Information computation with gradient stability analysis and develops weighted complexity metrics based on medical terminology density. We evaluate our approach across five medical imaging datasets (Kvasir-SEG, ISIC 2018, CheXlocalize, BUSI, CAMUS) representing diverse modalities including endoscopy, dermoscopy, radiography, and ultrasound. Experimental results demonstrate that PA-EWC reduces catastrophic forgetting by up to 17.58% compared to baseline methods, with performance improvements of 4.30% on chest X-ray pathology localization and 6.06% on polyp segmentation.

Accep...

Accepted by 32nd International Conference on MultiMedia Modeling (MMM 2026)

None
KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA) 2025-11-24
Show

Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource-limited settings. To address this challenge, we developed KOM, a multi-agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general-purpose large language models in imaging analysis and prescription generation. A randomized three-arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI-assisted management systems for other chronic conditions.

None
Enhancing Medical Image Analysis through Geometric and Photometric transformations 2025-11-24
Show

Medical image analysis suffers from a lack of labeled data due to several challenges including patient privacy and lack of experts. Although some AI models only perform well with large amounts of data, we will move to data augmentation where there is a solution to improve the performance of our models and increase the dataset size through traditional or advanced techniques. In this paper, we evaluate the effectiveness of data augmentation techniques on two different medical image datasets. In the first step, we applied some transformation techniques to the skin cancer dataset containing benign and malignant classes. Then, we trained the convolutional neural network (CNN) on the dataset before and after augmentation, which significantly improved test accuracy from 90.74% to 96.88% and decreased test loss from 0.7921 to 0.1468 after augmentation. In the second step, we used the Mixup technique by mixing two random images and their corresponding masks using the retina and blood vessels dataset, then we trained the U-net model and obtained the Dice coefficient which increased from 0 before augmentation to 0.4163 after augmentation. The result shows the effect of using data augmentation to increase the dataset size on the classification and segmentation performance.

None
An Anatomy Aware Hybrid Deep Learning Framework for Lung Cancer Tumor Stage Classification 2025-11-24
Show

Accurate lung cancer tumor staging is crucial for prognosis and treatment planning. However, it remains challenging for end-to-end deep learning approaches, as such approaches often overlook spatial and anatomical information that are central to the tumor-node-metastasis system. The tumor stage depends on multiple quantitative criteria, including the tumor size and its proximity to the nearest anatomical structures, and small variations can alter the staging outcome. We propose a medically grounded hybrid pipeline that performs staging by explicitly measuring the tumor's size and distance properties rather than treating it as a pure image classification task. Our method employs specialized encoder-decoder networks to precisely segment the lung and adjacent anatomy, including the lobes, tumor, mediastinum, and diaphragm. Subsequently, we extract the necessary tumor properties, i.e. measure the largest tumor dimension and calculate the distance between the tumor and neighboring anatomical structures by a quantitative analysis of the segmentation masks. Finally, we apply rule-based tumor staging aligned with the medical guidelines. This novel framework has been evaluated on the Lung-PET-CT-Dx dataset, demonstrating superior performance compared to traditional deep learning models, achieving an overall classification accuracy of 91.36%. We report the per-stage F1-scores of 0.93 (T1), 0.89 (T2), 0.96 (T3), and 0.90 (T4), a critical evaluation aspect often omitted in prior literature. To our knowledge, this is the first study that embeds explicit clinical context into tumor stage classification. Unlike standard convolutional neural networks that operate in an uninterpretable "black box" manner, our method offers both state-of-the-art performance and transparent decision support.

None
Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering 2025-11-24
Show

Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.

Accep...

Accepted at the Workshop on Multimodal Representation Learning for Healthcare (MMRL4H), EurIPS 2025

None
Benchmarking the Spatial Robustness of DNNs via Natural and Adversarial Localized Corruptions 2025-11-24
Show

The robustness of deep neural networks is a crucial factor in safety-critical applications, particularly in complex and dynamic environments (e.g., medical or driving scenarios) where localized corruptions can arise. While previous studies have evaluated the robustness of semantic segmentation (SS) models under whole-image natural or adversarial corruptions, a comprehensive investigation into the spatial robustness of dense vision models under localized corruptions remains underexplored. This paper fills this gap by introducing novel, region-aware metrics for benchmarking the spatial robustness of segmentation models, along with an evaluation framework to assess the impact of natural localized corruptions. Furthermore, it uncovers the inherent complexity of evaluating worst-case spatial robustness using only a single localized adversarial attack. To address this, the work proposes a region-aware multi-attack adversarial analysis to systematically assess model robustness across specific image regions. The proposed metrics and analysis were exploited to evaluate 14 segmentation models in driving scenarios, uncovering key insights into the effects of localized corruption in both natural and adversarial forms. The results reveal that models respond to these two types of threats differently; for instance, transformer-based segmentation models demonstrate notable robustness to localized natural corruptions but are highly vulnerable to adversarial ones, and vice versa for CNN-based models. Consequently, we also address the challenge of balancing robustness to both natural and adversarial localized corruptions by means of ensemble models, thereby achieving a broader threat coverage and improved reliability for dense vision tasks.

Accep...

Accepted for publication in Pattern Recognition

None
EEG-VLM: A Hierarchical Vision-Language Model with Multi-Level Feature Alignment and Visually Enhanced Language-Guided Reasoning for EEG Image-Based Sleep Stage Prediction 2025-11-24
Show

Sleep stage classification based on electroencephalography (EEG) is fundamental for assessing sleep quality and diagnosing sleep-related disorders. However, most traditional machine learning methods rely heavily on prior knowledge and handcrafted features, while existing deep learning models still struggle to jointly capture fine-grained time-frequency patterns and achieve clinical interpretability. Recently, vision-language models (VLMs) have made significant progress in the medical domain, yet their performance remains constrained when applied to physiological waveform data, especially EEG signals, due to their limited visual understanding and insufficient reasoning capability. To address these challenges, we propose EEG-VLM, a hierarchical vision-language framework that integrates multi-level feature alignment with visually enhanced language-guided reasoning for interpretable EEG-based sleep stage classification. Specifically, a specialized visual enhancement module constructs high-level visual tokens from intermediate-layer features to extract rich semantic representations of EEG images. These tokens are further aligned with low-level CLIP features through a multi-level alignment mechanism, enhancing the VLM's image-processing capability. In addition, a Chain-of-Thought (CoT) reasoning strategy decomposes complex medical inference into interpretable logical steps, effectively simulating expert-like decision-making. Experimental results demonstrate that the proposed method significantly improves both the accuracy and interpretability of VLMs in EEG-based sleep stage classification, showing promising potential for automated and explainable EEG analysis in clinical settings.

None
MedVision: Dataset and Benchmark for Quantitative Medical Image Analysis 2025-11-24
Show

Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. Our benchmarks show that current off-the-shelf VLMs perform poorly on these tasks. However, with supervised fine-tuning on MedVision, we significantly enhance their performance across detection, T/L estimation, and A/D measurement, demonstrating reduced error rates and improved precision. This work provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging. Code and data are available at https://medvision-vlm.github.io.

8 pag...

8 pages, 8 figures, 4 tables

None
Linear Algebraic Approaches to Neuroimaging Data Compression: A Comparative Analysis of Matrix and Tensor Decomposition Methods for High-Dimensional Medical Images 2025-11-22
Show

This paper evaluates Tucker decomposition and Singular Value Decomposition (SVD) for compressing neuroimaging data. Tucker decomposition preserves multi-dimensional relationships, achieving superior reconstruction fidelity and perceptual similarity. SVD excels in extreme compression but sacrifices fidelity. The results highlight Tucker decomposition's suitability for applications requiring the preservation of structural and temporal relationships.

None
Is Complete Labeling Necessary? Understanding Active Learning in Longitudinal Medical Imaging 2025-11-22
Show

Detecting changes in longitudinal medical imaging using deep learning requires a substantial amount of accurately labeled data. However, labeling these images is notably more costly and time-consuming than labeling other image types, as it requires labeling across various time points, where new lesions can be minor, and subtle changes are easily missed. Deep Active Learning (DAL) has shown promise in minimizing labeling costs by selectively querying the most informative samples, but existing studies have primarily focused on static tasks like classification and segmentation. Consequently, the conventional DAL approach cannot be directly applied to change detection tasks, which involve identifying subtle differences across multiple images. In this study, we propose a novel DAL framework, named Longitudinal Medical Imaging Active Learning (LMI-AL), tailored specifically for longitudinal medical imaging. By pairing and differencing all 2D slices from baseline and follow-up 3D images, LMI-AL iteratively selects the most informative pairs for labeling using DAL, training a deep learning model with minimal manual annotation. Experimental results demonstrate that, with less than 8% of the data labeled, LMI-AL can achieve performance comparable to models trained on fully labeled datasets. We also provide a detailed analysis of the method's performance, as guidance for future research. The code is publicly available at https://github.com/HelenMa9998/Longitudinal_AL.

This ...

This paper has been accepted at International Joint Conference on Neural Networks (IJCNN) 2025

Code Link
Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers 2025-11-21
Show

Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.

Accep...

Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:020

None
MorphSeek: Fine-grained Latent Representation-Level Policy Optimization for Deformable Image Registration 2025-11-21
Show

Deformable image registration (DIR) remains a fundamental yet challenging problem in medical image analysis, largely due to the prohibitively high-dimensional deformation space of dense displacement fields and the scarcity of voxel-level supervision. Existing reinforcement learning frameworks often project this space into coarse, low-dimensional representations, limiting their ability to capture spatially variant deformations. We propose MorphSeek, a fine-grained representation-level policy optimization paradigm that reformulates DIR as a spatially continuous optimization process in the latent feature space. MorphSeek introduces a stochastic Gaussian policy head atop the encoder to model a distribution over latent features, facilitating efficient exploration and coarse-to-fine refinement. The framework integrates unsupervised warm-up with weakly supervised fine-tuning through Group Relative Policy Optimization, where multi-trajectory sampling stabilizes training and improves label efficiency. Across three 3D registration benchmarks (OASIS brain MRI, LiTS liver CT, and Abdomen MR-CT), MorphSeek achieves consistent Dice improvements over competitive baselines while maintaining high label efficiency with minimal parameter cost and low step-level latency overhead. Beyond optimizer specifics, MorphSeek advances a representation-level policy learning paradigm that achieves spatially coherent and data-efficient deformation optimization, offering a principled, backbone-agnostic, and optimizer-agnostic solution for scalable visual alignment in high-dimensional settings.

None
SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction 2025-11-20
Show

Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.

20 pages None
Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks 2025-11-20
Show

The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.

None
Fairness in Multi-modal Medical Diagnosis with Demonstration Selection 2025-11-20
Show

Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

10 pa...

10 pages (including 2 pages of references), 4 figures. This work explores fairness in multi-modal medical image reasoning using in-context learning

None
Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs 2025-11-19
Show

Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model's spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5's predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, but showed improvements when provided examples through few shot prompting. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

Proce...

Proceedings of the 5th Machine Learning for Health (ML4H) Symposium

None
Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation 2025-11-19
Show

Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.

None
FunnyNodules: A Customizable Medical Dataset Tailored for Evaluating Explainable AI 2025-11-19
Show

Densely annotated medical image datasets that capture not only diagnostic labels but also the underlying reasoning behind these diagnoses are scarce. Such reasoning-related annotations are essential for developing and evaluating explainable AI (xAI) models that reason similarly to radiologists: making correct predictions for the right reasons. To address this gap, we introduce FunnyNodules, a fully parameterized synthetic dataset designed for systematic analysis of attribute-based reasoning in medical AI models. The dataset generates abstract, lung nodule-like shapes with controllable visual attributes such as roundness, margin sharpness, and spiculation. Target class is derived from a predefined attribute combination, allowing full control over the decision rule that links attributes to the diagnostic class. We demonstrate how FunnyNodules can be used in model-agnostic evaluations to assess whether models learn correct attribute-target relations, to interpret over- or underperformance in attribute prediction, and to analyze attention alignment with attribute-specific regions of interest. The framework is fully customizable, supporting variations in dataset complexity, target definitions, class balance, and beyond. With complete ground truth information, FunnyNodules provides a versatile foundation for developing, benchmarking, and conducting in-depth analyses of explainable AI methods in medical image analysis.

None
A new non-parametric Kendall's tau for matrix-valued elliptical observations 2025-11-19
Show

In this article, we first propose generalized row/column matrix Kendall's tau for matrix-variate observations that are ubiquitous in areas such as finance and medical imaging. For a random matrix following a matrix-variate elliptically contoured distribution, we show that the eigenspaces of the proposed row/column matrix Kendall's tau coincide with those of the row/column scatter matrix respectively, with the same descending order of the eigenvalues. We perform eigenvalue decomposition to the generalized row/column matrix Kendall's tau for recovering the loading spaces of the matrix factor model. We also propose to estimate the pair of the factor numbers by exploiting the eigenvalue-ratios of the row/column matrix Kendall's tau. Theoretically, we derive the convergence rates of the estimators for loading spaces, factor scores and common components, and prove the consistency of the estimators for the factor numbers without any moment constraints on the idiosyncratic errors. Thorough simulation studies are conducted to show the higher degree of robustness of the proposed estimators over the existing ones. Analysis of a financial dataset of asset returns and a medical imaging dataset associated with COVID-19 illustrate the empirical usefulness of the proposed method.

None
RN-SDEs: Limited-Angle CT Reconstruction with Residual Null-Space Diffusion Stochastic Differential Equations 2025-11-18
Show

Computed tomography is a widely used imaging modality with applications ranging from medical imaging to material analysis. One major challenge arises from the lack of scanning information at certain angles, leading to distorted CT images with artifacts. This results in an ill-posed problem known as the Limited Angle Computed Tomography (LACT) reconstruction problem. To address this problem, we propose Residual Null-Space Diffusion Stochastic Differential Equations (RN-SDEs), which are a variant of diffusion models that characterize the diffusion process with mean-reverting (MR) stochastic differential equations. To demonstrate the generalizability of RN-SDEs, our experiments are conducted on two different LACT datasets, i.e., ChromSTEM and C4KC-KiTS. Through extensive experiments, we show that by leveraging learned Mean-Reverting SDEs as a prior and emphasizing data consistency using Range-Null Space Decomposition (RNSD) based rectification, RN-SDEs can restore high-quality images from severe degradation and achieve state-of-the-art performance in most LACT tasks. Additionally, we present a quantitative comparison of computational complexity and runtime efficiency, highlighting the superior effectiveness of our proposed approach.

None
GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification 2025-11-18
Show

Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.

Accce...

Acccepted in MICCAI Workshop 2025

None
SMOL-MapSeg: Show Me One Label as prompt 2025-11-18
Show

Historical maps offer valuable insights into changes on Earth's surface but pose challenges for modern segmentation models due to inconsistent visual styles and symbols. While deep learning models such as UNet and pre-trained foundation models perform well in domains like autonomous driving and medical imaging, they struggle with the variability of historical maps, where similar concepts appear in diverse forms. To address this issue, we propose On-Need Declarative (OND) knowledge-based prompting, a method that provides explicit image-label pair prompts to guide models in linking visual patterns with semantic concepts. This enables users to define and segment target concepts on demand, supporting flexible, concept-aware segmentation. Our approach replaces the prompt encoder of the Segment Anything Model (SAM) with the OND prompting mechanism and fine-tunes it on historical maps, creating SMOL-MapSeg (Show Me One Label). Unlike existing SAM-based fine-tuning methods that are class-agnostic or restricted to fixed classes, SMOL-MapSeg supports class-aware segmentation across arbitrary datasets. Experiments show that SMOL-MapSeg accurately segments user-defined classes and substantially outperforms baseline models. Furthermore, it demonstrates strong generalization even with minimal training data, highlighting its potential for scalable and adaptable historical map analysis.

None
Zero-Training Task-Specific Model Synthesis for Few-Shot Medical Image Classification 2025-11-18
Show

Deep learning models have achieved remarkable success in medical image analysis but are fundamentally constrained by the requirement for large-scale, meticulously annotated datasets. This dependency on "big data" is a critical bottleneck in the medical domain, where patient data is inherently difficult to acquire and expert annotation is expensive, particularly for rare diseases where samples are scarce by definition. To overcome this fundamental challenge, we propose a novel paradigm: Zero-Training Task-Specific Model Synthesis (ZS-TMS). Instead of adapting a pre-existing model or training a new one, our approach leverages a large-scale, pre-trained generative engine to directly synthesize the entire set of parameters for a task-specific classifier. Our framework, the Semantic-Guided Parameter Synthesizer (SGPS), takes as input minimal, multi-modal task information as little as a single example image (1-shot) and a corresponding clinical text description to directly synthesize the entire set of parameters for a task-specific classifier. The generative engine interprets these inputs to generate the weights for a lightweight, efficient classifier (e.g., an EfficientNet-V2), which can be deployed for inference immediately without any task-specific training or fine-tuning. We conduct extensive evaluations on challenging few-shot classification benchmarks derived from the ISIC 2018 skin lesion dataset and a custom rare disease dataset. Our results demonstrate that SGPS establishes a new state-of-the-art, significantly outperforming advanced few-shot and zero-shot learning methods, especially in the ultra-low data regimes of 1-shot and 5-shot classification. This work paves the way for the rapid development and deployment of AI-powered diagnostic tools, particularly for the long tail of rare diseases where data is critically limited.

None
AtlasMorph: Learning conditional deformable templates for brain MRI 2025-11-17
Show

Deformable templates, or atlases, are images that represent a prototypical anatomy for a population, and are often enhanced with probabilistic anatomical label maps. They are commonly used in medical image analysis for population studies and computational anatomy tasks such as registration and segmentation. Because developing a template is a computationally expensive process, relatively few templates are available. As a result, analysis is often conducted with sub-optimal templates that are not truly representative of the study population, especially when there are large variations within this population. We propose a machine learning framework that uses convolutional registration neural networks to efficiently learn a function that outputs templates conditioned on subject-specific attributes, such as age and sex. We also leverage segmentations, when available, to produce anatomical segmentation maps for the resulting templates. The learned network can also be used to register subject images to the templates. We demonstrate our method on a compilation of 3D brain MRI datasets, and show that it can learn high-quality templates that are representative of populations. We find that annotated conditional templates enable better registration than their unlabeled unconditional counterparts, and outperform other templates construction methods.

None
A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset 2025-11-17
Show

While the field of medical image analysis has undergone a transformative shift with the integration of machine learning techniques, the main challenge of these techniques is often the scarcity of large, diverse, and well-annotated datasets. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, for usage in machine learning. Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, all of which are standardized to the same format and readily usable in PyTorch or other ML frameworks. We perform a technical validation of MedIMeta, demonstrating its utility through fully supervised and cross-domain few-shot learning baselines.

None
THIR: Topological Histopathological Image Retrieval 2025-11-17
Show

According to the World Health Organization, breast cancer claimed the lives of approximately 685,000 women in 2020. Early diagnosis and accurate clinical decision making are critical in reducing this global burden. In this study, we propose THIR, a novel Content-Based Medical Image Retrieval (CBMIR) framework that leverages topological data analysis specifically, Betti numbers derived from persistent homology to characterize and retrieve histopathological images based on their intrinsic structural patterns. Unlike conventional deep learning approaches that rely on extensive training, annotated datasets, and powerful GPU resources, THIR operates entirely without supervision. It extracts topological fingerprints directly from RGB histopathological images using cubical persistence, encoding the evolution of loops as compact, interpretable feature vectors. The similarity retrieval is then performed by computing the distances between these topological descriptors, efficiently returning the top-K most relevant matches. Extensive experiments on the BreaKHis dataset demonstrate that THIR outperforms state of the art supervised and unsupervised methods. It processes the entire dataset in under 20 minutes on a standard CPU, offering a fast, scalable, and training free solution for clinical image retrieval.

None
MedGEN-Bench: Contextually entangled benchmark for open-ended multimodal medical generation 2025-11-17
Show

As Vision-Language Models (VLMs) increasingly gain traction in medical applications, clinicians are progressively expecting AI systems not only to generate textual diagnoses but also to produce corresponding medical images that integrate seamlessly into authentic clinical workflows. Despite the growing interest, existing medical visual benchmarks present notable limitations. They often rely on ambiguous queries that lack sufficient relevance to image content, oversimplify complex diagnostic reasoning into closed-ended shortcuts, and adopt a text-centric evaluation paradigm that overlooks the importance of image generation capabilities. To address these challenges, we introduce \textsc{MedGEN-Bench}, a comprehensive multimodal benchmark designed to advance medical AI research. MedGEN-Bench comprises 6,422 expert-validated image-text pairs spanning six imaging modalities, 16 clinical tasks, and 28 subtasks. It is structured into three distinct formats: Visual Question Answering, Image Editing, and Contextual Multimodal Generation. What sets MedGEN-Bench apart is its focus on contextually intertwined instructions that necessitate sophisticated cross-modal reasoning and open-ended generative outputs, moving beyond the constraints of multiple-choice formats. To evaluate the performance of existing systems, we employ a novel three-tier assessment framework that integrates pixel-level metrics, semantic text analysis, and expert-guided clinical relevance scoring. Using this framework, we systematically assess 10 compositional frameworks, 3 unified models, and 5 VLMs.

CVPR ...

CVPR 2026 Under Review

None
X-VMamba: Explainable Vision Mamba 2025-11-16
Show

State Space Models (SSMs), particularly the Mamba architecture, have recently emerged as powerful alternatives to Transformers for sequence modeling, offering linear computational complexity while achieving competitive performance. Yet, despite their effectiveness, understanding how these Vision SSMs process spatial information remains challenging due to the lack of transparent, attention-like mechanisms. To address this gap, we introduce a controllability-based interpretability framework that quantifies how different parts of the input sequence (tokens or patches) influence the internal state dynamics of SSMs. We propose two complementary formulations: a Jacobian-based method applicable to any SSM architecture that measures influence through the full chain of state propagation, and a Gramian-based approach for diagonal SSMs that achieves superior speed through closed-form analytical solutions. Both methods operate in a single forward pass with linear complexity, requiring no architectural modifications or hyperparameter tuning. We validate our framework through experiments on three diverse medical imaging modalities, demonstrating that SSMs naturally implement hierarchical feature refinement from diffuse low-level textures in early layers to focused, clinically meaningful patterns in deeper layers. Our analysis reveals domain-specific controllability signatures aligned with diagnostic criteria, progressive spatial selectivity across the network hierarchy, and the substantial influence of scanning strategies on attention patterns. Beyond medical imaging, we articulate applications spanning computer vision, natural language processing, and cross-domain tasks. Our framework establishes controllability analysis as a unified, foundational interpretability paradigm for SSMs across all domains. Code and analysis tools will be made available upon publication

None
Sequential Attention-based Sampling for Histopathological Analysis 2025-11-16
Show

Deep neural networks are increasingly applied in automated histopathology. Yet, whole-slide images (WSIs) are often acquired at gigapixel sizes, rendering them computationally infeasible to analyze entirely at high resolution. Diagnostic labels are largely available only at the slide-level, because expert annotation of images at a finer (patch) level is both laborious and expensive. Moreover, regions with diagnostic information typically occupy only a small fraction of the WSI, making it inefficient to examine the entire slide at full resolution. Here, we propose SASHA -- Sequential Attention-based Sampling for Histopathological Analysis -- a deep reinforcement learning approach for efficient analysis of histopathological images. First, SASHA learns informative features with a lightweight hierarchical, attention-based multiple instance learning (MIL) model. Second, SASHA samples intelligently and zooms selectively into a small fraction (10-20%) of high-resolution patches to achieve reliable diagnoses. We show that SASHA matches state-of-the-art methods that analyze the WSI fully at high resolution, albeit at a fraction of their computational and memory costs. In addition, it significantly outperforms competing, sparse sampling methods. We propose SASHA as an intelligent sampling model for medical imaging challenges that involve automated diagnosis with exceptionally large images containing sparsely informative features. Model implementation is available at: https://github.com/coglabiisc/SASHA.

Code Link
Leveraging Quantum-Based Architectures for Robust Diagnostics 2025-11-15
Show

The objective of this study is to diagnose and differentiate kidney stones, cysts, and tumors using Computed Tomography (CT) images of the kidney. This study leverages a hybrid quantum-classical framework in this regard. We combine a pretrained ResNet50 encoder, with a Quantum Convolutional Neural Network (QCNN) to explore quantum-assisted diagnosis. We pre-process the kidney images using denoising and contrast limited adaptive histogram equalization to enhance feature extraction. We address class imbalance through data augmentation and weighted sampling. Latent features extracted by the encoder are transformed into qubits via angle encoding and processed by a QCNN. The model is evaluated on both 8-qubit and 12-qubit configurations. Both architectures achieved rapid convergence with stable learning curves and high consistency between training and validation performance. The models reached a test accuracy of 0.99, with the 12-qubit configuration providing improvements in overall recall and precision, particularly for Cyst and Tumor detection, where it achieved perfect recall for Cysts and a tumor F1-score of 0.9956. Confusion matrix analysis further confirmed reliable classification behavior across all classes, with very few misclassifications. Results demonstrate that integrating classical pre-processing and deep feature extraction with quantum circuits enhances medical diagnostic performance.

None
AGGRNet: Selective Feature Extraction and Aggregation for Enhanced Medical Image Classification 2025-11-15
Show

Medical image analysis for complex tasks such as severity grading and disease subtype classification poses significant challenges due to intricate and similar visual patterns among classes, scarcity of labeled data, and variability in expert interpretations. Despite the usefulness of existing attention-based models in capturing complex visual patterns for medical image classification, underlying architectures often face challenges in effectively distinguishing subtle classes since they struggle to capture inter-class similarity and intra-class variability, resulting in incorrect diagnosis. To address this, we propose AGGRNet framework to extract informative and non-informative features to effectively understand fine-grained visual patterns and improve classification for complex medical image analysis tasks. Experimental results show that our model achieves state-of-the-art performance on various medical imaging datasets, with the best improvement up to 5% over SOTA models on the Kvasir dataset.

None
A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation 2025-11-15
Show

Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists' workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.

Accep...

Accepted at AAAI 2026

None
DCMM-Transformer: Degree-Corrected Mixed-Membership Attention for Medical Imaging 2025-11-15
Show

Medical images exhibit latent anatomical groupings, such as organs, tissues, and pathological regions, that standard Vision Transformers (ViTs) fail to exploit. While recent work like SBM-Transformer attempts to incorporate such structures through stochastic binary masking, they suffer from non-differentiability, training instability, and the inability to model complex community structure. We present DCMM-Transformer, a novel ViT architecture for medical image analysis that incorporates a Degree-Corrected Mixed-Membership (DCMM) model as an additive bias in self-attention. Unlike prior approaches that rely on multiplicative masking and binary sampling, our method introduces community structure and degree heterogeneity in a fully differentiable and interpretable manner. Comprehensive experiments across diverse medical imaging datasets, including brain, chest, breast, and ocular modalities, demonstrate the superior performance and generalizability of the proposed approach. Furthermore, the learned group structure and structured attention modulation substantially enhance interpretability by yielding attention maps that are anatomically meaningful and semantically coherent.

None
Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement 2025-11-14
Show

In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.

Code Link
PI-NAIM: Path-Integrated Neural Adaptive Imputation Model 2025-11-14
Show

Medical imaging and multi-modal clinical settings often face the challange of missing modality in their diagnostic pipelines. Existing imputation methods either lack representational capacity or are computationally expensive. We propose PI-NAIM, a novel dual-path architecture that dynamically routes samples to optimized imputation approaches based on missingness complexity. Our framework integrates: (1) intelligent path routing that directs low missingness samples to efficient statistical imputation (MICE) and complex patterns to powerful neural networks (GAIN with temporal analysis); (2) cross-path attention fusion that leverages missingness-aware embeddings to intelligently combine both branches; and (3) end-to-end joint optimization of imputation accuracy and downstream task performance. Extensive experiments on MIMIC-III and multimodal benchmarks demonstrate state-of-the-art performance, achieving RMSE of 0.108 (vs. baselines' 0.119-0.152) and substantial gains in downstream tasks with an AUROC of 0.812 for mortality prediction. PI-NAIM's modular design enables seamless integration into vision pipelines handling incomplete sensor measurements, missing modalities, or corrupted inputs, providing a unified solution for real-world scenario. The code is publicly available at https://github.com/AfifaKhaled/PI-NAIM-Path-Integrated-Neural-Adaptive-Imputation-Model

Code Link
FocusSDF: Boundary-Aware Learning for Medical Image Segmentation via Signed Distance Supervision 2025-11-14
Show

Segmentation of medical images constitutes an essential component of medical image analysis, providing the foundation for precise diagnosis and efficient therapeutic interventions in clinical practices. Despite substantial progress, most segmentation models do not explicitly encode boundary information; as a result, making boundary preservation a persistent challenge in medical image segmentation. To address this challenge, we introduce FocusSDF, a novel loss function based on the signed distance functions (SDFs), which redirects the network to concentrate on boundary regions by adaptively assigning higher weights to pixels closer to the lesion or organ boundary, effectively making it boundary aware. To rigorously validate FocusSDF, we perform extensive evaluations against five state-of-the-art medical image segmentation models, including the foundation model MedSAM, using four distance-based loss functions across diverse datasets covering cerebral aneurysm, stroke, liver, and breast tumor segmentation tasks spanning multiple imaging modalities. The experimental results consistently demonstrate the superior performance of FocusSDF over existing distance transform based loss functions.

None
Active Contour Models Driven by Hyperbolic Mean Curvature Flow for Image Segmentation 2025-11-14
Show

Parabolic mean curvature flow-driven active contour models (PMCF-ACMs) are widely used for image segmentation, yet they suffer severe degradation under high-intensity noise because gradient-descent evolutions exhibit the well-known zig-zag phenomenon. To overcome this drawback, we propose hyperbolic mean curvature flow-driven ACMs (HMCF-ACMs). This novel framework incorporates an adjustable acceleration field to autonomously regulate curve evolution smoothness, providing dual degrees of freedom for adaptive selection of both initial contours and velocity fields. We rigorously prove that HMCF-ACMs are normal flows and establish their numerical equivalence to wave equations through a level set formulation with signed distance functions. An efficient numerical scheme combining spectral discretization and optimized temporal integration is developed to solve the governing equations, and its stability condition is derived through Fourier analysis. Extensive experiments on natural and medical images validate that HMCF-ACMs achieve superior performance under high-noise conditions, demonstrating reduced parameter sensitivity, enhanced noise robustness, and improved segmentation accuracy compared to PMCF-ACMs.

None
DENTEX: Dental Enumeration and Tooth Pathosis Detection Benchmark for Panoramic X-ray 2025-11-13
Show

Panoramic X-rays are frequently used in dentistry for treatment planning, but their interpretation can be both time-consuming and prone to error. Artificial intelligence (AI) has the potential to aid in the analysis of these X-rays, thereby improving the accuracy of dental diagnoses and treatment plans. Nevertheless, designing automated algorithms for this purpose poses significant challenges, mainly due to the scarcity of annotated data and variations in anatomical structure. To address these issues, we organized the Dental Enumeration and Diagnosis on Panoramic X-rays Challenge (DENTEX) in association with the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) in 2023. This challenge aims to promote the development of algorithms for multi-label detection of abnormal teeth, using three types of hierarchically annotated data: partially annotated quadrant data, partially annotated quadrant-enumeration data, and fully annotated quadrant-enumeration-diagnosis data, inclusive of four different diagnoses. In this paper, we present a comprehensive analysis of the methods and results from the challenge. Our findings reveal that top performers succeeded through diverse, specialized strategies, from segmentation-guided pipelines to highly-engineered single-stage detectors, using advanced Transformer and diffusion models. These strategies significantly outperformed traditional approaches, particularly for the challenging tasks of tooth enumeration and subtle disease classification. By dissecting the architectural choices that drove success, this paper provides key insights for future development of AI-powered tools that can offer more precise and efficient diagnosis and treatment planning in dentistry. The evaluation code and datasets can be accessed at https://github.com/ibrahimethemhamamci/DENTEX

Code Link
Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from LDCT 2025-11-13
Show

Low-dose chest computed tomography (LDCT) inherently captures both pulmonary and cardiac structures, offering a unique opportunity for joint assessment of lung and cardiovascular health. However, most existing approaches treat these domains as independent tasks, overlooking their physiological interplay and shared imaging biomarkers. We propose an Explainable Cross-Disease Reasoning Framework that enables interpretable cardiopulmonary risk assessment from a single LDCT scan. The framework introduces an agentic reasoning process that emulates clinical diagnostic thinking-first perceiving pulmonary findings, then reasoning through established medical knowledge, and finally deriving a cardiovascular judgment with explanatory rationale. It integrates three synergistic components: a pulmonary perception module that summarizes lung abnormalities, a knowledge-guided reasoning module that infers their cardiovascular implications, and a cardiac representation module that encodes structural biomarkers. Their outputs are fused to produce a holistic cardiovascular risk prediction that is both accurate and physiologically grounded. Experiments on the NLST cohort demonstrate that the proposed framework achieves state-of-the-art performance for CVD screening and mortality prediction, outperforming single-disease and purely image-based baselines. Beyond quantitative gains, the framework provides human-verifiable reasoning that aligns with cardiological understanding, revealing coherent links between pulmonary abnormalities and cardiac stress mechanisms. Overall, this work establishes a unified and explainable paradigm for cardiovascular analysis from LDCT, bridging the gap between image-based prediction and mechanism-based medical interpretation.

None
Dual-Mode Deep Anomaly Detection for Medical Manufacturing: Structural Similarity and Feature Distance 2025-11-13
Show

Automated visual inspection in medical-device manufacturing faces unique challenges, including extremely low defect rates, limited annotated data, hardware restrictions on production lines, and the need for validated, explainable artificial-intelligence systems. This paper presents two attention-guided autoencoder architectures that address these constraints through complementary anomaly-detection strategies. The first employs a multi-scale structural-similarity (4-MS-SSIM) index for inline inspection, enabling interpretable, real-time defect detection on constrained hardware. The second applies a Mahalanobis-distance analysis of randomly reduced latent features for efficient feature-space monitoring and lifecycle verification. Both approaches share a lightweight backbone optimised for high-resolution imagery for typical manufacturing conditions. Evaluations on the Surface Seal Image (SSI) dataset-representing sterile-barrier packaging inspection-demonstrate that the proposed methods outperform reference baselines, including MOCCA, CPCAE, and RAG-PaDiM, under realistic industrial constraints. Cross-domain validation on the MVTec-Zipper benchmark confirms comparable accuracy to state-of-the-art anomaly-detection methods. The dual-mode framework integrates inline anomaly detection and supervisory monitoring, advancing explainable AI architectures toward greater reliability, observability, and lifecycle monitoring in safety-critical manufacturing environments. To facilitate reproducibility, the source code developed for the experiments has been released in the project repository, while the datasets were obtained from publicly available sources.

12 pa...

12 pages, 3 figures, 3 tables

None
MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging 2025-11-13
Show

Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels--representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.

To appear at AAAI-26 None
Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning 2025-11-13
Show

Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no_repeat_ngram_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.

None
Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation 2025-11-12
Show

Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.

7 figures, 3 tables None
TomoGraphView: 3D Medical Image Classification with Omnidirectional Slice Representations and Graph Neural Networks 2025-11-12
Show

The growing number of medical tomography examinations has necessitated the development of automated methods capable of extracting comprehensive imaging features to facilitate downstream tasks such as tumor characterization, while assisting physicians in managing their growing workload. However, 3D medical image classification remains a challenging task due to the complex spatial relationships and long-range dependencies inherent in volumetric data. Training models from scratch suffers from low data regimes, and the absence of 3D large-scale multimodal datasets has limited the development of 3D medical imaging foundation models. Recent studies, however, have highlighted the potential of 2D vision foundation models, originally trained on natural images, as powerful feature extractors for medical image analysis. Despite these advances, existing approaches that apply 2D models to 3D volumes via slice-based decomposition remain suboptimal. Conventional volume slicing strategies, which rely on canonical planes such as axial, sagittal, or coronal, may inadequately capture the spatial extent of target structures when these are misaligned with standardized viewing planes. Furthermore, existing slice-wise aggregation strategies rarely account for preserving the volumetric structure, resulting in a loss of spatial coherence across slices. To overcome these limitations, we propose TomoGraphView, a novel framework that integrates omnidirectional volume slicing with spherical graph-based feature aggregation. We publicly share our accessible code base at http://github.com/compai-lab/2025-MedIA-kiechle and provide a user-friendly library for omnidirectional volume slicing at https://pypi.org/project/OmniSlicer.

Prepr...

Preprint submitted to Medical Image Analysis (MedIA)

Code Link
A multimodal AI agent for clinical decision support in ophthalmology 2025-11-12
Show

Artificial intelligence has shown promise in medical imaging, yet most existing systems lack flexibility, interpretability, and adaptability - challenges especially pronounced in ophthalmology, where diverse imaging modalities are essential. We present EyeAgent, the first agentic AI framework for comprehensive and interpretable clinical decision support in ophthalmology. Using a large language model (DeepSeek-V3) as its central reasoning engine, EyeAgent interprets user queries and dynamically orchestrates 53 validated ophthalmic tools across 23 imaging modalities for diverse tasks including classification, segmentation, detection, image/report generation, and quantitative analysis. Stepwise ablation analysis demonstrated a progressive improvement in diagnostic accuracy, rising from a baseline of 69.71% (using only 5 general tools) to 80.79% when the full suite of 53 specialized tools was integrated. In an expert rating study on 200 real-world clinical cases, EyeAgent achieved 93.7% tool selection accuracy and received expert ratings of more than 88% across accuracy, completeness, safety, reasoning, and interpretability. In human-AI collaboration, EyeAgent matched or exceeded the performance of senior ophthalmologists and, when used as an assistant, improved overall diagnostic accuracy by 18.51% and report quality scores by 19%, with the greatest benefit observed among junior ophthalmologists. These findings establish EyeAgent as a scalable and trustworthy AI framework for ophthalmology and provide a blueprint for modular, multimodal, and clinically aligned next-generation AI systems.

28 pages, 5 figures None
Learning with Preserving for Continual Multitask Learning 2025-11-11
Show

Artificial intelligence systems in critical fields like autonomous driving and medical imaging analysis often continually learn new tasks using a shared stream of input data. For instance, after learning to detect traffic signs, a model may later need to learn to classify traffic lights or different types of vehicles using the same camera feed. This scenario introduces a challenging setting we term Continual Multitask Learning (CMTL), where a model sequentially learns new tasks on an underlying data distribution without forgetting previously learned abilities. Existing continual learning methods often fail in this setting because they learn fragmented, task-specific features that interfere with one another. To address this, we introduce Learning with Preserving (LwP), a novel framework that shifts the focus from preserving task outputs to maintaining the geometric structure of the shared representation space. The core of LwP is a Dynamically Weighted Distance Preservation (DWDP) loss that prevents representation drift by regularizing the pairwise distances between latent data representations. This mechanism of preserving the underlying geometric structure allows the model to retain implicit knowledge and support diverse tasks without requiring a replay buffer, making it suitable for privacy-conscious applications. Extensive evaluations on time-series and image benchmarks show that LwP not only mitigates catastrophic forgetting but also consistently outperforms state-of-the-art baselines in CMTL tasks. Notably, our method shows superior robustness to distribution shifts and is the only approach to surpass the strong single-task learning baseline, underscoring its effectiveness for real-world dynamic environments.

25 pa...

25 pages, 16 figures, accepted at AAAI-2026

None
Fast Multi-Organ Fine Segmentation in CT Images with Hierarchical Sparse Sampling and Residual Transformer 2025-11-11
Show

Multi-organ segmentation of 3D medical images is fundamental with meaningful applications in various clinical automation pipelines. Although deep learning has achieved superior performance, the time and memory consumption of segmenting the entire 3D volume voxel by voxel using neural networks can be huge. Classifiers have been developed as an alternative in cases with certain points of interest, but the trade-off between speed and accuracy remains an issue. Thus, we propose a novel fast multi-organ segmentation framework with the usage of hierarchical sparse sampling and a Residual Transformer. Compared with whole-volume analysis, the hierarchical sparse sampling strategy could successfully reduce computation time while preserving a meaningful hierarchical context utilizing multiple resolution levels. The architecture of the Residual Transformer segmentation network could extract and combine information from different levels of information in the sparse descriptor while maintaining a low computational cost. In an internal data set containing 10,253 CT images and the public dataset TotalSegmentator, the proposed method successfully improved qualitative and quantitative segmentation performance compared to the current fast organ classifier, with fast speed at the level of ~2.24 seconds on CPU hardware. The potential of achieving real-time fine organ segmentation is suggested.

EMBC 2025 oral None
ChexFract: From General to Specialized -- Enhancing Fracture Description Generation 2025-11-11
Show

Generating accurate and clinically meaningful radiology reports from chest X-ray images remains a significant challenge in medical AI. While recent vision-language models achieve strong results in general radiology report generation, they often fail to adequately describe rare but clinically important pathologies like fractures. This work addresses this gap by developing specialized models for fracture pathology detection and description. We train fracture-specific vision-language models with encoders from MAIRA-2 and CheXagent, demonstrating significant improvements over general-purpose models in generating accurate fracture descriptions. Analysis of model outputs by fracture type, location, and age reveals distinct strengths and limitations of current vision-language model architectures. We publicly release our best-performing fracture-reporting model, facilitating future research in accurate reporting of rare pathologies.

13 pages, 3 figures None
Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs 2025-11-11
Show

AI-assisted ultrasound video diagnosis presents new opportunities to enhance the efficiency and accuracy of medical imaging analysis. However, existing research remains limited in terms of dataset diversity, diagnostic performance, and clinical applicability. In this study, we propose \textbf{Auto-US}, an intelligent diagnosis agent that integrates ultrasound video data with clinical diagnostic text. To support this, we constructed \textbf{CUV Dataset} of 495 ultrasound videos spanning five categories and three organs, aggregated from multiple open-access sources. We developed \textbf{CTU-Net}, which achieves state-of-the-art performance in ultrasound video classification, reaching an accuracy of 86.73% Furthermore, by incorporating large language models, Auto-US is capable of generating clinically meaningful diagnostic suggestions. The final diagnostic scores for each case exceeded 3 out of 5 and were validated by professional clinicians. These results demonstrate the effectiveness and clinical potential of Auto-US in real-world ultrasound applications. Code and data are available at: https://github.com/Bean-Young/Auto-US.

Under Review Code Link
Segmentation of Ischemic Stroke Lesions using Transfer Learning on Multi-sequence MRI 2025-11-10
Show

The accurate understanding of ischemic stroke lesions is critical for efficient therapy and prognosis of stroke patients. Magnetic resonance imaging (MRI) is sensitive to acute ischemic stroke and is a common diagnostic method for stroke. However, manual lesion segmentation performed by experts is tedious, time-consuming, and prone to observer inconsistency. Automatic medical image analysis methods have been proposed to overcome this challenge. However, previous approaches have relied on hand-crafted features that may not capture the irregular and physiologically complex shapes of ischemic stroke lesions. In this study, we present a novel framework for quickly and automatically segmenting ischemic stroke lesions on various MRI sequences, including T1-weighted, T2-weighted, DWI, and FLAIR. The proposed methodology is validated on the ISLES 2015 Brain Stroke sequence dataset, where we trained our model using the Res-Unet architecture twice: first, with pre-existing weights, and then without, to explore the benefits of transfer learning. Evaluation metrics, including the Dice score and sensitivity, were computed across 3D volumes. Finally, a Majority Voting Classifier was integrated to amalgamate the outcomes from each axis, resulting in a comprehensive segmentation method. Our efforts culminated in achieving a Dice score of 80.5% and an accuracy of 74.03%, showcasing the efficacy of our segmentation approach.

Ische...

Ischemic Stroke, Segmentation, Transfer Learning, Magnetic Resonance Imaging, Deep Learning, Res-UNet

None
Sim4Seg: Boosting Multimodal Multi-disease Medical Diagnosis Segmentation with Region-Aware Vision-Language Similarity Masks 2025-11-10
Show

Despite significant progress in pixel-level medical image analysis, existing medical image segmentation models rarely explore medical segmentation and diagnosis tasks jointly. However, it is crucial for patients that models can provide explainable diagnoses along with medical segmentation results. In this paper, we introduce a medical vision-language task named Medical Diagnosis Segmentation (MDS), which aims to understand clinical queries for medical images and generate the corresponding segmentation masks as well as diagnostic results. To facilitate this task, we first present the Multimodal Multi-disease Medical Diagnosis Segmentation (M3DS) dataset, containing diverse multimodal multi-disease medical images paired with their corresponding segmentation masks and diagnosis chain-of-thought, created via an automated diagnosis chain-of-thought generation pipeline. Moreover, we propose Sim4Seg, a novel framework that improves the performance of diagnosis segmentation by taking advantage of the Region-Aware Vision-Language Similarity to Mask (RVLS2M) module. To improve overall performance, we investigate a test-time scaling strategy for MDS tasks. Experimental results demonstrate that our method outperforms the baselines in both segmentation and diagnosis.

AAAI 2026 None
EditGRPO: Reinforcement Learning with Post-Rollout Edits for Clinically Accurate Chest X-Ray Report Generation 2025-11-10
Show

Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models, have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B, EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in clinical metrics across four major datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9% on unseen datasets.

AACL 2025 None
Differential privacy for medical deep learning: methods, tradeoffs, and deployment implications 2025-11-09
Show

Differential privacy (DP) is a key technique for protecting sensitive patient data in medical deep learning (DL). As clinical models grow more data-dependent, balancing privacy with utility and fairness has become a critical challenge. This scoping review synthesizes recent developments in applying DP to medical DL, with a particular focus on DP-SGD and alternative mechanisms across centralized and federated settings. Using a structured search strategy, we identified 74 studies published up to March 2025. Our analysis spans diverse data modalities, training setups, and downstream tasks, and highlights the tradeoffs between privacy guarantees, model accuracy, and subgroup fairness. We find that while DP-especially at strong privacy budgets-can preserve performance in well-structured imaging tasks, severe degradation often occurs under strict privacy, particularly in underrepresented or complex modalities. Furthermore, privacy-induced performance gaps disproportionately affect demographic subgroups, with fairness impacts varying by data type and task. A small subset of studies explicitly addresses these tradeoffs through subgroup analysis or fairness metrics, but most omit them entirely. Beyond DP-SGD, emerging approaches leverage alternative mechanisms, generative models, and hybrid federated designs, though reporting remains inconsistent. We conclude by outlining key gaps in fairness auditing, standardization, and evaluation protocols, offering guidance for future work toward equitable and clinically robust privacy-preserving DL systems in medicine.

None
Adapted Foundation Models for Breast MRI Triaging in Contrast-Enhanced and Non-Contrast Enhanced Protocols 2025-11-08
Show

Background: Magnetic resonance imaging (MRI) has high sensitivity for breast cancer detection, but interpretation is time-consuming. Artificial intelligence may aid in pre-screening. Purpose: To evaluate the DINOv2-based Medical Slice Transformer (MST) for ruling out significant findings (Breast Imaging Reporting and Data System [BI-RADS] >=4) in contrast-enhanced and non-contrast-enhanced abbreviated breast MRI. Materials and Methods: This institutional review board approved retrospective study included 1,847 single-breast MRI examinations (377 BI-RADS >=4) from an in-house dataset and 924 from an external validation dataset (Duke). Four abbreviated protocols were tested: T1-weighted early subtraction (T1sub), diffusion-weighted imaging with b=1500 s/mm2 (DWI1500), DWI1500+T2-weighted (T2w), and T1sub+T2w. Performance was assessed at 90%, 95%, and 97.5% sensitivity using five-fold cross-validation and area under the receiver operating characteristic curve (AUC) analysis. AUC differences were compared with the DeLong test. False negatives were characterized, and attention maps of true positives were rated in the external dataset. Results: A total of 1,448 female patients (mean age, 49 +/- 12 years) were included. T1sub+T2w achieved an AUC of 0.77 +/- 0.04; DWI1500+T2w, 0.74 +/- 0.04 (p=0.15). At 97.5% sensitivity, T1sub+T2w had the highest specificity (19% +/- 7%), followed by DWI1500+T2w (17% +/- 11%). Missed lesions had a mean diameter =4, achieving 19% specificity for contrast-enhanced and 17% for non-contrast-enhanced MRI. Further research is warranted before clinical implementation.

23 pa...

23 pages, 6 figures, 4 tables. Originally submitted to Radiology (RAD-25-2541); under consideration for transfer to Radiology: Artificial Intelligence (RSNA Portfolio Journal)

None
TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation 2025-11-08
Show

Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.

None
Towards Better Ultrasound Video Segmentation Foundation Model: An Empirical study on SAM2 Finetuning from Data Perspective 2025-11-07
Show

Ultrasound (US) video segmentation remains a challenging problem due to strong inter- and intra-dataset variability, motion artifacts, and limited annotated data. Although foundation models such as Segment Anything Model 2 (SAM2) demonstrate strong zero-shot and prompt-guided segmentation capabilities, their performance deteriorates substantially when transferred to medical imaging domains. Current adaptation studies mainly emphasize architectural modifications, while the influence of data characteristics and training regimes has not been systematically examined. In this study, we present a comprehensive, data-centric investigation of SAM2 adaptation for ultrasound video segmentation. We analyze how training-set size, video duration, and augmentation schemes affect adaptation performance under three paradigms: task-specific fine-tuning, intermediate adaptation, and multi-task joint training, across five SAM2 variants and multiple prompting modes. We further design six ultrasound-specific augmentations, assessing their effect relative to generic strategies. Experiments on three representative ultrasound datasets reveal that data scale and temporal context play a more decisive role than model architecture or initialization. Moreover, joint training offers an efficient compromise between modality alignment and task specialization. This work aims to provide empirical insights for developing efficient, data-aware adaptation pipelines for SAM2 in ultrasound video analysis.

None
RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability 2025-11-06
Show

Recent advancements in multimodal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce $\textbf{RadZero}$, a novel framework for VL alignment in chest X-ray with zero-shot multi-task capability. A key component of our approach is $\textbf{VL-CABS}$ ($\textbf{V}$ision-$\textbf{L}$anguage $\textbf{C}$ross-$\textbf{A}$ttention $\textbf{B}$ased on $\textbf{S}$imilarity), which aligns text embeddings with local image features for interpretable, fine-grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi-positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, VL-CABS enables zero-shot inference with similarity probability for classification, and pixel-level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL-CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging. Code is available at $\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$.

NeurIPS 2025 Code Link