Skip to content

Latest commit

 

History

History
205 lines (203 loc) · 318 KB

File metadata and controls

205 lines (203 loc) · 318 KB

Image Classification

Title Date Abstract Comment CodeRepository
A Self supervised learning framework for imbalanced medical imaging datasets 2026-04-02
Show

Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.

None
Enhancing the Reliability of Medical AI through Expert-guided Uncertainty Modeling 2026-04-02
Show

Artificial intelligence (AI) systems accelerate medical workflows and improve diagnostic accuracy in healthcare, serving as second-opinion systems. However, the unpredictability of AI errors poses a significant challenge, particularly in healthcare contexts, where mistakes can have severe consequences. A widely adopted safeguard is to pair predictions with uncertainty estimation, enabling human experts to focus on high-risk cases while streamlining routine verification. Current uncertainty estimation methods, however, remain limited, particularly in quantifying aleatoric uncertainty, which arises from data ambiguity and noise. To address this, we propose a novel approach that leverages disagreement in expert responses to generate targets for training machine learning models. These targets are used in conjunction with standard data labels to estimate two components of uncertainty separately, as given by the law of total variance, via a two-ensemble approach, as well as its lightweight variant. We validate our method on binary image classification, binary and multi-class image segmentation, and multiple-choice question answering. Our experiments demonstrate that incorporating expert knowledge can enhance uncertainty estimation quality by $9%$ to $50%$ depending on the task, making this source of information invaluable for the construction of risk-aware AI systems in healthcare applications.

None
Cosine-Normalized Attention for Hyperspectral Image Classification 2026-04-02
Show

Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.

None
StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold 2026-04-02
Show

Low-rank adaptation (LoRA) has been widely adopted as a parameter-efficient technique for fine-tuning large-scale pre-trained models. However, it still lags behind full fine-tuning in performance, partly due to its insufficient exploitation of the geometric structure underlying low-rank manifolds. In this paper, we propose a geometry-aware extension of LoRA that uses a three-factor decomposition $U!SV^\top$. Analogous to the structure of singular value decomposition (SVD), it separates the adapter's input and output subspaces, $V$ and $U$, from the scaling factor $S$. Our method constrains $U$ and $V$ to lie on the Stiefel manifold, ensuring their orthonormality throughout the training. To optimize on the Stiefel manifold, we employ a flexible and modular geometric optimization design that converts any Euclidean optimizer to a Riemannian one. It enables efficient subspace learning while remaining compatible with existing fine-tuning pipelines. Empirical results across a wide range of downstream tasks, including commonsense reasoning, math and code generation, image classification, and image generation, demonstrate the superior performance of our approach against the recent state-of-the-art variants of LoRA. Code is available at https://github.com/SonyResearch/stella.

NeurI...

NeurIPS 2025 Spotlight

Code Link
Beyond Logit Adjustment: A Residual Decomposition Framework for Long-Tailed Reranking 2026-04-02
Show

Long-tailed classification, where a small number of frequent classes dominate many rare ones, remains challenging because models systematically favor frequent classes at inference time. Existing post-hoc methods such as logit adjustment address this by adding a fixed classwise offset to the base-model logits. However, the correction required to restore the relative ranking of two classes need not be constant across inputs, and a fixed offset cannot adapt to such variation. We study this problem through Bayes-optimal reranking on a base-model top-k shortlist. The gap between the optimal score and the base score, the residual correction, decomposes into a classwise component that is constant within each class, and a pairwise component that depends on the input and competing labels. When the residual is purely classwise, a fixed offset suffices to recover the Bayes-optimal ordering. We further show that when the same label pair induces incompatible ordering constraints across contexts, no fixed offset can achieve this recovery. This decomposition leads to testable predictions regarding when pairwise correction can improve performance and when cannot. We develop REPAIR (Reranking via Pairwise residual correction), a lightweight post-hoc reranker that combines a shrinkage-stabilized classwise term with a linear pairwise term driven by competition features on the shortlist. Experiments on five benchmarks spanning image classification, species recognition, scene recognition, and rare disease diagnosis confirm that the decomposition explains where pairwise correction helps and where classwise correction alone suffices.

Preprint None
Vision Tiny Recursion Model (ViTRM): Parameter-Efficient Image Classification via Recursive State Refinement 2026-04-01
Show

The success of deep learning in computer vision has been driven by models of increasing scale, from deep Convolutional Neural Networks (CNN) to large Vision Transformers (ViT). While effective, these architectures are parameter-intensive and demand significant computational resources, limiting deployment in resource-constrained environments. Inspired by Tiny Recursive Models (TRM), which show that small recursive networks can solve complex reasoning tasks through iterative state refinement, we introduce the \textbf{Vision Tiny Recursion Model (ViTRM)}: a parameter-efficient architecture that replaces the $L$-layer ViT encoder with a single tiny $k$-layer block ($k{=}3$) applied recursively $N$ times. Despite using up to $6 \times $ and $84 \times$ fewer parameters than CNN based models and ViT respectively, ViTRM maintains competitive performance on CIFAR-10 and CIFAR-100. This demonstrates that recursive computation is a viable, parameter-efficient alternative to architectural depth in vision.

None
Fair Indivisible Payoffs through Shapley Value 2026-04-01
Show

We consider the problem of payoff division in indivisible coalitional games, where the value of the grand coalition is a natural number. This number represents a certain quantity of indivisible objects, such as parliamentary seats, kidney exchanges, or top features contributing to the outcome of a machine learning model. The goal of this paper is to propose a fair method for dividing these objects among players. To achieve this, we define the indivisible Shapley value and study its properties. We demonstrate our proposed technique using three case studies, in particular, we use it to identify key regions of an image in the context of an image classification task.

None
WAON: Large-Scale Japanese Image-Text Pair Dataset for Improving Model Performance on Japanese Cultural Tasks 2026-04-01
Show

Contrastive pre-training on large-scale image-text pair datasets has driven major advances in vision-language representation learning. Recent work shows that pretraining on global data followed by language or culture specific fine-tuning is effective for improving performance in target domains. With the availability of strong open-weight multilingual models such as SigLIP2, this paradigm has become increasingly practical. However, for Japanese, the scarcity of large-scale, high-quality image-text pair datasets tailored to Japanese language and cultural content remains a key limitation. To address this gap, we introduce WAON, the largest Japanese image-text pair dataset constructed from Japanese web content in Common Crawl, containing approximately 155 million examples. Our dataset construction pipeline employs filtering and deduplication to improve dataset quality. To improve the quality and reliability of evaluation on Japanese cultural tasks, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification comprising 374 classes, which addresses issues in the existing benchmark such as category imbalance and label-image mismatches. Our experiments demonstrate that fine-tuning on WAON improves model performance on Japanese cultural benchmarks more efficiently than existing datasets, achieving state-of-the-art results among publicly available models of comparable architecture. We release our dataset, model, and code.

14 pages, 7 figures None
EvalBlocks: A Modular Pipeline for Rapidly Evaluating Foundation Models in Medical Imaging 2026-04-01
Show

Developing foundation models in medical imaging requires continuous monitoring of downstream performance. Researchers are burdened with tracking numerous experiments, design choices, and their effects on performance, often relying on ad-hoc, manual workflows that are inherently slow and error-prone. We introduce EvalBlocks, a modular, plug-and-play framework for efficient evaluation of foundation models during development. Built on Snakemake, EvalBlocks supports seamless integration of new datasets, foundation models, aggregation methods, and evaluation strategies. All experiments and results are tracked centrally and are reproducible with a single command, while efficient caching and parallel execution enable scalable use on shared compute infrastructure. Demonstrated on five state-of-the-art foundation models and three medical imaging classification tasks, EvalBlocks streamlines model evaluation, enabling researchers to iterate faster and focus on model innovation rather than evaluation logistics. The framework is released as open source software at https://github.com/DIAGNijmegen/eval-blocks.

Accep...

Accepted and published in BVM 2026 proceedings (Springer)

Code Link
Representation Learning with Semantic-aware Instance and Sparse Token Alignments 2026-04-01
Show

Medical contrastive vision-language pre-training (VLP) has demonstrated significant potential in improving performance on downstream tasks. Traditional approaches typically employ contrastive learning, treating paired image-report samples as positives and unpaired ones as negatives. However, in medical datasets, there can be substantial similarities between images or reports from different patients. Rigidly treating all unpaired samples as negatives, can disrupt the underlying semantic structure and negatively impact the quality of the learned representations. In this paper, we propose a multi-level alignment framework, Representation Learning with Semantic-aware Instance and Sparse Token Alignments (SISTA) by exploiting the semantic correspondence between medical image and radiology reports at two levels, i.e., image-report and patch-word levels. Specifically, we improve the conventional contrastive learning by incorporating inter-report similarity to eliminate the false negatives and introduce a method to effectively align image patches with relevant word tokens. Experimental results demonstrate the effectiveness of the proposed framework in improving transfer performance across different datasets on three downstream tasks: image classification, image segmentation, and object detection. Notably, our framework achieves significant improvements in fine-grained tasks even with limited labeled data. Codes and pre-trained models will be made available.

Accep...

Accepted to ICPR 2026

None
Uncertainty Quantification With Multiple Sources 2026-03-31
Show

Weighted conformal prediction (WCP) has been commonly used to quantify prediction uncertainty under covariate shift. However, the effectiveness of WCP relies heavily on the degree of overlap between the training and test covariate distributions. This challenge is exacerbated in multi-source settings with varying covariate distributions, where direct application of WCP can be impractical. In this paper, we address the multi-source setup by leveraging WCP under the assumption of a shared conditional distribution. We investigate two extensions of WCP: (i) a merge-based aggregation of source-specific weighted conformal prediction sets, and (ii) a data-pooling strategy that jointly reweights samples across all sources. Theoretical guarantees are provided for the proposed approaches, and experiments are conducted based on a synthetic regression task and a multi-domain image classification benchmark to validate our proposed methods.

23 pages None
Hierarchical Pre-Training of Vision Encoders with Large Language Models 2026-03-31
Show

The field of computer vision has experienced significant advancements through scalable vision encoders and multimodal pre-training frameworks. However, existing approaches often treat vision encoders and large language models (LLMs) as independent modules, limiting the integration of hierarchical visual features. In this work, we propose HIVE (Hierarchical Pre-Training of Vision Encoders), a novel framework that enhances vision-language alignment by introducing hierarchical cross-attention between the vision encoder and LLM. Unlike conventional methods that flatten image embeddings, HIVE enables structured feature fusion across multiple layers, improving gradient flow and representation learning. To optimize this interaction, we introduce a three-stage training strategy that progressively aligns the vision encoder with the LLM, ensuring stable optimization and effective multimodal fusion. Empirical evaluations demonstrate that HIVE achieves superior performance not only in image classification but also on various vision-language tasks, outperforming self-attention-based methods in benchmarks such as MME, GQA, OK-VQA, and ScienceQA. Our results highlight the benefits of hierarchical feature integration, paving the way for more efficient and expressive vision-language models.

17 pa...

17 pages, 14 figures, accepted to Computer Vision and Pattern Recognition Conference (CVPR) Workshops 2026. 5th MMFM Workshop: What is Next in Multimodal Foundation Models?

None
Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence 2026-03-31
Show

Post-hoc explanation methods are widely used to interpret black-box predictions, but their generation is often computationally expensive and their reliability is not guaranteed. We propose epistemic uncertainty as a low-cost proxy for explanation reliability: high epistemic uncertainty identifies regions where the decision boundary is poorly defined and where explanations become unstable and unfaithful. This insight enables two complementary use cases: improving worst-case explanations' (routing samples to cheap or expensive XAI methods based on expected explanation reliability), and recalling high-quality explanations' (deferring explanation generation for uncertain samples under constrained budget). Across four tabular datasets, five diverse architectures, and four XAI methods, we observe a strong negative correlation between epistemic uncertainty and explanation stability. Further analysis shows that epistemic uncertainty distinguishes not only stable from unstable explanations, but also faithful from unfaithful ones. Experiments on image classification confirm that our findings generalize beyond tabular data.

None
MAPLE: Multi-Path Adaptive Propagation with Level-Aware Embeddings for Hierarchical Multi-Label Image Classification 2026-03-31
Show

Hierarchical multi-label classification (HMLC) is essential for modeling structured label dependencies in remote sensing. Yet existing approaches struggle in multi-path settings, where images may activate multiple taxonomic branches, leading to underuse of hierarchical information. We propose MAPLE (Multi-Path Adaptive Propagation with Level-Aware Embeddings), a framework that integrates (i) hierarchical semantic initialization from graph-aware textual descriptions, (ii) graph-based structure encoding via graph convolutional networks (GCNs), and (iii) adaptive multi-modal fusion that dynamically balances semantic priors and visual evidence. An adaptive level-aware objective automatically selects appropriate losses per hierarchy level. Evaluations on CORINE-aligned remote sensing datasets (AID, DFC-15, and MLRSNet) show consistent improvements of up to +42% in few-shot regimes while adding only 2.6% parameter overhead, demonstrating that MAPLE effectively and efficiently models hierarchical semantics for Earth observation (EO).

REO: ...

REO: Advances in Representation Learning for Earth Observation, accepted workshow paper at EurIPS

None
Big2Small: A Unifying Neural Network Framework for Model Compression 2026-03-31
Show

With the development of foundational models, model compression has become a critical requirement. Various model compression approaches have been proposed such as low-rank decomposition, pruning, quantization, ergodic dynamic systems, and knowledge distillation, which are based on different heuristics. To elevate the field from fragmentation to a principled discipline, we construct a unifying mathematical framework for model compression grounded in measure theory. We further demonstrate that each model compression technique is mathematically equivalent to a neural network subject to a regularization. Building upon this mathematical and structural equivalence, we propose an experimentally-verified data-free model compression framework, termed \textit{Big2Small}, which translates Implicit Neural Representations (INRs) from data domain to the domain of network parameters. \textit{Big2Small} trains compact INRs to encode the weights of larger models and reconstruct the weights during inference. To enhance reconstruction fidelity, we introduce Outlier-Aware Preprocessing to handle extreme weight values and a Frequency-Aware Loss function to preserve high-frequency details. Experiments on image classification and segmentation demonstrate that \textit{Big2Small} achieves competitive accuracy and compression ratios compared to state-of-the-art baselines.

None
Detection of Adversarial Attacks in Robotic Perception 2026-03-31
Show

Deep Neural Networks (DNNs) achieve strong performance in semantic segmentation for robotic perception but remain vulnerable to adversarial attacks, threatening safety-critical applications. While robustness has been studied for image classification, semantic segmentation in robotic contexts requires specialized architectures and detection strategies.

9 pag...

9 pages, 6 figures. Accepted and presented at STE 2025, Transilvania University of Brasov, Romania

None
Deep Learning-Based Anomaly Detection in Spacecraft Telemetry on Edge Devices 2026-03-31
Show

Spacecraft anomaly detection is critical for mission safety, yet deploying sophisticated models on-board presents significant challenges due to hardware constraints. This paper investigates three approaches for spacecraft telemetry anomaly detection -- forecasting & threshold, direct classification, and image classification -- and optimizes them for edge deployment using multi-objective neural architecture optimization on the European Space Agency Anomaly Dataset. Our baseline experiments demonstrate that forecasting & threshold achieves superior detection performance (92.7% Corrected Event-wise F0.5-score (CEF0.5)) [1] compared to alternatives. Through Pareto-optimal architecture optimization, we dramatically reduced computational requirements while maintaining capabilities -- the optimized forecasting & threshold model preserved 88.8% CEF0.5 while reducing RAM usage by 97.1% to just 59 KB and operations by 99.4%. Analysis of deployment viability shows our optimized models require just 0.36-6.25% of CubeSat RAM, making on-board anomaly detection practical even on highly constrained hardware. This research demonstrates that sophisticated anomaly detection capabilities can be successfully deployed within spacecraft edge computing constraints, providing near-instantaneous detection without exceeding hardware limitations or compromising mission safety.

IEEE ...

IEEE Space Computing Conference (SCC 2025), Los Angeles, CA, USA, 28 July - 1 August 2025

None
REN: Anatomically-Informed Mixture-of-Experts for Interstitial Lung Disease Diagnosis 2026-03-30
Show

Mixture-of-Experts (MoE) architectures achieve scalable learning by routing inputs to specialized subnetworks through conditional computation. However, conventional MoE designs assume homogeneous expert capability and domain-agnostic routing-assumptions that are fundamentally misaligned with medical imaging, where anatomical structure and regional disease heterogeneity govern pathological patterns. We introduce Regional Expert Networks (REN), the first anatomically-informed MoE framework for medical image classification. REN encodes anatomical priors by training seven specialized experts, each dedicated to a distinct lung lobe or bilateral lung combination, enabling precise modeling of region-specific pathological variation. Multi-modal gating mechanisms dynamically integrate radiomics biomarkers with deep learning (DL) features extracted by convolutional (CNN), Transformer (ViT), and state-space (Mamba) architectures to weight expert contributions at inference. Applied to interstitial lung disease (ILD) classification on a 597-patient, 1,898-scan longitudinal cohort, REN achieves consistently superior performance: the radiomics-guided ensemble attains an average AUC of 0.8646 +- 0.0467, a +12.5 % improvement over the SwinUNETR single-model baseline (AUC 0.7685, p=0.031). Lower-lobe experts reach AUCs of 0.88-0.90, outperforming DL baselines (CNN: 0.76-0.79) and mirroring known patterns of basal ILD progression. Evaluated under rigorous patient-level cross-validation, REN demonstrates strong generalizability and clinical interpretability, establishing a scalable, anatomically-guided framework potentially extensible to other structured medical imaging tasks. Code is available on our GitHub https://github.com/NUBagciLab/MoE-REN.

13 pa...

13 pages, 4 figures, 5 tables

Code Link
Understanding SAM's Robustness to Noisy Labels through Gradient Down-weighting 2026-03-30
Show

Sharpness-Aware Minimization (SAM) was introduced to improve generalization by seeking flat minima, yet it also exhibits robustness to label noise, a phenomenon that remains only partially understood. Prior work has mainly attributed this effect to SAM's tendency to prolong the learning of clean samples. In this work, we provide a complementary explanation by analyzing SAM at the element-wise level. We show that when noisy gradients dominate a parameter direction, their influence is reduced by the stronger amplification of clean gradients. This slows the memorization of noisy labels while sustaining clean learning, offering a more complete account of SAM's robustness. Building on this insight, we propose SANER (Sharpness-Aware Noise-Explicit Reweighting), a simple variant of SAM that explicitly magnifies this down-weighting effect. Experiments on benchmark image classification tasks with noisy labels demonstrate that SANER significantly mitigates noisy-label memorization and improves generalization over both SAM and SGD. Moreover, since SANER is designed from the mechanism of SAM, it can also be seamlessly integrated into SAM-like variants, further boosting their robustness.

None
Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes 2026-03-30
Show

Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

None
Revisiting Adversarial Training under Hyperspectral Image 2026-03-30
Show

Recent studies have shown that deep learning-based hyperspectral image (HSI) classification models are highly vulnerable to adversarial attacks, posing significant security risks. Although most approaches attempt to enhance robustness by optimizing network architectures, these methods often rely on customized designs with limited scalability and struggle to defend against strong attacks. To address this issue, we introduce adversarial training (AT), one of the most effective defense strategies, into the hyperspectral domain. However, unlike conventional RGB image classification, directly applying AT to HSI classification introduces unique challenges due to the high-dimensional spectral signatures and strong inter-band correlations of hyperspectral data, where discriminative information relies on subtle spectral semantics and spectral-spatial consistency that are highly sensitive to adversarial perturbations. Through extensive empirical analyses, we observe that adversarial perturbations and the non-smooth nature of adversarial examples can distort or even eliminate important spectral semantic information. To mitigate this issue, we propose two hyperspectral-specific AT methods, termed AT-HARL and AT-RA. Specifically, AT-HARL exploits spectral characteristic differences and class distribution ratios to design a novel loss function that alleviates semantic distortion caused by adversarial perturbations. Meanwhile, AT-RA introduces spectral data augmentation to enhance spectral diversity while preserving spatial smoothness. Experiments on four benchmark HSI datasets demonstrate that the proposed methods achieve competitive performance compared with state-of-the-art approaches under adversarial attacks.

None
EVA: Bridging Performance and Human Alignment in Hard-Attention Vision Models for Image Classification 2026-03-28
Show

Optimizing vision models purely for classification accuracy can impose an alignment tax, degrading human-like scanpaths and limiting interpretability. We introduce EVA, a neuroscience-inspired hard-attention mechanistic testbed that makes the performance-human-likeness trade-off explicit and adjustable. EVA samples a small number of sequential glimpses using a minimal fovea-periphery representation with CNN-based feature extractor and integrates variance control and adaptive gating to stabilize and regulate attention dynamics. EVA is trained with the standard classification objective without gaze supervision. On CIFAR-10 with dense human gaze annotations, EVA improves scanpath alignment under established metrics such as DTW, NSS, while maintaining competitive accuracy. Ablations show that CNN-based feature extraction drives accuracy but suppresses human-likeness, whereas variance control and gating restore human-aligned trajectories with minimal performance loss. We further validate EVA's scalability on ImageNet-100 and evaluate scanpath alignment on COCO-Search18 without COCO-Search18 gaze supervision or finetuning, where EVA yields human-like scanpaths on natural scenes without additional training. Overall, EVA provides a principled framework for trustworthy, human-interpretable active vision.

None
MAN++: Scaling Momentum Auxiliary Network for Supervised Local Learning in Vision Tasks 2026-03-28
Show

Deep learning typically relies on end-to-end backpropagation for training, a method that inherently suffers from issues such as update locking during parameter optimization, high GPU memory consumption, and a lack of biological plausibility. In contrast, supervised local learning seeks to mitigate these challenges by partitioning the network into multiple local blocks and designing independent auxiliary networks to update each block separately. However, because gradients are propagated solely within individual local blocks, performance degradation occurs, preventing supervised local learning from supplanting end-to-end backpropagation. To address these limitations and facilitate inter-block information flow, we propose the Momentum Auxiliary Network++ (MAN++). MAN++ introduces a dynamic interaction mechanism by employing the Exponential Moving Average (EMA) of parameters from adjacent blocks to enhance communication across the network. The auxiliary network, updated via EMA, effectively bridges the information gap between blocks. Notably, we observed that directly applying EMA parameters can be suboptimal due to feature discrepancies between local blocks. To resolve this issue, we introduce a learnable scaling bias that balances feature differences, thereby further improving performance. We validate MAN++ through extensive experiments on tasks that include image classification, object detection, and image segmentation, utilizing multiple network architectures. The experimental results demonstrate that MAN++ achieves performance comparable to end-to-end training while significantly reducing GPU memory usage. Consequently, MAN++ offers a novel perspective for supervised local learning and presents a viable alternative to conventional training methods.

Accepted by TPAMI None
Preconditioned Attention: Enhancing Efficiency in Transformers 2026-03-28
Show

Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers often produce ill-conditioned matrices with large condition numbers. This ill-conditioning is a well-known obstacle for gradient-based optimizers, leading to inefficient training. To address this issue, we introduce preconditioned attention, a novel approach that incorporates a conditioning matrix into each attention head. Our theoretical analysis shows that this method significantly reduces the condition number of attention matrices, resulting in better-conditioned matrices that improve optimization. Conditioned attention serves as a simple drop-in replacement for a wide variety of attention mechanisms in the literature. We validate the effectiveness of preconditioned attention across a diverse set of transformer applications, including image classification, object detection, instance segmentation, long sequence modeling and language modeling.

AISTATS 2026 None
Live Interactive Training for Video Segmentation 2026-03-27
Show

Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: https://youngxinyu1802.github.io/projects/LIT/.

CVPR 2026 Code Link
Tunable Soft Equivariance with Guarantees 2026-03-27
Show

Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real-world data, which can limit a model's performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre-trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre-trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human-trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.

None
Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones 2026-03-27
Show

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

Submi...

Submitted to International Journal of Computer Vision (IJCV); currently under minor revision

Code Link
Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification 2026-03-27
Show

3D medical image classification is essential for modern clinical workflows. Medical foundation models (FMs) have emerged as a promising approach for scaling to new tasks, yet current research suffers from three critical pitfalls: data-regime bias, suboptimal adaptation, and insufficient task coverage. In this paper, we address these pitfalls and introduce AnyMC3D, a scalable 3D classifier adapted from 2D FMs. Our method scales efficiently to new tasks by adding only lightweight plugins (about 1M parameters per task) on top of a single frozen backbone. This versatile framework also supports multi-view inputs, auxiliary pixel-level supervision, and interpretable heatmap generation. We establish a comprehensive benchmark of 12 tasks covering diverse pathologies, anatomies, and modalities, and systematically analyze state-of-the-art 3D classification techniques. Our analysis reveals key insights: (1) effective adaptation is essential to unlock FM potential, (2) general-purpose FMs can match medical-specific FMs if properly adapted, and (3) 2D-based methods surpass 3D architectures for 3D classification. For the first time, we demonstrate the feasibility of achieving state-of-the-art performance across diverse applications using a single scalable framework (including 1st place in the VLM3D challenge), eliminating the need for separate task-specific models.

1st P...

1st Place in VLM3D Challenge

None
CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models 2026-03-26
Show

Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image classification and diverse vision-language benchmarks. Our results suggest that modality balancing plays a critical role in multimodal generalization by improving representation utilization within autoregressive LVLMs.

None
End-to-end Feature Alignment: A Simple CNN with Intrinsic Class Attribution 2026-03-26
Show

We present Feature-Align CNN (FA-CNN), a prototype CNN architecture with intrinsic class attribution through end-to-end feature alignment. Our intuition is that the use of unordered operations such as Linear and Conv2D layers cause unnecessary shuffling and mixing of semantic concepts, thereby making raw feature maps difficult to understand. We introduce two new order preserving layers, the dampened skip connection, and the global average pooling classifier head. These layers force the model to maintain an end-to-end feature alignment from the raw input pixels all the way to final class logits. This end-to-end alignment enhances the interpretability of the model by allowing the raw feature maps to intrinsically exhibit class attribution. We prove theoretically that FA-CNN penultimate feature maps are identical to Grad-CAM saliency maps. Moreover, we prove that these feature maps slowly morph layer-by-layer over network depth, showing the evolution of features through network depth toward penultimate class activations. FA-CNN performs well on benchmark image classification datasets. Moreover, we compare the averaged FA-CNN raw feature maps against Grad-CAM and permutation methods in a percent pixels removed interpretability task. We conclude this work with a discussion and future, including limitations and extensions toward hybrid models.

None
Hyperspectral Trajectory Image for Multi-Month Trajectory Anomaly Detection 2026-03-26
Show

Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine-grained evidence such as abnormal speeds and short-duration events, but their quadratic cost makes multi-month analysis intractable; consequently, no existing approach detects anomalies over multi-month dense GPS trajectories. The field instead relies on scalable sparse stay-point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two-dimensional cyclic structure along within-day and across-day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time-of-day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent-level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi-month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC-PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11-75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure-aware modeling are jointly essential. Code will be made public soon.

None
MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness 2026-03-26
Show

Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.

None
Evaluating Synthetic Images as Effective Substitutes for Experimental Data in Surface Roughness Classification 2026-03-26
Show

Hard coatings play a critical role in industry, with ceramic materials offering outstanding hardness and thermal stability for applications that demand superior mechanical performance. However, deploying artificial intelligence (AI) for surface roughness classification is often constrained by the need for large labeled datasets and costly high-resolution imaging equipment. In this study, we explore the use of synthetic images, generated with Stable Diffusion XL, as an efficient alternative or supplement to experimentally acquired data for classifying ceramic surface roughness. We show that augmenting authentic datasets with generative images yields test accuracies comparable to those obtained using exclusively experimental images, demonstrating that synthetic images effectively reproduce the structural features necessary for classification. We further assess method robustness by systematically varying key training hyperparameters (epoch count, batch size, and learning rate), and identify configurations that preserve performance while reducing data requirements. Our results indicate that generative AI can substantially improve data efficiency and reliability in materials-image classification workflows, offering a practical route to lower experimental cost, accelerate model development, and expand AI applicability in materials engineering.

None
Improving Infinitely Deep Bayesian Neural Networks with Nesterov's Accelerated Gradient Method 2026-03-26
Show

As a representative continuous-depth neural network approach, stochastic differential equation (SDE)-based Bayesian neural networks (BNNs) have attracted considerable attention due to their solid theoretical foundations and strong potential for real-world applications. However, their reliance on numerical SDE solvers inevitably incurs a large number of function evaluations (NFEs), resulting in high computational cost and occasional convergence instability. To address these challenges, we propose a Nesterov-accelerated gradient (NAG) enhanced SDE-BNN model. By integrating NAG into the SDE-BNN framework along with an NFE-dependent residual skip connection, our method accelerates convergence and substantially reduces NFEs during both training and testing. Extensive empirical results show that our model consistently outperforms conventional SDE-BNNs across various tasks, including image classification and sequence modeling, achieving lower NFEs and improved predictive accuracy.

None
Attention-based Pin Site Image Classification in Orthopaedic Patients with External Fixators 2026-03-25
Show

Pin sites represent the interface where a metal pin or wire from the external environment passes through the skin into the internal environment of the limb. These pins or wires connect an external fixator to the bone to stabilize the bone segments in a patient with trauma or deformity. Because these pin sites represent an opportunity for external skin flora to enter the internal environment of the limb, infections of the pin site are common. These pin site infections are painful, annoying, and cause increased morbidity to the patients. Improving the identification and management of pin site infections would greatly enhance the patient experience when external fixators are used. For this, this paper collects and produces a dataset on pin sites wound infections and proposes a deep learning (DL) method to classify pin sites images based on their appearance: Group A displayed signs of inflammation or infection, while Group B showed no evident complications. Unlike studies that primarily focus on open wounds, our research includes potential interventions at the metal pin/skin interface. Our attention-based deep learning model addresses this complexity by emphasizing relevant regions and minimizing distractions from the pins. Moreover, we introduce an Efficient Redundant Reconstruction Convolution (ERRC) method to enhance the richness of feature maps while reducing the number of parameters. Our model outperforms baseline methods with an AUC of 0.975 and an F1-score of 0.927, requiring only 5.77 M parameters. These results highlight the potential of DL in differentiating pin sites only based on visual signs of infection, aligning with healthcare professional assessments, while further validation with more data remains essential.

None
A Framework for Generating Semantically Ambiguous Images to Probe Human and Machine Perception 2026-03-25
Show

The classic duck-rabbit illusion reveals that when visual evidence is ambiguous, the human brain must decide what it sees. But where exactly do human observers draw the line between ''duck'' and ''rabbit'', and do machine classifiers draw it in the same place? We use semantically ambiguous images as interpretability probes to expose how vision models represent the boundaries between concepts. We present a psychophysically-informed framework that interpolates between concepts in the CLIP embedding space to generate continuous spectra of ambiguous images, allowing us to precisely measure where and how humans and machine classifiers place their semantic boundaries. Using this framework, we show that machine classifiers are more biased towards seeing ''rabbit'', whereas humans are more aligned with the CLIP embedding used for synthesis, and the guidance scale seems to affect human sensitivity more strongly than machine classifiers. Our framework demonstrates how controlled ambiguity can serve as a diagnostic tool to bridge the gap between human psychophysical analysis, image classification, and generative image models, offering insight into human-model alignment, robustness, model interpretability, and image synthesis methods.

None
Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification 2026-03-25
Show

Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.

Preprint None
Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection 2026-03-25
Show

Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.

None
LGEST: Dynamic Spatial-Spectral Expert Routing for Hyperspectral Image Classification 2026-03-25
Show

Deep learning methods, including Convolutional Neural Networks, Transformers and Mamba, have achieved remarkable success in hyperspectral image (HSI) classification. Nevertheless, existing methods exhibit inflexible integration of local-global representations, inadequate handling of spectral-spatial scale disparities across heterogeneous bands, and susceptibility to the Hughes phenomenon under high-dimensional sample heterogeneity. To address these challenges, we propose Local-Global Expert Spatial-Spectral Transformer (LGEST), a novel framework that synergistically combines three key innovations. The LGEST first employs a Deep Spatial-Spectral Autoencoder (DSAE) to generate compact yet discriminative embeddings through hierarchical nonlinear compression, preserving 3D neighborhood coherence while mitigating information loss in high-dimensional spaces. Secondly, a Cross-Interactive Mixed Expert Feature Pyramid (CIEM-FPN) leverages cross-attention mechanisms and residual mixture-of-experts layers to dynamically fuse multi-scale features, adaptively weighting spectral discriminability and spatial saliency through learnable gating functions. Finally, a Local-Global Expert System (LGES) processes decomposed features via sparsely activated expert pairs: convolutional sub-experts capture fine-grained textures, while transformer sub-experts model long-range contextual dependencies, with a routing controller dynamically selecting experts based on real-time feature saliency. Extensive experiments on four benchmark datasets demonstrate that LGEST consistently outperforms state-of-the-art methods.

None
Kirchhoff-Inspired Neural Networks for Evolving High-Order Perception 2026-03-25
Show

Deep learning architectures are fundamentally inspired by neuroscience, particularly the structure of the brain's sensory pathways, and have achieved remarkable success in learning informative data representations. Although these architectures mimic the communication mechanisms of biological neurons, their strategies for information encoding and transmission are fundamentally distinct. Biological systems depend on dynamic fluctuations in membrane potential; by contrast, conventional deep networks optimize weights and biases by adjusting the strengths of inter-neural connections, lacking a systematic mechanism to jointly characterize the interplay among signal intensity, coupling structure, and state evolution. To tackle this limitation, we propose the Kirchhoff-Inspired Neural Network (KINN), a state-variable-based network architecture constructed based on Kirchhoff's current law. KINN derives numerically stable state updates from fundamental ordinary differential equations, enabling the explicit decoupling and encoding of higher-order evolutionary components within a single layer while preserving physical consistency, interpretability, and end-to-end trainability. Extensive experiments on partial differential equation (PDE) solving and ImageNet image classification validate that KINN outperforms state-of-the-art existing methods.

None
Machine vision with small numbers of detected photons per inference 2026-03-25
Show

Machine vision, including object recognition and image reconstruction, is a central technology in many consumer devices and scientific instruments. The design of machine-vision systems has been revolutionized by the adoption of end-to-end optimization, in which the optical front end and the post-processing back end are jointly optimized. However, while machine vision currently works extremely well in moderate-light or bright-light situations -- where a camera may detect thousands of photons per pixel and billions of photons per frame -- it is far more challenging in very low-light situations. We introduce photon-aware neuromorphic sensing (PANS), an approach for end-to-end optimization in highly photon-starved scenarios. The training incorporates knowledge of the low photon budget and the stochastic nature of light detection when the average number of photons per pixel is near or less than 1. We report a proof-of-principle experimental demonstration in which we performed low-light image classification using PANS, achieving 73% (82%) accuracy on FashionMNIST with an average of only 4.9 (17) detected photons in total per inference, and 86% (97%) on MNIST with 8.6 (29) detected photons -- orders of magnitude more photon-efficient than conventional approaches. We also report simulation studies showing how PANS could be applied to other classification, event-detection, and image-reconstruction tasks. By taking into account the statistics of measurement results for non-classical states or alternative sensing hardware, PANS could in principle be adapted to enable high-accuracy results in quantum and other photon-starved setups.

98 pages, 34 figures None
Efficient Encrypted Computation in Convolutional Spiking Neural Networks with TFHE 2026-03-25
Show

With the rapid advancement of AI technology, we have seen more and more concerns on data privacy, leading to some cutting-edge research on machine learning with encrypted computation. Fully Homomorphic Encryption (FHE) is a crucial technology for privacy-preserving computation, while it struggles with continuous non-polynomial functions, as it operates on discrete integers and supports only addition and multiplication. Spiking Neural Networks (SNNs), which use discrete spike signals, naturally complement FHE's characteristics. In this paper, we introduce FHE-DiCSNN, a framework built on the TFHE scheme, utilizing the discrete nature of SNNs for secure and efficient computations. By leveraging bootstrapping techniques, we successfully implement Leaky Integrate-and-Fire (LIF) neuron models on ciphertexts, allowing SNNs of arbitrary depth. Our framework is adaptable to other spiking neuron models, offering a novel approach to homomorphic evaluation of SNNs. Additionally, we integrate convolutional methods inspired by CNNs to enhance accuracy and reduce the simulation time associated with random encoding. Parallel computation techniques further accelerate bootstrapping operations. Experimental results on the MNIST and FashionMNIST datasets validate the effectiveness of FHE-DiCSNN, with a loss of less than 3% compared to plaintext, respectively, and computation times of under 1 second per prediction. We also apply the model into real medical image classification problems and analyze the parameter optimization and selection.

None
Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection 2026-03-24
Show

Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score--ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: https://github.com/sgchr273/cosine-layers.git.

Code Link
ARGENT: Adaptive Hierarchical Image-Text Representations 2026-03-24
Show

Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

None
PoiCGAN: A Targeted Poisoning Based on Feature-Label Joint Perturbation in Federated Learning 2026-03-24
Show

Federated Learning (FL), as a popular distributed learning paradigm, has shown outstanding performance in improving computational efficiency and protecting data privacy, and is widely applied in industrial image classification. However, due to its distributed nature, FL is vulnerable to threats from malicious clients, with poisoning attacks being a common threat. A major limitation of existing poisoning attack methods is their difficulty in bypassing model performance tests and defense mechanisms based on model anomaly detection. This often results in the detection and removal of poisoned models, which undermines their practical utility. To ensure both the performance of industrial image classification and attacks, we propose a targeted poisoning attack, PoiCGAN, based on feature-label collaborative perturbation. Our method modifies the inputs of the discriminator and generator in the Conditional Generative Adversarial Network (CGAN) to influence the training process, generating an ideal poison generator. This generator not only produces specific poisoned samples but also automatically performs label flipping. Experiments across various datasets show that our method achieves an attack success rate 83.97% higher than baseline methods, with a less than 8.87% reduction in the main task's accuracy. Moreover, the poisoned samples and malicious models exhibit high stealthiness.

None
Designing to Forget: Deep Semi-parametric Models for Unlearning 2026-03-24
Show

Recent advances in machine unlearning have focused on developing algorithms to remove specific training samples from a trained model. In contrast, we observe that not all models are equally easy to unlearn. Hence, we introduce a family of deep semi-parametric models (SPMs) that exhibit non-parametric behavior during unlearning. SPMs use a fusion module that aggregates information from each training sample, enabling explicit test-time deletion of selected samples without altering model parameters. Empirically, we demonstrate that SPMs achieve competitive task performance to parametric models in image classification and generation, while being significantly more efficient for unlearning. Notably, on ImageNet classification, SPMs reduce the prediction gap relative to a retrained (oracle) baseline by $11%$ and achieve over $10\times$ faster unlearning compared to existing approaches on parametric models. The code is available at https://github.com/amberyzheng/spm_unlearning.

CVPR 2026 Code Link
Multiplicative learning from observation-prediction ratios 2026-03-24
Show

Additive parameter updates, as used in gradient descent and its adaptive extensions, underpin most modern machine-learning optimization. Yet, such additive schemes often demand numerous iterations and intricate learning-rate schedules to cope with scale and curvature of loss functions. Here we introduce Expectation Reflection (ER), a multiplicative learning paradigm that updates parameters based on the ratio of observed to predicted outputs, rather than their differences. ER eliminates the need for ad hoc loss functions or learning-rate tuning while maintaining internal consistency. Extending ER to multilayer networks, we demonstrate its efficacy in image classification, achieving optimal weight determination in a single iteration. We further show that ER can be interpreted as a modified gradient descent incorporating an inverse target-propagation mapping. Together, these results position ER as a fast and scalable alternative to conventional optimization methods for neural-network training.

None
Bounding Box Anomaly Scoring for simple and efficient Out-of-Distribution detection 2026-03-24
Show

Out-of-distribution (OOD) detection aims to identify inputs that differ from the training distribution in order to reduce unreliable predictions by deep neural networks. Among post-hoc feature-space approaches, OOD detection is commonly performed by approximating the in-distribution support in the representation space of a pretrained network. Existing methods often reflect a trade-off between compact parametric models, such as Mahalanobis-based scores, and more flexible but reference-based methods, such as k-nearest neighbors. Bounding-box abstraction provides an attractive intermediate perspective by representing in-distribution support through compact axis-aligned summaries of hidden activations. In this paper, we introduce Bounding Box Anomaly Scoring (BBAS), a post-hoc OOD detection method that leverages bounding-box abstraction. BBAS combines graded anomaly scores based on interval exceedances, monitoring variables adapted to convolutional layers, and decoupled clustering and box construction for richer and multi-layer representations. Experiments on image-classification benchmarks show that BBAS provides robust separation between in-distribution and out-of-distribution samples while preserving the simplicity, compactness, and updateability of the bounding-box approach.

45 pa...

45 pages, 4 figures, 17 tables

None
Arc Gradient Descent: A Geometrically Motivated Gradient Descent-based Optimiser with Phase-Aware, User-Controlled Step Dynamics (proof-of-concept) 2026-03-23
Show

The paper presents the formulation, implementation, and evaluation of the ArcGD optimiser. The evaluation is conducted initially on a non-convex benchmark function and subsequently on a real-world ML dataset. The initial comparative study using the Adam optimiser is conducted on a stochastic variant of the highly non-convex and notoriously challenging Rosenbrock function, renowned for its narrow, curved valley, across dimensions ranging from 2D to 1000D and an extreme case of 50,000D. Two configurations were evaluated to eliminate learning-rate bias: (i) both using ArcGD's effective learning rate and (ii) both using Adam's default learning rate. ArcGD consistently outperformed Adam under the first setting and, although slower under the second, achieved superior final solutions in most cases. In the second evaluation, ArcGD is evaluated against state-of-the-art optimizers (Adam, AdamW, Lion, SGD) on the CIFAR-10 image classification dataset across 8 diverse MLP architectures ranging from 1 to 5 hidden layers. ArcGD achieved the highest average test accuracy (50.7%) at 20,000 iterations, outperforming AdamW (46.6%), Adam (46.8%), SGD (49.6%), and Lion (43.4%), winning or tying on 6 of 8 architectures. Notably, while Adam and AdamW showed strong early convergence at 5,000 iterations, but regressed with extended training, whereas ArcGD continued improving, demonstrating generalization and resistance to overfitting without requiring early stopping tuning. Strong performance on geometric stress tests and standard deep-learning benchmarks indicates broad applicability, highlighting the need for further exploration. Moreover, it is also shown that both a limiting variant of ArcGD and a momentum augmented ArcGD, recover sign-based momentum updates, revealing a clear conceptual link between ArcGD's phase structure and the core mechanism of the Lion Optimiser.

90 pa...

90 pages, 6 appendices, proof-of-concept

None
Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity 2026-03-23
Show

Hard-label black-box settings, where only top-1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign-flipping hard-label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard-label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first-principles understanding, we propose a new attack framework that combines a zero-query frequency-domain initialization with a Pattern-Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR-10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP-based models. The results show that our method consistently surpasses SOTA hard-label attacks in both attack success rate and query efficiency, particularly in low-query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a $0%$ detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.

Code Link
SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection 2026-03-23
Show

Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.

This ...

This paper was submitted to CVPR 2026. A revised version will be updated soon

Code Link
Interpretable Deep Learning Framework for Improved Disease Classification in Medical Imaging 2026-03-23
Show

Deep learning models have gained increasing adoption in medical image analysis. However, these models often produce overconfident predictions, which can compromise clinical accuracy and reliability. Bridging the gap between high-performance and awareness of uncertainty remains a crucial challenge in biomedical imaging applications. This study focuses on developing a unified deep learning framework for enhancing feature integration, interpretability, and reliability in prediction. We introduced a cross-guided channel spatial attention architecture that fuses feature representations extracted from EfficientNetB4 and ResNet34. Bidirectional attention approach enables the exchange of information across networks with differing receptive fields, enhancing discriminative and contextual feature learning. For quantitative predictive uncertainty assessment, Monte Carlo (MC)-Dropout is integrated with conformal prediction. This provides statistically valid prediction sets with entropy-based uncertainty visualization. The framework is evaluated on four medical imaging benchmark datasets: chest X-rays of COVID-19, Tuberculosis, Pneumonia, and retinal Optical Coherence Tomography (OCT) images. The proposed framework achieved strong classification performance with an AUC of 99.75% for COVID-19, 100% for Tuberculosis, 99.3% for Pneumonia chest X-rays, and 98.69% for retinal OCT images. Uncertainty-aware inference yields calibrated prediction sets with interpretable examples of uncertainty, showing transparency. The results demonstrate that bidirectional cross-attention with uncertainty quantification can improve performance and transparency in medical image classification.

18 pa...

18 pages, 8 figures, 5 tables

None
Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification 2026-03-23
Show

Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.

Accep...

Accepted for publication at CVPR 2026 Workshop on Medical Reasoning with Vision Language Foundation Models (Med-Reasoner)

Code Link
Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization 2026-03-23
Show

Single Domain Generalization (SDG) aims to train models that maintain consistent performance across diverse scenarios using data from a single source. While latent diffusion models (LDMs) show promise for augmenting limited source data, our analysis reveals that directly employing synthetic data may not only fail to provide benefits but can actually compromise performance due to substantial feature distribution discrepancies between synthetic and real target domains. To address this issue, we propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization. We employ LDMs to produce diverse pseudo-target domain samples and introduce two key modules to handle distribution bias. First, Discriminative Feature Decoupling and Reassembly (DFDR) module uses entropy-guided attention to recalibrate channel-level features, suppressing synthetic noise while preserving semantic consistency. Second, Multi-pseudo-domain Soft Fusion (MDSF) module uses adversarial training with latent-space feature interpolation, creating continuous feature transitions between domains. Extensive SDG experiments on image classification, object detection, and semantic segmentation demonstrate that DRSF delivers substantial performance gains with only marginal computational overhead. Notably, DRSF's plug-and-play architecture enables seamless integration with unsupervised domain adaptation paradigms, underscoring its broad applicability to diverse, real-world domain challenges.

26 pa...

26 pages, 10 figures (Accepted by IJCV)

None
Mitigating Objectness Bias and Region-to-Text Misalignment for Open-Vocabulary Panoptic Segmentation 2026-03-22
Show

Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP's region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: https://github.com/nickormushev/OVRCOAT

Code Link
All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning 2026-03-22
Show

The exponential growth of AI-generated images (AIGIs) underscores the urgent need for robust and generalizable detection methods. In this paper, we establish two key principles for AIGI detection through systematic analysis: (1) All Patches Matter: Unlike conventional image classification where discriminative features concentrate on object-centric regions, each patch in AIGIs inherently contains synthetic artifacts due to the uniform generation process, suggesting that every patch serves as an important artifact source for detection. (2) More Patches Better: Leveraging distributed artifacts across more patches improves detection robustness by capturing complementary forensic evidence and reducing over-reliance on specific patches, thereby enhancing robustness and generalization. However, our counterfactual analysis reveals an undesirable phenomenon: naively trained detectors often exhibit a Few-Patch Bias, discriminating between real and synthetic images based on minority patches. We identify Lazy Learner as the root cause: detectors preferentially learn conspicuous artifacts in limited patches while neglecting broader artifact distributions. To address this bias, we propose the Panoptic Patch Learning (PPL) framework, involving: (1) Random Patch Replacement that randomly substitutes synthetic patches with real counterparts to compel models to identify artifacts in underutilized regions, encouraging the broader use of more patches; (2) Patch-wise Contrastive Learning that enforces consistent discriminative capability across all patches, ensuring uniform utilization of all patches. Extensive experiments across two different settings on several benchmarks verify the effectiveness of our approach.

None
Natural Gradient Descent for Online Continual Learning 2026-03-21
Show

Online Continual Learning (OCL) for image classification represents a challenging subset of Continual Learning, focusing on classifying images from a stream without assuming data independence and identical distribution (i.i.d). The primary challenge in this context is to prevent catastrophic forgetting, where the model's performance on previous tasks deteriorates as it learns new ones. Although various strategies have been proposed to address this issue, achieving rapid convergence remains a significant challenge in the online setting. In this work, we introduce a novel approach to training OCL models that utilizes the Natural Gradient Descent optimizer, incorporating an approximation of the Fisher Information Matrix (FIM) through Kronecker Factored Approximate Curvature (KFAC). This method demonstrates substantial improvements in performance across all OCL methods, particularly when combined with existing OCL tricks, on datasets such as Split CIFAR-100, CORE50, and Split miniImageNet.

13 pages, 2 figures None
Restoring Neural Network Plasticity for Faster Transfer Learning 2026-03-21
Show

Transfer learning with models pretrained on ImageNet has become a standard practice in computer vision. Transfer learning refers to fine-tuning pretrained weights of a neural network on a downstream task, typically unrelated to ImageNet. However, pretrained weights can become saturated and may yield insignificant gradients, failing to adapt to the downstream task. This hinders the ability of the model to train effectively, and is commonly referred to as loss of neural plasticity. Loss of plasticity may prevent the model from fully adapting to the target domain, especially when the downstream dataset is atypical in nature. While this issue has been widely explored in continual learning, it remains relatively understudied in the context of transfer learning. In this work, we propose the use of a targeted weight re-initialization strategy to restore neural plasticity prior to fine-tuning. Our experiments show that both convolutional neural networks (CNNs) and vision transformers (ViTs) benefit from this approach, yielding higher test accuracy with faster convergence on several image classification benchmarks. Our method introduces negligible computational overhead and is compatible with common transfer learning pipelines.

11 pa...

11 pages, 1 figure, 6 tables and 2 formulas

None
Less is More in Semantic Space: Intrinsic Decoupling via Clifford-M for Fundus Image Classification 2026-03-21
Show

Multi-label fundus diagnosis requires features that capture both fine-grained lesions and large-scale retinal structure. Many multi-scale medical vision models address this challenge through explicit frequency decomposition, but our ablation studies show that such heuristics provide limited benefit in this setting: replacing the proposed simple dual-resolution stem with Octave Convolution increased parameters by 35% and computation by a 2.23-fold increase in computation; without improving mean accuracy, while a fixed wavelet-based variant performed substantially worse. Motivated by these findings, we propose Clifford-M, a lightweight backbone that replaces both feed-forward expansion and frequency-splitting modules with sparse geometric interaction. The model is built on a Clifford-style rolling product that jointly captures alignment and structural variation with linear complexity, enabling efficient cross-scale fusion and self-refinement in a compact dual-resolution architecture. Without pre-training, Clifford-M achieves a mean AUC-ROC of 0.8142 and a mean macro-F1 (optimal threshold) of 0.5481 on ODIR-5K using only 0.85M parameters, outperforming substantially larger mid-scale CNN baselines under the same training protocol. When evaluated on RFMiD without fine-tuning, it attains 0.7425 +/- 0.0198 macro AUC and 0.7610 +/- 0.0344 micro AUC, indicating reasonable robustness to cross-dataset shift. These results suggest that competitive and efficient fundus diagnosis can be achieved without explicit frequency engineering, provided that the core feature interaction is designed to capture multi-scale structure directly.

29 pa...

29 pages, 3 figures, 8 tables

None
Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision 2026-03-21
Show

A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly integrates these frequency- and spatial-domain embeddings, and the fused features are passed through a Bayesian linear classifier to output the final category predictions. On our self-built 50-class wildlife dataset, this approach outperforms conventional CNN and fixed-band DCT pipelines, achieving state-of-the-art accuracy under extreme sample scarcity.

This ...

This manuscript has been withdrawn by the authors following further review. The authors concluded that the current version requires substantial revision to ensure the robustness and reproducibility of the reported results. To avoid possible misunderstanding or overinterpretation, the authors have chosen to withdraw the present version

None
Modeling and benchmarking quantum optical neurons for efficient neural computation 2026-03-20
Show

Quantum optical neurons (QONs) are emerging as promising computational units that leverage photonic interference to perform neural operations in an energy-efficient and physically grounded manner. Building on recent theoretical proposals, we introduce a family of QON architectures based on Hong-Ou-Mandel (HOM) and Mach-Zehnder (MZ) interferometers, incorporating different photon modulation strategies -- phase, amplitude, and intensity. These physical setups yield distinct pre-activation functions, which we implement as fully differentiable software modules. We evaluate these QONs both in isolation and as building blocks of multilayer networks, training them on binary and multiclass image classification tasks using the MNIST and FashionMNIST datasets. Each experiment is repeated over five independent runs and assessed under both ideal and non-ideal conditions to measure accuracy, convergence, and robustness. Across settings, MZ-based neurons exhibit consistently stable behavior -- including under noise -- while HOM amplitude modulation performs competitively in deeper architectures, in several cases approaching classical performance. In contrast, phase- and intensity-modulated HOM-based variants show reduced stability and greater sensitivity to perturbations. These results highlight the potential of QONs as efficient and scalable components for future quantum-inspired neural architectures and hybrid photonic-electronic systems. The code is publicly available at https://github.com/gvessio/quantum-optical-neurons.

Code Link
Mixture of Experts with Soft Nearest Neighbor Loss: Resolving Expert Collapse via Representation Disentanglement 2026-03-20
Show

The Mixture-of-Experts (MoE) model uses a set of expert networks that specialize on subsets of a dataset under the supervision of a gating network. A common issue in MoE architectures is ``expert collapse'' where overlapping class boundaries in the raw input feature space cause multiple experts to learn redundant representations, thus forcing the gating network into rigid routing to compensate. We propose an enhanced MoE architecture that utilizes a feature extractor network optimized using Soft Nearest Neighbor Loss (SNNL) prior to feeding input features to the gating and expert networks. By pre-conditioning the latent space to minimize distances among class-similar data points, we resolve structural expert collapse which results to experts learning highly orthogonal weights. We employ Expert Specialization Entropy and Pairwise Embedding Similarity to quantify this dynamic. We evaluate our experimental approach across four benchmark image classification datasets (MNIST, FashionMNIST, CIFAR10, and CIFAR100), and we show our SNNL-augmented MoE models demonstrate structurally diverse experts which allow the gating network to adopt a more flexible routing strategy. This paradigm significantly improves classification accuracy on the FashionMNIST, CIFAR10, and CIFAR100 datasets.

7 pag...

7 pages, 7 figures, accepted for oral presentation at the Philippine Computing Science Congress 2026

None
Jigsaw Regularization in Whole-Slide Image Classification 2026-03-20
Show

Computational pathology involves the digitization of stained tissues into whole-slide images (WSIs) that contain billions of pixels arranged as contiguous patches. Statistical analysis of WSIs largely focuses on classification via multiple instance learning (MIL), in which slide-level labels are inferred from unlabeled patches. Most MIL methods treat patches as exchangeable, overlooking the rich spatial and topological structure that underlies tissue images. This work builds on recent graph-based methods that aim to incorporate spatial awareness into MIL. Our approach is new in two regards: (1) we deploy vision \emph{foundation-model embeddings} to incorporate local spatial structure within each patch, and (2) achieve across-patch spatial awareness using graph neural networks together with a novel {\em jigsaw regularization}. We find that a combination of these two features markedly improves classification over state-of-the-art attention-based MIL approaches on benchmark datasets in breast, head-and-neck, and colon cancer.

None
TinyML Enhances CubeSat Mission Capabilities 2026-03-20
Show

Earth observation (EO) missions traditionally rely on transmitting raw or minimally processed imagery from satellites to ground stations for computationally intensive analysis. This paradigm is infeasible for CubeSat systems due to stringent constraints on the onboard embedded processors, energy availability, and communication bandwidth. To overcome these limitations, the paper presents a TinyML-based Convolutional Neural Networks (ConvNets) model optimization and deployment pipeline for onboard image classification, enabling accurate, energy-efficient, and hardware-aware inference under CubeSat-class constraints. Our pipeline integrates structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress models and align them with the heterogeneous compute architecture of the STM32N6 microcontroller from STMicroelectronics. This Microcontroller Unit (MCU) integrates a novel Arm Cortex-M55 core and a Neural-ART Neural Processing Unit (NPU), providing a realistic proxy for CubeSat onboard computers. The paper evaluates the proposed approach on three EO benchmark datasets (i.e., EuroSAT, RS_C11, MEDIC) and four models (i.e., SqueezeNet, MobileNetV3, EfficientNet, MCUNetV1). We demonstrate an average reduction in RAM usage of 89.55% and Flash memory of 70.09% for the optimized models, significantly decreasing downlink bandwidth requirements while maintaining task-acceptable accuracy (with a drop ranging from 0.4 to 8.6 percentage points compared to the Float32 baseline). The energy consumption per inference ranges from 0.68 mJ to 6.45 mJ, with latency spanning from 3.22 ms to 30.38 ms. These results fully satisfy the stringent energy budgets and real-time constraints required for efficient onboard EO processing.

Accep...

Accepted at the 17th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS) 2026

None
MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models 2026-03-20
Show

State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal-khadka/MFil-Mamba.

Code Link
Growing Networks with Autonomous Pruning 2026-03-20
Show

This paper introduces Growing Networks with Autonomous Pruning (GNAP) for image classification. Unlike traditional convolutional neural networks, GNAP change their size, as well as the number of parameters they are using, during training, in order to best fit the data while trying to use as few parameters as possible. This is achieved through two complementary mechanisms: growth and pruning. GNAP start with few parameters, but their size is expanded periodically during training to add more expressive power each time the network has converged to a saturation point. Between these growing phases, model parameters are trained for classification and pruned simultaneously, with complete autonomy by gradient descent. Growing phases allow GNAP to improve their classification performance, while autonomous pruning allows them to keep as few parameters as possible. Experimental results on several image classification benchmarks show that our approach can train extremely sparse neural networks with high accuracy. For example, on MNIST, we achieved 99.44% accuracy with as few as 6.2k parameters, while on CIFAR10, we achieved 92.2\ accuracy with 157.8k parameters.

None
VeloxNet: Efficient Spatial Gating for Lightweight Embedded Image Classification 2026-03-19
Show

Deploying deep learning models on embedded devices for tasks such as aerial disaster monitoring and infrastructure inspection requires architectures that balance accuracy with strict constraints on model size, memory, and latency. This paper introduces VeloxNet, a lightweight CNN architecture that replaces SqueezeNet's fire modules with gated multi-layer perceptron (gMLP) blocks for embedded image classification. Each gMLP block uses a spatial gating unit (SGU) that applies learned spatial projections and multiplicative gating, enabling the network to capture spatial dependencies across the full feature map in a single layer. Unlike fire modules, which are limited to local receptive fields defined by small convolutional kernels, the SGU provides global spatial modeling at each layer with fewer parameters. We evaluate VeloxNet on three aerial image datasets: the Aerial Image Database for Emergency Response (AIDER), the Comprehensive Disaster Dataset (CDD), and the Levee Defect Dataset (LDD), comparing against eleven baselines including MobileNet variants, ShuffleNet, EfficientNet, and recent vision transformers. VeloxNet reduces the parameter count by 46.1% relative to SqueezeNet (from 740,970 to 399,366) while improving weighted F1 scores by 6.32% on AIDER, 30.83% on CDD, and 2.51% on LDD. These results demonstrate that substituting local convolutional modules with spatial gating blocks can improve both classification accuracy and parameter efficiency for resource-constrained deployment. The source code will be made publicly available upon acceptance of the paper.

This ...

This work has been submitted to the IEEE for possible publication

None
Revisiting Autoregressive Models for Generative Image Classification 2026-03-19
Show

Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

Tech report None
Page image classification for content-specific data processing 2026-03-19
Show

Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. The set of categories was chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics)

69 pa...

69 pages, 68 figures, 30 tables

None
HSI Image Enhancement Classification Based on Knowledge Distillation: A Study on Forgetting 2026-03-19
Show

In incremental classification tasks for hyperspectral images, catastrophic forgetting is an unavoidable challenge. While memory recall methods can mitigate this issue, they heavily rely on samples from old categories. This paper proposes a teacher-based knowledge retention method for incremental image classification. It alleviates model forgetting of old category samples by utilizing incremental category samples, without depending on old category samples. Additionally, this paper introduces a mask-based partial category knowledge distillation algorithm. By decoupling knowledge distillation, this approach filters out potentially misleading information that could misguide the student model, thereby enhancing overall accuracy. Comparative and ablation experiments demonstrate the proposed method's robust performance.

18pages,7figures None
Procedural Generation of Algorithm Discovery Tasks in Machine Learning 2026-03-18
Show

Automating the development of machine learning algorithms has the potential to unlock new breakthroughs. However, our ability to improve and evaluate algorithm discovery systems has thus far been limited by existing task suites. They suffer from many issues, such as: poor evaluation methodologies; data contamination; and containing saturated or very similar problems. Here, we introduce DiscoGen, a procedural generator of algorithm discovery tasks for machine learning, such as developing optimisers for reinforcement learning or loss functions for image classification. Motivated by the success of procedural generation in reinforcement learning, DiscoGen spans millions of tasks of varying difficulty and complexity from a range of machine learning fields. These tasks are specified by a small number of configuration parameters and can be used to optimise algorithm discovery agents (ADAs). We present DiscoBench, a benchmark consisting of a fixed, small subset of DiscoGen tasks for principled evaluation of ADAs. Finally, we propose a number of ambitious, impactful research directions enabled by DiscoGen, in addition to experiments demonstrating its use for prompt optimisation of an ADA. DiscoGen is released open-source at https://github.com/AlexGoldie/discogen.

Code Link
Exploring parameter-efficient fine-tuning (PEFT) of billion-parameter vision models with QLoRA and DoRA: insights into generalization for limited-data image classification under a 98:1 test-to-train regime 2026-03-18
Show

Automated behavior classification is essential for precision livestock farming but faces challenges of high computational costs and limited labeled data. This study systematically compared three approaches: training from scratch (ResNet-18, ViT-Small), frozen feature extraction, and parameter-efficient fine-tuning (PEFT) of the DINOv3 foundation model (6.7 billion parameters). We evaluated QLoRA and DoRA across multiple configurations varying rank (8, 16, 64) and target modules (q_proj versus all-linear layers). With 2,160 verified training images, we assessed generalization of our model on 211,800 test samples, which is essentially a 98:1 test-to-train ratio. Results demonstrated that PEFT substantially outperformed alternatives, where the best QLoRA configuration (all-linear layers and rank=64) achieved 83.16% test accuracy with only 2.72% parameters (183.0M) in 5.8 hours, compared to 72.87% for ResNet-18 (16.8 hours), 61.91% for ViT-Small (18.7 hours), and 76.56% for frozen DINOv3 (17.5 hours). DoRA achieved comparable accuracy (83.14%) but with longer training time (11.0 hours). Notably, increasing adapter capacity consistently improved generalization while simultaneously not causing overfitting: reducing rank from 16 to 8 decreased test accuracy from 78.38% to 77.17%, while expanding from q_proj-only to all-linear layers with rank=64 improved accuracy from 78.38% to 83.16%. This suggests underfitting, instead of overfitting, is the primary challenge when adapting foundation models to agricultural imagery. Our findings provide guidelines for deploying billion-parameter vision models with PEFT in agricultural livestock applications.

None
Impact of Data Duplication on Deep Neural Network-Based Image Classifiers: Robust vs. Standard Models 2026-03-18
Show

The accuracy and robustness of machine learning models against adversarial attacks are significantly influenced by factors such as training data quality, model architecture, the training process, and the deployment environment. In recent years, duplicated data in training sets, especially in language models, has attracted considerable attention. It has been shown that deduplication enhances both training performance and model accuracy in language models. While the importance of data quality in training image classifier Deep Neural Networks (DNNs) is widely recognized, the impact of duplicated images in the training set on model generalization and performance has received little attention. In this paper, we address this gap and provide a comprehensive study on the effect of duplicates in image classification. Our analysis indicates that the presence of duplicated images in the training set not only negatively affects the efficiency of model training but also may result in lower accuracy of the image classifier. This negative impact of duplication on accuracy is particularly evident when duplicated data is non-uniform across classes or when duplication, whether uniform or non-uniform, occurs in the training set of an adversarially trained model. Even when duplicated samples are selected in a uniform way, increasing the amount of duplication does not lead to a significant improvement in accuracy.

None
rSDNet: Unified Robust Neural Learning against Label Noise and Adversarial Attacks 2026-03-18
Show

Neural networks are central to modern artificial intelligence, yet their training remains highly sensitive to data contamination. Standard neural classifiers are trained by minimizing the categorical cross-entropy loss, corresponding to maximum likelihood estimation under a multinomial model. While statistically efficient under ideal conditions, this approach is highly vulnerable to contaminated observations including label noises corrupting supervision in the output space, and adversarial perturbations inducing worst-case deviations in the input space. In this paper, we propose a unified and statistically grounded framework for robust neural classification that addresses both forms of contamination within a single learning objective. We formulate neural network training as a minimum-divergence estimation problem and introduce rSDNet, a robust learning algorithm based on the general class of $S$-divergences. The resulting training objective inherits robustness properties from classical statistical estimation, automatically down-weighting aberrant observations through model probabilities. We establish essential population-level properties of rSDNet, including Fisher consistency, classification calibration implying Bayes optimality, and robustness guarantees under uniform label noise and infinitesimal feature contamination. Experiments on three benchmark image classification datasets show that rSDNet improves robustness to label corruption and adversarial attacks while maintaining competitive accuracy on clean data, Our results highlight minimum-divergence learning as a principled and effective framework for robust neural classification under heterogeneous data contamination.

Pre-p...

Pre-print; under review

None
Trust the Unreliability: Inward Backward Dynamic Unreliability Driven Coreset Selection for Medical Image Classification 2026-03-18
Show

Efficiently managing and utilizing large-scale medical imaging datasets with limited resources presents significant challenges. While coreset selection helps reduce computational costs, its effectiveness in medical data remains limited due to inherent complexity, such as large intra-class variation and high inter-class similarity. To address this, we revisit the training process and observe that neural networks consistently produce stable confidence predictions and better remember samples near class centers in training. However, concentrating on these samples may complicate the modeling of decision boundaries. Hence, we argue that the more unreliable samples are, in fact, the more informative in helping build the decision boundary. Based on this, we propose the Dynamic Unreliability-Driven Coreset Selection(DUCS) strategy. Specifically, we introduce an inward-backward unreliability assessment perspective: 1) Inward Self-Awareness: The model introspects its behavior by analyzing the evolution of confidence during training, thereby quantifying uncertainty of each sample. 2) Backward Memory Tracking: The model reflects on its training tracking by tracking the frequency of forgetting samples, thus evaluating its retention ability for each sample. Next, we select unreliable samples that exhibit substantial confidence fluctuations and are repeatedly forgotten during training. This selection process ensures that the chosen samples are near the decision boundary, thereby aiding the model in refining the boundary. Extensive experiments on public medical datasets demonstrate our superior performance compared to state-of-the-art(SOTA) methods, particularly at high compression rates.

None
Harnessing the Power of Foundation Models for Accurate Material Classification 2026-03-18
Show

Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.

None
Hybrid Classical-Quantum Transfer Learning with Noisy Quantum Circuits 2026-03-17
Show

Quantum transfer learning combines pretrained classical deep learning models with quantum circuits to reuse expressive feature representations while limiting the number of trainable parameters. In this work, we introduce a family of compact quantum transfer learning architectures that attach variational quantum classifiers to frozen convolutional backbones for image classification. We instantiate and evaluate several classical-quantum hybrid models implemented in PennyLane and Qiskit, and systematically compare them with a classical transfer-learning baseline across heterogeneous image datasets. To ensure a realistic assessment, we evaluate all approaches under both ideal simulation and noisy emulation using noise models calibrated from IBM quantum hardware specifications, as well as on real IBM quantum hardware. Experimental results show that the proposed quantum transfer learning architectures achieve competitive and, in several cases, superior accuracy while consistently reducing training time and energy consumption relative to the classical baseline. Among the evaluated approaches, PennyLane-based implementations provide the most favorable trade-off between accuracy and computational efficiency, suggesting that hybrid quantum transfer learning can offer practical benefits in realistic NISQ era settings when feature extraction remains classical.

None
CFM: Language-aligned Concept Foundation Model for Vision 2026-03-17
Show

Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making difficult. Recent work decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose CFM, a language-aligned concept foundation model for vision that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. When paired with a foundation model with strong semantic representations, we get explanations for any of its downstream tasks. Examining local co-occurrence dependencies of concepts allows us to define concept relationships through which we improve concept naming and obtain richer explanations. On benchmark data, we show that CFM provides performance on classification, segmentation, and captioning that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code at https://github.com/kawi19/CFM.

53 pa...

53 pages, 29 figures, 4 tables

Code Link
Confusion-Aware Spectral Regularizer for Long-Tailed Recognition 2026-03-17
Show

Long-tailed image classification remains a long-standing challenge, as real-world data typically follow highly imbalanced distributions where a few head classes dominate and many tail classes contain only limited samples. This imbalance biases feature learning toward head categories and leads to significant degradation on rare classes. Although recent studies have proposed re-sampling, re-weighting, and decoupled learning strategies, the improvement on the most underrepresented classes still remains marginal compared with overall accuracy. In this work, we present a confusion-centric perspective for long-tailed recognition that explicitly focuses on worst-class generalization. We first establish a new theoretical framework of class-specific error analysis, which shows that the worst-class error can be tightly upper-bounded by the spectral norm of the frequency-weighted confusion matrix and a model-dependent complexity term. Guided by this insight, we propose the Confusion-Aware Spectral Regularizer (CAR) that minimizes the spectral norm of the confusion matrix during training to reduce inter-class confusion and enhance tail-class generalization. To enable stable and efficient optimization, CAR integrates a Differentiable Confusion Matrix Surrogate and an EMA-based Confusion Estimator to maintain smooth and low-variance estimates across mini-batches. Extensive experiments across multiple long-tailed benchmarks demonstrates that CAR substantially improves both worst-class accuracy and overall performance. When combined with ConCutMix augmentation, CAR consistently surpasses exisiting state-of-the-art long-tailed learning methods under both the training-from-scratch setting (by 2.37% ~ 4.83%) and the fine-tuning-from-pretrained setting (by 2.42% ~ 4.17%) across ImageNet-LT, CIFAR100-LT, and iNaturalist datasets.

None
vAccSOL: Efficient and Transparent AI Vision Offloading for Mobile Robots 2026-03-17
Show

Mobile robots are increasingly deployed for inspection, patrol, and search-and-rescue operations, relying on computer vision for perception, navigation, and autonomous decision-making. However, executing modern vision workloads onboard is challenging due to limited compute resources and strict energy constraints. While some platforms include embedded accelerators, these are typically tied to proprietary software stacks, leaving user-defined workloads to run on resource-constrained companion computers. We present vAccSOL, a framework for efficient and transparent execution of AI-based vision workloads across heterogeneous robotic and edge platforms. vAccSOL integrates two components: SOL, a neural network compiler that generates optimized inference libraries with minimal runtime dependencies, and vAccel, a lightweight execution framework that transparently dispatches inference locally on the robot or to nearby edge infrastructure. This combination enables hardware-optimized inference and flexible execution placement without requiring modifications to robot applications. We evaluate vAccSOL on a real-world testbed with a commercial quadruped robot and twelve deep learning models covering image classification, video classification, and semantic segmentation. Compared to a PyTorch compiler baseline, SOL achieves comparable or better inference performance. With edge offloading, vAccSOL reduces robot-side power consumption by up to 80% and edge-side power by up to 60% compared to PyTorch, while increasing vision pipeline frame rate by up to 24x, extending the operating lifetime of battery-powered robots.

None
MASS: MoErging through Adaptive Subspace Selection 2026-03-17
Show

Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.

None
Dynamic Memory Transformer for Hyperspectral Image Classification 2026-03-17
Show

Hyperspectral image (HSI) classification (HSIC) requires effective modeling of complex spatial-spectral dependencies under limited labeled data and high dimensionality. While transformer-based models have shown strong capability in capturing long-range contextual information, they often introduce redundant attention patterns, which limits their effectiveness for fine-grained HSI analysis. To address these challenges, this paper proposes MemFormer, a lightweight transformer architecture for HSIC that incorporates a dynamic memory-enhanced attention mechanism. The proposed design augments multi-head self-attention with a compact global memory module that progressively aggregates contextual information across layers, enabling efficient modeling of long-range dependencies while reducing attention redundancy. In addition, a Spatial-Spectral Positional Embedding (SSPE) is used to jointly encode spatial continuity and spectral ordering, providing structurally consistent representations without relying on convolution-based positional encodings. Extensive experiments conducted on three benchmark hyperspectral datasets, including Indian Pines, WHU-Hi-HanChuan, and WHU-Hi-HongHu, demonstrate that MemFormer achieves superior classification performance compared to representative convolutional, hybrid, and transformer-based methods. On the Indian Pines dataset, MemFormer attains an overall accuracy of up to 99.55%, average accuracy of 99.38%, and a $κ$ coefficient of 99.49%, highlighting its effectiveness and efficiency for HSIC.

None
3D Fourier-based Global Feature Extraction for Hyperspectral Image Classification 2026-03-17
Show

Hyperspectral image classification (HSIC) has been significantly advanced by deep learning methods that exploit rich spatial-spectral correlations. However, existing approaches still face fundamental limitations: transformer-based models suffer from poor scalability due to the quadratic complexity of self-attention, while recent Fourier transform-based methods typically rely on 2D spatial FFTs and largely ignore critical inter-band spectral dependencies inherent to hyperspectral data. To address these challenges, we propose Hybrid GFNet (HGFNet), a novel architecture that integrates localized 3D convolutional feature extraction with frequency-domain global filtering via GFNet-style blocks for efficient and robust spatial-spectral representation learning. HGFNet introduces three complementary frequency transforms tailored to hyperspectral imagery: Spectral Fourier Transform (a 1D FFT along the spectral axis), Spatial Fourier Transform (a 2D FFT over spatial dimensions), and Spatial-Spatial Fourier Transform (a 3D FFT jointly over spectral and spatial dimensions), enabling comprehensive and high-dimensional frequency modeling. The 3D convolutional layers capture fine-grained local spatial-spectral structures, while the Fourier-based global filtering modules efficiently model long-range dependencies and suppress noise. To further mitigate the severe class imbalance commonly observed in HSIC, HGFNet incorporates an Adaptive Focal Loss (AFL) that dynamically adjusts class-wise focusing and weighting, improving discrimination for underrepresented classes.

None
SF-Mamba: Rethinking State Space Model for Vision 2026-03-17
Show

The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

21 pages None
Poisoning the Pixels: Revisiting Backdoor Attacks on Semantic Segmentation 2026-03-17
Show

Semantic segmentation models are widely deployed in safety-critical applications such as autonomous driving, yet their vulnerability to backdoor attacks remains largely underexplored. Prior segmentation backdoor studies transfer threat settings from existing image classification tasks, focusing primarily on object-to-background mis-segmentation. In this work, we revisit the threats by systematically examining backdoor attacks tailored to semantic segmentation. We identify four coarse-grained attack vectors (Object-to-Object, Object-to-Background, Background-to-Object, and Background-to-Background attacks), as well as two fine-grained vectors (Instance-Level and Conditional attacks). To formalize these attacks, we introduce BADSEG, a unified framework that optimizes trigger designs and applies label manipulation strategies to maximize attack performance while preserving victim model utility. Extensive experiments across diverse segmentation architectures on benchmark datasets demonstrate that BADSEG achieves high attack effectiveness with minimal impact on clean samples. We further evaluate six representative defenses and find that they fail to reliably mitigate our attacks, revealing critical gaps in current defenses. Finally, we demonstrate that these vulnerabilities persist in recent emerging architectures, including transformer-based networks and the Segment Anything Model (SAM), thereby compromising their security. Our work reveals previously overlooked security vulnerabilities in semantic segmentation, and motivates the development of defenses tailored to segmentation-specific threat models.

None
DermaFlux: Synthetic Skin Lesion Generation with Rectified Flows for Enhanced Image Classification 2026-03-17
Show

Despite recent advances in deep generative modeling, skin lesion classification systems remain constrained by the limited availability of large, diverse, and well-annotated clinical datasets, resulting in class imbalance between benign and malignant lesions and consequently reduced generalization performance. We introduce DermaFlux, a rectified flow-based text-to-image generative framework that synthesizes clinically grounded skin lesion images from natural language descriptions of dermatological attributes. Built upon Flux.1, DermaFlux is fine-tuned using parameter-efficient Low-Rank Adaptation (LoRA) on a large curated collection of publicly available clinical image datasets. We construct image-text pairs using synthetic textual captions generated by Llama 3.2, following established dermatological criteria including lesion asymmetry, border irregularity, and color variation. Extensive experiments demonstrate that DermaFlux generates diverse and clinically meaningful dermatology images that improve binary classification performance by up to 6% when augmenting small real-world datasets, and by up to 9% when classifiers are trained on DermaFlux-generated synthetic images rather than diffusion-based synthetic images. Our ImageNet-pretrained ViT fine-tuned with only 2,500 real images and 4,375 DermaFlux-generated samples achieves 78.04% binary classification accuracy and an AUC of 0.859, surpassing the next best dermatology model by 8%.

None
AI Application Benchmarking: Power-Aware Performance Analysis for Vision and Language Models 2026-03-17
Show

Artificial Intelligence (AI) workloads drive a rapid expansion of high-performance computing (HPC) infrastructures and increase their power and energy demands towards a critical level. AI benchmarks representing state-of-the art workloads and their understanding in the context of performance-energy trade-offs are critical to deploy efficient infrastructures and can guide energy efficiency measures, such as power capping. We introduce a benchmarking framework with popular deep learning applications from computer vision (image classification and generation) and large language models (continued pre-training and inference) implementing modern methods. Our performance analysis focuses on throughput rather than time to "completion", which is the standard metric in HPC. We analyse performance and energy efficiency under various power capping scenarios on NVIDIA H100, NVIDIA H200, and AMD MI300X GPUs. Our results reveal that no universal optimal power cap exists, as the efficiency peak varies across application types and GPU architectures. Interestingly, the two NVIDIA GPUs which mainly differ in their HBM configuration show qualitatively different performance-energy trade-offs. The developed benchmarking framework will be released as a public tool.

None
Deformation-Invariant Neural Network and Its Applications in Distorted Image Restoration and Analysis 2026-03-17
Show

Images degraded by geometric distortions pose a significant challenge to imaging and computer vision tasks such as object recognition. Deep learning-based imaging models usually fail to give accurate performance for geometrically distorted images. In this paper, we propose the deformation-invariant neural network (DINN), a framework to address the problem of imaging tasks for geometrically distorted images. The DINN outputs consistent latent features for images that are geometrically distorted but represent the same underlying object or scene. The idea of DINN is to incorporate a simple component, called the quasiconformal transformer network (QCTN), into other existing deep networks for imaging tasks. The QCTN is a deep neural network that outputs a quasiconformal map, which can be used to transform a geometrically distorted image into an improved version that is closer to the distribution of natural or good images. It first outputs a Beltrami coefficient, which measures the quasiconformality of the output deformation map. By controlling the Beltrami coefficient, the local geometric distortion under the quasiconformal mapping can be controlled. The QCTN is lightweight and simple, which can be readily integrated into other existing deep neural networks to enhance their performance. Leveraging our framework, we have developed an image classification network that achieves accurate classification of distorted images. Our proposed framework has been applied to restore geometrically distorted images by atmospheric turbulence and water turbulence. DINN outperforms existing GAN-based restoration methods under these scenarios, demonstrating the effectiveness of the proposed framework. Additionally, we apply our proposed framework to the 1-1 verification of human face images under atmospheric turbulence and achieve satisfactory performance, further demonstrating the efficacy of our approach.

None
Vision-Language Model Based Multi-Expert Fusion for CT Image Classification 2026-03-16
Show

Robust detection of COVID-19 from chest CT remains challenging in multi-institutional settings due to substantial source shift, source imbalance, and hidden test-source identities. In this work, we propose a three-stage source-aware multi-expert framework for multi-source COVID-19 CT classification. First, we build a lung-aware 3D expert by combining original CT volumes and lung-extracted CT volumes for volumetric classification. Second, we develop two MedSigLIP-based experts: a slice-wise representation and probability learning module, and a Transformer-based inter-slice context modeling module for capturing cross-slice dependency. Third, we train a source classifier to predict the latent source identity of each test scan. By leveraging the predicted source information, we perform model fusion and voting based on different experts. On the validation set covering all four sources, the Stage 1 model achieves the best macro-F1 of 0.9711, ACC of 0.9712, and AUC of 0.9791. Stage2a and Stage2b achieve the best AUC scores of 0.9864 and 0.9854, respectively. Stage~3 source classifier reaches 0.9107 ACC and 0.9114 F1. These results demonstrate that source-aware expert modeling and hierarchical voting provide an effective solution for robust COVID-19 CT classification under heterogeneous multi-source conditions.

None
PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units 2026-03-16
Show

Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.

16 pa...

16 pages, 6 figures, 4 tables

None
WaRA: Wavelet Low Rank Adaptation 2026-03-16
Show

Adapting large pretrained vision models to medical image classification is often limited by memory, computation, and task-specific specializations. Parameter-efficient fine-tuning (PEFT) methods like LoRA reduce this cost by learning low-rank updates, but operating directly in feature space can struggle to capture the localized, multi-scale features common in medical imaging. We propose WaRA, a wavelet-structured adaptation module that performs low-rank adaptation in a wavelet domain. WaRA reshapes patch tokens into a spatial grid, applies a fixed discrete wavelet transform, updates subband coefficients using a shared low-rank adapter, and reconstructs the additive update through an inverse wavelet transform. This design provides a compact trainable interface while biasing the update toward both coarse structure and fine detail. For extremely low-resource settings, we introduce Tiny-WaRA, which further reduces trainable parameters by learning only a small set of coefficients in a fixed basis derived from the pretrained weights through a truncated SVD. Experiments on medical image classification across four modalities and datasets demonstrate that WaRA consistently improves performance over strong PEFT baselines, while retaining a favorable efficiency profile. Our code is publicly available at~\href{https://github.com/moeinheidari7829/WaRA}{\textcolor{magenta}{GitHub}}.

Code Link
TopoCL: Topological Contrastive Learning for Medical Imaging 2026-03-15
Show

Contrastive learning (CL) has become a powerful approach for learning representations from unlabeled images. However, existing CL methods focus predominantly on visual appearance features while neglecting topological characteristics (e.g., connectivity patterns, boundary configurations, cavity formations) that provide valuable cues for medical image analysis. To address this limitation, we propose a new topological CL framework (TopoCL) that explicitly exploits topological structures during contrastive learning for medical imaging. Specifically, we first introduce topology-aware augmentations that control topological perturbations using a relative bottleneck distance between persistence diagrams, preserving medically relevant topological properties while enabling controlled structural variations. We then design a Hierarchical Topology Encoder that captures topological features through self-attention and cross-attention mechanisms. Finally, we develop an adaptive mixture-of-experts (MoE) module to dynamically integrate visual and topological representations. TopoCL can be seamlessly integrated with existing CL methods. We evaluate TopoCL on five representative CL methods (SimCLR, MoCo-v3, BYOL, DINO, and Barlow Twins) and five diverse medical image classification datasets. The experimental results show that TopoCL achieves consistent improvements: an average gain of +3.26% in linear probe classification accuracy with strong statistical significance, verifying its effectiveness.

None
Quantifying task-relevant representational similarity using decision variable correlation 2026-03-15
Show

Previous studies have compared neural activities in the visual cortex to representations in deep neural networks trained on image classification. Interestingly, while some suggest that their representations are highly similar, others argued the opposite. Here, we propose a new approach to characterize the similarity of the decision strategies of two observers (models or brains) using decision variable correlation (DVC). DVC quantifies the image-by-image correlation between the decoded decisions based on the internal neural representations in a classification task. Thus, it can capture task-relevant information rather than general representational alignment. We evaluate DVC using monkey V4/IT recordings and network models trained on image classification tasks. We find that model-model similarity is comparable to monkey-monkey similarity, whereas model-monkey similarity is consistently lower. Strikingly, DVC decreases with increasing network performance on ImageNet-1k. Adversarial training does not improve model-monkey similarity in task-relevant dimensions assessed using DVC, although it markedly increases the model-model similarity. Similarly, pre-training on larger datasets does not improve model-monkey similarity. These results suggest a divergence between the task-relevant representations in monkey V4/IT and those learned by models trained on image classification tasks.

Camer...

Camera-ready version; accepted at NeurIPS 2025

None
A Heterogeneous Ensemble for Multi-Center COVID-19 Classification from Chest CT Scans 2026-03-15
Show

The COVID-19 pandemic exposed critical limitations in diagnostic workflows: RT-PCR tests suffer from slow turnaround times and high false-negative rates, while CT-based screening offers faster complementary diagnosis but requires expert radiological interpretation. Deploying automated CT analysis across multiple hospital centres introduces further challenges, as differences in scanner hardware, acquisition protocols, and patient populations cause substantial domain shift that degrades single-model performance. To address these challenges, we present a heterogeneous ensemble of nine models spanning three inference paradigms: (1) a self-supervised DINOv2 Vision Transformer with slice-level sigmoid aggregation, (2) a RadImageNet-pretrained DenseNet-121 with slice-level sigmoid averaging, and (3) seven Gated Attention Multiple Instance Learning models using EfficientNet-B3, ConvNeXt-Tiny, and EfficientNetV2-S backbones with scan-level softmax classification. Ensemble diversity is further enhanced through random-seed variation and Stochastic Weight Averaging. We address severe overfitting, reducing the validation-to-training loss ratio from 35x to less than 3x, through a combination of Focal Loss, embedding-level Mixup, and domain-aware augmentation. Model outputs are fused via score-weighted probability averaging and calibrated with per-source threshold optimization. The final ensemble achieves an average macro F1 of 0.9280 across four hospital centres, outperforming the best single model (F1=0.8969) by +0.031, demonstrating that heterogeneous architectures combined with source-aware calibration are essential for robust multi-site medical image classification.

None
Protecting Deep Neural Network Intellectual Property with Chaos-Based White-Box Watermarking 2026-03-15
Show

The rapid proliferation of deep neural networks (DNNs) across several domains has led to increasing concerns regarding intellectual property (IP) protection and model misuse. Trained DNNs represent valuable assets, often developed through significant investments. However, the ease with which models can be copied, redistributed, or repurposed highlights the urgent need for effective mechanisms to assert and verify model ownership. In this work, we propose an efficient and resilient white-box watermarking framework that embeds ownership information into the internal parameters of a DNN using chaotic sequences. The watermark is generated using a logistic map, a well-known chaotic function, producing a sequence that is sensitive to its initialization parameters. This sequence is injected into the weights of a chosen intermediate layer without requiring structural modifications to the model or degradation in predictive performance. To validate ownership, we introduce a verification process based on a genetic algorithm that recovers the original chaotic parameters by optimizing the similarity between the extracted and regenerated sequences. The effectiveness of the proposed approach is demonstrated through extensive experiments on image classification tasks using MNIST and CIFAR-10 datasets. The results show that the embedded watermark remains detectable after fine-tuning, with negligible loss in model accuracy. In addition to numerical recovery of the watermark, we perform visual analyses using weight density plots and construct activation-based classifiers to distinguish between original, watermarked, and tampered models. Overall, the proposed method offers a flexible and scalable solution for embedding and verifying model ownership in white-box settings well-suited for real-world scenarios where IP protection is critical.

None
Histo-MExNet: A Unified Framework for Real-World, Cross-Magnification, and Trustworthy Breast Cancer Histopathology 2026-03-15
Show

Accurate and reliable histopathological image classification is essential for breast cancer diagnosis. However, many deep learning models remain sensitive to magnification variability and lack interpretability. To address these challenges, we propose Histo-MExNet, a unified framework designed for scaleinvariant and uncertainty-aware classification. The model integrates DenseNet, ConvNeXt, and EfficientNet backbones within a gated multi-expert architecture, incorporates a prototype learning module for example-driven interpretability, and applies physics-informed regularization to enforce morphology preservation and spatial coherence during feature learning. Monte Carlo Dropout is used to quantify predictive uncertainty. On the BreaKHis dataset, Histo-MExNet achieves 96.97% accuracy under multi-magnification training and demonstrates improved generalization to unseen magnification levels compared to single-expert models, while uncertainty estimation helps identify out-of-distribution samples and reduce overconfident errors, supporting a balanced combination of accuracy, robustness, and interpretability for clinical decision support.

34, 6 figures None
Facial beauty prediction fusing transfer learning and broad learning system 2026-03-14
Show

Facial beauty prediction (FBP) is an important and challenging problem in the fields of computer vision and machine learning. Not only it is easily prone to overfitting due to the lack of large-scale and effective data, but also difficult to quickly build robust and effective facial beauty evaluation models because of the variability of facial appearance and the complexity of human perception. Transfer Learning can be able to reduce the dependence on large amounts of data as well as avoid overfitting problems. Broad learning system (BLS) can be capable of quickly completing models building and training. For this purpose, Transfer Learning was fused with BLS for FBP in this paper. Firstly, a feature extractor is constructed by way of CNNs models based on transfer learning for facial feature extraction, in which EfficientNets are used in this paper, and the fused features of facial beauty extracted are transferred to BLS for FBP, called E-BLS. Secondly, on the basis of E-BLS, a connection layer is designed to connect the feature extractor and BLS, called ER-BLS. Finally, experimental results show that, compared with the previous BLS and CNNs methods existed, the accuracy of FBP was improved by E-BLS and ER-BLS, demonstrating the effectiveness and superiority of the method presented, which can also be widely used in pattern recognition, object detection and image classification.

None
Discriminative Flow Matching Via Local Generative Predictors 2026-03-14
Show

Traditional discriminative computer vision relies predominantly on static projections, mapping input features to outputs in a single computational step. Although efficient, this paradigm lacks the iterative refinement and robustness inherent in biological vision and modern generative modelling. In this paper, we propose Discriminative Flow Matching, a framework that reformulates classification and object detection as a conditional transport process. By learning a vector field that continuously transports samples from a simple noise distribution toward a task-aligned target manifold -- such as class embeddings or bounding box coordinates -- we are at the interface between generative and discriminative learning. Our method attaches multiple independent flow predictors to a shared backbone. These predictors are trained using local flow matching objectives, where gradients are computed independently for each block. We formulate this approach for standard image classification and extend it to the complex task of object detection, where targets are high-dimensional and spatially distributed. This architecture provides the flexibility to update blocks either sequentially to minimise activation memory or in parallel to suit different hardware constraints. By aggregating the predictions from these independent flow predictors, our framework enables robust, generative-inspired inference across diverse architectures, including CNNs and vision transformers.

None
Diffusion-Based Feature Denoising and Using NNMF for Robust Brain Tumor Classification 2026-03-13
Show

Brain tumor classification from magnetic resonance imaging, which is also known as MRI, plays a sensitive role in computer-assisted diagnosis systems. In recent years, deep learning models have achieved high classification accuracy. However, their sensitivity to adversarial perturbations has become an important reliability concern in medical applications. This study suggests a robust brain tumor classification framework that combines Non-Negative Matrix Factorization (NNMF or NMF), lightweight convolutional neural networks (CNNs), and diffusion-based feature purification. Initially, MRI images are preprocessed and converted into a non-negative data matrix, from which compact and interpretable NNMF feature representations are extracted. Statistical metrics, including AUC, Cohen's d, and p-values, are used to rank and choose the most discriminative components. Then, a lightweight CNN classifier is trained directly on the selected feature groups. To improve adversarial robustness, a diffusion-based feature-space purification module is introduced. A forward noise method followed by a learned denoiser network is used before classification. System performance is estimated using both clean accuracy and robust accuracy under powerful adversarial attacks created by AutoAttack. The experimental results show that the proposed framework achieves competitive classification performance while significantly enhancing robustness against adversarial perturbations.The findings presuppose that combining interpretable NNMF-based representations with a lightweight deep approach and diffusion-based defense technique supplies an effective and reliable solution for medical image classification under adversarial conditions.

30 pages, 29 figures None
A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks 2026-03-13
Show

While Vision-Language Models (VLMs) have achieved remarkable performance across diverse downstream tasks, recent studies have shown that they can inherit social biases from the training data and further propagate them into downstream applications. To address this issue, various debiasing approaches have been proposed, yet most of them aim to improve fairness without having a theoretical guarantee that the utility of the model is preserved. In this paper, we introduce a debiasing method that yields a \textbf{closed-form} solution in the cross-modal space, achieving Pareto-optimal fairness with \textbf{bounded utility losses}. Our method is \textbf{training-free}, requires \textbf{no annotated data}, and can jointly debias both visual and textual modalities across downstream tasks. Extensive experiments show that our method outperforms existing methods in debiasing VLMs across diverse fairness metrics and datasets for both group and \textbf{intersectional} fairness in downstream tasks such as zero-shot image classification, text-to-image retrieval, and text-to-image generation while preserving task performance.

None
Stake the Points: Structure-Faithful Instance Unlearning 2026-03-13
Show

Machine unlearning (MU) addresses privacy risks in pretrained models. The main goal of MU is to remove the influence of designated data while preserving the utility of retained knowledge. Achieving this goal requires preserving semantic relations among retained instances, which existing studies often overlook. We observe that without such preservation, models suffer from progressive structural collapse, undermining both the deletion-retention balance. In this work, we propose a novel structure-faithful framework that introduces stakes, i.e., semantic anchors that serve as reference points to maintain the knowledge structure. By leveraging these anchors, our framework captures and stabilizes the semantic organization of knowledge. Specifically, we instantiate the anchors from language-driven attribute descriptions encoded by a semantic encoder (e.g., CLIP). We enforce preservation of the knowledge structure via structure-aware alignment and regularization: the former aligns the organization of retained knowledge before and after unlearning around anchors, while the latter regulates updates to structure-critical parameters. Results from image classification, retrieval, and face recognition show average gains of 32.9%, 22.5%, and 19.3% in performance, balancing the deletion-retention trade-off and enhancing generalization.

Accep...

Accepted by CVPR 2026

None
On Linear Separability of the MNIST Handwritten Digits Dataset 2026-03-13
Show

The MNIST dataset containing thousands of handwritten digit images is still a fundamental benchmark for evaluating various pattern-recognition and image-classification models. Linear separability is a key concept in many statistical and machine-learning techniques. Despite the long history of the MNIST dataset and its relative simplicity in size and resolution, the question of whether the dataset is linearly separable has never been fully answered -- scientific and informal sources share conflicting claims. This paper aims to provide a comprehensive empirical investigation to address this question, distinguishing pairwise and one-vs-rest separation of the training, the test and the combined sets, respectively. It reviews the theoretical approaches to assessing linear separability, alongside state-of-the-art methods and tools, then systematically examines all relevant assemblies, and reports the findings.

8 pages, 1 figure None
AI Model Modulation with Logits Redistribution 2026-03-13
Show

Large-scale models are typically adapted to meet the diverse requirements of model owners and users. However, maintaining multiple specialized versions of the model is inefficient. In response, we propose AIM, a novel model modulation paradigm that enables a single model to exhibit diverse behaviors to meet the specific end requirements. AIM enables two key modulation modes: utility and focus modulations. The former provides model owners with dynamic control over output quality to deliver varying utility levels, and the latter offers users precise control to shift model's focused input features. AIM introduces a logits redistribution strategy that operates in a training data-agnostic and retraining-free manner. We establish a formal foundation to ensure AIM's regulation capability, based on the statistical properties of logits ordering via joint probability distributions. Our evaluation confirms AIM's practicality and versatility for Al model modulation, with tasks spanning image classification, semantic segmentation and text generation, and prevalent architectures including ResNet, SegFormer and Llama.

The 2...

The 2025 ACM Web Conference

None
AIMC-Spec: A Benchmark Dataset for Automatic Intrapulse Modulation Classification under Variable Noise Conditions 2026-03-12
Show

A lack of standardized datasets has long hindered progress in automatic intrapulse modulation classification (AIMC), a critical task in radar signal analysis for electronic support systems, particularly under noisy or degraded conditions. AIMC seeks to identify the modulation type embedded within a single radar pulse from its complex in-phase and quadrature (I/Q) representation, enabling automated interpretation of intrapulse structure. This paper introduces AIMC-Spec, a comprehensive synthetic dataset for spectrogram-based image classification, encompassing 30 modulation types across 5 signal-to-noise ratio (SNR) levels. To benchmark AIMC-Spec, five representative deep learning algorithms ranging from lightweight CNNs and denoising architectures to transformer-based networks were re-implemented and evaluated under a unified input format. The results reveal significant performance variation, with frequency-modulated (FM) signals classified more reliably than phase-modulated (PM) types, particularly at low SNRs. A focused FM-only test further highlights how modulation type and network architecture influence classifier robustness. AIMC-Spec establishes a reproducible baseline and provides a foundation for future research and standardization in the AIMC domain.

This ...

This version updates the previously released dataset by reducing storage requirements, revising the SNR calculation procedure, and restructuring the dataset format The first version of this work was published in IEEE Access DOI: 10.1109/ACCESS.2025.3645091

None
DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture 2026-03-12
Show

Recent advances in self-supervised visual representation learning have demonstrated the effectiveness of predictive latent-space objectives for learning transferable features. In particular, Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns representations by predicting latent embeddings of masked target regions from visible context. However, it predicts target regions in parallel and all at once, lacking ability to order predictions meaningfully. Inspired by human visual perception, which attends selectively and progressively from primary to secondary cues, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges latent predictive and autoregressive self-supervised learning. Specifically, DSeq-JEPA integrates a discriminatively ordered sequential process with JEPA-style learning objective. This is achieved by (i) identifying primary discriminative regions using an attention-derived saliency map that serves as a proxy for visual importance, and (ii) predicting subsequent regions in discriminative order, inducing a curriculum-like semantic progression from primary to secondary cues in pre-training. Extensive experiments across tasks -- image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB, Stanford Cars), detection/segmentation (MS-COCO, ADE20K), and low-level reasoning (CLEVR) -- show that DSeq-JEPA consistently learns more discriminative and generalizable representations compared to I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.

Proje...

Project page: https://github.com/SkyShunsuke/DSeq-JEPA

Code Link
Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization 2026-03-12
Show

Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.

None
FedSKD: Aggregation-free Model-heterogeneous Federated Learning via Multi-dimensional Similarity Knowledge Distillation for Medical Image Classification 2026-03-12
Show

Federated learning (FL) enables privacy-preserving collaborative model training without direct data sharing. Model-heterogeneous FL (MHFL) extends this paradigm by allowing clients to train personalized models with heterogeneous architectures tailored to their computational resources and application-specific needs. However, existing MHFL methods predominantly rely on centralized aggregation, which introduces scalability and efficiency bottlenecks, or impose restrictions requiring partially identical model architectures across clients. While peer-to-peer (P2P) FL removes server dependence, it suffers from model drift and knowledge dilution, limiting its effectiveness in heterogeneous settings. To address these challenges, we propose FedSKD, a novel MHFL framework that facilitates direct knowledge exchange through round-robin model circulation, eliminating the need for centralized aggregation while allowing fully heterogeneous model architectures across clients. FedSKD's key innovation lies in multi-dimensional similarity knowledge distillation, which enables bidirectional cross-client knowledge transfer at batch, pixel/voxel, and region levels for heterogeneous models in FL. This approach mitigates catastrophic forgetting and model drift through progressive reinforcement and distribution alignment while preserving model heterogeneity. Extensive evaluations on fMRI-based autism spectrum disorder diagnosis and skin lesion classification demonstrate that FedSKD outperforms state-of-the-art heterogeneous and homogeneous FL baselines, achieving superior personalization (client-specific accuracy) and generalization (cross-institutional adaptability). These findings underscore FedSKD's potential as a scalable and robust solution for real-world medical federated learning applications.

Accep...

Accepted at IEEE-TNNLS, 17 pages

None
Resource-Efficient Iterative LLM-Based NAS with Feedback Memory 2026-03-12
Show

Neural Architecture Search (NAS) automates network design, but conventional methods demand substantial computational resources. We propose a closed-loop pipeline leveraging large language models (LLMs) to iteratively generate, evaluate, and refine convolutional neural network architectures for image classification on a single consumer-grade GPU without LLM fine-tuning. Central to our approach is a historical feedback memory inspired by Markov chains: a sliding window of $K{=}5$ recent improvement attempts keeps context size constant while providing sufficient signal for iterative learning. Unlike prior LLM optimizers that discard failure trajectories, each history entry is a structured diagnostic triple -- recording the identified problem, suggested modification, and resulting outcome -- treating code execution failures as first-class learning signals. A dual-LLM specialization reduces per-call cognitive load: a Code Generator produces executable PyTorch architectures while a Prompt Improver handles diagnostic reasoning. Since both the LLM and architecture training share limited VRAM, the search implicitly favors compact, hardware-efficient models suited to edge deployment. We evaluate three frozen instruction-tuned LLMs (${\leq}7$B parameters) across up to 2000 iterations in an unconstrained open code space, using one-epoch proxy accuracy on CIFAR-10, CIFAR-100, and ImageNette as a fast ranking signal. On CIFAR-10, DeepSeek-Coder-6.7B improves from 28.2% to 69.2%, Qwen2.5-7B from 50.0% to 71.5%, and GLM-5 from 43.2% to 62.0%. A full 2000-iteration search completes in ${\approx}18$ GPU hours on a single RTX~4090, establishing a low-budget, reproducible, and hardware-aware paradigm for LLM-driven NAS without cloud infrastructure.

None
Single Pixel Image Classification using an Ultrafast Digital Light Projector 2026-03-12
Show

Pattern recognition and image classification are essential tasks in machine vision. Autonomous vehicles, for example, require being able to collect the complex information contained in a changing environment and classify it in real time. Here, we experimentally demonstrate image classification at multi-kHz frame rates combining the technique of single pixel imaging (SPI) with a low complexity machine learning model. The use of a microLED-on-CMOS digital light projector for SPI enables ultrafast pattern generation for sub-ms image encoding. We investigate the classification accuracy of our experimental system against the broadly accepted benchmarking task of the MNIST digits classification. We compare the classification performance of two machine learning models: An extreme learning machine (ELM) and a backpropagation trained deep neural network. The complexity of both models is kept low so the overhead added to the inference time is comparable to the image generation time. Crucially, our single pixel image classification approach is based on a spatiotemporal transformation of the information, entirely bypassing the need for image reconstruction. By exploring the performance of our SPI based ELM as binary classifier we demonstrate its potential for efficient anomaly detection in ultrafast imaging scenarios.

None
HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification 2026-03-12
Show

Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.

Accep...

Accepted and presented at REO workshop at EurIPS 2025

None
Quantum mechanical framework for quantization-based optimization: from Gradient flow to Schroedinger equation 2026-03-12
Show

This work presents a quantum mechanical framework for analyzing quantization-based optimization algorithms. The sampling process of the quantization-based search is modeled as a gradient-flow dissipative system, leading to a Hamilton-Jacobi-Bellman (HJB) representation. Through a suitable transformation of the objective function, this formulation yields the Schroedinger equation, which reveals that quantum tunneling enables escape from local minima and guarantees access to the global optimum. By establishing the connection to the Fokker-Planck equation, the framework provides a thermodynamic interpretation of global convergence. Such an analysis between the thermodynamic and the quantum dynamic methodology unifies combinatorial and continuous optimization, and extends naturally to machine learning tasks such as image classification. Numerical experiments demonstrate that quantization-based optimization consistently outperforms conventional algorithms across both combinatorial problems and nonconvex continuous functions.

preprint, 41 pages None
Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network 2026-03-12
Show

Group-equivariant convolutional neural networks (G-CNN) heavily rely on parameter sharing to increase CNN's data efficiency and performance. However, the parameter-sharing strategy greatly increases the computational burden for each added parameter, which hampers its application to deep neural network models. In this paper, we address these problems by proposing a non-parameter-sharing approach for group equivariant neural networks. The proposed methods adaptively aggregate a diverse range of filters by a weighted sum of stochastically augmented decomposed filters. We give theoretical proof about how the group equivariance can be achieved by our methods. Our method applies to both continuous and discrete groups, where the augmentation is implemented using Monte Carlo sampling and bootstrap resampling, respectively. Our methods also serve as an efficient extension of standard CNN. The experiments show that our method outperforms parameter-sharing group equivariant networks and enhances the performance of standard CNNs in image classification and denoising tasks, by using suitable filter bases to build efficient lightweight networks. The code is available at https://github.com/ZhaoWenzhao/MCG_CNN.

Code Link
Beyond Barren Plateaus: A Scalable Quantum Convolutional Architecture for High-Fidelity Image Classification 2026-03-11
Show

While Quantum Convolutional Neural Networks (QCNNs) offer a theoretical paradigm for quantum machine learning, their practical implementation is severely bottlenecked by barren plateaus -- the exponential vanishing of gradients -- and poor empirical accuracy compared to classical counterparts. In this work, we propose a novel QCNN architecture utilizing localized cost functions and a hardware-efficient tensor-network initialization strategy to provably mitigate barren plateaus. We evaluate our scalable QCNN on the MNIST dataset, demonstrating a significant performance leap. By resolving the gradient vanishing issue, our optimized QCNN achieves a classification accuracy of 98.7%, a substantial improvement over the baseline QCNN accuracy of 52.32% found in unmitigated models. Furthermore, we provide empirical evidence of a parameter-efficiency advantage, requiring $\mathcal{O}(\log N)$ fewer trainable parameters than equivalent classical CNNs to achieve $>95%$ convergence. This work bridges the gap between theoretical quantum utility and practical application, providing a scalable framework for quantum computer vision tasks without succumbing to loss landscape concentration.

None
A Saccade-inspired Approach to Image Classification using Vision Transformer Attention Maps 2026-03-11
Show

Human vision achieves remarkable perceptual performance while operating under strict metabolic constraints. A key ingredient is the selective attention mechanism, driven by rapid saccadic eye movements that constantly reposition the high-resolution fovea onto task-relevant locations, unlike conventional AI systems that process entire images with equal emphasis. Our work aims to draw inspiration from the human visual system to create smarter, more efficient image processing models. Using DINO, a self-supervised Vision Transformer that produces attention maps strikingly similar to human gaze patterns, we explore a saccade inspired method to focus the processing of information on key regions in visual space. To do so, we use the ImageNet dataset in a standard classification task and measure how each successive saccade affects the model's class scores. This selective-processing strategy preserves most of the full-image classification performance and can even outperform it in certain cases. By benchmarking against established saliency models built for human gaze prediction, we demonstrate that DINO provides superior fixation guidance for selecting informative regions. These findings highlight Vision Transformer attention as a promising basis for biologically inspired active vision and open new directions for efficient, neuromorphic visual processing.

16 pa...

16 page, 11 figure main paper + 3 pages, 6 appendix

None
REMSA: Foundation Model Selection for Remote Sensing via a Constraint-Aware Agent 2026-03-11
Show

Foundation Models (FMs) are increasingly integrated into remote sensing (RS) pipelines. These models include unimodal vision encoders and multimodal architectures. FMs are adapted to diverse perception tasks, such as image classification, change detection, and visual question answering. However, selecting the most suitable remote sensing foundation model (RSFM) for a specific task remains challenging due to scattered documentation, heterogeneous formats, and complex deployment constraints. To address this, we first introduce the RSFM Database (RS-FMD), the first structured and schema-guided resource covering over 160 RSFMs trained on various data modalities, spanning different spatial, spectral, and temporal resolutions, considering different learning paradigms. Built upon RS-FMD, we further present REMSA, a constraint-aware agent that enables automated RSFM selection from natural language queries. REMSA combines structured FM metadata retrieval with a task-driven decision workflow. In detail, it interprets user input, clarifies missing constraints, ranks models via in-context learning, and provides transparent justifications. Our system supports various RS tasks and data modalities, enabling personalized, reproducible, and efficient FM selection. To evaluate REMSA, we construct a benchmark of 100 expert-verified RS query scenarios. Each query is evaluated across 4 systems and 3 LLM backbones, with the top-3 selected models manually assessed by domain experts. This results in 3,000 expert-scored task--system--model configurations under our novel expert-centered evaluation protocol. REMSA outperforms multiple baselines, showing its practical utility in real decision-making applications. REMSA operates entirely on publicly available metadata of open source RSFMs, without accessing private or sensitive data.

Code ...

Code and data available at https://github.com/be-chen/REMSA

Code Link
Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution 2026-03-11
Show

With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation. Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings. The code is at https://github.com/hongsong-wang/LIDA

To ap...

To appear in CVPR 2026, Code is at https://github.com/hongsong-wang/LIDA

Code Link
A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification 2026-03-11
Show

Out-of-distribution (OOD) detection is critical in safety-sensitive applications. While this challenge has been addressed from various perspectives, the influence of training objectives on OOD behavior remains comparatively underexplored. In this paper, we present a systematic comparison of four widely used training objectives: Cross-Entropy Loss, Prototype Loss, Triplet Loss, and Average Precision (AP) Loss, spanning probabilistic, prototype-based, metric-learning, and ranking-based supervision, for OOD detection in image classification under standardized OpenOOD protocols. Across CIFAR-10/100 and ImageNet-200, we find that Cross-Entropy Loss, Prototype Loss, and AP Loss achieve comparable in-distribution accuracy, while Cross-Entropy Loss provides the most consistent near- and far-OOD performance overall; the other objectives can be competitive in specific settings.

None
Class Incremental Learning with Task-Specific Batch Normalization and Out-of-Distribution Detection 2026-03-11
Show

This study focuses on incremental learning for image classification, exploring how to reduce catastrophic forgetting of all learned knowledge when access to old data is restricted. The challenge lies in balancing plasticity (learning new knowledge) and stability (retaining old knowledge). Based on whether the task identifier (task-ID) is available during testing, incremental learning is divided into task incremental learning (TIL) and class incremental learning (CIL). The TIL paradigm often uses multiple classifier heads, selecting the corresponding head based on the task-ID. Since the CIL paradigm cannot access task-ID, methods originally developed for TIL require explicit task-ID prediction to bridge this gap and enable their adaptation to the CIL paradigm. {In this study, a novel continual learning framework extends the TIL method for CIL by introducing out-of-distribution detection for task-ID prediction. Our framework utilizes task-specific Batch Normalization (BN) and task-specific classification heads to effectively adjust feature map distributions for each task, enhancing plasticity. With far fewer parameters than convolutional kernels, task-specific BN helps minimize parameter growth, preserving stability. Based on multiple task-specific classification heads, we introduce an ``unknow'' class for each head. During training, data from other tasks are mapped to this unknown class. During inference, the task-ID is predicted by selecting the classification head with the lowest probability assigned to the unknown class. Our method achieves state-of-the-art performance on two medical image datasets and two natural image datasets. The source code is available at https://github.com/z1968357787/mbn_ood_git_main.

accep...

accepted by Neurocomputing Journal, camera ready version

Code Link
What do near-optimal learning rate schedules look like? 2026-03-11
Show

A basic unanswered question in neural network training is: what is the best learning rate schedule shape for a given workload? The choice of learning rate schedule is a key factor in the success or failure of the training process, but beyond having some kind of warmup and decay, there is no consensus on what makes a good schedule shape. To answer this question, we designed a search procedure to find the best shapes within a parameterized schedule family. Our approach factors out the schedule shape from the base learning rate, which otherwise would dominate cross-schedule comparisons. We applied our search procedure to a variety of schedule families on three workloads: linear regression, image classification on CIFAR-10, and small-scale language modeling on Wikitext103. We showed that our search procedure indeed generally found near-optimal schedules. We found that warmup and decay are robust features of good schedules, and that commonly used schedule families are not optimal on these workloads. Finally, we explored how the outputs of our shape search depend on other optimization hyperparameters, and found that weight decay can have a strong effect on the optimal schedule shape. To the best of our knowledge, our results represent the most comprehensive results on near-optimal schedule shapes for deep neural network training, to date.

None
CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification 2026-03-10
Show

Medical image classification is a core task in computer-aided diagnosis (CAD), playing a pivotal role in early disease detection, treatment planning, and patient prognosis assessment. In ophthalmic practice, fluorescein fundus angiography (FFA) and indocyanine green angiography (ICGA) provide hemodynamic and lesion-structural information that conventional fundus photography cannot capture. However, due to the single-modality nature, subtle lesion patterns, and significant inter-device variability, existing methods still face limitations in generalization and high-confidence prediction. To address these challenges, we propose CLEAR-Mamba, an enhanced framework built upon MedMamba with optimizations in both architecture and training strategy. Architecturally, we introduce HaC, a hypernetwork-based adaptive conditioning layer that dynamically generates parameters according to input feature distributions, thereby improving cross-domain adaptability. From a training perspective, we develop RaP, a reliability-aware prediction scheme built upon evidential uncertainty learning, which encourages the model to emphasize low-confidence samples and improves overall stability and reliability. We further construct a large-scale ophthalmic angiography dataset covering both FFA and ICGA modalities, comprising multiple retinal disease categories for model training and evaluation. Experimental results demonstrate that CLEAR-Mamba consistently outperforms multiple baseline models, including the original MedMamba, across various metrics-showing particular advantages in multi-disease classification and reliability-aware prediction. This study provides an effective solution that balances generalizability and reliability for modality-specific medical image classification tasks. Our project can be accessed at https://github.com/ZJU4HealthCare/CLEAR-Mamba.

12 pages, 7 figures Code Link
Joint Imaging-ROI Representation Learning via Cross-View Contrastive Alignment for Brain Disorder Classification 2026-03-10
Show

Brain imaging classification is commonly approached from two perspectives: modeling the full image volume to capture global anatomical context, or constructing ROI-based graphs to encode localized and topological interactions. Although both representations have demonstrated independent efficacy, their relative contributions and potential complementarity remain insufficiently understood. Existing fusion approaches are typically task-specific and do not enable controlled evaluation of each representation under consistent training settings. To address this gap, we propose a unified cross-view contrastive framework for joint imaging-ROI representation learning. Our method learns subject-level global (imaging) and local (ROI-graph) embeddings and aligns them in a shared latent space using a bidirectional contrastive objective, encouraging representations from the same subject to converge while separating those from different subjects. This alignment produces comparable embeddings suitable for downstream fusion and enables systematic evaluation of imaging-only, ROI-only, and joint configurations within a unified training protocol. Extensive experiments on the ADHD-200 and ABIDE datasets demonstrate that joint learning consistently improves classification performance over either branch alone across multiple backbone choices. Moreover, interpretability analyses reveal that imaging-based and ROI-based branches emphasize distinct yet complementary discriminative patterns, explaining the observed performance gains. These findings provide principled evidence that explicitly integrating global volumetric and ROI-level representations is a promising direction for neuroimaging-based brain disorder classification. The source code is available at https://anonymous.4open.science/r/imaging-roi-contrastive-152C/.

None
Why Does It Look There? Structured Explanations for Image Classification 2026-03-10
Show

Deep learning models achieve remarkable predictive performance, yet their black-box nature limits transparency and trustworthiness. Although numerous explainable artificial intelligence (XAI) methods have been proposed, they primarily provide saliency maps or concepts (i.e., unstructured interpretability). Existing approaches often rely on auxiliary models (\eg, GPT, CLIP) to describe model behavior, thereby compromising faithfulness to the original models. We propose Interpretability to Explainability (I2X), a framework that builds structured explanations directly from unstructured interpretability by quantifying progress at selected checkpoints during training using prototypes extracted from post-hoc XAI methods (e.g., GradCAM). I2X answers the question of "why does it look there" by providing a structured view of both intra- and inter-class decision making during training. Experiments on MNIST and CIFAR10 demonstrate effectiveness of I2X to reveal prototype-based inference process of various image classification models. Moreover, we demonstrate that I2X can be used to improve predictions across different model architectures and datasets: we can identify uncertain prototypes recognized by I2X and then use targeted perturbation of samples that allows fine-tuning to ultimately improve accuracy. Thus, I2X not only faithfully explains model behavior but also provides a practical approach to guide optimization toward desired targets.

None
From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding 2026-03-10
Show

Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.

None
Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels 2026-03-10
Show

Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.

None
YOLO-NAS-Bench: A Surrogate Benchmark with Self-Evolving Predictors for YOLO Architecture Search 2026-03-10
Show

Neural Architecture Search (NAS) for object detection is severely bottlenecked by high evaluation cost, as fully training each candidate YOLO architecture on COCO demands days of GPU time. Meanwhile, existing NAS benchmarks largely target image classification, leaving the detection community without a comparable benchmark for NAS evaluation. To address this gap, we introduce YOLO-NAS-Bench, the first surrogate benchmark tailored to YOLO-style detectors. YOLO-NAS-Bench defines a search space spanning channel width, block depth, and operator type across both backbone and neck, covering the core modules of YOLOv8 through YOLO12. We sample 1,000 architectures via random, stratified, and Latin Hypercube strategies, train them on COCO-mini, and build a LightGBM surrogate predictor. To sharpen the predictor in the high-performance regime most relevant to NAS, we propose a Self-Evolving Mechanism that progressively aligns the predictor's training distribution with the high-performance frontier, by using the predictor itself to discover and evaluate informative architectures in each iteration. This method grows the pool to 1,500 architectures and raises the ensemble predictor's R2 from 0.770 to 0.815 and Sparse Kendall Tau from 0.694 to 0.752, demonstrating strong predictive accuracy and ranking consistency. Using the final predictor as the fitness function for evolutionary search, we discover architectures that surpass all official YOLOv8-YOLO12 baselines at comparable latency on COCO-mini, confirming the predictor's discriminative power for top-performing detection architectures.

None
Rotation Equivariant Mamba for Vision Tasks 2026-03-10
Show

Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ-VMamba.

Code Link
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction 2026-03-10
Show

Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

Code Link
OptiRoulette Optimizer: A New Stochastic Meta-Optimizer for up to 5.3x Faster Convergence 2026-03-10
Show

This paper presents OptiRoulette, a stochastic meta-optimizer that selects update rules during training instead of fixing a single optimizer. The method combines warmup optimizer locking, random sampling from an active optimizer pool, compatibility-aware learning-rate scaling during optimizer transitions, and failure-aware pool replacement. OptiRoulette is implemented as a drop-in, "torch.optim.Optimizer-compatible" component and packaged for pip installation. We report completed 10-seed results on five image-classification suites: CIFAR-100, CIFAR-100-C, SVHN, Tiny ImageNet, and Caltech-256. Against a single-optimizer AdamW baseline, OptiRoulette improves mean test accuracy from 0.6734 to 0.7656 on CIFAR-100 (+9.22 percentage points), 0.2904 to 0.3355 on CIFAR-100-C (+4.52), 0.9667 to 0.9756 on SVHN (+0.89), 0.5669 to 0.6642 on Tiny ImageNet (+9.73), and 0.5946 to 0.6920 on Caltech-256 (+9.74). Its main advantage is convergence reliability at higher targets: it reaches CIFAR-100/CIFAR-100-C 0.75, SVHN 0.96, Tiny ImageNet 0.65, and Caltech-256 0.62 validation accuracy in 10/10 runs, while the AdamW baseline reaches none of these targets within budget. On shared targets, OptiRoulette also reduces time-to-target (e.g., Caltech-256 at 0.59: 25.7 vs 77.0 epochs). Paired-seed deltas are positive on all datasets; CIFAR-100-C test ROC-AUC is the only metric not statistically significant in the current 10-seed study.

23 pa...

23 pages, 10 figures, 7 tables

None
An accurate flatness measure to estimate the generalization performance of CNN models 2026-03-09
Show

Flatness measures based on the spectrum or the trace of the Hessian of the loss are widely used as proxies for the generalization ability of deep networks. However, most existing definitions are either tailored to fully connected architectures, relying on stochastic estimators of the Hessian trace, or ignore the specific geometric structure of modern Convolutional Neural Networks (CNNs). In this work, we develop a flatness measure that is both exact and architecturally faithful for a broad and practically relevant class of CNNs. We first derive a closed-form expression for the trace of the Hessian of the cross-entropy loss with respect to convolutional kernels in networks that use global average pooling followed by a linear classifier. Building on this result, we then specialize the notion of relative flatness to convolutional layers and obtain a parameterization-aware flatness measure that properly accounts for the scaling symmetries and filter interactions induced by convolution and pooling. Finally, we empirically investigate the proposed measure on families of CNNs trained on standard image-classification benchmarks. The results obtained suggest that the proposed measure can serve as a robust tool to assess and compare the generalization performance of CNN models, and to guide the design of architecture and training choices in practice.

None
Are vision-language models ready to zero-shot replace supervised classification models in agriculture? 2026-03-09
Show

Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

None
Geometrically Constrained Outlier Synthesis 2026-03-09
Show

Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introduce Geometrically Constrained Outlier Synthesis (GCOS), a training-time regularization framework aimed at improving OOD robustness during inference. GCOS addresses a limitation of prior synthesis methods by generating virtual outliers in the hidden feature space that respect the learned manifold structure of in-distribution (ID) data. The synthesis proceeds in two stages: (i) a dominant-variance subspace extracted from the training features identifies geometrically informed, off-manifold directions; (ii) a conformally-inspired shell, defined by the empirical quantiles of a nonconformity score from a calibration set, adaptively controls the synthesis magnitude to produce boundary samples. The shell ensures that generated outliers are neither trivially detectable nor indistinguishable from in-distribution data, facilitating smoother learning of robust features. This is combined with a contrastive regularization objective that promotes separability of ID and OOD samples in a chosen score space, such as Mahalanobis or energy-based. Experiments demonstrate that GCOS outperforms state-of-the-art methods using standard energy-based inference on near-OOD benchmarks, defined as tasks where outliers share the same semantic domain as in-distribution data. As an exploratory extension, the framework naturally transitions to conformal OOD inference, which translates uncertainty scores into statistically valid p-values and enables thresholds with formal error guarantees, providing a pathway toward more predictable and reliable OOD detection.

18 pages, 6 figures None
Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks 2026-03-09
Show

Adversarial examples in neural networks have been extensively studied in Euclidean geometry, but recent advances in \textit{hyperbolic networks} call for a reevaluation of attack strategies in non-Euclidean geometries. Existing methods such as FGSM and PGD apply perturbations without regard to the underlying hyperbolic structure, potentially leading to inefficient or geometrically inconsistent attacks. In this work, we propose a novel adversarial attack that explicitly leverages the geometric properties of hyperbolic space. Specifically, we compute the gradient of the loss function in the tangent space of hyperbolic space, decompose it into a radial (depth) component and an angular (semantic) component, and apply perturbation derived solely from the angular direction. Our method generates adversarial examples by focusing perturbations in semantically sensitive directions encoded in angular movement within the hyperbolic geometry. Empirical results on image classification, cross-modal retrieval tasks and network architectures demonstrate that our attack achieves higher fooling rates than conventional adversarial attacks, while producing high-impact perturbations with deeper insights into vulnerabilities of hyperbolic embeddings. This work highlights the importance of geometry-aware adversarial strategies in curved representation spaces and provides a principled framework for attacking hierarchical embeddings.

Accep...

Accepted by AAAI 2026. Code available at: https://github.com/J-Minsoo/AGSM

Code Link
Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition 2026-03-09
Show

Vision-Language Models (VLMs), such as CLIP, have significantly advanced zero-shot image recognition. However, their performance remains limited by suboptimal prompt engineering and poor adaptability to target classes. While recent methods attempt to improve prompts through diverse class descriptions, they often rely on heuristic designs, lack versatility, and are vulnerable to outlier prompts. This paper enhances prompt by incorporating class-specific concepts. By treating concepts as latent variables, we rethink zero-shot image classification from a Bayesian perspective, casting prediction as marginalization over the concept space, where each concept is weighted by a prior and a test-image conditioned likelihood. This formulation underscores the importance of both a well-structured concept proposal distribution and the refinement of concept priors. To construct an expressive and efficient proposal distribution, we introduce a multi-stage concept synthesis pipeline driven by LLMs to generate discriminative and compositional concepts, followed by a Determinantal Point Process to enforce diversity. To mitigate the influence of outlier concepts, we propose a training-free, adaptive soft-trim likelihood, which attenuates their impact in a single forward pass. We further provide robustness guarantees and derive multi-class excess risk bounds for our framework. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness in zero-shot image classification. Our code is available at https://github.com/less-and-less-bugs/CGBC.

19 pa...

19 pages, Accepted by CVPR 2026

Code Link
UltraUPConvNet: A UPerNet- and ConvNeXt-Based Multi-Task Network for Ultrasound Tissue Segmentation and Disease Prediction 2026-03-08
Show

Ultrasound imaging is widely used in clinical practice due to its cost-effectiveness, mobility, and safety. However, current AI research often treats disease prediction and tissue segmentation as two separate tasks and their model requires substantial computational overhead. In such a situation, we introduce UltraUPConvNet, a computationally efficient universal framework designed for both ultrasound image classification and segmentation. Trained on a large-scale dataset containing more than 9,700 annotations across seven different anatomical regions, our model achieves state-of-the-art performance on certain datasets with lower computational overhead. Our model weights and codes are available at https://github.com/yyxl123/UltraUPConvNet

8 pages Code Link
Robustness Verification of Graph Neural Networks Via Lightweight Satisfiability Testing 2026-03-08
Show

Graph neural networks (GNNs) are the predominant architecture for learning over graphs. As with any machine learning model, an important issue is the detection of attacks, where an adversary can change the output with a small perturbation of the input. Techniques for solving the adversarial robustness problem - determining whether an attack exists - were originally developed for image classification. In the case of graph learning, the attack model usually considers changes to the graph structure in addition to or instead of the numerical features of the input, and the state of the art techniques proceed via reduction to constraint solving, working on top of powerful solvers, e.g. for mixed integer programming. We show that it is possible to improve on the state of the art in structural robustness by replacing the use of powerful solvers by calls to efficient partial solvers, which run in polynomial time but may be incomplete. We evaluate our tool RobLight on a diverse set of GNN variants and datasets.

None
Discovering the Hidden Role of Gini Index In Prompt-based Classification 2026-03-07
Show

In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Yet these classes consistently exhibit low accuracies, whereas a few high-performing classes dominate the game. We pursue a foundational understanding of the hidden role of Gini Index as a tool for detecting and optimizing (debiasing) disparities in class accuracy, focusing on the case of prompt-based classification. We introduce the intuitions, benchmark Gini scores in real-world LLMs and vision models, and thoroughly discuss the insights of Gini not only as a measure of relative accuracy dominance but also as a direct optimization metric. Through rigorous case analyses, we first show that weak to strong relative accuracy imbalance exists in both prompt-based, text and image classification results and regardless of whether the classification is high-dimensional or low-dimensional. Then, we harness the Gini metric to propose a post-hoc model-agnostic bias mitigation method. Experimental results across few-shot news, biomedical, and zero-shot image classification show that our method significantly reduces both relative and absolute accuracy imbalances, minimizing top class relative dominance while elevating weakest classes.

None
Puppet-CNN: Continuous Parameter Dynamics for Input-Adaptive Convolutional Networks 2026-03-07
Show

Modern convolutional neural networks (CNNs) organize computation as a discrete stack of layers whose parameters are independently stored and learned, with the number of layers fixed as an architectural hyperparameter. In this work, we explore an alternative perspective: can network parameterization itself be modeled as a continuous dynamical system? We introduce Puppet-CNN, a framework that represents convolutional layer parameters as states evolving along a learned parameter flow governed by a neural ordinary differential equation (ODE). Under this formulation, layer parameters are generated through continuous evolution in parameter space, and the effective number of generated layers is determined by the integration horizon of the learned dynamics, which can be modulated by input complexity to enable input-adaptive computation. We validate this formulation on standard image classification benchmarks and demonstrate that continuous parameter dynamics can achieve competitive predictive performance while substantially reducing stored trainable parameters. These results suggest that viewing neural network parameterization through the lens of dynamical systems provides a structured and flexible design space for adaptive convolutional models.

12 pa...

12 pages, 4 figures. Updated version with revised title

None
Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers 2026-03-07
Show

In the training of neural networks, adaptive moment estimation (Adam) typically converges fast but exhibits suboptimal generalization performance. A widely accepted explanation for its defect in generalization is that it often tends to converge to sharp minima. To enhance its ability to find flat minima, we propose its new variant named inverse Adam (InvAdam). The key improvement of InvAdam lies in its parameter update mechanism, which is opposite to that of Adam. Specifically, it computes element-wise multiplication of the first-order and second-order moments, while Adam computes the element-wise division of these two moments. This modification aims to increase the step size of the parameter update when the elements in the second-order moments are large and vice versa, which helps the parameter escape sharp minima and stay at flat ones. However, InvAdam's update mechanism may face challenges in convergence. To address this challenge, we propose dual Adam (DualAdam), which integrates the update mechanisms of both Adam and InvAdam, ensuring convergence while enhancing generalization performance. Additionally, we introduce the diffusion theory to mathematically demonstrate InvAdam's ability to escape sharp minima. Extensive experiments are conducted on image classification tasks and large language model (LLM) fine-tuning. The results validate that DualAdam outperforms Adam and its state-of-the-art variants in terms of generalization performance. The code is publicly available at https://github.com/LongJin-lab/DualAdam.

Code Link
Causal Interpretation of Neural Network Computations with Contribution Decomposition 2026-03-06
Show

Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.

32 pa...

32 pages, 19 figures. ICLR 2026 poster

None
ButterflyViT: 354$\times$ Expert Compression for Edge Vision Transformers 2026-03-06
Show

Deploying sparse Mixture of Experts(MoE) Vision Transformers remains a challenge due to linear expert memory scaling. Linear memory scaling stores $N$ independent expert weight matrices requiring $\mathcal{O}(N_E \cdot d^2)$ memory, which exceeds edge devices memory budget. Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved. We introduce ButterflyViT, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified shared quantized substrate. Diversity among experts arises from viewing different angles of shared capacity, not from redundant storage. By applying learned rotations to a shared ternary prototype, each expert yields $\mathcal{O}(d_{\text{model}} \cdot d_{\text{ff}} + N_E \cdot n_\ell \cdot d)$ memory which is sub-linear in the number of experts. To address the unique challenges of vision, a spatial smoothness regulariser is introduced that penalises routing irregularities between adjacent patch tokens, turning patch correlation into a training signal. Across image classification tasks on CIFAR-100, ButterflyViT achieves 354$\times$ memory reduction at 64 experts with negligible accuracy loss. ButterflyViT allows multiple experts to fit on edge-constrained devices showing that geometric parameterization breaks linear scaling.

None
Demystifying KAN for Vision Tasks: The RepKAN Approach 2026-03-06
Show

Remote sensing image classification is essential for Earth observation, yet standard CNNs and Transformers often function as uninterpretable black-boxes. We propose RepKAN, a novel architecture that integrates the structural efficiency of CNNs with the non-linear representational power of KANs. By utilizing a dual-path design -- Spatial Linear and Spectral Non-linear -- RepKAN enables the autonomous discovery of class-specific spectral fingerprints and physical interaction manifolds. Experimental results on the EuroSAT and NWPU-RESISC45 datasets demonstrate that RepKAN provides explicit physically interpretable reasoning while outperforming state-of-the-art models. These findings indicate that RepKAN holds significant potential to serve as the backbone for future interpretable visual foundation models.

None
Mitigating Bias in Concept Bottleneck Models for Fair and Interpretable Image Classification 2026-03-06
Show

Ensuring fairness in image classification prevents models from perpetuating and amplifying bias. Concept bottleneck models (CBMs) map images to high-level, human-interpretable concepts before making predictions via a sparse, one-layer classifier. This structure enhances interpretability and, in theory, supports fairness by masking sensitive attribute proxies such as facial features. However, CBM concepts have been known to leak information unrelated to concept semantics and early results reveal only marginal reductions in gender bias on datasets like ImSitu. We propose three bias mitigation techniques to improve fairness in CBMs: 1. Decreasing information leakage using a top-k concept filter, 2. Removing biased concepts, and 3. Adversarial debiasing. Our results outperform prior work in terms of fairness-performance tradeoffs, indicating that our debiased CBM provides a significant step towards fair and interpretable image classification.

None
Remote Sensing Image Classification Using Deep Ensemble Learning 2026-03-06
Show

Remote sensing imagery plays a crucial role in many applications and requires accurate computerized classification techniques. Reliable classification is essential for transforming raw imagery into structured and usable information. While Convolutional Neural Networks (CNNs) are mostly used for image classification, they excel at local feature extraction, but struggle to capture global contextual information. Vision Transformers (ViTs) address this limitation through self attention mechanisms that model long-range dependencies. Integrating CNNs and ViTs, therefore, leads to better performance than standalone architectures. However, the use of additional CNN and ViT components does not lead to further performance improvement and instead introduces a bottleneck caused by redundant feature representations. In this research, we propose a fusion model that combines the strengths of CNNs and ViTs for remote sensing image classification. To overcome the performance bottleneck, the proposed approach trains four independent fusion models that integrate CNN and ViT backbones and combine their outputs at the final prediction stage through ensembling. The proposed method achieves accuracy rates of 98.10 percent, 94.46 percent, and 95.45 percent on the UC Merced, RSSCN7, and MSRSI datasets, respectively. These results outperform competing architectures and highlight the effectiveness of the proposed solution, particularly due to its efficient use of computational resources during training.

None
Margin and Consistency Supervision for Calibrated and Robust Vision Models 2026-03-06
Show

Deep vision classifiers often achieve high accuracy while remaining poorly calibrated and fragile under small distribution shifts. We present Margin and Consistency Supervision (MaCS), a simple, architecture-agnostic regularization framework that jointly enforces logit-space separation and local prediction stability. MaCS augments cross-entropy with (i) a hinge-squared margin penalty that enforces a target logit gap between the correct class and the strongest competitor, and (ii) a consistency regularizer that minimizes the KL divergence between predictions on clean inputs and mildly perturbed views. We provide a unifying theoretical analysis showing that increasing classification margin while reducing local sensitivity formalized via a Lipschitz-type stability proxy yields improved generalization guarantees and a provable robustness radius bound scaling with the margin-to-sensitivity ratio. Across several image classification benchmarks and several backbones spanning CNNs and Vision Transformers, MaCS consistently improves calibration (lower ECE and NLL) and robustness to common corruptions while preserving or improving top-1 accuracy. Our approach requires no additional data, no architectural changes, and negligible inference overhead, making it an effective drop-in replacement for standard training objectives.

None
Iterative Quantum Feature Maps 2026-03-06
Show

Quantum machine learning models that leverage quantum circuits as quantum feature maps (QFMs) are recognized for their enhanced expressive power in learning tasks. Such models have demonstrated rigorous end-to-end quantum speedups for specific families of classification problems. However, deploying deep QFMs on real quantum hardware remains challenging due to circuit noise and hardware constraints. Additionally, variational quantum algorithms often suffer from computational bottlenecks, particularly in accurate gradient estimation, which significantly increases quantum resource demands during training. We propose Iterative Quantum Feature Maps (IQFMs), a hybrid quantum-classical framework that constructs a deep architecture by iteratively connecting shallow QFMs with classically computed augmentation weights. By incorporating contrastive learning and a layer-wise training mechanism, the IQFMs framework effectively reduces quantum runtime and mitigates noise-induced degradation. In tasks involving noisy quantum data, numerical experiments show that the IQFMs framework outperforms quantum convolutional neural networks, without requiring the optimization of variational quantum parameters. Even for a typical classical image classification benchmark, a carefully designed IQFMs framework achieves performance comparable to that of classical neural networks. This framework presents a promising path to address current limitations and harness the full potential of quantum-enhanced machine learning.

19 pa...

19 pages, 13 figures; updated for refining results and discussion

None
Layer by layer, module by module: Choose both for optimal OOD probing of ViT 2026-03-05
Show

Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.

Accep...

Accepted at ICLR 2026 CAO Workshop

None
Synchronization-based clustering on the unit hypersphere 2026-03-05
Show

Clustering on the unit hypersphere is a fundamental problem in various fields, with applications ranging from gene expression analysis to text and image classification. Traditional clustering methods are not always suitable for unit sphere data, as they do not account for the geometric structure of the sphere. We introduce a novel algorithm for clustering data represented as points on the unit sphere $\mathbf{S}^{d-1}$. Our method is based on the $d$-dimensional generalized Kuramoto model. The effectiveness of the introduced method is demonstrated on synthetic and real-world datasets. Results are compared with some of the traditional clustering methods, showing that our method achieves similar or better results in terms of clustering accuracy.

None
A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification 2026-03-05
Show

Deep neural networks have achieved strong performance in image classification tasks due to their ability to learn complex patterns from high-dimensional data. However, their large computational and memory requirements often limit deployment on resource-constrained platforms such as remote sensing devices and edge systems. Network compression techniques have therefore been proposed to reduce model size and computational cost while maintaining predictive performance. In this study, we conduct a systematic evaluation of neural network compression methods for a remote sensing application, namely hyperspectral land cover classification. Specifically, we examine three widely used compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments are conducted on two benchmark hyperspectral datasets, considering classification accuracy, memory consumption, and inference efficiency. Our results demonstrate that compressed models can significantly reduce model size and computational cost while maintaining competitive classification performance. These findings provide insights into the trade-offs between compression ratio, efficiency, and accuracy, and highlight the potential of compression techniques for enabling efficient deep learning deployment in remote sensing applications.

18 pages, 5 figures None
Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models 2026-03-04
Show

Synthetically augmenting training datasets with diffusion models has become an effective strategy for improving the generalization of image classifiers. However, existing approaches typically increase dataset size by 10-30x and struggle to ensure generation diversity, leading to substantial computational overhead. In this work, we introduce TADA (TArgeted Diffusion Augmentation), a principled framework that selectively augments examples that are not learned early in training using faithful synthetic images that preserve semantic features while varying noise. We show that augmenting only this targeted subset consistently outperforms augmenting the entire dataset. Through theoretical analysis on a two-layer CNN, we prove that TADA improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Extensive experiments demonstrate that by augmenting only 30-40% of the training data, TADA improves generalization by up to 2.8% across diverse architectures including ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, TinyImageNet, and ImageNet, using optimizers such as SGD and SAM. Notably, TADA combined with SGD outperforms the state-of-the-art optimizer SAM on CIFAR-100 and TinyImageNet. Furthermore, TADA shows promising improvements on object detection benchmarks, demonstrating its applicability beyond image classification. Our code is available at https://github.com/BigML-CS-UCLA/TADA.

Code Link
LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis 2026-03-04
Show

Driven by the continuous development of models such as Multi-Layer Perceptron, Convolutional Neural Network (CNN), and Transformer, deep learning has made breakthrough progress in fields such as computer vision and natural language processing, and has been successfully applied in practical scenarios such as image classification and industrial fault diagnosis. However, existing models still have certain limitations in local feature modeling and global dependency capture. Specifically, CNN is limited by local receptive fields, while Transformer has shortcomings in effectively modeling local structures, and both face challenges of high model complexity and insufficient interpretability. In response to the above issues, we proposes the following innovative work: A sparse Transformer based on Learnable Iterative Shrinkage Threshold Algorithm (LISTA-Transformer) was designed, which deeply integrates LISTA sparse encoding with visual Transformer to construct a model architecture with adaptive local and global feature collaboration mechanism. This method utilizes continuous wavelet transform to convert vibration signals into time-frequency maps and inputs them into LISTA-Transformer for more effective feature extraction. On the CWRU dataset, the fault recognition rate of our method reached 98.5%, which is 3.3% higher than traditional methods and exhibits certain superiority over existing Transformer-based approaches.

14 pa...

14 pages, 14 figures, conference paper

None
GeoTop: Advancing Image Classification with Geometric-Topological Analysis 2026-03-04
Show

A fundamental challenge in diagnostic imaging is the phenomenon of topological equivalence, where benign and malignant structures share global topology but differ in critical geometric detail, leading to diagnostic errors in both conventional and deep learning models. We introduce GeoTop, a mathematically principled framework that unifies Topological Data Analysis (TDA) and Lipschitz-Killing Curvatures (LKCs) to resolve this ambiguity. Unlike hybrid deep learning approaches, GeoTop provides intrinsic interpretability by fusing the capacity of persistent homology to identify robust topological signatures with the precision of LKCs in quantifying local geometric features such as boundary complexity and surface regularity. The framework's clinical utility is demonstrated through its application to skin lesion classification, where it achieves a consistent accuracy improvement of 3.6% and reduces false positives and negatives by 15-18% compared to conventional single-modality methods. Crucially, GeoTop directly addresses the problem of topological equivalence by incorporating geometric differentiators, providing both theoretical guarantees (via a formal lemma) and empirical validation via controlled benchmarks. Beyond its predictive performance, GeoTop offers inherent mathematical interpretability through persistence diagrams and curvature-based descriptors, computational efficiency for large datasets (processing 224x224 pixel images in less or equal 0.5 s), and demonstrated generalisability to molecular-level data. By unifying topological invariance with geometric sensitivity, GeoTop provides a principled, interpretable solution for advanced shape discrimination in diagnostic imaging.

37 pages, 6 figures None
Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models? 2026-03-04
Show

Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

None
Specificity-aware reinforcement learning for fine-grained open-world classification 2026-03-04
Show

Classifying fine-grained visual concepts under open-world settings, i.e., without a predefined label set, demands models to be both accurate and specific. Recent reasoning Large Multimodal Models (LMMs) exhibit strong visual understanding capability but tend to produce overly generic predictions when performing fine-grained image classification. Our preliminary analysis reveals that models do possess the intrinsic fine-grained domain knowledge. However, promoting more specific predictions (specificity) without compromising correct ones (correctness) remains a non-trivial and understudied challenge. In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. We propose a novel specificity-aware reinforcement learning framework, SpeciaRL, to fine-tune reasoning LMMs on fine-grained image classification under the open-world setting. SpeciaRL introduces a dynamic, verifier-based reward signal anchored to the best predictions within online rollouts, promoting specificity while respecting the model's capabilities to prevent incorrect predictions. Our out-of-domain experiments show that SpeciaRL delivers the best trade-off between correctness and specificity across extensive fine-grained benchmarks, surpassing existing methods and advancing open-world fine-grained image classification. Code and model are publicly available at https://github.com/s-angheben/SpeciaRL.

Accep...

Accepted at CVPR 2026

Code Link
Semantic Bridging Domains: Pseudo-Source as Test-Time Connector 2026-03-04
Show

Distribution shifts between training and testing data are a critical bottleneck limiting the practical utility of models, especially in real-world test-time scenarios. To adapt models when the source domain is unknown and the target domain is unlabeled, previous works constructed pseudo-source domains via data generation and translation, then aligned the target domain with them. However, significant discrepancies exist between the pseudo-source and the original source domain, leading to potential divergence when correcting the target directly. From this perspective, we propose a Stepwise Semantic Alignment (SSA) method, viewing the pseudo-source as a semantic bridge connecting the source and target, rather than a direct substitute for the source. Specifically, we leverage easily accessible universal semantics to rectify the semantic features of the pseudo-source, and then align the target domain using the corrected pseudo-source semantics. Additionally, we introduce a Hierarchical Feature Aggregation (HFA) module and a Confidence-Aware Complementary Learning (CACL) strategy to enhance the semantic quality of the SSA process in the absence of source and ground truth of target domains. We evaluated our approach on tasks like semantic segmentation and image classification, achieving a 5.2% performance boost on GTA2Cityscapes over the state-of-the-art.

25 pages None
LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing 2026-03-04
Show

Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.

None
Volley Revolver: A Novel Matrix-Encoding Method for Privacy-Preserving Neural Networks (Inference) 2026-03-04
Show

In this work, we present a novel matrix-encoding method that is particularly convenient for neural networks to make predictions in a privacy-preserving manner using homomorphic encryption. Based on this encoding method, we implement a convolutional neural network for handwritten image classification over encryption. For two matrices $A$ and $B$ to perform homomorphic multiplication, the main idea behind it, in a simple version, is to encrypt matrix $A$ and the transpose of matrix $B$ into two ciphertexts respectively. With additional operations, the homomorphic matrix multiplication can be calculated over encrypted matrices efficiently. For the convolution operation, we in advance span each convolution kernel to a matrix space of the same size as the input image so as to generate several ciphertexts, each of which is later used together with the ciphertext encrypting input images for calculating some of the final convolution results. We accumulate all these intermediate results and thus complete the convolution operation. In a public cloud with 40 vCPUs, our convolutional neural network implementation on the MNIST testing dataset takes $\sim$ 287 seconds to compute ten likelihoods of 32 encrypted images of size $28 \times 28$ simultaneously. The data owner only needs to upload one ciphertext ($\sim 19.8$ MB) encrypting these 32 images to the public cloud.

The e...

The encoding method we proposed in this work, $\texttt{Volley Revolver}$, is particularly tailored for privacy-preserving neural networks. There is a great chance that it can be used to assist the private neural networks training, in which case for the backpropagation algorithm of the fully-connected layer the first matrix $A$ is revolved while the second matrix $B$ is settled to be still

None
Graph Recognition via Subgraph Prediction 2026-03-03
Show

Despite tremendous improvements in tasks such as image classification, object detection, and segmentation, the recognition of visual relationships, commonly modeled as the extraction of a graph from an image, remains a challenging task. We believe that this mainly stems from the fact that there is no canonical way to approach the visual graph recognition task. Most existing solutions are specific to a problem and cannot be transferred between different contexts out-of-the box, even though the conceptual problem remains the same. With broad applicability and simplicity in mind, in this paper we develop a method, \textbf{Gra}ph Recognition via \textbf{S}ubgraph \textbf{P}rediction (\textbf{GraSP}), for recognizing graphs in images. We show across several synthetic benchmarks and one real-world application that our method works with a set of diverse types of graphs and their drawings, and can be transferred between tasks without task-specific modifications, paving the way to a more unified framework for visual graph recognition.

None
Classification of Histopathology Slides with Persistent Homology Convolutions 2026-03-03
Show

Convolutional neural networks (CNNs) are a standard tool for computer vision tasks such as image classification. However, typical model architectures may result in the loss of topological information. In specific domains such as histopathology, topology is an important descriptor that can be used to distinguish between disease-indicating tissue by analyzing the shape characteristics of cells. Current literature suggests that reintroducing topological information using persistent homology can improve medical diagnostics; however, previous methods utilize global topological summaries which do not contain information about the locality of topological features. To address this gap, we present a novel method that generates local persistent homology-based data using a modified version of the convolution operator called \textit{Persistent Homology Convolutions}. This method captures information about the locality and translation equivariance of topological features. We perform a comparative study using various representations of histopathology slides and find that models trained with persistent homology convolutions outperform conventionally trained models and are less sensitive to hyperparameters. These results indicate that persistent homology convolutions extract meaningful geometric information from the histopathology slides.

Refor...

Reformatted citations and other minor adjustments

None
mHC-HSI: Clustering-Guided Hyper-Connection Mamba for Hyperspectral Image Classification 2026-03-03
Show

Recently, DeepSeek has invented the manifold-constrained hyper-connection (mHC) approach which has demonstrated significant improvements over the traditional residual connection in deep learning models \cite{xie2026mhc}. Nevertheless, this approach has not been tailor-designed for improving hyperspectral image (HSI) classification. This paper presents a clustering-guided mHC Mamba model (mHC-HSI) for enhanced HSI classification, with the following contributions. First, to improve spatial-spectral feature learning, we design a novel clustering-guided Mamba module, based on the mHC framework, that explicitly learns both spatial and spectral information in HSI. Second, to decompose the complex and heterogeneous HSI into smaller clusters, we design a new implementation of the residual matrix in mHC, which can be treated as soft cluster membership maps, leading to improved explainability of the mHC approach. Third, to leverage the physical spectral knowledge, we divide the spectral bands into physically-meaningful groups and use them as the "parallel streams" in mHC, leading to a physically-meaningful approach with enhanced interpretability. The proposed approach is tested on benchmark datasets in comparison with the state-of-the-art methods, and the results suggest that the proposed model not only improves the accuracy but also enhances the model explainability. Code is available here: https://github.com/GSIL-UCalgary/mHC_HyperSpectral

Code Link
IoUCert: Robustness Verification for Anchor-based Object Detectors 2026-03-03
Show

While formal robustness verification has seen significant success in image classification, scaling these guarantees to object detection remains notoriously difficult due to complex non-linear coordinate transformations and Intersection-over-Union (IoU) metrics. We introduce {\sc \sf IoUCert}, a novel formal verification framework designed specifically to overcome these bottlenecks in foundational anchor-based object detection architectures. Focusing on the object localisation component in single-object settings, we propose a coordinate transformation that enables our algorithm to circumvent precision-degrading relaxations of non-linear box prediction functions. This allows us to optimise bounds directly with respect to the anchor box offsets which enables a novel Interval Bound Propagation method that derives optimal IoU bounds. We demonstrate that our method enables, for the first time, the robustness verification of realistic, anchor-based models including SSD, YOLOv2, and YOLOv3 variants against various input perturbations.

None
Semi-Supervised Few-Shot Adaptation of Vision-Language Models 2026-03-03
Show

Vision-language models (VLMs) pre-trained on large, heterogeneous data sources are becoming increasingly popular, providing rich multi-modal embeddings that enable efficient transfer to new tasks. A particularly relevant application is few-shot adaptation, where only a handful of annotated examples are available to adapt the model through multi-modal linear probes. In medical imaging, specialized VLMs have shown promising performance in zero- and few-shot image classification, which is valuable for mitigating the high cost of expert annotations. However, challenges remain in extremely low-shot regimes: the inherent class imbalances in medical tasks often lead to underrepresented categories, penalizing overall model performance. To address this limitation, we propose leveraging unlabeled data by introducing an efficient semi-supervised solver that propagates text-informed pseudo-labels during few-shot adaptation. The proposed method enables lower-budget annotation pipelines for adapting VLMs, reducing labeling effort by >50% in low-shot regimes.

Code:...

Code: https://github.com/jusiro/SS-Text-U

Code Link
Layer-wise QUBO-Based Training of CNN Classifiers for Quantum Annealing 2026-03-03
Show

Variational quantum circuits for image classification suffer from barren plateaus, while quantum kernel methods scale quadratically with dataset size. We propose an iterative framework based on Quadratic Unconstrained Binary Optimization (QUBO) for training the classifier head of convolutional neural networks (CNNs) via quantum annealing, entirely avoiding gradient-based circuit optimization. Following the Extreme Learning Machine paradigm, convolutional filters are randomly initialized and frozen, and only the fully connected layer is optimized. At each iteration, a convex quadratic surrogate derived from the feature Gram matrix replaces the non-quadratic cross-entropy loss, yielding an iteration-stable curvature proxy. A per-output decomposition splits the $C$-class problem into $C$ independent QUBOs, each with $(d+1)K$ binary variables, where $d$ is the feature dimension and $K$ is the bit precision, so that problem size depends on the image resolution and bit precision, not on the number of training samples. We evaluate the method on six image-classification benchmarks (sklearn digits, MNIST, Fashion-MNIST, CIFAR-10, EMNIST, KMNIST). A precision study shows that accuracy improves monotonically with bit resolution, with 10 bits representing a practical minimum for effective optimization; the 15-bit formulation remains within the qubit and coupler limits of current D-Wave Advantage hardware. The 20-bit formulation matches or exceeds classical stochastic gradient descent on MNIST, Fashion-MNIST, and EMNIST, while remaining competitive on CIFAR-10 and KMNIST. All experiments use simulated annealing, establishing a baseline for direct deployment on quantum annealing hardware.

28 pa...

28 pages, 5 figures, 9 tables. Submitted to Quantum Machine Intelligence

None
Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment 2026-03-03
Show

While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.

Accep...

Accepted by ICML 2025

None
From Fewer Samples to Fewer Bits: Reframing Dataset Distillation as Joint Optimization of Precision and Compactness 2026-03-02
Show

Dataset Distillation (DD) compresses large datasets into compact synthetic ones that maintain training performance. However, current methods mainly target sample reduction, with limited consideration of data precision and its impact on efficiency. We propose Quantization-aware Dataset Distillation (QuADD), a unified framework that jointly optimizes dataset compactness and precision under fixed bit budgets. QuADD integrates a differentiable quantization module within the distillation loop, enabling end-to-end co-optimization of synthetic samples and quantization parameters. Guided by the rate-distortion perspective, we empirically analyze how bit allocation between sample count and precision influences learning performance. Our framework supports both uniform and adaptive non-uniform quantization, where the latter learns quantization levels from data to represent information-dense regions better. Experiments on image classification and 3GPP beam management tasks show that QuADD surpasses existing DD and post-quantized baselines in accuracy per bit, establishing a new standard for information-efficient dataset distillation.

Accep...

Accepted to CVPR 2026 - Findings Workshop

None
Diagnosing Generalization Failures from Representational Geometry Markers 2026-03-02
Show

Generalization, the ability to perform well beyond the training context, is a hallmark of biological and artificial intelligence, yet anticipating unseen failures remains a central challenge. Conventional approaches often take a bottom-up'' mechanistic route by reverse-engineering interpretable features or circuits to build explanatory models. While insightful, these methods often struggle to provide the high-level, predictive signals for anticipating failure in real-world deployment. Here, we propose using a top-down'' approach to studying generalization failures inspired by medical biomarkers: identifying system-level measurements that serve as robust indicators of a model's future performance. Rather than mapping out detailed internal mechanisms, we systematically design and test network markers to probe structure, function links, identify prognostic indicators, and validate predictions in real-world settings. In image classification, we find that task-relevant geometric properties of in-distribution (ID) object manifolds consistently forecast poor out-of-distribution (OOD) generalization. In particular, reductions in two geometric measures, effective manifold dimensionality and utility, predict weaker OOD performance across diverse architectures, optimizers, and datasets. We apply this finding to transfer learning with ImageNet-pretrained models. We consistently find that the same geometric patterns predict OOD transfer performance more reliably than ID accuracy. This work demonstrates that representational geometry can expose hidden vulnerabilities, offering more robust guidance for model selection and AI interpretability.

Publi...

Published in the International Conference on Learning Representations (ICLR), 2026

None
Exploiting Low-Dimensional Manifold of Features for Few-Shot Whole Slide Image Classification 2026-03-02
Show

Few-shot Whole Slide Image (WSI) classification is severely hampered by overfitting. We argue that this is not merely a data-scarcity issue but a fundamentally geometric problem. Grounded in the manifold hypothesis, our analysis shows that features from pathology foundation models exhibit a low-dimensional manifold geometry that is easily perturbed by downstream models. This insight reveals a key potential issue in downstream multiple instance learning models: linear layers are geometry-agnostic and, as we show empirically, can distort the manifold geometry of the features. To address this, we propose the Manifold Residual (MR) block, a plug-and-play module that is explicitly geometry-aware. The MR block reframes the linear layer as residual learning and decouples it into two pathways: (1) a fixed, random matrix serving as a geometric anchor that approximately preserves topology while also acting as a spectral shaper to sharpen the feature spectrum; and (2) a trainable, low-rank residual pathway that acts as a residual learner for task-specific adaptation, with its structural bottleneck explicitly mirroring the low effective rank of the features. This decoupling imposes a structured inductive bias and reduces learning to a simpler residual fitting task. Through extensive experiments, we demonstrate that our approach achieves state-of-the-art results with significantly fewer parameters, offering a new paradigm for few-shot WSI classification. Code is available in https://github.com/BearCleverProud/MR-Block.

Accep...

Accepted to ICLR 2026

Code Link
DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks 2026-03-02
Show

Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.

None
A Diagnostic Evaluation of Neural Networks Trained with the Error Diffusion Learning Algorithm 2026-03-02
Show

The Error Diffusion Learning Algorithm (EDLA) is a learning scheme that performs synaptically local weight updates driven by a single, globally defined error signal. Although originally proposed as an alternative to backpropagation, its behavior has not been systematically characterized. We provide a modern formulation and implementation of EDLA and evaluate multilayer perceptrons trained with EDLA on parity, regression, and image-classification benchmarks (Digits, MNIST, Fashion-MNIST, and CIFAR-10). Following the original formulation, multi-class classification is implemented by training independent single-output networks (one per class), which makes the computational cost scale linearly with the number of classes. Under comparable architectures and training protocols, EDLA consistently underperforms backpropagation-trained baselines on all benchmarks considered. Through an analysis of internal dynamics, we identify a depth-related failure mode in ReLU-based EDLA: activations can grow explosively, causing unstable training and degraded accuracy. To mitigate this instability, we incorporate root mean square normalization (RMSNorm) into EDLA training. RMSNorm substantially improves numerical stability and expands the depth range in which EDLA can be trained, but it does not close the accuracy gap and retains the overhead of the parallel-network implementation. Overall, we offer a diagnostic evaluation of where and why global error diffusion breaks down in deep networks, providing guidance for future development of local, biologically inspired learning rules.

None
Polynomial, trigonometric, and tropical activations 2026-03-02
Show

Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library via: https://github.com/K-H-Ismail/torchortho.

Publi...

Published at ICLR 2026

Code Link
VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification 2026-03-01
Show

Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2%, the model achieves Overall Accuracy (OA) of 99.69% on the Salinas dataset and 99.45% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.

None
Aligned explanations in neural networks 2026-02-28
Show

Feature attribution is the dominant paradigm for explaining the predictions of complex machine learning models like neural networks. However, most existing methods offer little guarantee of reflecting the model's prediction-making process. We define the notion of explanatory alignment and argue that it is central to trustworthy predictive modeling: in short, it requires that explanations directly underlie predictions rather than serve as rationalizations. We present model readability as a design principle enabling alignment, and Pointwise-interpretable Networks (PiNets) as a modeling framework to pursue it in a deep learning context. PiNets combine statistical intelligence with a pseudo-linear structure that yields instance-wise linear predictions in an arbitrary feature space. We illustrate their use on image classification and segmentation tasks, demonstrating that PiNets produce explanations that are not only aligned by design but also faithful across other dimensions: meaningfulness, robustness, and sufficiency.

None
GradPCA: Leveraging NTK Alignment for Reliable Out-of-Distribution Detection 2026-02-28
Show

We introduce GradPCA, an Out-of-Distribution (OOD) detection method that exploits the low-rank structure of neural network gradients induced by Neural Tangent Kernel (NTK) alignment. GradPCA applies Principal Component Analysis (PCA) to gradient class-means, achieving more consistent performance than existing methods across standard image classification benchmarks. We provide a theoretical perspective on spectral OOD detection in neural networks to support GradPCA, highlighting feature-space properties that enable effective detection and naturally emerge from NTK alignment. Our analysis further reveals that feature quality -- particularly the use of pretrained versus non-pretrained representations -- plays a crucial role in determining which detectors will succeed. Extensive experiments validate the strong performance of GradPCA, and our theoretical framework offers guidance for designing more principled spectral OOD detectors.

None
TP-Spikformer: Token Pruned Spiking Transformer 2026-02-28
Show

Spiking neural networks (SNNs) offer an energy-efficient alternative to traditional neural networks due to their event-driven computing paradigm. However, recent advancements in spiking transformers have focused on improving accuracy with large-scale architectures, which require significant computational resources and limit deployment on resource-constrained devices. In this paper, we propose a simple yet effective token pruning method for spiking transformers, termed TP-Spikformer, that reduces storage and computational overhead while maintaining competitive performance. Specifically, we first introduce a heuristic spatiotemporal information-retaining criterion that comprehensively evaluates tokens' importance, assigning higher scores to informative tokens for retention and lower scores to uninformative ones for pruning. Based on this criterion, we propose an information-retaining token pruning framework that employs a block-level early stopping strategy for uninformative tokens, instead of removing them outright. This also helps preserve more information during token pruning. We demonstrate the effectiveness, efficiency and scalability of TP-Spikformer through extensive experiments across diverse architectures, including Spikformer, QKFormer and Spike-driven Transformer V1 and V3, and a range of tasks such as image classification, object detection, semantic segmentation and event-based object tracking. Particularly, TP-Spikformer performs well in a training-free manner. These results reveal its potential as an efficient and practical solution for deploying SNNs in real-world applications with limited computational resources.

24 pages, 7 figures None
Margin-Consistent Deep Subtyping of Invasive Lung Adenocarcinoma via Perturbation Fidelity in Whole-Slide Image Analysis 2026-02-27
Show

Whole-slide image classification for invasive lung adenocarcinoma subtyping remains vulnerable to real-world imaging perturbations that undermine model reliability at the decision boundary. We propose a margin consistency framework evaluated on 203,226 patches from 143 whole-slide images spanning five adenocarcinoma subtypes in the BMIRDS-LUAD dataset. By combining attention-weighted patch aggregation with margin-aware training, our approach achieves robust feature-logit space alignment measured by Kendall correlations of 0.88 during training and 0.64 during validation. Contrastive regularization, while effective at improving class separation, tends to over-cluster features and suppress fine-grained morphological variation; to counteract this, we introduce Perturbation Fidelity (PF) scoring, which imposes structured perturbations through Bayesian-optimized parameters. Vision Transformer-Large achieves 95.20 +/- 4.65% accuracy, representing a 40% error reduction from the 92.00 +/- 5.36% baseline, while ResNet101 with an attention mechanism reaches 95.89 +/- 5.37% from 91.73 +/- 9.23%, a 50% error reduction. All five subtypes exceed an area under the receiver operating characteristic curve (AUC) of 0.99. On the WSSS4LUAD external benchmark, ResNet50 with an attention mechanism attains 80.1% accuracy, demonstrating cross-institutional generalizability despite approximately 15-20% domain-shift-related degradation and identifying opportunities for future adaptation research.

This ...

This document is the author's accepted manuscript (author version). The final published version is available online in the Journal of Imaging Informatics in Medicine at DOI: 10.1007/s10278-026-01875-6

None
Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways 2026-02-27
Show

Radiology reports are invaluable for clinical decision-making and hold great potential for automated analysis when structured into machine-readable formats. These reports often contain uncertainty, which we categorize into two distinct types: (i) Explicit uncertainty reflects doubt about the presence or absence of findings, conveyed through hedging phrases. These vary in meaning depending on the context, making rule-based systems insufficient to quantify the level of uncertainty for specific findings; (ii) Implicit uncertainty arises when radiologists omit parts of their reasoning, recording only key findings or diagnoses. Here, it is often unclear whether omitted findings are truly absent or simply unmentioned for brevity. We address these challenges with a two-part framework. We quantify explicit uncertainty by creating an expert-validated, LLM-based reference ranking of common hedging phrases, and mapping each finding to a probability value based on this reference. In addition, we model implicit uncertainty through an expansion framework that systematically adds characteristic sub-findings derived from expert-defined diagnostic pathways for 14 common diagnoses. Using these methods, we release Lunguage++, an expanded, uncertainty-aware version of the Lunguage benchmark of fine-grained structured radiology reports. This enriched resource enables uncertainty-aware image classification, faithful diagnostic reasoning, and new investigations into the clinical impact of diagnostic uncertainty.

None
A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification 2026-02-27
Show

Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical settings. Existing auditing approaches mainly rely on unimodal features or metadata-based subgroup analyses, which are limited in interpretability and often fail to capture hidden systematic failures. To address these limitations, we introduce the first automated auditing framework that extends slice discovery methods to multimodal representations specifically for medical applications. Comprehensive experiments were conducted under common failure scenarios using the MIMIC-CXR-JPG dataset, demonstrating the framework's strong capability in both failure discovery and explanation generation. Our results also show that multimodal information generally allows more comprehensive and effective auditing of classifiers, while unimodal variants beyond image-only inputs exhibit strong potential in scenarios where resources are constrained.

None
RAViT: Resolution-Adaptive Vision Transformer 2026-02-27
Show

Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.

None
Altitude-Aware Visual Place Recognition in Top-Down View 2026-02-27
Show

To address the challenge of aerial visual place recognition (VPR) problem under significant altitude variations, this study proposes an altitude-adaptive VPR approach that integrates ground feature density analysis with image classification techniques. The proposed method estimates airborne platforms' relative altitude by analyzing the density of ground features in images, then applies relative altitude-based cropping to generate canonical query images, which are subsequently used in a classification-based VPR strategy for localization. Extensive experiments across diverse terrains and altitude conditions demonstrate that the proposed approach achieves high accuracy and robustness in both altitude estimation and VPR under significant altitude changes. Compared to conventional methods relying on barometric altimeters or Time-of-Flight (ToF) sensors, this solution requires no additional hardware and offers a plug-and-play solution for downstream applications, {making it suitable for small- and medium-sized airborne platforms operating in diverse environments, including rural and urban areas.} Under significant altitude variations, incorporating our relative altitude estimation module into the VPR retrieval pipeline boosts average R@1 and R@5 by 29.85% and 60.20%, respectively, compared with applying VPR retrieval alone. Furthermore, compared to traditional {Monocular Metric Depth Estimation (MMDE) methods}, the proposed method reduces the mean error by 202.1 m, yielding average additional improvements of 31.4% in R@1 and 44% in R@5. These results demonstrate that our method establishes a robust, vision-only framework for three-dimensional visual place recognition, offering a practical and scalable solution for accurate airborne platforms localization under large altitude variations and limited sensor availability.

None
Quantum Deep Learning: A Comprehensive Review 2026-02-26
Show

Quantum deep learning (QDL) explores the use of both quantum and quantum-inspired resources to determine when deep learning's core capabilities, such as expressivity, generalization, and scalability, can be enhanced based on specific resource constraints. Distinct from broader quantum machine learning, QDL emphasizes compositional depth at the pipeline level and the integration of quantum or quantum-inspired components within end-to-end workflows. This review provides an operational definition of QDL and introduces a taxonomy comprising four primary paradigms: hybrid quantum-classical models, quantum deep neural networks, quantum algorithms for deep learning primitives, and quantum-inspired classical algorithms. Theoretical principles are connected to advanced architectures, software toolchains, and experimental demonstrations across superconducting, trapped-ion, photonic, semiconductor spin, and neutral-atom systems, as well as quantum annealers. Claims of quantum advantage are critically assessed by distinguishing provable complexity-theoretic separations from empirical observations. The analysis characterizes trade-offs between model expressivity, trainability, and classical simulability, while systematically detailing the bottlenecks imposed by optimization landscapes, input-output access models, and hardware constraints. Applications are surveyed in domains encompassing image classification, natural language processing, scientific discovery, quantum data processing, and quantum optimal control, underscoring fair benchmarking against optimized classical counterparts and a comprehensive assessment of resource requirements. This review serves as a tutorial entry point for graduate students while guiding readers to specialized literature. It concludes with a verification-aware roadmap to transition QDL from near-term demonstrations to scalable and fault-tolerant implementations.

None
A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning 2026-02-26
Show

Most pseudo-label selection strategies in semi-supervised learning rely on fixed confidence thresholds, implicitly assuming that prediction confidence reliably indicates correctness. In practice, deep networks are often overconfident: high-confidence predictions can still be wrong, while informative low-confidence samples near decision boundaries are discarded. This paper introduces a Confidence-Variance (CoVar) theory framework that provides a principled joint reliability criterion for pseudo-label selection. Starting from the entropy minimization principle, we derive a reliability measure that combines maximum confidence (MC) with residual-class variance (RCV), which characterizes how probability mass is distributed over non-maximum classes. The derivation shows that reliable pseudo-labels should have both high MC and low RCV, and that the influence of RCV increases as confidence grows, thereby correcting overconfident but unstable predictions. From this perspective, we cast pseudo-label selection as a spectral relaxation problem that maximizes separability in a confidence-variance feature space, and design a threshold-free selection mechanism to distinguish high- from low-reliability predictions. We integrate CoVar as a plug-in module into representative semi-supervised semantic segmentation and image classification methods. Across PASCAL VOC 2012, Cityscapes, CIFAR-10, and Mini-ImageNet with varying label ratios and backbones, it consistently improves over strong baselines, indicating that combining confidence with residual-class variance provides a more reliable basis for pseudo-label selection than fixed confidence thresholds. (Code: https://github.com/ljs11528/CoVar_Pseudo_Label_Selection.git)

Code Link
Image-Based Classification of Olive Species Specific to Turkiye with Deep Neural Networks 2026-02-26
Show

In this study, image processing and deep learning methodologies were employed to automatically classify local olive species cultivated in Turkiye. A stereo camera was utilized to capture images of five distinct olive species, which were then preprocessed to ensure their suitability for analysis. Convolutional Neural Network (CNN) architectures, specifically MobileNetV2 and EfficientNetB0, were employed for image classification. These models were optimized through a transfer learning approach. The training and testing results indicated that the EfficientNetB0 model exhibited the optimal performance, with an accuracy of 94.5%. The findings demonstrate that deep learning-based systems offer an effective solution for classifying olive species with high accuracy. The developed method has significant potential for application in areas such as automatic identification and quality control of agricultural products.

None
FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification 2026-02-26
Show

Compressing neural networks by quantizing model parameters offers useful trade-off between performance and efficiency. Methods like quantization-aware training and post-training quantization strive to maintain the downstream performance of compressed models compared to the full precision models. However, these techniques do not explicitly consider the impact on algorithmic fairness. In this work, we study fairness-aware mixed-precision quantization schemes for medical image classification under explicit bit budgets. We introduce FairQuant, a framework that combines group-aware importance analysis, budgeted mixed-precision allocation, and a learnable Bit-Aware Quantization (BAQ) mode that jointly optimizes weights and per-unit bit allocations under bitrate and fairness regularization. We evaluate the method on Fitzpatrick17k and ISIC2019 across ResNet18/50, DeiT-Tiny, and TinyViT. Results show that FairQuant configurations with average precision near 4-6 bits recover much of the Uniform 8-bit accuracy while improving worst-group performance relative to Uniform 4- and 8-bit baselines, with comparable fairness metrics under shared budgets.

Sourc...

Source code available at https://github.com/saintslab/FairQuant

Code Link
Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation 2026-02-26
Show

Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.

Code ...

Code is available at https://github.com/Trustworthy-AI-Group/TransferAttack

Code Link
GmNet: Revisiting Gating Mechanisms From A Frequency View 2026-02-26
Show

Gating mechanisms have emerged as an effective strategy integrated into model designs beyond recurrent neural networks for addressing long-range dependency problems. In a broad understanding, it provides adaptive control over the information flow while maintaining computational efficiency. However, there is a lack of theoretical analysis on how the gating mechanism works in neural networks. In this paper, inspired by the \textit{convolution theorem}, we systematically explore the effect of gating mechanisms on the training dynamics of neural networks from a frequency perspective. We investigate the interact between the element-wise product and activation functions in managing the responses to different frequency components. Leveraging these insights, we propose a Gating Mechanism Network (GmNet), a lightweight model designed to efficiently utilize the information of various frequency components. It minimizes the low-frequency bias present in existing lightweight models. GmNet achieves impressive performance in terms of both effectiveness and efficiency in the image classification task.

None
Using the Path of Least Resistance to Explain Deep Networks 2026-02-26
Show

Integrated Gradients (IG), a widely used axiomatic path-based attribution method, assigns importance scores to input features by integrating model gradients along a straight path from a baseline to the input. While effective in some cases, we show that straight paths can lead to flawed attributions. In this paper, we identify the cause of these misattributions and propose an alternative approach that equips the input space with a model-induced Riemannian metric (derived from the explained model's Jacobian) and computes attributions by integrating gradients along geodesics under this metric. We call this method Geodesic Integrated Gradients (GIG). To approximate geodesic paths, we introduce two techniques: a k-Nearest Neighbours-based approach for smaller models and a Stochastic Variational Inference-based method for larger ones. Additionally, we propose a new axiom, No-Cancellation Completeness (NCC), which strengthens completeness by ruling out feature-wise cancellation. We prove that, for path-based attributions under the model-induced metric, NCC holds if and only if the integration path is a geodesic. Through experiments on both synthetic and real-world image classification data, we provide empirical evidence supporting our theoretical analysis and showing that GIG produces more faithful attributions than existing methods, including IG, on the benchmarks considered.

None
RECAP: Local Hebbian Prototype Learning as a Self-Organizing Readout for Reservoir Dynamics 2026-02-25
Show

Robust perception in brains is often attributed to high-dimensional population activity together with local plasticity mechanisms that reinforce recurring structure. In contrast, most modern image recognition systems are trained by error backpropagation and end-to-end gradient optimization, which are not naturally aligned with local computation and local plasticity. We introduce RECAP (Reservoir Computing with Hebbian Co-Activation Prototypes), a bio-inspired learning strategy for robust image classification that couples untrained reservoir dynamics with a self-organizing Hebbian prototype readout. RECAP discretizes time-averaged reservoir responses into activation levels, constructs a co-activation mask over reservoir unit pairs, and incrementally updates class-wise prototype matrices via a Hebbian-like potentiation-decay rule. Inference is performed by overlap-based prototype matching. The method avoids error backpropagation and is naturally compatible with online prototype updates. We illustrate the resulting robustness behavior on MNIST-C, where RECAP remains robust under diverse corruptions without exposure to corrupted training samples.

20 pages, 6 figures None
Cross-Task Benchmarking of CNN Architectures 2026-02-25
Show

This project provides a comparative study of dynamic convolutional neural networks (CNNs) for various tasks, including image classification, segmentation, and time series analysis. Based on the ResNet-18 architecture, we compare five variants of CNNs: the vanilla CNN, the hard attention-based CNN, the soft attention-based CNN with local (pixel-wise) and global (image-wise) feature attention, and the omni-directional CNN (ODConv). Experiments on Tiny ImageNet, Pascal VOC, and the UCR Time Series Classification Archive illustrate that attention mechanisms and dynamic convolution methods consistently exceed conventional CNNs in accuracy, efficiency, and computational performance. ODConv was especially effective on morphologically complex images by being able to dynamically adjust to varying spatial patterns. Dynamic CNNs enhanced feature representation and cross-task generalization through adaptive kernel modulation. This project provides perspectives on advanced CNN design architecture for multiplexed data modalities and indicates promising directions in neural network engineering.

None
Robustness in sparse artificial neural networks trained with adaptive topology 2026-02-25
Show

We investigate the robustness of sparse artificial neural networks trained with adaptive topology. We focus on a simple yet effective architecture consisting of three sparse layers with 99% sparsity followed by a dense layer, applied to image classification tasks such as MNIST and Fashion MNIST. By updating the topology of the sparse layers between each epoch, we achieve competitive accuracy despite the significantly reduced number of weights. Our primary contribution is a detailed analysis of the robustness of these networks, exploring their performance under various perturbations including random link removal, adversarial attack, and link weight shuffling. Through extensive experiments, we demonstrate that adaptive topology not only enhances efficiency but also maintains robustness. This work highlights the potential of adaptive sparse networks as a promising direction for developing efficient and reliable deep learning models.

None
Resilient Federated Chain: Transforming Blockchain Consensus into an Active Defense Layer for Federated Learning 2026-02-25
Show

Federated Learning (FL) has emerged as a key paradigm for building Trustworthy AI systems by enabling privacy-preserving, decentralized model training. However, FL is highly susceptible to adversarial attacks that compromise model integrity and data confidentiality, a vulnerability exacerbated by the fact that conventional data inspection methods are incompatible with its decentralized design. While integrating FL with Blockchain technology has been proposed to address some limitations, its potential for mitigating adversarial attacks remains largely unexplored. This paper introduces Resilient Federated Chain (RFC), a novel blockchain-enabled FL framework designed specifically to enhance resilience against such threats. RFC builds upon the existing Proof of Federated Learning architecture by repurposing the redundancy of its Pooled Mining mechanism as an active defense layer that can be combined with robust aggregation rules. Furthermore, the framework introduces a flexible evaluation function in its consensus mechanism, allowing for adaptive defense against different attack strategies. Extensive experimental evaluation on image classification tasks under various adversarial scenarios, demonstrates that RFC significantly improves robustness compared to baseline methods, providing a viable solution for securing decentralized learning environments.

This ...

This work has been submitted to the IEEE for possible publication

None
MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification 2026-02-25
Show

Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance. We present MedicalPatchNet, an inherently self-explainable architecture for chest X-ray classification that transparently attributes decisions to distinct image regions. MedicalPatchNet splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions, enabling intuitive visualization of each patch's diagnostic contribution without post-hoc techniques. Trained on the CheXpert dataset (223,414 images), MedicalPatchNet matches the classification performance (AUROC 0.907 vs. 0.908) of EfficientNetV2-S, while improving interpretability: MedicalPatchNet demonstrates improved interpretability with higher pathology localization accuracy (mean hit-rate 0.485 vs. 0.376 with Grad-CAM) on the CheXlocalize dataset. By providing explicit, reliable explanations accessible even to non-AI experts, MedicalPatchNet mitigates risks associated with shortcut learning, thus improving clinical trust. Our model is publicly available with reproducible training and inference scripts and contributes to safer, explainable AI-assisted diagnostics across medical imaging domains. We make the code publicly available: https://github.com/TruhnLab/MedicalPatchNet

28 pages, 12 figures Code Link
Axial-Centric Cross-Plane Attention for 3D Medical Image Classification 2026-02-25
Show

Clinicians commonly interpret three-dimensional (3D) medical images, such as computed tomography (CT) scans, using multiple anatomical planes rather than as a single volumetric representation. In this multi-planar approach, the axial plane typically serves as the primary acquisition and diagnostic reference, while the coronal and sagittal planes provide complementary spatial information to increase diagnostic confidence. However, many existing 3D deep learning methods either process volumetric data holistically or assign equal importance to all planes, failing to reflect the axial-centric clinical interpretation workflow. To address this gap, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that captures the inherent asymmetric dependencies between different anatomical planes. Our architecture incorporates MedDINOv3, a medical vision foundation model pretrained via self-supervised learning on large-scale axial CT images, as a frozen feature extractor for the axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information within each anatomical plane, while axial-centric cross-plane transformer encoders condition axial features on complementary information from auxiliary planes. Experimental results on six datasets from the MedMNIST3D benchmark demonstrate that the proposed architecture consistently outperforms existing 3D and multi-plane models in terms of accuracy and AUC. Ablation studies further confirm the importance of axial-centric query-key-value allocation and directional cross-plane fusion. These results highlight the importance of aligning architectural design with clinical interpretation workflows for robust and data-efficient 3D medical image analysis.

Submi...

Submitted to MICCAI 2026

None
The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging 2026-02-24
Show

Model merging under unseen test-time distribution shifts often renders naive strategies, such as mean averaging unreliable. This challenge is especially acute in medical imaging, where models are fine-tuned locally at clinics on private data, producing domain-specific models that differ by scanner, protocol, and population. When deployed at an unseen clinical site, test cases arrive in unlabeled, non-i.i.d. batches, and the model must adapt immediately without labels. In this work, we introduce an entropy-adaptive, fully online model-merging method that yields a batch-specific merged model via only forward passes, effectively leveraging target information. We further demonstrate why mean merging is prone to failure and misaligned under heterogeneous domain shifts. Next, we mitigate encoder classifier mismatch by decoupling the encoder and classification head, merging with separate merging coefficients. We extensively evaluate our method with state-of-the-art baselines using two backbones across nine medical and natural-domain generalization image classification datasets, showing consistent gains across standard evaluation and challenging scenarios. These performance gains are achieved while retaining single-model inference at test-time, thereby demonstrating the effectiveness of our method.

None
Motivation is Something You Need 2026-02-24
Show

This work introduces a novel training paradigm that draws from affective neuroscience. Inspired by the interplay of emotions and cognition in the human brain and more specifically the SEEKING motivational state, we design a dual-model framework where a smaller base model is trained continuously, while a larger motivated model is activated intermittently during predefined "motivation conditions". The framework mimics the emotional state of high curiosity and anticipation of reward in which broader brain regions are recruited to enhance cognitive performance. Exploiting scalable architectures where larger models extend smaller ones, our method enables shared weight updates and selective expansion of network capacity during noteworthy training steps. Empirical evaluation on the image classification task demonstrates that, not only does the alternating training scheme efficiently and effectively enhance the base model compared to a traditional scheme, in some cases, the motivational model also surpasses its standalone counterpart despite seeing less data per epoch. This opens the possibility of simultaneously training two models tailored to different deployment constraints with competitive or superior performance while keeping training cost lower than when training the larger model.

None
Pychop: Emulating Low-Precision Arithmetic in Numerical Methods and Neural Networks 2026-02-24
Show

Motivated by the growing demand for low-precision arithmetic in computational science, we exploit lower-precision emulation in Python -- widely regarded as the dominant programming language for numerical analysis and machine learning. Low-precision training has revolutionized deep learning by enabling more efficient computation and reduced memory and energy consumption while maintaining model fidelity. To better enable numerical experimentation with and exploration of low precision computation, we developed the Pychop library, which supports customizable floating-point formats and a comprehensive set of rounding modes in Python, allowing users to benefit from fast, low-precision emulation in numerous applications. Pychop also introduces interfaces for both PyTorch and JAX, enabling efficient low-precision emulation on GPUs for neural network training and inference with unparalleled flexibility. In this paper, we offer a comprehensive exposition of the design, implementation, validation, and practical application of Pychop, establishing it as a foundational tool for advancing efficient mixed-precision algorithms. Furthermore, we present empirical results on low-precision emulation for image classification and object detection using published datasets, illustrating the sensitivity of the use of low precision and offering valuable insights into its impact. Pychop enables in-depth investigations into the effects of numerical precision, facilitates the development of novel hardware accelerators, and integrates seamlessly into existing deep learning workflows. Software and experimental code are publicly available at https://github.com/inEXASCALE/pychop.

Code Link
Towards Attributions of Input Variables in a Coalition 2026-02-24
Show

This paper focuses on the fundamental challenge of partitioning input variables in attribution methods for Explainable AI, particularly in Shapley value-based approaches. Previous methods always compute attributions given a predefined partition but lack theoretical guidance on how to form meaningful variable partitions. We identify that attribution conflicts arise when the attribution of a coalition differs from the sum of its individual variables' attributions. To address this, we analyze the numerical effects of AND-OR interactions in AI models and extend the Shapley value to a new attribution metric for variable coalitions. Our theoretical findings reveal that specific interactions cause attribution conflicts, and we propose three metrics to evaluate coalition faithfulness. Experiments on synthetic data, NLP, image classification, and the game of Go validate our approach, demonstrating consistency with human intuition and practical applicability.

Accep...

Accepted to the 2025 International Conference on Machine Learning (ICML 2025)

None
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification 2026-02-24
Show

In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: https://github.com/JiahaoXu-god/CVPR2026_MUSE.

Accep...

Accepted by CVPR 2026

Code Link
Knee or ROC 2026-02-23
Show

Self-attention transformers have demonstrated accuracy for image classification with smaller data sets. However, a limitation is that tests to-date are based upon single class image detection with known representation of image populations. For instances where the input image classes may be greater than one and test sets that lack full information on representation of image populations, accuracy calculations must adapt. The Receiver Operating Characteristic (ROC) accuracy threshold can address the instances of multiclass input images. However, this approach is unsuitable in instances where image population representation is unknown. We then consider calculating accuracy using the knee method to determine threshold values on an ad-hoc basis. Results of ROC curve and knee thresholds for a multi-class data set, created from CIFAR-10 images, are discussed for multiclass image detection.

9 pages None
Conformal Risk Control for Non-Monotonic Losses 2026-02-23
Show

Conformal risk control is an extension of conformal prediction for controlling risk functions beyond miscoverage. The original algorithm controls the expected value of a loss that is monotonic in a one-dimensional parameter. Here, we present risk control guarantees for generic algorithms applied to possibly non-monotonic losses with multidimensional parameters. The guarantees depend on the stability of the algorithm -- unstable algorithms have looser guarantees. We give applications of this technique to selective image classification, FDR and IOU control of tumor segmentations, and multigroup debiasing of recidivism predictions across overlapping race and sex groups using empirical risk minimization.

None
Explainability Methods for Hardware Trojan Detection: A Systematic Comparison 2026-02-22
Show

Hardware trojans are malicious circuits which compromise the functionality and security of an integrated circuit (IC). These circuits are manufactured directly into the silicon and cannot be fixed by security patches like software. The solution would require a costly product recall by replacing the IC and hence, early detection in the design process is essential. Hardware detection at best provides statistically based solutions with many false positives and false negatives. These detection methods require more thorough explainable analysis to filter out false indicators. Existing explainability methods developed for general domains like image classification may not provide the actionable insights that hardware engineers need. A question remains: How do domain-aware property analysis, model-agnostic case-based reasoning, and model-agnostic feature attribution techniques compare for hardware security applications? This work compares three categories of explainability for gate-level hardware trojan detection on the Trust-Hub benchmark dataset: (1) domain-aware property-based analysis of 31 circuit-specific features derived from gate fanin patterns, flip-flop distances, and primary Input/Output (I/O) connectivity; (2) model-agnostic case-based reasoning using k-nearest neighbors for precedent-based explanations; and (3) model-agnostic feature attribution methods (Local Interpretable Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), gradient) that provide generic importance scores without circuit-level context.

None
RetinaVision: XAI-Driven Augmented Regulation for Precise Retinal Disease Classification using deep learning framework 2026-02-22
Show

Early and accurate classification of retinal diseases is critical to counter vision loss and for guiding clinical management of retinal diseases. In this study, we proposed a deep learning method for retinal disease classification utilizing optical coherence tomography (OCT) images from the Retinal OCT Image Classification - C8 dataset (comprising 24,000 labeled images spanning eight conditions). Images were resized to 224x224 px and tested on convolutional neural network (CNN) architectures: Xception and InceptionV3. Data augmentation techniques (CutMix, MixUp) were employed to enhance model generalization. Additionally, we applied GradCAM and LIME for interpretability evaluation. We implemented this in a real-world scenario via our web application named RetinaVision. This study found that Xception was the most accurate network (95.25%), followed closely by InceptionV3 (94.82%). These results suggest that deep learning methods allow effective OCT retinal disease classification and highlight the importance of implementing accuracy and interpretability for clinical applications.

6 pages, 15 figures None