Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-18

Scalable Batch Bayesian Optimization Via Subspace Acquisition Functions

arXiv:2411.16206v3 Announce Type: replace-cross Abstract: Extending Bayesian optimization to batch evaluation can enable the designer to make the most use of parallel computing technology. However, most of current batch approaches do not scale well with the batch size. That is, their optimization efficiencies often deteriorate as the batch size increases. To address this issue, we propose a simple and efficient approach to extend Bayesian optimization to large-scale batch evaluation in this work. Different from existing batch approaches, the idea of the new approach is to draw a batch of axis-aligned subspaces of the original problem and select one point from each subspace using existing acquisition functions. Numerical experiments show that our proposed approach speedups the convergence significantly when compared with the sequential Bayesian optimization algorithm, and performs very competitively when compared with ten batch Bayesian optimization algorithms. The implementation of our proposed approach is available at https://github.com/zhandawei/SubSpace_Acquisition_Functions.

02.
arXiv (CS.LG) 2026-06-17

From Compression to Deployment: Real-Time and Energy-Efficient FastGRNN on Ultra-Constrained Microcontrollers

arXiv:2606.17249v1 Announce Type: cross Abstract: The dominant trajectory of modern machine learning has been to scale up: larger models, larger accelerators, larger memory budgets. Yet a multi-year global semiconductor supply constraint and the growing energy and carbon cost of always-online inference expose the fragility of this trajectory and motivate the opposite direction: refactoring AI and ML algorithms to fit the small, ubiquitous microcontrollers already in mass production in wearables, sensors, and edge appliances. We present an end-to-end open-source reproduction of FastGRNN, a compact gated recurrent cell, deployed on two bare-metal targets: the 8-bit Arduino (ATmega328P) and the 16-bit MSP430 (no hardware multiplier; 16 KB Flash; 512 B SRAM). Our compression pipeline combines low-rank weight factorization, iterative hard-thresholding sparsity, and per-tensor Q15 post-training quantization with explicit activation calibration. The deployed model occupies 566 bytes of weights and achieves macro F1 = 0.918 (seed 0; five-seed Q15 mean 0.853+-0.107) on the HAPT test set. It matches a PyTorch reference at 100% prediction agreement across 3,399 test windows (MCU seed 0; 99.91-100% C-equivalent across five seeds). Both platforms sustain real-time 50 Hz streaming inference (9.21 ms per sample on Arduino; 13 ms on MSP430), where a 256-entry sigmoid/tanh look-up table delivers a 30.5x speedup on the multiplier-less MSP430. Four contributions extend the original FastGRNN paper: (i) cross-platform bit-equivalent deterministic inference; (ii) characterization of recurrent warm-up latency (median 74 samples, 1.48 s; worst-case 125 samples, 2.50 s over 100 test windows); (iii) a deployable look-up-table recipe for multiplier-less embedded targets; and (iv) hardware energy characterization showing 17.7 mW active inference power,

03.
arXiv (CS.LG) 2026-06-15

FedSPC: Shared Parameter Correction for Personalized Federated Learning

arXiv:2606.13748v1 Announce Type: new Abstract: Personalized federated learning (PFL) is one of the important approaches in federated learning for addressing statistical heterogeneity while enabling client-specific adaptation. Many PFL methods split the model into shared and personalized parameters, which are jointly trained on each client. However, this creates an optimization issue: shared parameters are updated by clients optimizing different local objectives, which can lead to inconsistent shared updates and weaken the shared representation. To address this problem, we propose Federated Shared Parameter Correction (FedSPC), a modular correction method for PFL. FedSPC applies control-variate correction only to the shared parameters of a given PFL method, while leaving personalized parameters unchanged. It can be integrated into three common PFL settings: shared feature extractors, shared classifiers, and fully shared models with local regularization. Experiments on CIFAR-100 and Tiny-ImageNet with ViT, ResNet-34, and VGG-11 show that FedSPC improves performance across representative PFL methods, including FedPer, FedRep, FedBABU, LG-FedAvg, and Ditto.

04.
arXiv (CS.AI) 2026-06-18

Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport

arXiv:2606.18256v1 Announce Type: cross Abstract: LLM-based chatbots are increasingly applied in interpersonal domains such as counseling and peer support, where establishing human-AI rapport is crucial yet remains challenging. In this work, we introduce a novel approach for conditioning LLMs with in-group personas, which (i) first identifies a user's primary concern and brief personal context (e.g., a computer science undergraduate worried about future career prospects), and (ii) generates a synthetic in-group persona that shares a similar primary concern while differing in background and narrative details, such as age or profession (e.g., a junior researcher at an AI startup). Furthermore, we conduct a human-subject study to systematically evaluate the effectiveness of in-group persona agents in enhancing human-AI rapport. We compare our approach against two baseline conditions: a conventional agent without persona conditioning and an agent exhibiting minimal self-disclosure (e.g., "I've felt that too"). Results from post-task questionnaires assessing rapport and user experience indicate that the in-group persona agent significantly improves perceived rapport and personal relevance compared to the baselines, and also yields more positive user experience-most notably higher engagement.

05.
medRxiv (Medicine) 2026-06-16

Investigating naming error patterns after non-invasive brain stimulation and language treatment in persons with aphasia

Abstract Background: Transcranial direct current stimulation (tDCS) paired with behavioral language therapy can improve naming in persons with aphasia (PWA), yet naming errors persist. Little is known about how naming error patterns change after non-invasive brain stimulation is combined with language treatment. Aims: To examine whether right cerebellar tDCS plus computerized aphasia therapy changes the types of naming errors in people with chronic aphasia across timepoints, and to determine whether effects differ by cerebellar tDCS polarity (anode vs. cathode). Methods and Procedures: In a randomized, double-blind, sham-controlled, within-subject crossover study, we retrospectively analyzed behavioral data from 24 individuals with post-stroke aphasia. Each participant completed two 15-session intervention periods (3-5 sessions/week) with active cerebellar tDCS + computerized aphasia therapy and sham + computerized aphasia therapy, separated by a two-month washout. General linear models (GLMs) assessed longitudinal changes in six error types (semantic, phonological real word, phonological nonword, no response, mixed, unrelated) on an untrained picture naming task (Philadelphia Naming Test; PNT) and a trained task (Naming 80; N80). Additional GLMs evaluated polarity effects with 2 (Group: anode vs. cathode) x 2 (Treatment) interactions, and treatment-order effects with 2 (Group: tDCS-first vs. sham-first) x 2 (Treatment) interactions. Outcomes and Results: Active cerebellar tDCS did not significantly change error types for trained items (N80). For untrained items (PNT), active tDCS reduced several error types relative to sham, with the clearest and most durable reduction in phonological nonword errors; more moderate reductions occurred for phonological real word and unrelated errors. Mixed errors showed a marginally opposite pattern, tending to increase after tDCS and decrease after sham. Polarity analyses indicated broadly similar effects across anodal and cathodal stimulation overall, but only the anode group showed a reliable treatment effect for phonological nonword errors on the PNT. Treatment-order analyses revealed no significant order effects. Conclusions: Our results indicate a shift in naming error types, particularly after tDCS treatment for the untrained naming task (PNT). These findings may help guide the course of treatment approaches of those with aphasia and what error naming pattern types may show changes post stroke when combining non-invasive brain stimulation and computerized aphasia therapy. Clinical Trial Registration: Cerebellar Transcranial Direct Current Stimulation and Aphasia Treatment [NCT02901574] Keywords: aphasia, naming errors, non-invasive brain stimulation, cerebellar tDCS, computerized aphasia treatment

06.
arXiv (CS.CV) 2026-06-16

The Third Challenge on Image Denoising at NTIRE 2026: Methods and Results

This paper reports on the NTIRE 2026 Challenge on Image Denoising, specifically focusing on the high-noise regime ($\sigma = 50$). The competition investigates advanced neural architectures designed to restore high-fidelity details from images corrupted by additive white Gaussian noise (AWGN). Unlike constrained benchmarks, this track emphasizes peak quantitative performance, measured by Peak Signal-to-Noise Ratio (PSNR), without limitations on parameter count or computational overhead. By synthesizing contributions from 20 finalist teams out of 116 registrants, this report benchmarks the latest technical innovations and provides a comprehensive snapshot of the current state-of-the-art in unconstrained image restoration.

07.
arXiv (CS.CV) 2026-06-16

Training-free sparse attention based on cumulative energy filtering

Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42\% to 82\% with a VBench metric drop of less than 5\%. This results in an approximate 15\% in attention computation and a $1.61\times$ increase in computational efficiency, which is 1.18x higher than that of BLASST.

08.
arXiv (math.PR) 2026-06-11

Marked random graphs with given degree sequence: large deviations on the local topology

arXiv:2401.00351v2 Announce Type: replace Abstract: We investigate the behavior of the empirical neighborhood distribution of marked graphs in the framework of local weak convergence. Here we extend known results by considering uniform random graphs with given degree sequences and i.i.d. marks on half-edges and vertices. We establish a large deviation principle for such families of empirical measures. The proof builds on Bordenave and Caputo's seminal 2015 paper, and Delgosha and Anantharam's 2019 introduction of BC entropy, relying on combinatorial lemmas that allow one to construct suitable approximations of measures supported on marked trees. Possible applications of these results are in the study of interacting diffusions on top of random graphs.

09.
PLOS Computational Biology 2026-06-01

Challenges and progress in RNA velocity: Comparative analysis across multiple biological contexts

by Sarah Ancheta, Leah Dorman, Guillaume Le Treut, Abel Gurung, Greg Huber, Loïc A. Royer, Alejandro Granados, Merlin Lange Single-cell RNA sequencing is revolutionizing our understanding of cell state dynamics, allowing researchers to capture and quantify the transcriptomic profile of a single cell at a specific timepoint. Among the computational techniques used to predict cellular trajectories, RNA velocity has emerged as a predominant tool for modeling transcriptional dynamics. RNA velocity leverages the mRNA maturation process to generate velocity vectors that predict the likely future state of a cell, offering insights into cellular differentiation, aging, and disease progression. Although this technique has shown promise across biological fields, the performance accuracy varies depending on the RNA velocity method and dataset. We established a comparative pipeline and analyzed the performance of five RNA velocity methods on three datasets based on local consistency, method agreement, identification of driver genes, and robustness to sequencing depth. This benchmark provides a resource for scientists to understand the strengths and limitations of different RNA velocity methods.

10.
medRxiv (Medicine) 2026-06-15

Epileptogenicity alters intrahippocampal ripple propagation

Objective: Tracing the propagation of high-frequency oscillations (HFOs) aids in localizing epileptogenic regions and improving surgical outcomes. We examined how hippocampal epileptogenicity influences the propagation properties of the HFOs it generates. Methods: We analyzed non-REM sleep stereo-EEG from 49 patients (68 hemispheres) with verified hippocampal contacts. Hippocampi were stratified by excitability: 28 seizure onset zone (SOZ), 22 more-irritative non-SOZ (>6 interictal epileptiform discharges [IED]/min), and 18 less-irritative non-SOZ (

11.
arXiv (CS.LG) 2026-06-12

Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory

arXiv:2508.12681v3 Announce Type: replace-cross Abstract: Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of up to 44,000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3\% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.

12.
arXiv (math.PR) 2026-06-16

The distribution of the de Moivre experiment

arXiv:2606.15178v1 Announce Type: new Abstract: In this paper, we focus on de Moivre random experience which allows us to introduce the $ s- $Bernoulli distribution and the bi$ ^s $nomial distribution. We present some probabilistic properties such as the expectation, the variance, the skewness and kurtosis coefficients, the moments and the generating functions. Then we establish that for $ s\in\mathbb{N} $, the bi$ ^s $nomial distribution converges to a limiting Poisson and normal distributions when $ n\rightarrow\infty. $

13.
arXiv (CS.CV) 2026-06-17

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($\nu$) and density ($\rho$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $\nu$, $\rho$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

14.
arXiv (CS.CV) 2026-06-18

Bidirectional Cross-Attention Fusion of High-Resolution RGB and Low-Resolution Hyperspectral Inputs for Multimodal Semantic Segmentation

Multimodal semantic segmentation with heterogeneous sensors must reconcile complementary information across modalities that differ in spatial resolution and channel dimensionality. In particular, high-resolution RGB imaging provides detailed spatial structure but often fails to distinguish visually similar materials, whereas hyperspectral imaging (HSI) provides discriminative spectral signatures but at lower spatial resolution. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF delivers strong performance, achieving 75.4% at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.). These results show that preserving native-grid spatial detail and spectral structure improves multimodal segmentation under real-time constraints. Code and model checkpoints are publicly available at https://github.com/jonasvilhofunk/BCAF_2026.

15.
arXiv (CS.CV) 2026-06-12

Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison

Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at https://anonymous.4open.science/r/HTSCGAN.

16.
arXiv (CS.LG) 2026-06-19

Evaluating Universal Machine Learning Force Fields Against Experimental Measurements

arXiv:2508.05762v2 Announce Type: replace-cross Abstract: Universal machine learning force fields (UMLFFs) promise to revolutionize materials science by enabling rapid atomistic simulations across the periodic table. However, their evaluation has been limited to computational benchmarks that may not reflect real-world performance. We introduce UniFFBench, a comprehensive evaluation framework featuring the MinX dataset – a diverse collection of 1,500+ mineral systems spanning 85 elements, extreme thermodynamic conditions (0–5000 K, 0–1000 GPa), and structural complexity, including partial occupancy and disorder. This diversity, combined with experimental reference values for validation, enables assessment of UMLFF generalization across chemical space and conditions substantially beyond typical training scenarios. Our systematic evaluation of six state-of-the-art UMLFFs reveals a substantial ``reality gap'': models achieving impressive performance on computational benchmarks often fail when confronted with experimental complexity. Even the best-performing models exhibit higher density prediction error than the threshold required for practical applications. We observe disconnects between simulation stability and mechanical property accuracy, with prediction errors correlating with training data representation rather than the modeling method.

17.
arXiv (math.PR) 2026-06-17

Analysis of the asymmetric shelf shuffle

arXiv:2606.18047v1 Announce Type: new Abstract: In an asymmetric shelf shuffle, a deck of $n$ cards is dealt sequentially from the bottom and assigned one of the $m$ shelves uniformly at random. The card is placed at the top of the assigned shelf with probability $p$, and at the bottom of the assigned shelf with probability $(1-p)$. Analysis of the shelf shuffle has gained much attention recently, and the case $p=1/2$ was first treated by Diaconis–Fulman–Holmes [Ann. Appl. Prob. 23 (2013), no. 4, 1692–1720]. In this paper, we extend the analysis of the shelf shuffle to general $p\in (0, 1)$. In particular, we study the distribution of cycles, cycle lengths, number of descents, number of valleys, number of inversions, and the RSK shape of a permutation obtained from an asymmetric shelf shuffle. Our results extend the analysis of Diaconis–Fulman–Holmes to arbitrary $p$. Furthermore, our analysis of the distribution of descents and inversions is new even for $p=1/2$.

18.
arXiv (CS.AI) 2026-06-19

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

arXiv:2606.19568v1 Announce Type: cross Abstract: Acoustic gunshot detection is a problem with applications across civilian public safety, military operations, and wildlife conservation, yet the field lacks a rigorous exploration of feature extraction techniques with a focus on generalization to realistic data. The mixed effectiveness of commercial gunshot detection and classification systems indicates an open problem that is not adequately addressed by the current literature. In this paper, we present a systematic investigation of common feature extraction techniques using a dataset of 23,000 gunshot recordings across 85 firearms and 21 calibers. We benchmark three feature extraction techniques with 12 total unique parameter sets using ResNet-18. Our results demonstrate that using the correct feature extraction technique can improve top-1 accuracy by up to 20%, and utilizing the correct parameters for a given feature extraction technique can improve that value by up to 4.7%.

19.
arXiv (CS.CL) 2026-06-11

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.

20.
arXiv (CS.CV) 2026-06-17

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

21.
arXiv (quant-ph) 2026-06-16

Worst-case depth hierarchy for shallow quantum circuits

arXiv:2606.16425v1 Announce Type: new Abstract: Circuit depth is a central resource in complexity theory. While bounded-depth classical circuits admit well-understood hierarchy theorems, the internal structure of constant-depth quantum computation remains comparatively unexplored. We prove an explicit depth hierarchy theorem for $\mathsf{QNC}^0$. For each $d\ge 12$, we construct a family of two-round interactive problems on which no depth-$(d-1)$ quantum circuit can achieve near-perfect success, regardless of gate set, circuit size, or ancillary qubits. In contrast, we prove that our construction admits realizations by simple bounded fan-in quantum circuits of depth larger than $d$ by a small constant factor. Moreover, all bounded fan-in classical circuits of sublogarithmic depth (in the input size) fail to achieve perfect success on these tasks for every $d$, yielding a hierarchy of problems that show unconditional quantum advantage of $\mathsf{QNC}^0$ over $\mathsf{NC}^0$. A key obstacle is the scarcity of lower bound techniques for quantum circuits. To address this, we develop methods to analyze how depth affects a circuit's ability to realize nonlocal correlations amongst its output qubits in a fine-grained manner. Our approach exploits the correspondence between constraint systems and nonlocal games, translating group-theoretic constructions into rigid operator-valued constraint systems and then into non-local games. In particular, we construct constraint systems whose unique faithful operator-valued solutions require every perfect strategy, and every near-perfect strategy to a fixed precision, to implement multi-controlled phase operations. This reduces to a nonlocal unitary-synthesis problem, yielding depth lower bounds for both shallow quantum and classical circuits. These results show that increasing depth strictly increases computational power within $\mathsf{QNC}^0$, establishing a genuinely quantum hierarchy.

22.
arXiv (CS.CV) 2026-06-15

MMRINet: Efficient Mamba-Based Segmentation with Dual-Path Refinement for Low-Resource MRI Analysis

Automated brain tumor segmentation in multi-parametric MRI remains a critical yet underserved challenge in resource-constrained clinical settings, where deep 3D networks requiring high-end GPUs are not viable. This is particularly acute across sub-Saharan Africa (SSA), where low-field scanners, heterogeneous patient demographics, and severe data scarcity compound the difficulty of applying standard deep learning pipelines. We present MMRINet, a lightweight segmentation architecture purpose-built for these constraints. At its core, MMRINet replaces quadratic-complexity self-attention with linear-complexity Mamba state-space models, enabling efficient long-range volumetric context modeling without the computational overhead of Transformer-based approaches. We combine two lightweight refinement components:Dual-Path Feature Refinement (DPFR), which extracts complementary detail and contextual representations to improve feature diversity under limited data, and Progressive Feature Aggregation (PFA), which hierarchically fuses multi-scale decoder outputs for sharper segmentation boundaries. Evaluated on the BraTS-Lighthouse SSA 2025 challenge dataset, comprising 3D MRI scans from Nigerian clinical sites, MMRINet achieves an average Dice score of 0.752 and an average HD95 of 12.23 mm with only ~2.5M parameters, outperforming all evaluated baselines, including UNETR, Swin-UNETR, SegMamba, and SegResNet3D. These results indicate that strong validation-set segmentation performance can be achieved with substantially reduced computation, offering a practical step toward AI-assisted neuro-oncology in low-resource clinical environments. Our GitHub repository can be accessed here: BioMedIA-MBZUAI/MMRINet.

23.
arXiv (CS.CV) 2026-06-16

NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis

Interpretability is essential for trustworthy medical image diagnosis. However, existing concept-driven interpretable methods have key limitations: Concept Bottleneck Models (CBMs) require scoring all predefined concepts at inference time and for manual intervention, imposing a substantial burden on clinicians, while rationale-based generative approaches often select concepts by class discriminability, which can drift from diagnostic ontologies. To address these issues, we propose Neuro-Symbolic Rule Distillation (NeRD), a framework that produces efficient, ontology-grounded reasoning chains that are sufficient yet non-redundant, without manually crafting diagnostic rules. Experiments on two skin datasets demonstrate strong diagnostic performance and interpretability, and blinded expert evaluation confirms the clinical plausibility of NeRD rationales. Our method further enables a first expert-in-the-loop study for Multimodal Chain-of-Thought-based diagnosis, achieving efficient and effective concept-level intervention.

24.
arXiv (CS.CV) 2026-06-16

Learned Image Compression for Vision-Language-Action Models

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

25.
arXiv (CS.CV) 2026-06-16

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.