Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-11

Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.

02.
arXiv (CS.AI) 2026-06-11

On the Geometry of On-Policy Distillation

arXiv:2606.07082v2 Announce Type: replace-cross Abstract: On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

03.
arXiv (CS.AI) 2026-06-18

Information-Theoretic Measures in AI: A Practical Decision Guide

arXiv:2604.23716v2 Announce Type: replace Abstract: Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity.

04.
arXiv (CS.CV) 2026-06-11

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

05.
arXiv (CS.LG) 2026-06-19

SMT-AD: a scalable quantum-inspired anomaly detection approach

arXiv:2604.06265v2 Announce Type: replace Abstract: Quantum-inspired tensor networks algorithms have shown to be effective and efficient models for machine learning tasks, including anomaly detection. Here, we propose a highly parallelizable quantum-inspired approach which we call SMT-AD from Superposition of Multiresolution Tensors for Anomaly Detection. It is based upon the superposition of bond-dimension-1 matrix product operators to transform the input data with Fourier-assisted feature embedding, where the number of learnable parameters grows linearly with feature size, embedding resolutions, and the number of additional components in the matrix product operators structure. We demonstrate successful anomaly detection when applied to standard datasets, including credit card transactions, and find that, even with minimal configurations, it achieves competitive performance against established anomaly detection baselines. Furthermore, it provides a straightforward way to reduce the weight of the model and even improve the performance by highlighting the most relevant input features.

06.
medRxiv (Medicine) 2026-06-11

Assessment of occupational aerosol exposure for laboratory technicians: A quantitative study using {Phi}X174 phage as a substitute virus

作者:

This study aimed to clarify aerosol exposure risks throughout the workflow of a Biosafety Level 2 (BSL-2) polymerase chain reaction (PCR) laboratory, validate the suitability of the {Phi}X174 bacteriophage as an indicator virus, and provide evidence for biosafety control measures. The {Phi}X174 bacteriophage was used to simulate viral samples, and a concentration-bacteriophage plaque standard curve was constructed (R2=0.998). Five operational steps in a simulated PCR laboratory were quantitatively monitored for aerosol concentration using double-layer agar plates, with blank controls used to eliminate interference. Statistical analysis was employed to identify risk differences. Sample homogenization ((5.67 {+/-} 1.23) x 104 plaque-forming units (PFU)/m3) and nucleic acid extraction ((3.45 {+/-} 0.89) x 104 PFU/m3) were identified as high-/very high-risk steps. The viral load in the samples was strongly positively correlated with the aerosol concentration (r = 0.926, P

07.
medRxiv (Medicine) 2026-06-16

Cross-sectional study of the association between depressive symptoms and attentional bias to emotional stimuli in patients with acute stroke: Study protocol

Post-stroke depression affects approximately 30% of patients after stroke and is associated with delayed recovery in activities of daily living, reduced rehabilitation effectiveness, and poorer quality of life. Attentional bias modification may provide a low-burden, nonpharmacological approach for patients in the acute phase of stroke. However, before such an intervention can be implemented in clinical practice, it is necessary to clarify whether attentional bias is present in patients with acute stroke and depressive symptoms, whether cognitive function influences the manifestation of this bias, and which task and stimulus formats are most appropriate for assessment. This multicenter, cross-sectional observational study will enroll patients with acute stroke between 7-30 days after stroke onset. Depressive symptoms will be assessed using the depression subscale of the Hospital Anxiety and Depression Scale. Attentional bias will be measured under four task conditions based on the dot-probe task and the cue-target task, using face and word stimuli. Secondary assessments will include cognitive function, anxiety symptoms, activities of daily living, health-related quality of life, and clinical background variables. The aims of this study are to investigate the association between depressive symptoms and attentional bias in patients with acute stroke, compare attentional bias characteristics across task and stimulus types, and examine the potential influence of cognitive function on this association. The findings are expected to provide an empirical basis for designing future attentional bias modification protocols targeting post-stroke depression in the acute phase. This study has been registered with the UMIN Clinical Trials Registry (UMIN000059166).

08.
arXiv (CS.AI) 2026-06-16

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

arXiv:2606.15186v1 Announce Type: cross Abstract: Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/

09.
arXiv (quant-ph) 2026-06-12

Approximate quantum error correction theory of non-isometric codes

arXiv:2606.13559v1 Announce Type: new Abstract: Non-isometric encoding arises in various important contexts in quantum error correction, most notably in the finite-energy, non-ideal codewords inevitable in experimental realizations of continuous-variable codes, and holographic quantum gravity. In this work, we present a general and systematic theory of non-isometric quantum error-correcting codes. In particular, we employ the approximate quantum error correction framework to quantitatively study the fundamental limitations imposed by non-isometric encodings on the accuracy of quantum error correction and implementation of logical operations. We apply our theory to analyze GKP and tiger codes under energy constraints, and discuss the implications to holography.

10.
arXiv (CS.LG) 2026-06-18

Exponentially many initializations to avoid barren plateaus

arXiv:2606.18515v1 Announce Type: cross Abstract: Barren plateaus are stated as an average-case phenomenon: pick an ansatz, initialize it naively, and concentration follows. This has led to the common view that a potential cure for barren plateaus is simply to initialize the parameters more carefully. Here we show that the situation is subtler. We introduce a first-moment framework that gives a simple operator-level diagnostic for when an initialization may escape the fully concentrated barren-plateau fixed point, and for comparing the biases induced by different initialization strategies. Our framework recovers several known initialization schemes such as identity and Gaussian initialization, but also shows that barren-plateau avoidance is highly non-unique. Indeed, many shifted, biased, and non-symmetric parameter distributions can avoid concentration, and these choices need not be equivalent. In fact, our results show that one can generate exponentially many families of inequivalent initialization strategies. Then, our numerics indicate that different first-moment-distinct initializations can lead to different attained minima, suggesting that avoiding barren plateaus via smart initializations can trade the exponential concentration problem for the challenge of selecting the right trainable pocket amongst many options.

11.
arXiv (CS.AI) 2026-06-18

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

arXiv:2605.11287v2 Announce Type: replace-cross Abstract: A persistent paradox in time-series forecasting is that structurally simple MLP and linear models often outperform high-capacity Transformers. We argue that this gap arises from a mismatch in the sequence-modeling primitive: while many time-series dynamics are governed by global temporal operators (e.g., filtering and harmonic structure), standard attention forms each output as a convex combination of inputs. This restricts its ability to represent signed and oscillatory transformations that are fundamental to temporal signal processing. We formalize this limitation as a simplex-constrained mixing bottleneck in softmax attention, which becomes especially restrictive for operator-driven time-series tasks. To address this, we propose $Temporal Operator Attention (TOA)$, a framework that augments attention with explicit, learnable sequence-space operators, enabling direct signed mixing across time while preserving input-dependent adaptivity. To make dense $N \times N$ operators practical, we introduce Stochastic Operator Regularization, a high-variance dropout mechanism that stabilizes training and prevents trivial memorization. Across forecasting, anomaly detection, and classification benchmarks, TOA consistently improves performance when integrated into standard backbones such as PatchTST and iTransformer, with particularly strong gains in reconstruction-heavy tasks. These results suggest that explicit operator learning is a key ingredient for effective time-series modeling.

12.
Nature (Science) 2026-06-17

Optical metasurfaces for general vision processing on the edge

作者:

Large-scale artificial intelligence (AI) models achieve notable performance in computer vision but require substantial computational resources, limiting their deployment on edge devices1,2. Optical neural networks (ONNs) promise reduced latency and energy consumption by making use of the inherent parallelism of light3. However, present ONNs struggle to scale and are confined to simple tasks, owing to the challenges of replicating exact algebraic operations of digital models using physical (analogue) systems. This work introduces a new paradigm that directly embeds core computer vision principles, including similarity-based recognition, attention-guided perception and detail–context fusion, into a large-scale optical metasurface. By unifying optical physics with these computer vision fundamentals, we develop a photonic–electronic engine that overcomes scalability and generality barriers, enabling high-accuracy, general-purpose computer vision at the edge. The resulting system combines a 41-million-parameter optical metasurface front end with a co-designed, ultraefficient 87,000-parameter digital back end, outperforming many digital models with tens of millions of parameters across object detection, segmentation, 3D reconstruction and video understanding. We build a deployable prototype and demonstrate real-time edge visual processing in natural scenes. This work represents a path towards practical optical computing for general vision tasks in complex natural environments, enabling a new paradigm for low-energy, low-latency, real-time on-device vision intelligence. By embedding core computer vision principles into a large-scale optical metasurface, an efficient vision processing system using far fewer parameters is demonstrated to outperform many digital models and enables deployment on edge devices.

13.
arXiv (quant-ph) 2026-06-16

Cosmological Pseudo-Entropy

arXiv:2606.15227v1 Announce Type: cross Abstract: We study pseudo entropy $\mathcal{S}$, a recent generalization of entanglement entropy, for scalar cosmological perturbations in de Sitter space with sound speed $0.024 \leq c_s \leq 1$, and in expanding and contracting FLRW backgrounds with varying equation-of-state parameter $w$. In de Sitter space, $\mathrm{Re}(\mathcal{S})$ grows after horizon exit while $c_s$ controls its onset and saturates at late times. A similar saturation occurs in expanding-accelerating and contracting-decelerating backgrounds. In contrast, expanding-decelerating and contracting-accelerating backgrounds show large early-time $\mathrm{Re}(\mathcal{S})$ followed by oscillations after horizon re-entry. This happens because while the squeezing freezes, the squeezing angle doesn't. Unlike entanglement entropy, pseudo entropy possesses an imaginary part, $\mathrm{Im}(\mathcal{S})$, as well, which can encode the relative phase. $\mathrm{Im}(\mathcal{S})$ decays to zero in de Sitter and expanding-accelerating cases, but forms dense sub-Hubble oscillation bands in expanding-decelerating and contracting-accelerating backgrounds. Compared with entanglement entropy, Krylov complexity, and Nielsen circuit complexity, pseudo entropy captures otherwise hidden phase information; in the unsaturated regime, its slope is $\sqrt{2}$ times that of Nielsen complexity. Unlike circuit complexity, whose saturation bound is $w$-independent, pseudo entropy is sensitive to $w$ during the transition regime, making it a finer information theoretic diagnostic of cosmological dynamics.

14.
arXiv (CS.LG) 2026-06-17

The Implicit Bias of Steepest Descent with Mini-batch Stochastic Gradient

arXiv:2602.11557v2 Announce Type: replace Abstract: A variety of widely used optimization methods like SignSGD and Muon can be interpreted as instances of steepest descent under different norm-induced geometries. In this work, we study the implicit bias of mini-batch stochastic steepest descent in multi-class classification, characterizing how batch size, momentum, and variance reduction shape the limiting max-margin behavior and convergence rates under general entry-wise and Schatten-$p$ norms. We show that, without momentum, worst-case convergence and successful classification can only be guaranteed with full-batch gradient. In contrast, momentum enables small-batch convergence to an approximate max-margin solution through a batch-momentum trade-off, though it slows convergence. This approach provides fully explicit, dimension-free rates that improve upon prior results. Moreover, we prove that variance reduction can recover the exact full-batch implicit bias for any batch size, albeit at a slower convergence rate. Finally, we further investigate the batch-size-one steepest descent without momentum, and reveal its convergence to a fundamentally different bias via a concrete data example, which reveals a key limitation of purely stochastic updates. Overall, our unified analysis clarifies when stochastic optimization aligns with full-batch behavior, and paves the way for perform deeper explorations of the training behavior of stochastic gradient steepest descent algorithms.

15.
arXiv (CS.AI) 2026-06-11

Position: Hippocampal Explicit Memory Is the Cornerstone for AGI

作者:

arXiv:2606.11245v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, raising expectations for Artificial General Intelligence (AGI). This position paper argues that integrating explicit memory is the cornerstone for advancing LLMs toward AGI. The key reason is that the underlying learning mechanism of LLMs is highly analogous to human implicit memory. However, higher-order cognitive functions necessary for AGI, such as long-term strategic planning, metacognition, and symbolic reasoning, heavily rely on hippocampal explicit memory and cannot arise solely from implicit statistical learning. Drawing on findings from neuroscience, I advance this perspective and complement it with computational requirements for artificial explicit memory systems, hoping to foster further research and lay the groundwork for explicit memory integration.

16.
arXiv (CS.CL) 2026-06-16

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.

17.
arXiv (CS.AI) 2026-06-15

VHDLSuite: Unified Pipeline for LLM VHDL Generation with Data Synthesis and Evaluation

arXiv:2606.13735v1 Announce Type: cross Abstract: Large Language Models (LLM) have shown impressive capabilities in Register Transfer Level (RTL) code generation, particularly for Verilog. However, evaluating their performance with other Hardware Description Languages (HDL), especially VHDL, remains limited although its distinct language characteristics, such as stricter semantic rules, introduce evaluation considerations that differ from Verilog. This lack of coverage restricts fully understanding of how well current models generalize across hardware design languages with differing structures and semantics. To address this gap, we introduce VHDLSuite, a benchmark-centered infrastructure for scalable VHDL generation evaluation, integrating automated benchmark synthesis, executable validation, and multi-model diagnostic analysis. First, we propose a data pipeline that automatically converts Verilog designs and their accompanying testbenches into executable VHDL benchmark instances, followed by VUnit/GHDL-based validation to ensure each released task is compilable, runnable, and consistently checkable in the VHDL environment. Second, we introduce VHDLBench, a benchmark with over 200 VHDL problems with complete and validated testbenches across a wide range of complexity levels. Third, we extensively evaluate cutting-edge LLMs and uncover key challenges specific on LLM-aided VHDL generation. Our findings provide important insights and support future work in multi-language hardware design automation.Our data pipeline, benchmark, and evaluation framework will be open-sourced.

18.
arXiv (quant-ph) 2026-06-19

Simulation of Non-Markovian Quantum Accelerated Dynamics via Time-Fractional Schrödinger Equation

arXiv:2606.20024v1 Announce Type: new Abstract: The Time-Fractional Schrödinger Equation (TFSE) is an effective tool for simulating the dynamics of non-Markovian quantum systems. The Quantum Speed Limit (QSL) time characterizes the minimum time required for the evolution of a non-Markovian quantum system. In this paper, Wei's TFSE is employed to simulate the non-Markovian quantum accelerated evolution process in the Resonant Dissipative Jaynes-Cummings (RDJC) model. By solving the QSL time of a time-fractional single-qubit open system, the enhancement mechanism of the system evolution speed induced by the non-Markovian memory effects of the environment is revealed. Further studies show that the optimized acceleration of the system evolution can be achieved by jointly regulating the fractional order, coupling strength, and photon number. Comparative analyses indicate that Wei's TFSE can accurately capture the non-Markovian accelerated dynamical features of the system over the entire fractional order range, whereas Naber's TFSE is applicable only within a limited fractional order interval. In addition, the comparisons of the average simulation time for calculating the dynamical trajectory of the excited-state probability demonstrate that Wei's TFSE has a significant simulation advantage in computational efficiency. Therefore, Wei's TFSE is more accurate and efficient for simulating the accelerated dynamics of non-Markovian quantum systems.

19.
arXiv (CS.AI) 2026-06-12

Towards Personalized Federated Learning for Dysarthric Speech Recognition

arXiv:2606.13253v1 Announce Type: cross Abstract: Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.

20.
arXiv (CS.LG) 2026-06-17

Meta-classification of one-class classification models using ranking correlation and nearest neighbor

arXiv:2606.17858v1 Announce Type: new Abstract: Machine Learning (ML) techniques have been applied to various problems. However, applying ML to ML models is an unexplored direction. For this purpose, this paper considers a meta-classification of one-class classification (OCC) models, because all ML models could be approximated as OCC models. The proposal represents OCC models as normality rankings and classifies them using nearest-neighbor and ranking-correlation metrics. The experiment classifies OCC models, where classes correspond to training datasets, algorithms, and hyperparameters. The proposal achieves high accuracy when class labels are datasets. Moreover, it can classify algorithms when the training datasets contain the same class. In addition, the discussion highlights that the classification of OCC models is essentially the classification of datasets that treats multiple samples as a single input. The experiment demonstrates the classification of datasets using sleeping records. The proposed method can provide a unified solution for classifying OCC models, datasets, and rankings. Source code is uploaded to the public repository https://github.com/ToshiHayashi/ClassOCC.

21.
arXiv (CS.AI) 2026-06-16

Benign in Isolation, Harmful in Composition: Security Risks in Agent Skill Ecosystems

arXiv:2606.15242v1 Announce Type: cross Abstract: Skills are becoming the capability layer through which LLM agents turn plans into actions, but their use introduces security risks such as data leakage, unauthorized operations, and tool misuse. Existing vetting usually evaluates each skill in isolation, while real agent tasks often invoke multiple skills in a shared execution context. This creates Skill Composition Risk (SCR): a skill that appears benign alone can become harmful when its outputs, trust signals, authorization cues, or side effects influence later invocations along an activated path. We introduce SCR-Bench to evaluate this risk in controlled, sandboxed skill environments. Rather than relying only on textual intent or surface behavior, SCR-Bench records downstream state changes and path-level outcomes across composed skill executions. It contains three sub-benchmarks: SCR-CapFlow for capability-flow composition, SCR-TrustLift for trust-transfer composition, and SCR-AuthBlur for authorization-confusion composition. Across SCR-Bench, composed paths expose risks that are largely absent under isolated evaluation. In SCR-CapFlow, attack success rate reaches 33.6 percent under composition, compared with near-zero isolated baselines. In SCR-TrustLift, attack success rate exceeds 96.5 percent on four of five backends. In SCR-AuthBlur, the risky-approval rate increases by 71.8 percent relative to the L0 isolated baseline under the L1 context setting. These results show that agent skill security should be assessed at the level of activated paths rather than isolated artifacts. SCR and SCR-Bench provide a foundation for path-aware risk evaluation and defense in LLM agent skill ecosystems. Benchmark: https://github.com/saint-viperx/SCR_Bench.

22.
arXiv (CS.CL) 2026-06-17

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

23.
arXiv (CS.LG) 2026-06-18

Optimal scenario design for climate emulation

arXiv:2606.19302v1 Announce Type: cross Abstract: As deep learning for physical systems continues to grow in popularity, efforts to improve generalizability have primarily focused on designing architectures that embed physical constraints. However, for machine-learning surrogate climate models (emulators), we show that the low structural diversity in existing scenarios commonly used to generate training data places a ceiling on predictive skill. Here, we examine whether training datasets themselves can be optimized to improve generalization. We introduce a method to create datasets that produce emulators capable of generalizing to new, structurally different scenarios absent from the training data. We use a differentiable Simple Climate Model (SCM) to calculate the sensitivity of emulator loss to perturbations in the training data, iteratively updating the training data to maximize emulator skill. For an SCM, training on one scenario optimized in this fashion outperforms an emulator trained on six standard ScenarioMIP pathways. We achieve this higher predictive skill despite training on a smaller dataset, finding that our emulator successfully isolates distinct physical behaviors of different climate forcing agents (e.g., greenhouse gases vs. aerosols) without single-forcing runs. We then demonstrate that scenarios optimized using an SCM, when used to drive an intermediate-complexity climate model, produce a training dataset that yields a more skillful emulator than training on ScenarioMIP outputs. Our results suggest that, in the compute-constrained environment of running full-scale climate models, generating a small number of dynamically rich scenarios provides greater marginal value for emulation and characterizing system responses than expanding the suite of traditional emissions pathways.

24.
arXiv (CS.CV) 2026-06-16

An Adaptive Data cleaning Framework for Noisy Label Detection

Deep neural networks (DNNs) excel in computer vision tasks given large annotated datasets. In real-world applications, however, labels are often corrupted by ambiguity, human error, or dynamic environments. Over-parameterized DNNs easily memorize these noisy labels during training, degrading model accuracy and generalization. Existing data-cleaning and sample-selection strategies often rely on manually specified thresholds, prior knowledge of the noise ratio, or a single metric (either learning dynamics or geometric structure), making them unstable in complex data regimes. This paper proposes a self-adaptive data-cleaning framework that integrates local, global, and learning dynamics cues for robust noisy-label detection. Samples are mapped into a unified low-dimensional feature space through a modular feature concatenation paradigm. We provide two instantiations: a 2D metric integrating class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. Unlike conventional 1D Gaussian Mixture Models applied to a single scalar metric, our framework performs multi-metric clustering on the feature space to adaptively partition samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100 with 5% to 40% symmetric label noise show high recall across settings, including near-perfect recall (>=98%) on ImageNet-100 at 40% noise. Subsequent training yields accuracy gains across evaluated settings, especially under severe corruption on ImageNet-100. These findings suggest that multi-metric integration provides a threshold-free, practical, and low-tuning strategy for noisy label detection.

25.
arXiv (CS.CV) 2026-06-17

MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.