Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-12

Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI

Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance–covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task's structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task's structure.

02.
arXiv (math.PR) 2026-06-17

Periodicity, type $II_1$ factors and free Poisson laws in interacting Fock spaces

arXiv:2606.18162v1 Announce Type: cross Abstract: We show that the von Neumann algebra generated by position operators in a 2-periodic interacting Fock space is a type $II_1$ factor. On the probabilistic side, we prove that the squared position operators have a Marchenko-Pastur distribution with respect to the vacuum state, yielding a natural realization of free Poisson laws within this framework.

03.
arXiv (CS.LG) 2026-06-16

Towards CONUS-Wide ML-Augmented Conceptually-Interpretable Modeling of Catchment-Scale Precipitation-Storage-Runoff Dynamics

arXiv:2510.02605v2 Announce Type: replace Abstract: While many modern studies are dedicated to ML-based large-sample hydrologic modeling, these efforts have not necessarily translated into predictive improvements that are grounded in enhanced physical-conceptual understanding. Here, we report on a CONUS-wide large-sample study (spanning diverse hydro-geo-climatic conditions) using ML-augmented physically-interpretable catchment-scale models of varying complexity based in the Mass-Conserving Perceptron (MCP). Results were evaluated using attribute masks such as snow regime, forest cover, and climate zone. Our results indicate the importance of selecting model architectures of appropriate model complexity based on how process dominance varies with hydrological regime. Benchmark comparisons show that physically-interpretable mass-conserving MCP-based models can achieve performance comparable to data-based models based in the Long Short-Term Memory network (LSTM) architecture. Overall, this study highlights the potential of a theory-informed, physically grounded approach to large-sample hydrology, with emphasis on mechanistic understanding and the development of parsimonious and interpretable model architectures, thereby laying the foundation for future models of everywhere that architecturally encode information about spatially- and temporally-varying process dominance.

04.
arXiv (CS.AI) 2026-06-16

Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

arXiv:2606.15308v1 Announce Type: new Abstract: While multimodal large language models (MLLMs) have shown strong visual reasoning abilities, serving a large model for every query is computationally expensive. MLLM cascades mitigate this cost by first querying a weak but cheaper model and deferring to a strong model when the weak model's output is unconfident. However, since the weak model's confidence directly controls compute allocation, these systems expose a new attack surface: an adversary can manipulate confidence so that their queries are consistently deferred to the strong model. Motivated by this vulnerability, we introduce the Forced Deferral Attack (FDA), an adversarial image attack that lowers the weak model's confidence and causes cascades to route queries to the strong model. FDA learns a universal border trigger by optimizing a temperature-flattened objective. This objective pushes the weak model's token distribution on triggered inputs toward less concentrated targets constructed from its clean responses. Across datasets, model families, and deferral metrics, FDA consistently increases strong-model routing while outperforming image-perturbation and prompt-injection baselines. These results show that MLLM cascades are vulnerable to attacks that manipulate compute allocation, forcing unintended strong-model usage without directly targeting answer correctness.

05.
arXiv (CS.AI) 2026-06-11

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

arXiv:2602.19502v2 Announce Type: replace Abstract: Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

06.
arXiv (CS.AI) 2026-06-17

LLM-Aided Joint Secrecy Precoding and Trajectory for RSMA-Based Heterogeneous UAV Networks

arXiv:2507.17188v3 Announce Type: replace-cross Abstract: This paper investigates secure communications in rate-splitting multiple access (RSMA) enabled heterogeneous UAV networks, where multiple UAVs collaboratively serve ground terminals in the presence of eavesdroppers. By jointly considering secrecy rate maximization and propulsion energy consumption minimization, we formulate a multi-objective optimization problem involving UAV trajectory design, service association, power allocation, and secrecy precoding under mobility, collision-avoidance, service-capacity, and communication constraints. The formulated problem is highly non-convex due to the coupling among UAV trajectories, RSMA transmission variables, and secrecy constraints.To address the resulting non-convex and highly coupled optimization problem, we propose a hierarchical optimization framework. The inner layer uses a semidefinite relaxation (SDR)-based S2DC algorithm combining penalty functions and difference-of-convex (D.C.) programming to solve the secrecy precoding problem with fixed UAV positions. The outer layer introduces a Large Language Model (LLM)-guided heuristic multi-agent reinforcement learning approach (LLM-HeMARL) for trajectory optimization. LLM-HeMARL efficiently incorporates LLM-generated expert heuristic policy, enabling UAVs to learn energy-aware, security-driven trajectories without the inference overhead of real-time LLM calls. The simulation results show that our method outperforms existing baselines in secrecy rate and energy efficiency, with consistent robustness across varying UAV swarm sizes and random seeds.

07.
arXiv (CS.LG) 2026-06-16

Scalar-pathway fidelity improves physical accuracy in short-range equivariant interatomic potentials

arXiv:2606.15892v1 Announce Type: new Abstract: Accurate interatomic potentials enable molecular dynamics of materials, molecules, and interfaces beyond density-functional-theory length and time scales. Equivariant neural network potentials have improved the representation of local geometry. However, their deployable energy surfaces ultimately manifest through invariant scalar channels, whose aggregation and spectral resolution remain comparatively underexamined. Here we use Physics-Aware Neighborhood (PAN) pooling and Physics-Guided Spectral (PGS) mixers as controlled scalar-pathway probes: lightweight, symmetry-preserving modifications that act only on \(\ell=0\) channels while leaving the equivariant tensor backbone unchanged. Using MACE as a high-body-order mechanistic scaffold, PAN adds coordination-sensitive amplitude modulation, whereas PGS augments edge and readout scalar features with radial and tapered spectral bases. Across metallic Ag, covalent Si, a short-range ionic LiF/Li–F subset, and MD17/rMD17 molecules, this scalar-pathway correction reduces MACE force errors by 22–27\% and energy errors by 19–22\%; on systems with stress labels, stress errors decrease by 27–28\%, at approximately 5\% additional inference-FLOPs cost. Directionally consistent gains in Allegro and NequIP further indicate that the correction is portable across distinct short-range equivariant backbones, although effect sizes remain architecture-dependent. These results identify scalar-pathway fidelity as a practical design dimension for short-range equivariant interatomic potentials.

08.
arXiv (CS.CL) 2026-06-16

Not All Skills Help: Measuring and Repairing Agent Knowledge

LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming-lab/assay.

09.
arXiv (CS.AI) 2026-06-12

Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

作者:

arXiv:2606.12969v1 Announce Type: new Abstract: The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions – such as querying knowledge bases or generating work orders – to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

10.
Nature (Science) 2026-06-09

Good recycling starts at home — and benefits the world

作者: 未知作者

New research supports the value of household-level waste separation. But policies must also carefully consider consumer behaviours to maximize the quality of material collected. New research supports the value of household-level waste separation. But policies must also carefully consider consumer behaviours to maximize the quality of material collected.

11.
arXiv (CS.LG) 2026-06-16

Representation Costs in Data Science: Foundations and the Quasi-Banach Spaces of Deep Neural Networks

arXiv:2606.14954v1 Announce Type: cross Abstract: We develop a general framework for analyzing representation costs of parametric data-fitting methods through their parameter-space regularizers. From this abstract perspective, we define representation costs for arbitrary parametric models and reveal their induced (native) function spaces. This unifies recent function-space views of data-fitting methods. We also prove that many natural results hold in this abstract setting, including representer theorems for parametric methods on their native spaces. The framework also rigorously connects parametric methods with their equivalent nonparametric descriptions under sufficient overparameterization. Classical methods and their native spaces, such as kernel methods / reproducing kernel Hilbert spaces, wavelets / Besov spaces, and shallow neural networks / variation spaces emerge as special cases of our abstract framework. A byproduct of "axiomatizing" the study of representation costs is that we also immediately obtain new results for deep neural networks: For depth-$L$ feedforward ReLU networks, their induced native spaces are $p$-normable quasi-Banach spaces with $p = 2/L$. This reveals that the inductive bias of deep neural networks (as given by the representation cost) cannot be captured by norms for depths $L > 2$.

12.
arXiv (CS.CV) 2026-06-18

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

Pedestrian trajectory prediction from an ego-centric camera is challenging since it depends on complex interactions with vehicles and scene context, as well as the intention of the pedestrian. By modelling correlation and intent from the historical and future trajectories of the pedestrian, it will usually result in a multimodal (i.e. multiple modes) distribution. Existing stochastic predictors often sample multiple futures from a single unimodal distribution, which can yield sub-optimal 'mixed-mode' trajectories that lie between distinct motion patterns and become implausible in real scenes. In this paper, we propose MMPM, a mode-aware framework that separately models future trajectory distributions into semantically meaningful modes based on the pedestrian's crossing behavior. MMPM consists of two modules: behavior-aware Pedestrian Interaction Module (PIM) that jointly captures pedestrian-vehicle and pedestrian-environment interactions by introducing gaze, head and hand gesture, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module to model the future trajectory distributions on two modes, crossing and non-crossing the road, separately. A query-based decoder further enforces mode consistency during decoding. Experiments on PIE and JAAD datasets show that our method surpasses state-of-the-art baselines. Our proposed MTP is model-agnostic, which can be integrated into existing frameworks such as BiTrap-NP and SGNet-ED to further improve future trajectory prediction performance. We additionally introduce a data-driven validation protocol that matches predictions to spatio-temporally consistent ground-truth trajectories, demonstrating improved frame-wise displacement errors over previous work.

13.
arXiv (CS.AI) 2026-06-19

SleepMaMi: A Universal Sleep Foundation Model for Integrating Macro- and Micro-structures

arXiv:2602.07628v2 Announce Type: replace Abstract: While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi , a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age, sex and BMI to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of $>$20,000 PSG recordings (158K hours),SleepMaMi outperforms or matches state-of-the-art existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.

14.
arXiv (quant-ph) 2026-06-17

Quantum Information Processing: A brief overview on Quantum Teleportation

作者:

arXiv:1604.00852v3 Announce Type: replace Abstract: Quantum Information Processing (QIP) exploits the principles of quantum mechanics to perform information storage, communication, and computation in ways that are fundamentally impossible within classical frameworks. This article presents a pedagogical overview of the mathematical foundations of quantum information theory, including qubits, Hilbert spaces, linear operators, quantum measurements, tensor products, density operators, and quantum entanglement. Building upon these concepts, we provide a detailed introduction to quantum teleportation, one of the most remarkable protocols in quantum communication. The discussion covers the no cloning theorem, the original teleportation protocol by Bennett et al., experimental realisations of quantum teleportation, and extensions involving probabilistic and multiqubit teleportation schemes. Particular emphasis is placed on the role of entanglement as a communication resource, together with the study of teleportation channels based on bipartite and multipartite quantum states. Various quantitative measures of entanglement, including concurrence, negativity, entanglement of formation, and relative entropy of entanglement, are reviewed alongside teleportation fidelity as a performance metric. Furthermore, the interplay between Bell nonlocality, mixed state entanglement, and teleportation efficiency is examined, followed by a survey of advanced developments such as controlled teleportation, bidirectional teleportation, cluster state teleportation, and recent advances in the Quantum 2.0 era. This review aims to provide students, researchers, and engineers with a coherent introduction to the theoretical foundations and practical significance of quantum teleportation in emerging quantum technologies.

15.
arXiv (CS.LG) 2026-06-11

Finding Multiple Interpretations in Datasets

arXiv:2606.12277v1 Announce Type: new Abstract: In this paper, we propose an approach to finding sets of similar-performing models (in terms of loss/accuracy measurements) with highly different context-aware characteristics. Through experiments on the METABRIC dataset, we show that the proposed method finds multiple models with highly different gene expressions than those found by the control methodology without performance penalties. We argue that the proposed methodology is important whenever one aims to analyze any global characteristic of a model to extract insight into the underlying phenomenon being studied.

16.
arXiv (CS.CV) 2026-06-12

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

17.
arXiv (quant-ph) 2026-06-11

Honest-binding quantum bit commitment from separable operations

arXiv:2501.07351v3 Announce Type: replace Abstract: Bit commitment is a fundamental cryptographic primitive and a cornerstone for numerous two-party cryptographic protocols, including zero-knowledge proofs. However, it has been proven that unconditionally secure bit commitment, both classical and quantum, is impossible. In this work, we demonstrate that imposing a restriction on the committing party to perform only separable operations enables secure quantum bit commitment schemes. Specifically, we prove that in any perfectly hiding bit commitment protocol, an honestly-committing party limited to separable operations will be detected with high probability if they attempt to alter their commitment. To illustrate our findings, we present an example protocol.

18.
PLOS Medicine 2026-05-06

Point-of-care early infant HIV diagnosis at birth in a pragmatic cluster-randomized trial in Mozambique and Tanzania: A comparative cost and cost-effectiveness study

by Kira Elsbernd, Issa Sabi, Ilesh V. Jani, Chishamiso Mudenyanga, Siriel Boniface, Arlete Mahumane, Joaquim Lequechane, Falume Chale, Bindiya Meggi, Kassia Pereira, Raphael Edom, Anange F. Lwilla, W. Chris Buck, Nyanda Elias Ntinyinya, Michael Hoelscher, Till Baernighausen, Arne Kroidl, Stefan Kohler, the LIFE Study Consortium Background Timely access to early infant diagnosis (EID) is crucial for newborns with HIV, as late diagnosis can delay lifesaving antiretroviral treatment (ART). We assessed the comparative cost and cost-effectiveness of integrating point-of-care EID at birth into routine care in primary healthcare settings. Methods and findings This pre-specified secondary analysis was nested in the cluster-randomized LIFE study conducted at 28 primary healthcare facilities in Mozambique and Tanzania from October 2019 to September 2021. We estimated the health system cost of point-of-care birth plus 4–8-week HIV testing (very early infant diagnosis; VEID) compared to standard-of-care (SoC) testing at 4–8 weeks only, both with immediate ART initiation. We assessed the cost-effectiveness of VEID relative to SoC with respect to ART initiation within one week of life using Bayesian hierarchical models. As this is an intermediate outcome, incremental cost-effectiveness ratios (ICERs) cannot be directly compared to available life-year-based cost-effectiveness thresholds. To contextualize results, we derived the minimum life-years gained per early ART initiation required for VEID to meet standard thresholds in a break-even analysis.VEID was associated with a higher cost and resulted in earlier ART initiation than SoC in both countries. In Mozambique, VEID increased the proportion of infants initiating ART within one week of life by 90.0 (95% CrI [67.5, 98.5]) percentage points at an incremental cost of $2,632 (95% CrI [$2,249, $3,062]) per infant with HIV. In Tanzania, VEID increased early ART initiation by 59.9 (95% CrI [20.9, 89.5]) percentage points at an incremental cost of $6,263 (95% CrI [$5,394, $7,243]) per infant with HIV. The ICER was $2,924 and $10,458 in Mozambique and Tanzania, respectively and was sensitive to intrauterine transmission rate. These findings were limited by the lack of long-term health outcome data and reliance on an intermediate outcome. Based on the break-even analysis, we estimated that VEID would need to yield 6–32 life-years gained per additional early ART initiation to meet standard thresholds. Conclusions Adding birth testing improved early ART initiation but was unlikely to be cost-effective relative to standard thresholds given current prices, vertical transmission rates, and knowledge of long-term health benefits. Cost-effectiveness could be achieved at current costs if early ART translates to substantial long-term health benefits or if targeted to infants at high risk of vertical transmission.

19.
arXiv (CS.LG) 2026-06-18

Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

arXiv:2606.19212v1 Announce Type: cross Abstract: Recent empirical work shows that semantically equivalent paraphrases can fool financial sentiment classifiers: although a paraphrase remains close to the original under a strong reference embedding, it may shift the target model's representation enough to change the predicted class. Existing robustness theory either assumes a single-model threat model or focuses mainly on empirical attack algorithms. We develop a continuous local model of semantic paraphrase perturbations that captures this two-model structure. We show that the worst-case local displacement of the target representation, subject to a proxy-model budget, is governed by the largest generalised eigenvalue of a matrix pencil $(A,B)$ constructed from the Jacobians of the two embedding maps. The resulting attackability index $\lambda^*(x)$ is intrinsic to the local paraphrase geometry and the chosen embedders, yields a closed-form prediction-flip condition for affine readouts, and supports conservative population and finite-sample attackability certificates. For uniform control over classes of affine readouts, we derive a distribution-free VC bound for binary attackability indicators and a scale-sensitive margin bound based on an attackability-adjusted margin that subtracts a local geometric penalty from the standard classifier margin. We also connect the continuous theory to discrete paraphrase search, identify an asymmetry between successful and unsuccessful finite searches, and give a covering condition under which the discrete and continuous settings agree. Finally, we propose an empirical verification framework using soft-token relaxations and generated paraphrase sets to assess the local eigenvalue geometry, prediction-flip condition, and finite-search approximation on a deployed financial-text classifier.

20.
arXiv (CS.CL) 2026-06-19

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the verifier, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: what is the optimal granularity of verification under a given compute budget? Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called GRACE (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

21.
arXiv (CS.CL) 2026-06-12

Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

Apple's Metal 4.1 exposes a tensor compute path: the Metal Performance Primitives (MPP) matmul2d operation over cooperative_tensor fragments, whose interface is documented but whose hardware behavior is deliberately hidden. The specification states which data-type rows are supported, never whether they are hardware-accelerated, where the operation physically executes, what its accumulator width is, or how it partitions matrix fragments across threads. We present Rigel, an empirical characterization of this path on a single Apple M4 Max (a pre-neural-accelerator generation). Using a checksum-gated, provenance-tracked microbenchmark harness, Rigel recovers eleven facts the v4.1 specification hides or contradicts. The headline finding: the Metal 4.1 fp8 (E4M3) matmul2d is emulated, not accelerated: it sustains 0.94x the throughput of fp16 despite reading half the operand bytes, so on M4 it is a memory-footprint feature, not a performance feature. We further show, via a three-signal triangulation (throughput ceiling, comparison against simdgroup_matrix, and per-rail power attribution), that matmul2d executes entirely on the GPU shader cores with no dedicated matrix datapath and no evidence of Apple Neural Engine routing; that it accumulates in >=fp32; and we reconstruct the opaque 8x8 cooperative_tensor fragment layout Apple documents nowhere. Acting on the characterization, a hand-fused GEMM + bias + GELU kernel beats the decomposed path by +6.5-12.9% in the cache-resident regime. All findings are reproducible from committed MIT-licensed code and per-cell CSVs.

22.
arXiv (CS.CL) 2026-06-19

NEST: Narrative Event Structures in Time for Long Video Understanding

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

23.
arXiv (quant-ph) 2026-06-19

Exact Markovian Dissipation Requires Singular Energy Resources

arXiv:2606.19510v1 Announce Type: new Abstract: The Gorini–Kossakowski–Lindblad–Sudarshan (GKLS) equation describes irreversible quantum dynamical semigroups. We show that this description cannot be exact under physically regular energy conditions. We prove that the open-system survival probability under physically regular energy conditions has sublinear decay, whereas any dissipative GKLS semigroup has a linear short-time decay. Hence exact Markovian dissipation requires singular energy resources: an unbounded-below total Hamiltonian or infinite initial energy, and a divergent interaction-energy moment. Therefore, a dissipative time-independent GKLS equation should be regarded as an effective description rather than the exact reduced dynamics of a Hamiltonian dilation satisfying physically regular energy conditions.

24.
arXiv (CS.CL) 2026-06-16

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

Vision-Language-Action models have recently demonstrated promising capabilities in learning generalist robot policies from large-scale multimodal data. However, most existing VLA systems are trained and evaluated primarily with English instructions, leaving their ability to understand and execute instructions in other languages largely unexplored. While the underlying large language models often possess multilingual capabilities, it remains unclear whether these multilingual capabilities transfer to VLAs during training. In this work, we present the first systematic study of multilingual instruction following in VLA models. We first construct multilingual instructions by extending existing benchmarks with translations of their instructions. Using these instructions, we evaluate several representative VLA models across a range of tasks in simulation settings. Our experiments reveal a significant multilingual gap: models trained primarily on English instructions exhibit substantial performance degradation when evaluated on other languages, even when the underlying language backbone is multilingual. We provide several findings and analyses to understand the multilingual gap. Cross-lingual transfer behavior analysis shows that performance drops correlate with both instruction understanding and action execution. Representation analyses suggest that multilingual instruction-caused representation shifts may contribute to the multilingual gap. Motivated by these findings, we further explore strategies to improve multilingual performance in VLAs. We propose a simple yet effective multilingual fine-tuning approach, Multilingual Principal Component Alignment, which leverages Principal Component Analysis to get the principal component subspace and align projected multilingual representations, effectively reducing the multilingual performance gap.

25.
arXiv (CS.CL) 2026-06-16

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant token outputs and degraded accuracy. Current methods to mitigate this issue remain limited: training-based approaches require substantial computational resources, while training-free methods rely on well-crafted prompts or unreliable confidence signals. In this work, we investigate early stopping from the perspective of attention distributions and propose a simple method, ASAG, which infers the model's reasoning state and adaptively adjusts the generation strategy. The proposed framework is training-free and plug-and-play, enabling seamless integration into existing LRMs. Extensive experiments on nine benchmarks demonstrate consistent improvements across mainstream LRMs with varying parameter scales, including the DeepSeek-R1-Distill and Qwen3 series. Specifically, ASAG improves average accuracy by 3.2% while reducing the number of generated tokens by nearly 40% across all reasoning tasks on Qwen3-8B.