Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-16

Stochastic trace estimation with tensor train random vectors

arXiv:2606.15679v1 Announce Type: cross Abstract: Stochastic trace estimation is a standard tool for approximating the trace of a large-scale matrix available only through matrix-vector products. However, in tensor-structured settings, unstructured Gaussian or Rademacher test vectors may be prohibitively expensive to store and compute with, while cheaper rank-one tensor-product vectors can require sample complexities that grow exponentially with the tensor order. This work studies Gaussian random tensor train vectors as a structured alternative for stochastic trace estimation. We show that, with a suitable choice of the tensor train rank, random tensor train vectors recover dimension-independent guarantees for the Girard–Hutchinson estimator. In particular, a median-of-means variant with tensor train rank $r \geq d-1$ achieves the same dependence on the accuracy $\varepsilon$ and failure probability $\delta$ as the classical estimator based on unstructured Gaussian vectors. We further prove an oblivious subspace injection result for sketches formed from independent Gaussian random tensor train vectors: tensor train rank $r\geq d-1$ and $\mathcal{O}(\varepsilon^{-2}(k+\log(1/\delta)))$ samples suffice for a $k$-dimensional target subspace. Finally, we investigate the use of such sketches within the Nystr\"{o}m++ framework. We show that the resulting estimator can achieve the desired $\mathcal{O}(\varepsilon^{-1})$ sample complexity under an additional spectral-tail condition. These results provide clarififcation on both the potential and the limitations of random tensor train vectors in stochastic trace estimation.

02.
arXiv (CS.AI) 2026-06-11

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

arXiv:2606.10046v2 Announce Type: replace-cross Abstract: Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.

03.
arXiv (CS.CL) 2026-06-16

Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

The Rapid Response (RR) framework, deployed in production systems, including Anthropic's ASL-3 safeguards, continuously improves jailbreak-detection classifiers. When new jailbreaks emerge that bypass these classifiers, Rapid Response generates synthetic variants for training, helping the model generalize from the new attacks and quickly adapt. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier's training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modifying only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challenging. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier misassociates that concept's presence with the safe label. Both attacks cause substantial and in some cases near-complete label flipping at only a 1% poisoning rate, achieving up to 100% false positive rates and up to 96% false negative rates.

04.
arXiv (CS.AI) 2026-06-15

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

arXiv:2606.07157v2 Announce Type: replace Abstract: Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

05.
arXiv (CS.AI) 2026-06-11

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

arXiv:2603.09555v2 Announce Type: replace-cross Abstract: High-throughput Mamba-2 inference is usually tied to fused CUDA and Triton kernels, limiting portability across accelerator backends. We show that the state space duality (SSD) recurrence has a compiler-friendly structure: diagonal per-head dynamics, fixed-size chunking, einsum-dominated compute, and static control flow. Expressing this structure in standard JAX primitives gives a single-source inference path with no custom kernels, a registered JAX PyTree cache, and a compiled on-device autoregressive loop. On a single Google Cloud TPU v6e, batch-1 prefill reaches approximately 140 TFLOPS, or 15% model FLOP utilisation (MFU), the roofline ceiling for this regime, and cached decode reaches up to 64% hardware bandwidth utilisation (HBU). At a 4096-token context, cached decode is 27x–36x faster than full-prefix recomputation across five Mamba-2 checkpoints from 130M to 2.7B parameters. The same source runs unmodified on NVIDIA L40S, where cached decode remains sequence-length independent across all model scales. WikiText-103 validation perplexity matches the Triton reference mamba_ssm v2.2.2 within +/-0.0005 points, and hidden states agree to float32 rounding tolerance. Code is available at https://github.com/CosmoNaught/mamba2-jax.

06.
arXiv (CS.CL) 2026-06-19

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

07.
arXiv (CS.LG) 2026-06-11

Mechanisms of Introspective Awareness

arXiv:2603.21396v5 Announce Type: replace Abstract: Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept – a phenomenon termed "introspective awareness." We investigate the mechanisms underlying this capability in open-weights models. First, we find that it is behaviorally robust: models detect injected steering vectors at moderate rates with 0% false positives across diverse prompts and dialogue formats. Notably, this capability emerges specifically from post-training; we show that preference optimization algorithms like DPO can elicit it, but standard supervised finetuning does not. We provide evidence that detection cannot be explained by simple linear association between certain steering vectors and directions promoting affirmative responses. We trace the detection mechanism to a two-stage circuit in which "evidence carrier" features in early post-injection layers detect perturbations monotonically along diverse directions, suppressing downstream "gate" features that implement a default negative response. This circuit is absent in base models and robust to refusal ablation. Identification of injected concepts relies on largely distinct later-layer mechanisms that only weakly overlap with those involved in detection. Finally, we show that introspective capability is substantially underelicited: ablating refusal directions improves detection by +53%, and a trained bias vector improves it by +75% on held-out concepts, both without meaningfully increasing false positives. Our results suggest that this introspective awareness of injected concepts is robust and mechanistically nontrivial, and could be substantially amplified in future models. Code: https://github.com/safety-research/introspection-mechanisms.

08.
arXiv (CS.CV) 2026-06-12

Reinforcement Learning for Neural Model Editing

作者:

Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

09.
arXiv (CS.CL) 2026-06-16

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.

10.
arXiv (math.PR) 2026-06-19

Optimal Sparsification of Gaussian Processes

arXiv:2606.19763v1 Announce Type: new Abstract: We prove an optimal dimension-free sparsification theorem for suprema of centered Gaussian processes. Given a bounded set $T\subseteq\mathbb{R}^n$, we show that the supremum of the canonical Gaussian process on $T$ can be $L^2$-approximated by the supremum of a shifted subprocess indexed by only $\exp(O(1/\varepsilon^2))$ points, with error at most $\varepsilon$ times the Gaussian width of $T$. In particular, the size of the approximating process is independent of both the ambient dimension and the cardinality of the original index set. This improves a recent sparsification theorem of De, Nadimpalli, O'Donnell, and Servedio (2026) by an exponential factor, and we show that the dependence on $\varepsilon$ is tight up to constants in the exponent. As consequences, we obtain an exponentially improved junta theorem for norms over Gaussian space and sharpen results on learning, property testing, and polyhedral approximation of convex sets under the Gaussian measure. The proof is based on an interpolation argument that combines Sudakov's minoration with the Brascamp–Lieb inequality.

11.
arXiv (CS.CV) 2026-06-17

Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations

Multi-task table recognition jointly addresses table structure prediction, cell localization, and cell content recognition within a unified framework. Existing approaches often rely on autoregressive decoders to generate table structures and reuse their hidden states for cell localization and content recognition. This autoregressive generation process can make cell representations order-dependent, degrading global consistency across cells. This paper proposes a structural refinement module that produces order-independent cell features through non-causal attention. This design enables parallel inference of cell contents while conditioning each cell on global context encoded in the refined features. Experiments on two large datasets demonstrate consistent gains in cell localization and end-to-end recognition, while reducing overall inference time by around threefold.

12.
arXiv (CS.CV) 2026-06-17

Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation

Mamba-based state space models offer linear-time long-range modeling for high-resolution dense prediction, but sequential state-space propagation can attenuate boundary-sensitive and detail-sensitive responses that are critical in multi-class semantic segmentation. We propose Reload-Mamba, a semantic segmentation framework that addresses this propagation-induced response dilution through three segmentation-specific designs: (i) a boundary-supervised local detail prior that is explicitly trained with ground-truth boundary masks to identify regions requiring response restoration; (ii) a class-uncertainty-aware Reload Gate that incorporates per-pixel class entropy from a pre-reload auxiliary head as an additional gating signal, a formulation that is informative only under multi-class dense prediction; and (iii) a hierarchical multi-level Reload mechanism that applies anti-dilution refinement at three decoder levels and fuses the restored representations top-down. Built upon a ConvNeXt-Tiny encoder with a multi-scale decoder and four-directional Mamba scanning with pixel-wise directional attention, Reload-Mamba achieves 47.9% single-scale (48.9% multi-scale) mIoU on ADE20K and 83.2% single-scale mIoU on Cityscapes. With ResNet-101 + COCO pre-training under the standard DeepLab-style protocol, Reload-Mamba reaches 87.8% mIoU on PASCAL VOC 2012 val. Controlled ablations show that each of the three segmentation-specific designs contributes beyond a direct port of the prior anti-dilution architecture proposed for binarization, cumulatively improving over the direct-port baseline by +2.2 mIoU on ADE20K.

13.
bioRxiv (Bioinfo) 2026-06-21

OracleScreen-LILRB4: Machine Learning-Guided Discovery of Myeloid Immune Checkpoint Binders Validated in Patient-Derived Cells

The identification of small molecule modulators of immune checkpoint proteins remains a significant challenge in drug discovery due to the flat, featureless nature of protein-protein interaction interfaces and the characteristically low hit rates observed in conventional high-throughput screening campaigns. Here we report OracleScreen-LILRB4, an ensemble machine learning framework trained on quantitative biophysical screening data from two structurally diverse compound libraries (19,800 compounds total) screened against the myeloid immune checkpoint leukocyte immunoglobulin-like receptor B4 (LILRB4/ILT3). By formulating binding prediction as a regression task targeting continuous {Delta}Fnorm values rather than binary hit classifications, OracleScreen-LILRB4 achieved a mean Spearman R of 0.61 and ROC-AUC of 0.86 under scaffold-aware cross-validation. Prospective virtual screening of a 45,760-member compound library and experimental validation of the top 200 predictions yielded a 28.5% hit rate, representing a 15.0-fold enrichment over baseline, with 16 compounds demonstrating nanomolar-affinity LILRB4 (ILT3) engagement. Lead compounds ORS-22 and ORS-14 restored anti-tumor immune activity across patient-derived colorectal cancer and acute myeloid leukemia co-culture systems, reversing SCG2-mediated immunosuppression and recovering cytotoxic T-cell function. These findings establish OracleScreen-LILRB4 as an effective computational framework for accelerating small molecule discovery against non-enzymatic immune checkpoint targets.

14.
arXiv (CS.CL) 2026-06-17

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

The rapid rise in popularity of large language models (LLMs) for translation calls for a thorough study of the reliability of their confidence in their own outputs. Unlike many generation tasks, translation errors and confidence levels can be useful at different levels of granularity (tokens, words, or spans). Unsupervised approaches based on internal signals like predicted probabilities can be misleading because they reflect certainty among alternatives rather than correctness. In addition, they require access to such internal signals. Here, we devise five verbalized methods of extracting an LLM's per-token confidence without those shortcomings and compare their reliability with that of the model's internal signals of certainty. We evaluate reliability using two forms of alignment: fine-grained error detection and calibration. For both, internal and verbalized methods perform similarly, although results vary by model. Interestingly, we find little to no correlation between internal and verbalized methods.

15.
arXiv (CS.CL) 2026-06-16

StagePilot: Stage-Level Planning for Long-Horizon Dialogue Simulation in Cybergrooming

Cybergrooming is an evolving threat to youth, requiring proactive educational interventions. We address this by modeling dialogue progression as a structured planning problem over stage-wise interactions. We propose StagePilot, a dialogue framework that separates stage-level planning from response generation, in which the model selects the next stage under constrained transitions and generates responses conditioned on it, enabling coherent and realistic progression. Reinforcement learning is used to learn stage-level policies from offline data, optimizing for both emotional alignment and goal-consistent progression. Our empirical experiments show that StagePilot generates more structured, coherent dialogue trajectories and reduces conversational stagnation compared to baselines; notably, the IQL+AWAC variant reaches the final stage more often while maintaining over 70% positive or neutral responses, yielding a 43% relative improvement.

16.
arXiv (CS.AI) 2026-06-16

Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems

作者:

arXiv:2606.14923v1 Announce Type: new Abstract: As language-model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate's work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust. Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduce verification by roughly 60-85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over-verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi-agent AI systems.

17.
arXiv (quant-ph) 2026-06-16

Quantum Algorithm for Open-System Battery Cathodes by Modeling Multiple Strongly Coupled Holstein Polarons with Chain-Mapped Caldeira-Leggett Dynamics

arXiv:2606.16017v1 Announce Type: new Abstract: Cathode lithiation occupies a chemical regime of tightly localized orbitals, narrow bandwidths, and strong electron-lattice coupling. The defining electrochemical observables (open-circuit voltage and differential capacity) are open-system, reservoir-equilibration quantities that closed-Hamiltonian quantum simulation cannot produce, set by exchange with electron, Li$^+$, and phonon baths. We present a fault-tolerant quantum algorithm that recovers them through a unitary chain-mapped Caldeira-Leggett embedding, rendering the baths Trotterizable. The resulting fourth-order Trotter step has a T-gate count polynomial in system size, validating its open-system dynamics against hierarchical equations of motion (HEOM) at strong coupling and the Lindblad limit at weak coupling. For single-carrier olivine LiFePO$_4$, a single voltage anchor on an otherwise DFT-fixed Hamiltonian places the differential-capacity peak within the $\pm5$ mV reproducibility of the experimental plateau. For multi-carrier spinel LiMn$_2$O$_4$, whose $1{:}1$ Mn$^{3+}$/Mn$^{4+}$ filling makes the inter-site Coulomb repulsion dynamically active, the same kernel yields a two-plateau voltage curve with a $125$ mV split, within $17\%$ of the observed $150$ mV. We deliver an end-to-end fault-tolerant resource estimate for such a multi-carrier, three-reservoir observable: $368$ logical qubits and $\sim3\times10^5$ T-gates per step, or $\sim1.7\times10^{12}$ T-gates for a full voltage curve (parallelizable over $\sim10^3$ trajectories), leaving the production-scale dynamical run as a milestone for future hardware. The same kernel reproduces macroscopic quantum coherence, two-band superconductivity, and the Mikheyev-Smirnov-Wolfenstein resonance without modification, placing dynamical battery chemistry and similar Hamiltonians within scope for fault-tolerant quantum simulation.

18.
bioRxiv (Bioinfo) 2026-06-21

Antibody-Antigen Affinity Prediction with Chain-Aware Protein Language Modeling

Motivation: Antibody-antigen affinity determines which antibodies advance in therapeutic discovery, repertoire analysis and affinity maturation, but experimental measurements are sparse relative to the scale of sequence libraries. Structure-based predictors can exploit interface geometry when reliable complexes are available, yet early discovery often requires ranking many heavy-light chain pairs against antigens for which no complex structure exists. Existing sequence-based models are scalable, but frequently compress heavy and light chains into a single antibody representation or concatenate antibody and antigen features obscuring the chain-specific and epitope-specific signals that drive binding. Results: We present AbAffinity, a sequence-only chain-aware three-stream architecture that maintains heavy chain, light chain and antigen as distinct streams. It integrates frozen ESM-2 embeddings with heavy-chain CDR-focused pooling, heavy-light self-attention, adaptive fusion gating and gated cross-attention, training only a compact interaction module. On the SAAINT-DB benchmark, AbAffinity achieves strong predictive performance under ten-fold cross-validation and maintains robust accuracy on novel antigens. It consistently outperforms recent sequence-based models across external benchmarks including SAbDab, AB-Bind and SKEMPI 2.0. Ablation studies highlight the contributions of chain-specific representations, CDR-focused pooling and the gated interaction pathway. Integrated Gradients attributions recover known paratope and epitope residues at structurally validated interfaces. AbAffinity provides a lightweight, explainable sequence-first framework for antibody triage and prioritisation when structural information is limited or unavailable.

19.
arXiv (CS.AI) 2026-06-12

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

arXiv:2602.07294v4 Announce Type: replace-cross Abstract: With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency–limitations that existing benchmarks have yet to formally categorize or quantify.

20.
arXiv (CS.AI) 2026-06-17

Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

arXiv:2606.17513v1 Announce Type: cross Abstract: Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification (UQ), especially under geometric variability. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry-aware representations learned by the operator itself. We propose REEF-GP (Residual on Embedded Features Gaussian Process), a post-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space. Rather than learning a separate feature map, REEF-GP adapts the operator's intrinsic coordinate-feature representations to construct geometry-aware uncertainties. To ensure stability and scalability on unstructured domains, REEF-GP incorporates spectral-normalized projections, heteroscedastic geometry-aware noise, and efficient subset-based training that avoids restrictive low-rank approximations. Across five PDE benchmarks with varying geometries, REEF-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions (e.g., shock fronts). Our results demonstrate that accurate and scalable post-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter-centric approaches.

21.
arXiv (CS.LG) 2026-06-12

Hierarchical Successor Representation for Robust Transfer

arXiv:2602.12753v2 Announce Type: replace Abstract: The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

22.
arXiv (math.PR) 2026-06-12

A mathematical study of the excess growth rate

arXiv:2510.25740v2 Announce Type: replace-cross Abstract: The excess growth rate, defined as the gap in Jensen's inequality for the logarithm, is a fundamental functional in portfolio theory. In this paper, we present a mathematical study motivated by information theory. We begin by establishing its properties and showing that it has rich connections with information theoretic concepts such as the Helmholtz free energy, L. Campbell's measure of average code length and large deviations. Our main results consist of three axiomatic characterization theorems of the excess growth rate, in terms of (i) the relative entropy, (ii) the gap in Jensen's inequality, and (iii) the logarithmic divergence that generalizes the Bregman divergence. Furthermore, we study maximization of the excess growth rate and compare it with the growth optimal portfolio. Our results not only provide theoretical justifications of the significance of the excess growth rate, but also establish new connections between information theory and quantitative finance.

24.
arXiv (CS.LG) 2026-06-11

MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

arXiv:2606.11868v1 Announce Type: new Abstract: De novo peptide sequencing from tandem mass spectrometry is pivotal in proteomics, enabling identification of novel peptides without reference databases. While recent Transformer-based encoder-decoder models have achieved remarkable performance, we uncover a critical pathology in their inference dynamics. Through comprehensive feature scaling experiments, we demonstrate that existing auto-regressive peptide decoders tend to over-rely on generated-sequence priors while progressively under-utilizing fine-grained physical evidence from the input mass spectrum. This phenomenon leads to suboptimal results, where generated peptide sequences are biologically plausible yet not faithful to the input spectrum. To rectify this, we propose MemNovo, a training-free and plug-and-play mechanism that re-balances peptide and spectral contributions at inference time. MemNovo alleviates the information bottleneck by establishing a persistent spectral memory bank and injecting retrieved features directly into the final decoding stage via an ultra-conservative residual connection. Theoretical analysis confirms that this mechanism restores the mutual information between the decoder state and the raw spectrum. Extensive experiments on the Nine Species benchmark with two representative baselines, Casanovo and InstaNovo, demonstrate that MemNovo consistently improves both amino acid precision and peptide precision, achieving up to 39.1% relative improvement in peptide precision for Casanovo and up to 3.9% for InstaNovo, with negligible computational overhead.

25.
arXiv (CS.CV) 2026-06-11

Brain-IT-VQA: From Brain Signals to Answers

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.