Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-12

Estimating the effectiveness of syndromic screening at airports for Bundibugyo ebolavirus disease

We used a stochastic simulation model to estimate the effectiveness of combined exit and entry airport screening for Bundibugyo ebolavirus disease (BVD), using natural-history parameters from a Bayesian re-analysis of the 2012 Isiro outbreak. For a 12-hour international flight from DRC or Uganda at 86% screening sensitivity, we estimate 65% of infected travellers would arrive undetected (95% CrI: 38 - 76%). The main driver of this outcome is the relative duration of the the incubation period (approximately 7.7 days) and the onset-to-severe-disease interval (approximately 4 days): most infected travellers board before symptom onset and are undetectable by any syndromic screen, whilst those who are symptomatic progress rapidly to illness severe enough to preclude travel. This is compounded during active epidemic growth, when recently exposed (and therefore pre-symptomatic) cases are overrepresented among travellers. Syndromic airport screening offers limited protection against BVD spread via air travel, and should be complemented by outbreak control at source and strengthened clinical surveillance in receiving countries with high travel connectivity to affected areas.

02.
arXiv (CS.AI) 2026-06-16

Green AI Carbon Optimizer: Carbon-Efficient Training Location Recommendation and Global AI Energy Demand Forecasting

arXiv:2606.14707v1 Announce Type: cross Abstract: AI training and deployment consume substantial electricity, but carbon outcomes remain weakly integrated into routine model development decisions. This paper presents Green AI Carbon Optimizer with two primary contributions: (i) a carbon aware cloud region recommendation method for training workloads, and (ii) a power law forecasting pipeline for global AI energy demand. For location recommendation, we combine regional grid carbon intensity, renewable share, and data center Power Usage Effectiveness (PUE) into a unified scoring model across 100+ regions from major cloud providers. For a reference workload (8*A100, 100h), estimated emissions in our sampled regions range from 7.74kg to 272.00kg CO2. Selecting the best region instead of the worst corresponds to a 97.2% reduction relative to the worst case. Ablation shows that ranking by renewable share alone can select regions with higher CO2 emissions than rankings that include grid carbon intensity. For forecasting, we fit a power law relation between parameter count and training energy using 26 anchor models. We combine this fit with scenario assumptions on model growth, hardware efficiency, and training frequency, and evaluate sensitivity to inference ratio and ecosystem scaling. Across scenarios, projected 2030 demand ranges from 7TWh to 1,436TWh under the stated assumptions, highlighting the importance of deployment choices, model scaling discipline, and transparent energy reporting.

03.
arXiv (CS.AI) 2026-06-25

BCoughBench: Benchmarking Respiratory Acoustic Foundation Models Under Body-Coupled Wearable Sensor Conditions

arXiv:2606.25116v1 Announce Type: cross Abstract: Respiratory acoustic foundation models (FMs) are benchmarked exclusively on smartphone recordings, yet clinical deployment increasingly targets body-coupled (BC) wearables whose sensors attenuate high-frequency content through tissue and bone, leaving FM reliability uncharacterised. We introduce BCoughBench, evaluating five FMs (OPERA-CT/CE/GT, HeAR, M2D+Resp) on nine classification tasks (AUROC, sensitivity at 95% specificity, Expected Calibration Error) and three age regression tasks (MAE vs. a mean-predictor baseline) across five EBEN-simulated BC sensor conditions on five labeled cough datasets. Mean AUROC declines from 0.785 (smartphone) to 0.689-0.723, degrading most under temple vibration pickup ($\Delta$ = -0.096) and least under the soft in-ear ($\Delta$ = -0.062). No FM meets the clinical sensitivity threshold (Se@Sp95 $\geq$ 0.20) on most disease tasks under any BC sensor. Sex classification on the CIDRZ cohort collapses (AUROC 0.954 to 0.596-0.628, $\Delta$ = -0.341) while COVID detection is nearly unaffected ($\Delta$ = -0.004). Age regression is robust, improving under the forehead accelerometer on CoughVID (MAE 9.61 to 8.97 yr); HeAR leads on regression and demographic tasks, M2D+Resp on disease and characteristic tasks. BCoughBench provides a reproducible framework for FM evaluation under wearable conditions.

04.
arXiv (CS.CL) 2026-06-25

Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

Long-term human-agent dialogues are organized by topic continuity: adjacent turns often develop the same goal, plan, problem, or event, while related activities may recur across distant sessions. Yet many LLM agent memory systems first decompose histories into isolated turns or fixed-size chunks, then compensate through enrichment, consolidation, or retrieval mechanisms still tied to semantic proximity or fragment-level records. This weakens temporal and causal organization and biases memory access toward semantic proximity rather than task- or topic-level continuity. We introduce Membox, a hierarchical memory architecture that instantiates topic continuity as an explicit organization layer for agent memory. Its Topic Loom incrementally organizes dialogue streams into boxes whose internal turns follow the same local topic, while its Trace Weaver links extracted events across boxes into macro-topic traces that recover recurring activities, goals, and factual developments across distant sessions. On LoCoMo, Topic-Loom-only retrieval improves over the best Mem0/A-MEM retrieval-depth setting by 13.00 F1 points (53.95 vs. 40.95), and trace-expanded retrieval further raises F1 to 55.28; with GPT-4o, trace-expanded retrieval reaches 59.71 F1. Additional DialSim results show the same gain from adding cross-box traces in multi-party dialogue. These results show that local topic-continuity organization and macro-topic trace expansion improve long-range memory beyond semantic retrieval over fragmented records.

05.
arXiv (CS.LG) 2026-06-17

Learning Survival Models with Right-Censored Reporting Delays

arXiv:2510.04421v3 Announce Type: replace-cross Abstract: Survival analysis provides statistical methods to model the time until an event occurs. Reporting delays arise when event times are not observed at their occurrence but are only revealed upon reporting. This issue is particularly critical for timely risk evaluation when the observation window is short due to administrative censoring. In this study, we incorporate right-censored reporting delays by jointly modeling parametric hazards for the event and reporting processes. We then construct a consistent estimator for the model parameters and develop a Monte Carlo expectation-maximization algorithm to compute it. To address the challenges posed by administrative censoring, we leverage these findings and propose a transfer-learning procedure. Experimental results demonstrate that our method improves the accuracy of timely risk evaluation under administrative censoring.

06.
arXiv (CS.CL) 2026-06-24

SHERLOC: Structured Diagnostic Localization for Code Repair Agents

LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedicated localization frameworks have emerged, yet are still evaluated as file retrieval rather than actionable diagnosis, producing locations without the diagnostic context a repair agent needs. We introduce SHERLOC (Structured Hypothesis-driven Exploration and Reasoning for Localization), a training-free framework pairing a reasoning LLM with compact repository tools and self-recovery, without fine-tuning or multi-agent orchestration. SHERLOC reaches state-of-the-art localization across model scales: 84.33% accuracy@1 on SWE-Bench Lite and 81.27% recall@1 on SWE-Bench Verified; at ~30B parameters, it matches or outperforms other agentic methods. Injecting our locations and diagnostic findings into repair agents yields, on average, +5.95 pp resolve rate on SWE-Bench Verified while cutting localization and total tokens by 36.7% and 23.1%.

07.
arXiv (CS.CV) 2026-06-25

Invoice Haystack: Benchmarking Document Retrieval and Visual Question Answering Under Strong Visual Homogeneity

Vision Language Models have achieved near-human performance on single-document Visual Question Answering, yet their effectiveness degrades significantly when retrieving information from large collections of visually homogeneous documents. Existing multi-document benchmarks aggregate diverse document types, creating artificial separation in embedding space that does not reflect enterprise document repositories where thousands of records share identical visual templates. We identify this as embedding collapse and introduce Invoice Haystack, a benchmark with 1,500 anonymized invoice images paired with 200 discriminative question-answer pairs, specifically designed to stress-test retrieval under strong visual homogeneity. Invoice Haystack exhibits a mean pairwise cosine similarity of 0.73, compared to 0.38 (DocHaystack) and 0.31 (InfoHaystack) in existing benchmarks, posing a fundamentally more challenging retrieval problem. Addressing the identified challenge, we propose VL-RAG, a hybrid retrieval-augmented generation framework that jointly leverages text and visual embeddings to harness the complementary strengths of both modalities, followed by a VLM-based verification filter for precise document identification. VL-RAG achieves 60.0\% Recall@1 on Invoice Haystack-500, outperforming existing state-of-the-art method by up to an absolute 13.5 percentage points. It further improves retrieval considerably on DocHaystack-1000 (77.1\% vs.\ 75.2\%) and InfoHaystack-1000 (84.5\% vs.\ 80.0\%), establishing the proposed dual-stream fusion as a consistently superior retrieval strategy across both homogeneous and heterogeneous document collections.

08.
arXiv (CS.LG) 2026-06-11

Impact of Connectivity on Laplacian Representations in Reinforcement Learning

arXiv:2603.08558v3 Announce Type: replace Abstract: Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches leverage structural priors on the MDP by constructing state representations as linear combinations of the state-graph Laplacian eigenvectors. When the transition graph is unknown or the state space is prohibitively large, the graph spectral features can be estimated directly via sample trajectories. In this work, we prove an upper bound on the approximation error of linear value function approximation under the learned spectral features. We show how this error scales with the algebraic connectivity of the state-graph, grounding the approximation quality in the topological structure of the MDP. We further bound the error introduced by the eigenvector estimation itself, leading to an end-to-end error decomposition across the representation learning pipeline. Additionally, our expression of the Laplacian operator for the RL setting, although equivalent to existing ones, prevents some common misunderstandings, of which we show some examples from the literature. Our results hold for general (non-uniform) policies without any assumptions on the symmetry of the induced transition kernel. We validate our theoretical findings with numerical simulations on gridworld environments.

09.
arXiv (quant-ph) 2026-06-11

An Introduction to the Foundations and Interpretations of Quantum Mechanics

arXiv:2603.09818v2 Announce Type: replace Abstract: This article surveys a selection of key conceptual and interpretational developments in quantum mechanics, tracing the theory from its foundational postulates to contemporary discussions of measurement, nonlocality, and the emergence of classicality. Beginning with the structure of Hilbert space and the postulates governing state evolution and measurement, the epistemic stance of the Copenhagen interpretation and its modern reformulations are examined. The Einstein-Podolsky-Rosen argument, Bell's theorem, and Hardy's paradox are then discussed as probes of locality and realism, alongside the deterministic but explicitly nonlocal de Broglie-Bohm theory. The measurement problem and the implications of contextuality are analyzed in relation to objective collapse models, which introduce new physical dynamics to account for definite outcomes. Finally, the role of decoherence in the suppression of interference and the emergence of classical behavior is explored, together with the interpretational frameworks of many-worlds and consistent histories. This material aims to provide a coherent introductory overview of how several of the most prominent interpretations address the central concern of what quantum mechanics tells us about the nature of physical reality.

10.
arXiv (quant-ph) 2026-06-11

Entanglement generation between field modes mediated by a fluctuating conducting wall

arXiv:2606.12338v1 Announce Type: cross Abstract: We consider a movable conducting plate of finite mass, between two fixed ones, whose mechanical degrees of freedom are treated quantum-mechanically and bound to its equilibrium position by a harmonic potential. The movable wall is thus subjected to quantum fluctuations of its position. This creates a system of two sub-cavities separated by the movable fluctuating plate, and two massless one-dimensional scalar fields, one in each sub-cavity. This system is described by an appropriate generalization of the Law Hamiltonian. The presence of the movable wall yields an effective plate-fields interaction, as well as an effective interaction between the field modes. We obtain, at the second order in perturbation theory, the ground state of the interacting system and the reduced density operator of the fields in each sub-cavity by tracing out the wall's degrees of freedom. We calculate the entanglement between two field modes, one in each cavity, by evaluating analytically the negativity; we then evaluate numerically also the total multimode negativity. Our results show that in both cases the fields in the two sub-cavities are entangled, in contrast to the case in which the wall is fixed in space. We discuss the amount of the field entanglement present as a function of relevant physical parameters of the system such as the mass and oscillation frequency of the movable wall, its distance from the fixed walls and the frequencies of the field modes considered.

11.
arXiv (math.PR) 2026-06-18

The FBSDE approach to sine-Gordon up to $6\pi$

arXiv:2401.13648v3 Announce Type: replace-cross Abstract: We develop a stochastic analysis of the sine-Gordon Euclidean quantum field $(\cos (\beta \varphi))_2$ on the full space up to the second threshold, i.e. for $\beta^2 < 6 \pi$. The basis of our method is a forward-backward stochastic differential equation (FBSDE) for a decomposition $(X_t)_{t \geqslant 0}$ of the interacting Euclidean field $X_{\infty}$ along a scale parameter $t \geqslant 0$. This FBSDE describes the optimiser of the stochastic control representation of the Euclidean QFT introduced by Barashkov and one of the authors. We show that the FBSDE provides a description of the interacting field without cut-offs and that it can be used effectively to study the sine-Gordon measure to obtain results about large deviations, integrability, decay of correlations for local observables, singularity with respect to the free field, Osterwalder-Schrader axioms and other properties.

12.
arXiv (CS.CV) 2026-06-24

Boosting Text-Driven Video Segmentation via Geometry-Aware Distillation

Text-driven Referring Video Object Segmentation (RVOS) aims to locate and segment target objects in videos given natural language. However, existing models are typically trained on 2D image or video datasets with naive segmentation losses, which overlooks the geometric consistency across frames and leads to weak spatial understanding. In this paper, we propose Geometry-enhanced Language-guided Video segmentation (GeoLaV), a two-stage framework that distills 3D geometric knowledge from images to enhance text-driven video segmentation. In the first stage, we perform monocular geometry pretraining with monocular novel-view synthesis, enabling the model to acquire geometry-consistent visual representations via spatial alignment on large-scale single-image datasets. In the second stage, we introduce geometry-aware distillation and fine-tune the model on video segmentation datasets, transferring 3D structural knowledge from a general 3D prior model. This process reinforces 3D awareness and improves both spatiotemporal coherence and language grounding in segmentation. Extensive experiments show that our method using only image segmentation data already provides notable zero-shot generalization in RVOS. When combined with geometry-aware distillation for fine-tuning on videos, our method achieves state-of-the-art performance across multiple RVOS benchmarks. The code is available at https://github.com/Tony1882880/GeoLaV.

13.
arXiv (quant-ph) 2026-06-11

Holographic Complexity, Extremality, and Cosmic Censorship

arXiv:2604.20170v2 Announce Type: replace-cross Abstract: We propose a holographic complexity origin for the third law of black-hole mechanics and weak cosmic censorship. In both complexity equals action and complexity equals volume prescriptions, the relative complexity between subextremal and extremal AdS black holes diverges logarithmically. For overcharged RN-AdS, explicit calculations in both prescriptions show that the near-singularity action terms are power-law divergent or finite, while the maximal-volume contribution is finite. Thus, the extremal-to-naked relative complexity also diverges, obstructing finite-time transitions.

14.
arXiv (CS.LG) 2026-06-24

One Ruler: A Same-Hands Re-Evaluation of Bivariate Causal Direction on Tuebingen, with a Parameter-Free Compression Baseline

arXiv:2606.23767v1 Announce Type: new Abstract: Headline accuracies on the Tuebingen cause-effect pairs are routinely compared across papers even though each is measured under its authors' own protocol – different pair subsets, weightings, model-selection, and decision rates. We argue this is the wrong comparison and run the right one: a same-hands re-evaluation in which every method is run by us on the identical 102 pairs, with one strict rule – no tuning and a decision forced on every pair. As a clean reference point we introduce a deliberately minimal baseline: sorted-conditional compression, which feeds quantized, sorted, first-differenced data to an off-the-shelf compressor (bz2) and has zero fitted parameters. Under the common ruler the ranking differs sharply from the literature. Our baseline reaches 74.7% weighted accuracy (p = 3.7e-7); on the same 100 pairs that SLOPE is evaluated on it scores 76.0%, a 1.2-point gap below the authors' own forced-decision SLOPE (77.2%) that is well inside noise (McNemar p = 0.39). A faithful re-run of RECI lands at 70.7% – inside the original authors' reported error bar, not the 77.5% often quoted (which we trace to a mis-copied cell). SLOPE's published 82.4% is a decided-subset figure: scoring the authors' own stored output only on the pairs its significance test chose to answer reproduces 81.7%. Under the common ruler the methods cluster in the low-to-mid 70s and the zero-parameter compressor ties the strongest of them. We document the mechanisms that inflate published figures (test-set model selection, significance-gated abstention) and contribute two further results: compression score magnitude is a model-free confounding flag (p = 2.8e-68), and a pre-registered falsification test fails in an instructive way that bounds the method's theoretical interpretation. Code, pre-registrations, and per-pair outputs are released.

15.
arXiv (CS.CL) 2026-06-25

Spam and Sentiment Detection in Arabic Tweets Using MARBERT Model

Saudi Telecom Company (STC) is among the most popular companies in Saudi Arabia, with many customers. Yet, there is still a big room for improvement in users' satisfaction. Social media is the most robust platform to gauge users' satisfaction and determine their sentiments and critics. Twitter is among the most popular social media platform in this regard. STC customers prefer to use Twitter to write their feedback because it's a fast way to get responses due to the STC customer services account. One way to achieve customer demands and improve customer service is using the Sentiment Analysis tool. Sentiment Analysis on Twitter is highly used because of the significant number of tweets and the different opinions. Likewise, Deep learning is the best existing Sentiment Analysis method, and it has diverse models. Bidirectional Encoder Representations from Transformers (BERT) model is one of the deep learning models which have achieved excellent results in Sentiment Analysis for Natural Language Processing (NLP). NLP is mainly investigated in the English language. However, for Arabic, there is a significant gap to be filled. This study trained the proposed model using MARBERT and measured the performance using f1-score, precision, and recall metrics. We trained the model with an Arabic dataset of 24,513 tweets, including 1,437 positive, 13,828 negative, 5,694 neutral, 1,221 sarcasm, and 2,297 indeterminate tweets. The main goal is to analyze the tweets and get the sentiment to improve STC customer service. The proposed scheme is promising in terms of accuracy in contrast to existing techniques in the literature.

16.
arXiv (CS.AI) 2026-06-24

Audio-visual Contrastive Alignment for Diffusion-based Visual-conditioned Speech Enhancement

arXiv:2606.23712v1 Announce Type: cross Abstract: Audio-visual speech enhancement (AVSE) exploits visual cues such as lip movements to recover speech in noisy environments. Recent work introduced diffusion-based unsupervised AVSE, where a speech diffusion model conditioned on visual features via cross-attention is trained and used as a data-driven prior for posterior sampling-based speech enhancement. Despite promising performance over its audio-only counterpart, the impact of explicitly enforcing cross-modal alignment in the fusion remains unclear. In this work, we propose to augment the diffusion training objective with a contrastive audio-visual loss to encourage stronger use of visual information while keeping the posterior sampling framework unchanged. Experiments across matched and mismatched test data show consistent improvements in interference suppression, signal reconstruction, and perceptual quality, with the largest gains at low SNRs. Code is available at https://github.com/ cexauce/AV-CA-DiffUSE

17.
arXiv (CS.CV) 2026-06-11

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

18.
medRxiv (Medicine) 2026-06-19

Reassessing Instrument Strength in Two-Sample Mendelian Randomization Analysis

Mendelian randomization (MR) analysis is widely used to estimate causal relationships between risk factors and outcomes of interest. Two-sample MR approaches have gained increasing attention in genetic epidemiology due to the growing availability of Genome-Wide Association Study (GWAS) summary statistics from public databases. A critical step in two-sample MR is the selection of genetic variants as instrumental variables (IVs). Although genome-wide significant variants are typically preferred, the inclusion of variants with weaker association p-values is considered, as they may potentially improve power through an increased instrument number of instruments, while they may introduce weak instrument bias and attenuate effect estimates towards the null. Our simulation results show that even modest levels of pleiotropy substantially increase the variability of causal effect estimates, while the inclusion of weak IVs does not substantially affect the direction and variability of causal effect estimates in most cases. In real data analyses, we used two released versions of FinnGen GWAS summary statistics with different sample sizes as exposure GWASs to assess the influence of weak IVs. Here, the inclusion of IVs with higher exposure-association p-values resulted in weakened estimated effect sizes, particularly when the exposure GWAS sample size was small. These findings suggest that incorporating weak IVs is reasonable when the exposure GWAS sample size is large, but it poses a risk of falsely concluding null associations when the exposure GWAS sample size is small.

19.
arXiv (CS.LG) 2026-06-12

Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function

arXiv:2606.12917v1 Announce Type: new Abstract: We present the first causal mechanistic analysis of a tabular foundation model, investigating how TabPFN 2.5's feature wise attention heads distribute computation across layers. Using activation patching, ablation, and attention entropy across two synthetic regression datasets, we find clear temporal specialisation: one head's causal necessity dominates that of the others by 2 to 5 times at peak layer, with its dominant layer shifting across tasks of different complexity, while the remaining heads exhibit symmetric late layer profiles. Attention entropy and patching provide convergent evidence for the computationally active layers of the dominant head. We additionally investigate inference time steerability via contrastive activation steering, which fails to transfer across samples. We attribute this result to TabPFN's in context learning mechanism, which encodes task structure through context dependent attention rather than the stable parametric directions that make steering tractable in language models.

20.
arXiv (CS.CL) 2026-06-17

When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs – LLaMA, GPT-4o-mini, and MedGemma – we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.

21.
arXiv (CS.AI) 2026-06-25

Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces

arXiv:2605.10913v3 Announce Type: replace Abstract: As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that create, operate on and manage other agents. Meta-agent operations such as coordinating agents, halting risky actions before execution, or repairing failed runs, require runtime manipulation of agentic execution. Yet existing agentic substrates make this difficult: they expose only transcripts and environment snapshots, forcing meta-agents to build ad hoc tooling to reconstruct and operate over full execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can easily inspect and transform. Every model action, tool call, and environment change becomes a structured event in a reversible, Git-like execution trace, where any past state can be reverted 5x faster than docker commit and fork. Three example use cases show Shepherd's versatility: (1) a supervisor meta-agent prevents conflicts among parallel coding agents, lifting pair-coding pass rate from 28.8% to 54.7% on CooperBench; (2) a counterfactual optimization meta-agent repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on Terminal-Bench 2.0 by 12.8% with 58% lower wall-clock; (3) a training meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's uplift on Terminal-Bench 2.0. We open-source Shepherd to enable principled and efficient operations over agentic execution for both users and meta-agents.

22.
arXiv (CS.LG) 2026-06-16

Near-Optimal Stochastic Linear Bandits with Delay

arXiv:2606.16656v1 Announce Type: new Abstract: We study stochastic linear bandits with delayed feedback under several delay models and establish near-optimal regret guarantees. Our results identify when delayed linear bandits exhibit the same qualitative behavior as multi-armed bandits (MAB), and when the linear structure creates fundamentally new challenges. Specifically, (1) for loss-independent delays, where the delay does not depend on the realized loss (but potentially depends on the arm), we show that delays incur only an additive regret penalty. Under stochastic delays, this penalty scales with the expected delay, while under adversarial delays, it scales with the maximum number of outstanding observations. Notably, both delay penalties are dimension-free, improving upon the state-of-the-art results; (2) for loss-dependent delays, we show that linear bandits are substantially harder than MAB: unlike in MAB, we prove matching (up to log factors) upper and lower bounds in linear bandits, whose delay penalty depends on the square root of the dimension. (3) for the delay-as-payoff model, a special case of loss-dependent delay, we show that the optimal MAB guarantee, which depends only on the delay of the optimal arm, is also unattainable in linear bandits. Together, these results provide a sharp characterization of how delayed feedback interacts with linear generalization.

23.
arXiv (CS.AI) 2026-06-15

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

arXiv:2605.07984v2 Announce Type: replace-cross Abstract: We study planning site formation in language models – where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.

24.
arXiv (CS.LG) 2026-06-18

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

arXiv:2606.18650v1 Announce Type: new Abstract: As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering, advanced data selection methods for LLM training largely follow two paradigms, each with fundamental limitations. Influence-based methods provide principled bi-level objectives but require intractable inverse-Hessian computations, while excess-loss methods are computationally efficient but rely on a static reference model that becomes misaligned with the evolving proxy model during training. We propose BLADE (Bi-Level Adaptive Data sElection), a Hessian-free framework for data selection. BLADE reformulates the bi-level optimization problem underlying influence-based methods as a penalized single-level objective via Lagrange multipliers, avoiding inverse-Hessian computation while revealing a principled connection to excess-loss based data selection. The resulting objective recovers an excess-loss form but replaces the static reference model with a dynamic one that stays synchronized with training. Theoretically, we prove that this penalized formulation guarantees first-order convergence. For efficient online batch selection, we instantiate BLADE as a memoryless randomized block-coordinate Frank-Wolfe algorithm. Extensive experiments show that BLADE consistently outperforms state-of-the-art data selection baselines, providing a practical recipe for LLM training.

25.
bioRxiv (Bioinfo) 2026-06-23

FateLimit quantifies the prediction horizon of cell fate

Single-cell technologies have enabled increasingly detailed reconstruction of developmental trajectories, yet a fundamental question remains unresolved: when does future cellular identity become predictable from cells current molecular state? Existing approaches infer lineage relationships, transition probabilities or future transcriptional dynamics, but do not directly quantify the emergence of fate predictability during cellular state transitions. Here we present FateLimit, an information-theoretic framework for measuring the temporal dynamics of cell-fate predictability from single-cell omics data. FateLimit combines probabilistic fate assignment, fate entropy and mutual information to quantify how information about future cellular outcomes is encoded in present molecular states. We introduce two quantitative descriptors: the Fate Information Half-Life (FIHL), which measures the characteristic timescale of fate-information dynamics, and the Prediction Horizon (PH), defined as the earliest developmental stage at which observed fate predictability exceeds the 95th percentile of a permutation-derived null distribution. We applied FateLimit across developmental, lineage-tracing and reprogramming systems, including pancreatic endocrinogenesis, CellTag reprogramming, human hematopoiesis and zebrafish embryogenesis. Across all datasets, FateLimit identified significant fate information and reproducible prediction horizons that were robust to cell-state representation, lineage structure and biological context. Comparative analysis revealed that prediction horizons differ substantially among cellular lineages, indicating that distinct developmental programs acquire predictive information at different rates. FateLimit establishes a general framework for quantifying the predictability of future cellular identity from present molecular states. By transforming developmental trajectories into predictability landscapes, FateLimit enables systematic comparison of commitment dynamics across biological systems and establishes prediction horizons as a quantitative measure of cell-fate determination.