Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

Viewer sentiment prediction in video advertisements aims to infer the latent affective response evoked in the audience. To bridge the gap between what is shown and what is felt, models must deduce hidden viewer emotions from explicit visual narratives, concrete character-object interactions, and visible textual cues. However, standard Multimodal Large Language Models (MLLMs) typically rely on holistic frame representations, which leave these fine-grained, affect-relevant events implicit and complicate precise emotional reasoning. To address this, we propose a grounded action-centric evidence augmentation framework that enhances video MLLMs' clue extraction and comprehension by introducing explicit event structure and localized visual evidence. Our method extracts temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions, grounds subject and object entities as visual entity crops, and then enables the MLLM to perform clue-enhanced emotional reasoning based on these extracted structured clues. In this way, action triplets specify "what happens", while grounded visual entity crops anchor "who or what participates in each event" to concrete visual evidence. Experiments on the Pitts dataset show consistent improvements over Qwen2.5-VL and Qwen3-VL baselines. Ablation studies, cross-dataset evaluation on AdsQA, and transfer experiments on an emotion-focused TVQA subset further support the effectiveness and generalization of our approach.

02.
arXiv (CS.AI) 2026-06-19

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

arXiv:2606.19771v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.

03.
arXiv (CS.AI) 2026-06-12

Parthenon Law: A Self-Evolving Legal-Agent Framework

arXiv:2606.04602v3 Announce Type: replace Abstract: As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products – yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB – $12{,}510$ agent trajectories – shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience – as a firm refines its checklists and playbooks after each matter – without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

04.
arXiv (quant-ph) 2026-06-17

Pulse-optimised circuit elements for scalable and noise-resilient quantum chemistry

arXiv:2606.17357v1 Announce Type: new Abstract: Useful chemistry calculations on near-term quantum processors are hindered by current algorithmic runtimes. We develop a methodology to significantly reduce these runtimes. Typically, variational quantum eigensolver (VQE) algorithms are implemented as sequences of primitive gates. Our methodology instead relies on gradient-ascent pulse engineering to construct hardware-tailored pulses for the direct implementation of VQEs. As problem sizes increase, it quickly becomes intractable to optimise a pulse that implements an entire VQE ansatz circuit. However, leading VQEs are constructed in a modular fashion. A problem-tailored VQE is assembled from parameterised circuit elements that simulate hopping between two or four electronic spin orbitals. We show that these circuit elements can be implemented more efficiently using hardware-tailored pulses. We numerically demonstrate our methodology on a silicon spin-qubit quantum processor. We find that common circuit elements, known as single- and double-qubit excitations, can be implemented in less than 289 ns and 927 ns, respectively. Compared with conventional gate-based implementations, our pulse-accelerated qubit excitations provide a scalable approach for faster and therefore more noise-robust quantum chemistry simulations by reducing VQE runtimes by up to a factor of 15.3.

05.
medRxiv (Medicine) 2026-06-17

Characterisation of disease progression in hantavirus haemorrhagic fever with renal syndrome

Hantaviruses can cause haemorrhagic fever with renal syndrome (HFRS). This is a clinically variable disease in which severe outcomes are hypothesized to arise from dysregulated host responses. To characterise this, longitudinal, label-free plasma proteomics was used to compare disease progression in a unique well-defined cohort of patients infected with either Dobrava virus (DOBV) or Puumala virus (PUUV) hantaviruses. Patients were stratified by clinical severity. The average viral load in the first available sample from hospitalized patients was higher in those who went on to have severe infection, and higher in patients infected with DOBV. There was marked separation of infected patients from controls across early, mid and late disease, including after viral RNA clearance, suggesting a sustained systemic host-response signature. Proteomic signatures were consistent with a strong acute-phase response in both mild and severe disease. There was evidence of activation of the adaptive humoral response at later stages. Hierarchical clustering identified severity-associated pathways linked to endothelial dysfunction, thrombocytopenia, vascular leakage and renal injury. These findings define a durable plasma proteomic signature of hantavirus disease and support a model in which severe HFRS is driven by persistent inflammatory, complement and platelet/coagulation pathway activation rather than viral burden alone.

06.
arXiv (CS.AI) 2026-06-19

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

arXiv:2606.20041v1 Announce Type: cross Abstract: We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

08.
arXiv (CS.LG) 2026-06-16

Probabilistic Signature Inversion: Learning Conditional Distributions from Truncated Signatures

arXiv:2606.15332v1 Announce Type: new Abstract: The signature transform is a principled feature map for continuous-time paths, valued for its uniqueness and universality. Recovering a path from its truncated signature is, however, structurally ill-posed because the truncated signature map is not injective. We therefore reframe truncated signature inversion as a probabilistic problem – learning the conditional distribution of a path given its truncated signature – and adopt a signature-conditioned flow matching model as a practical estimator. This probabilistic formulation elucidates the fundamental difficulty of inversion: Bayes reconstruction error quantifies the irreducible uncertainty remaining after conditioning on a statistic. We derive the Bayes-optimal error under linear statistics, obtaining a closed form for log-GBM and numerically tractable formulas for log-fBM and OU, yielding a concrete theoretical baseline for model validation. This baseline upper-bounds the Bayes error under truncated-signature conditioning, since truncated signatures provide richer information than linear statistics. Experiments show that empirical reconstruction errors under linear-statistics conditioning faithfully align with the theory-derived baseline, while errors decrease when the statistic is replaced with truncated signatures. Moreover, generated paths faithfully recover the conditioning signature while preserving key distributional and temporal structures, indicating that the estimator is well-calibrated to the target conditional distribution. Together, these results establish a well-posed probabilistic framework for truncated-signature inversion, with applicability demonstrated on real financial data beyond the parametric process families covered by theory.

09.
arXiv (CS.CL) 2026-06-16

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

While LLMs have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, documents, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, geometry, data semantics, editability, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, execute, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move multimodal code generation from single-output imitation toward evidence-grounded executable systems.

10.
arXiv (CS.AI) 2026-06-18

NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

arXiv:2606.18271v1 Announce Type: new Abstract: As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model (Gemma 3) to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue. The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine (LangGraph) coordinating dedicated agents for detection and dialogue. Results across ground benchmarking (88.16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery (including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument) demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.

11.
arXiv (CS.CL) 2026-06-11

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.

12.
medRxiv (Medicine) 2026-06-15

The clinical utility of functional testing in fibroblasts to diagnose primary mitochondrial disease

Genome sequencing of the heterogeneous primary mitochondrial disorders (PMD) frequently reveals variants of uncertain significance that require functional tests for diagnosis, and does not identify variants in all patients. We analyzed mitochondrial enzyme assays, blue native polyacrylamide gel electrophoresis (BN-PAGE) with in-gel activity staining, complex I assembly blot, and select protein abundances in fibroblasts of a case series of 204 PMD patients divided into functional classes, in comparison to 51 controls and 53 differential diagnostic conditions. Overall, sensitivity and specificity for respiratory chain enzyme assays were 46% and 93% respectively, for BN-PAGE 40% and 98%, for complex I assembly assay 49% and 99%. The overall sensitivity of all tests was 76%, specificity 93%, with positive predictive value 96% and negative predictive value 67%. Categories with high sensitivity were isolated complex deficiencies, nuclear DNA-encoded mitochondrial protein synthesis defects, co-factor defects, and mitochondrial amino-acyl-tRNA synthetase conditions when aided by protein abundance. Mitochondrial DNA mutations and maintenance disorders showed poor sensitivities. Secondary dysfunctions were rare. A complete battery of functional tests showed strong diagnostic clinical utility in fibroblasts.

13.
arXiv (CS.AI) 2026-06-18

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

arXiv:2606.19025v1 Announce Type: cross Abstract: Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.

14.
arXiv (CS.CV) 2026-06-16

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

作者:

Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

15.
arXiv (CS.AI) 2026-06-11

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

arXiv:2601.21898v2 Announce Type: replace Abstract: The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabilities. Yet, this modularity also creates a governance gap: downstream users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Because existing defenses are largely post-hoc and architecture-specific, they provide inconsistent protection across diverse architectures and release formats in practice. To close this gap, we propose Trap$^2$, an architecture-agnostic protection framework that encodes protection into updates during fine-tuning, regardless of whether they are released as adapters or full models. Instead of relying on architecture-dependent approaches, Trap$^2$ uses weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use, but degrades them under re-scaling that often arises in merging, undermining unauthorized recomposition.

16.
arXiv (CS.LG) 2026-06-16

A Fully First-Order Layer for Differentiable Optimization

arXiv:2512.02494v2 Announce Type: replace Abstract: Differentiable optimization layers enable learning systems to make decisions by solving embedded optimization problems. However, computing gradients via implicit differentiation requires solving a linear system with Hessian terms, which is both compute- and memory-intensive. To address this challenge, we propose a novel algorithm that computes the gradient using only first-order information. The key insight is to rewrite the differentiable optimization as a bilevel optimization problem and leverage recent advances in bilevel methods. Specifically, we introduce an active-set Lagrangian hypergradient oracle that avoids Hessian evaluations and provides finite-time, non-asymptotic approximation guarantees. We show that an approximate hypergradient can be computed using only first-order information in $\tilde{O}(1)$ time, leading to an overall complexity of $\tilde{O}(\delta^{-1}\epsilon^{-3})$ for constrained bilevel optimization, which matches the best known rate for non-smooth non-convex optimization. Furthermore, we release an open-source Python library that can be easily adapted from existing solvers. The source code is available at https://github.com/guaguakai/FFOLayer.

17.
arXiv (CS.CV) 2026-06-17

RAIGen: Rare Attribute Identification in Text-to-Image Generative Models

Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for label-free rare-attribute discovery in diffusion models, requiring no predefined minority categories. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation. The project page is available at https://vssilpa.github.io/RAIGen_webpage/ .

18.
arXiv (CS.CL) 2026-06-11

Beyond representational alignment with brain-guided language models for robust reasoning

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

19.
bioRxiv (Bioinfo) 2026-06-15

SMS: Symmetric Mediation Statistics for Powerful High-Dimensional Mediation Analysis

Background: Mediation analysis of high-dimensional features, particularly molecular-level omics features, provides important opportunities to uncover biological mechanisms underlying human health and disease. However, two central statistical challenges remain: testing the composite-null hypothesis and maintaining power when the exposure-mediator and mediator-outcome associations differ substantially in statistical significance. Existing methods typically rely on accurate estimation of the proportions of the three null types or on the maximum of the two association p-values, and may not always control the FDR well and may have limited power under imbalanced significance. Methods: We propose SMS, a new statistical framework based on symmetric mediation statistics. By exploiting symmetry, SMS calibrates the composite null distribution as a whole for FDR control. It also allows flexible combinations of the two association p-values, including the maximum, and then enables construction of an omnibus test. Moreover, it permits direct use of effect-size estimates, bypassing the need to compute p-values. Results: SMS controlled the FDR across a wide range of simulation scenarios while achieving a substantial sensitivity gain, often around 20 percentage points, over existing methods including HDMT, DACT, and DEI-B. Applications to a metabolomics dataset and a DNA methylation dataset further corroborated these findings. Notably, SMS discovered five plausible mediators in the metabolomics dataset that were missed by all existing methods considered.

20.
arXiv (CS.CL) 2026-06-12

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target–text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($\alpha=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution–abstention axis rather than removing the high-complexity bottleneck.

21.
arXiv (CS.AI) 2026-06-15

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

arXiv:2606.14608v1 Announce Type: cross Abstract: Survival prediction plays a central role for healthcare providers and clinical researchers. Accurate risk stratification enables early intervention and improved patient management. Most existing deep survival models learn one common feature representation for all patients, which may hide important differences between patient subgroups. In contrast, a Mixture-of-Experts (MoE) framework allows different parts of the model to focus on different patient patterns, leading to more individualized representations. Therefore, in this work, we propose a mixture-of-experts enhanced adaptive deep clustering survival framework (AdaCSM) for modeling such heterogeneous survival patterns. We introduce a routing-based expert mechanism that enables conditional specialization within a parametric survival modeling framework. The proposed architecture allocates patients to specialized risk predictors dynamically while preserving the patient survival and subtype clustering objectives. We compare our method with state-of-the-art survival and deep clustering models on multiple real-world longitudinal clinical cohorts spanning diverse disease domains. The proposed method demonstrates improved predictive performance and leads to interpretable results in survival analysis.

22.
arXiv (CS.LG) 2026-06-19

Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

arXiv:2601.02322v2 Announce Type: replace-cross Abstract: A common approach to out-of-distribution prediction restricts models to causal or invariant covariates to avoid spurious associations that may change across environments. Despite its theoretical appeal, this strategy can underperform empirical risk minimization when only a subset of the causal parents of the outcome is observed. In such settings, non-causal covariates can serve as proxies for unobserved causal parents and improve prediction when the proxy relationship is stable, but they can hurt when shifts disrupt that relationship. Thus, the optimal covariate set can depend on the specific shift encountered. Because different shifts leave signatures in the unlabeled covariate distribution, we propose an environment-adaptive covariate selection algorithm that maps environment-level summaries to environment-specific covariate sets. These summaries may be hand-crafted or learned from multi-environment data, and prior causal knowledge can be incorporated as constraints. Across simulations and applied datasets, the proposed method improves over static causal, invariant, and other non-adaptive rules under diverse shifts.

23.
arXiv (CS.CL) 2026-06-11

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route. We formalize this gap with a shortcut-aware difficulty framework and identify four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. To diagnose their realized effects, we use trajectory signatures including solving cost, answer hit time, and prior-shortcut rate. Guided by this framework, we introduce FORT, a Framework of Shortcut-Resistant Training-Data Synthesis. FORT constructs shortcut-resistant training data by controlling shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show that FORT induces longer pre-answer search and fewer shortcut patterns than existing open-source deep search datasets. Using the resulting trajectories, we train FORT-Searcher with supervised fine-tuning (SFT) only, and it achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Relevant resources will be made available at https://github.com/RUCAIBox/FORT-Searcher.

24.
arXiv (CS.AI) 2026-06-17

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

arXiv:2606.17165v1 Announce Type: cross Abstract: Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes recovers the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs. The framework shows that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. When these conditions fail, the effect of interest is only partially identified, and we provide diagnostics that can falsify surrogacy on historical experiments together with a bound on the worst-case bias from limited overlap. We further show that the stochasticity inherent to LLMs introduces both bias and variance, but using an average of multiple draws as the surrogate mitigates both. We illustrate the methods and theory in simulations and an application to A/B tests on Upworthy headlines. A central takeaway from our work is that the validity of LLM outcomes as surrogates can only be falsified for past treatments and never verified for new ones, so human experiments remain indispensable for novel interventions. We discuss the role of LLM choice, prompting, and temperature as design variables, and how to size human experiments for validation.

25.
arXiv (CS.AI) 2026-06-16

AgentLeak: A Benchmark for Internal-Channel Privacy Leakage in Multi-Agent LLM Systems

arXiv:2602.11510v3 Announce Type: replace Abstract: Multi-agent Large Language Model (LLM) systems create privacy risks that current output-only benchmarks cannot measure. When agents coordinate on tasks, sensitive data may pass through inter-agent messages, shared memory, and tool arguments, all pathways that final-output audits typically do not inspect. We introduce AgentLeak, a benchmark for evaluating internal-channel privacy leakage in multi-agent LLM systems. AgentLeak instruments seven privacy-relevant communication pathways and provides a large-scale empirical evaluation focused on final outputs, inter-agent messages, and shared memory. Across 1,000 scenarios spanning healthcare, finance, legal, and corporate domains, five production LLMs (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B), and 4,979 validated execution traces, we find that multi-agent configurations reduce final-output leakage (C1: 27.2% vs 43.2% in single-agent mode) compared with single-agent baselines but introduce internal channels that raise total system exposure to 68.9% (aggregated across C1, C2, C5). Inter-agent messages (C2) leak at 68.8%, compared with 27.2% for final outputs (C1), meaning that output-only audits miss 41.7% of violations. Across all five models and four domains, the pattern C2 $\geq$ C1 holds consistently. These results suggest, within the evaluated coordinator-worker setting, that privacy risk in multi-agent systems is strongly shaped by architectural coordination channels rather than final-output behavior alone: it arises from internal channels that remain invisible to standard output-level defenses.