Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

02.
arXiv (CS.LG) 2026-06-11

Provable Recovery of Locally Important Signed Features and Interactions from Random Forest

arXiv:2512.11081v2 Announce Type: replace-cross Abstract: Feature and Interaction Importance (FII) methods are essential in supervised learning for assessing the relevance of input variables and their interactions in complex prediction models. In many domains, such as personalized medicine, local interpretations for individual predictions are often required, rather than global scores summarizing overall feature importance. Random Forests (RFs) are widely used in these settings, and existing interpretability methods typically exploit tree structures and split statistics to provide model-specific insights. However, theoretical understanding of local FII methods for RF remains limited, making it unclear how to interpret high importance scores for individual predictions. We propose a novel, local, model-specific FII method that identifies frequent co-occurrences of features along decision paths, combining global patterns with those observed on paths specific to a given test point. We prove that our method consistently recovers the true local signal features and their interactions under a Locally Spike Sparse (LSS) model and also identifies whether large or small feature values drive a prediction. We illustrate the usefulness of our method and theoretical results through simulation studies and a real-world data example.

03.
arXiv (CS.CL) 2026-06-11

Compatibility-Aware Dynamic Fine-Tuning for Large Language Models

Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.

04.
arXiv (CS.CV) 2026-06-16

FUSE: Quantifying Uncertainty in Vision-Language Models by Bayesian Fusing Epistemic and Aleatoric Uncertainty

Vision-language models (VLMs) are playing an increasingly important role across multiple domains. In many applications, such as robotics, it is crucial to quantify the uncertainty in the output of these models. } We develop FUSE, a probabilistic framework for capturing two complementary sources of uncertainty in vision-language modeling: (i) aleatoric embedding-level uncertainty derived from input data vision-language ambiguity, and (ii) epistemic model-level uncertainty estimated from the semantic response diversity of VLMs. Our approach formulates a Bayesian fusion mechanism that analytically combines these uncertainty sources to produce a scalar measure of uncertainty. This measure can be used to reliably predict the model's output correctness for downstream applications. We demonstrate that our method outperforms baselines and achieves SOTA uncertainty calibration.

05.
arXiv (CS.LG) 2026-06-11

Least-Action-Guided Diffusion for Physical Extrapolation

arXiv:2606.11277v1 Announce Type: new Abstract: Reliable extrapolation remains a central challenge for generative models in computational physics, because models trained over finite ranges of time, parameters, or geometries may produce physically inconsistent predictions outside the training distribution. We introduce a least-action-principle-guided diffusion, LAPG, a framework that promotes physical consistency during inference rather than relying solely on constraints imposed during training. The method combines a conditional score-based diffusion model with an action-derived physical guidance score. In the first stage, the learned score model generates an in-distribution proposal; in the second, an action-based variational prior refines this proposal toward the target out-of-distribution condition. This formulation turns the principle of least action into a differentiable inference-time correction mechanism and provides an alternative to pointwise residual penalties that often require empirical loss balancing. We evaluate LAPG on representative ordinary- and partial-differential-equation systems, including free fall, conservative and dissipative spring-mass dynamics, interacting point vortices, and potential flow over parameterized airfoils. In temporal, parameter, and geometric extrapolation tests, LAPG reduces phase drift, preserves dissipative decay, captures vortex motion, and improves the lift response of airfoil flows compared with training-time physics-informed baselines.

06.
arXiv (CS.AI) 2026-06-16

PO-PDDL: Learning Symbolic POMDPs from Visual Demonstrations for Robot Planning Under Uncertainty

arXiv:2606.15654v1 Announce Type: cross Abstract: Real-world robot task planning must operate under both stochastic action execution and partial observability, yet constructing Partially Observable Markov Decision Process (POMDP) models for real robotics domains remains difficult and labor-intensive. We introduce PO-PDDL, a symbolic formulation of POMDPs that preserves the relational structure and LLM-friendly syntax of the Planning Domain Definition Language (PDDL), while explicitly modeling partial observability, stochasticity, and beliefs. Building on this formulation, we propose a demonstration-driven pipeline for learning PO-PDDL models. The proposed method reconstructs latent symbolic state trajectories from real-robot execution videos, identifies partial observability via inconsistencies between inferred states and visual observations, and learns stochastic transition and observation models accordingly. The resulting PO-PDDL domains are reusable across tasks and enable online belief-space planning under both perception and execution uncertainty. Experiments on real-world long-horizon manipulation tasks show that our method consistently outperforms existing PDDL and POMDP model-learning approaches, achieving robust task planning under uncertainty with significantly lower planning cost.

07.
arXiv (math.PR) 2026-06-11

Hierarchical Random Measures without Tables

arXiv:2505.02653v2 Announce Type: replace-cross Abstract: The hierarchical Dirichlet process is the cornerstone of Bayesian nonparametric multilevel models. Its generative model can be described through a set of latent variables, commonly referred to as tables within the popular restaurant franchise metaphor. The latent tables simplify the expression of the posterior and allow for the implementation of Gibbs sampling algorithms to approximately draw posterior samples. However, managing their assignments can become computationally expensive, especially as the size of the dataset and the number of levels increase. In this work, we identify a prior for the concentration parameter of the hierarchical Dirichlet process that (i) induces a quasi-conjugate posterior distribution, and (ii) removes the need for tables, leading to more interpretable expressions for the posterior, with both a scalable and an exact algorithm to sample from it. Remarkably, this construction extends beyond the Dirichlet process, leading to a new framework for defining normalized hierarchical random measures and a new class of algorithms to sample from their posteriors. The key analytical tool is the independence of multivariate increments, that is, their representation as completely random vectors.

08.
arXiv (CS.CL) 2026-06-16

CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.

09.
arXiv (CS.CL) 2026-06-19

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple : the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction – ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction – over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

10.
arXiv (CS.AI) 2026-06-17

Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines

arXiv:2606.18122v1 Announce Type: cross Abstract: Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents a systems-oriented synthesis of an embedded machine-learning workflow for microcontroller-class platforms. The emphasis is placed on engineering decisions that are often hidden in generic machine-learning introductions: sampling and buffering, feature extraction as dimensionality reduction, validation under class imbalance, model/runtime co-design, and streaming deployment. Two representative signal families are used throughout the paper. The first is inertial motion recognition, where a two-second, three-axis accelerometer window is transformed from raw samples into root-mean-square and spectral features before classification. The second is keyword spotting, where audio is sampled, anti-aliased, transformed into mel-frequency cepstral coefficients, and processed by a compact one-dimensional convolutional network. The paper concludes with practical design rules for robust on-device inference, including data curation, quantization, thresholding, scheduling, and field monitoring.

11.
arXiv (CS.LG) 2026-06-17

Learn from Your Mistakes: Self-Correcting Masked Diffusion Models

arXiv:2602.11590v3 Announce Type: replace Abstract: Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models, enabling parallel token generation while achieving competitive performance. Despite these advantages, MDMs face a fundamental limitation: once tokens are unmasked, they remain fixed, leading to error accumulation and ultimately degrading sample quality. We address this by proposing a framework that trains a model to perform both unmasking and correction. By reusing outputs from the MDM denoising network as inputs for corrector training, we train a model to recover from potential mistakes. During generation we apply additional corrective refinement steps between unmasking ones in order to change decoded tokens and improve outputs. We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence, including already generated tokens. We conduct extensive experimental validation across multiple conditional and unconditional tasks, demonstrating that \method~yields better quality-efficiency trade-offs (up to ~4x faster sampling) and enables inference-time compute scaling to further increase sample quality beyond standard MDMs (up to ~1.2x improvement on benchmarks).

12.
arXiv (CS.LG) 2026-06-15

Recovery thresholds for hidden weighted sparse graphs

arXiv:2606.14335v1 Announce Type: cross Abstract: Recovering structural information from noisy high-dimensional data is a fundamental task in statistical inference. We investigate the recovery thresholds for a graph hidden in a randomly weighted complete graph. Specifically, an unknown graph $H^* \in H_n$ is chosen uniformly at random, and hidden in a complete graph of $n$ vertices as follows: the weight of an edge $e \in H$ is distributed independently according to $P_n$; otherwise the weight is distributed independently according to $Q_n$. The goal is to recover almost all of $H$ from these edge weights. Assuming a local Lipschitzness of the Rényi divergence between distributions $P_n$ and $Q_n$, and a mild density condition for the graphs $H_n$, we give a unified characterization of the information-theoretic limit for recovering almost all of $H$ (also known as almost exact recovery). Our characterization connects the KL divergence between $P_n$ and $Q_n$ to the logarithm of the first moment threshold of $H$ in the Erdős-Rényi random graph model $G(n,p)$. Our lower bound also extends to the task of partial recovery, in which only a constant $\lambda$-fraction of $H$ needs to be recovered. Last but not least, for certain Bernoulli and Exponential regimes, and for Gaussian distributions, we are able to show an All-or-Nothing (AoN) threshold phenomenon at the exponential scale.

13.
arXiv (CS.AI) 2026-06-17

Querying an astronomical database using large language models: the ALeRCE text-to-SQL system

arXiv:2606.18108v1 Announce Type: cross Abstract: We develop a text-to-SQL (structured query language) system based on large language models (LLMs) using in-context learning and apply it to the Automatic Learning for the Rapid Classification of Events (ALeRCE) astronomical database. ALeRCE is a community broker for the Zwicky Transient Facility and the Vera C. Rubin Observatory. The system enables users to query the database in natural language (NL) and generates executable SQL queries. To develop and evaluate the system, we constructed a dataset of 110 NL/SQL pairs. We propose a step-by-step generation framework comprising four modules: schema linking, query classification, prompt decomposition, and self-correction. The performance of thirteen LLMs is evaluated using in-context learning and prompt engineering techniques. Text-to-SQL performance is assessed using the perfect-match (PM) rate for row identifiers (e.g., object identifiers) and column identifiers (i.e., column names). The proposed step-by-step framework consistently outperforms a direct-inference baseline, while the self-correction module consistently reduces execution errors. For Claude Opus 4.6, PM performance on row (column) identifiers is high for simple queries, reaching 0.97 (0.94), and decreases with query complexity to 0.44 (0.72) for medium queries and 0.59 (0.49) for hard queries. Among the thirteen evaluated models, the best-performing LLMs for the text-to-SQL task are Claude Opus 4.6, Gemini 2.5 Pro, Gemini 3 Flash, and GPT-5.2-Codex.

14.
arXiv (CS.LG) 2026-06-16

ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning

arXiv:2606.17011v1 Announce Type: cross Abstract: Human interventions provide crucial corrective signals for post-training Vision-Language-Action (VLA) models. However, enabling seamless humanoid interventions is a formidable systems challenge due to complex whole-body kinematics and dexterous-hand control. Consequently, the collected intervention trajectories are often suboptimal, and methods that rely on human interventions as expert supervision can absorb hesitant, inefficient, or even erroneous behaviors. To address both the system and algorithmic challenges, we propose ROVE, a reinforcement learning framework for humanoid VLA post-training with imperfect human interventions. First, ROVE introduces a human-in-the-loop pipeline capable of collecting deployment and intervention data for humanoid manipulation. Second, it utilizes Optimistic Value Estimation (OVE) to prioritize high-value behaviors from mixed-quality trajectories. To further robustify value estimation, we incorporate cross-embodiment human experience videos to provide rich supervision for long-tailed failure and recovery modes. The resulting critic yields informative advantage signals, steering the VLA actor to focus on high-value behaviors rather than indiscriminately imitating all actions. On challenging real-world contact-rich and fine-grained humanoid manipulation tasks, ROVE outperforms experience-learning baselines and consistently improves across multiple rollout-intervention iterations.

15.
arXiv (CS.CL) 2026-06-18

UniECG: Understanding and Generating ECG in One Unified Model

Electrocardiogram (ECG) interpretation is a fundamental skill in medical education, yet students often need more than static examples to connect waveform evidence with diagnostic reasoning. This paper presents UniECG as a step toward interactive ECG education. UniECG supports two complementary learning interactions: given an ECG signal or image, it generates an evidence-based explanation; given a textual learning objective, it generates a corresponding ECG signal example for case-based learning. The model follows a two-stage design. First, it learns grounded ECG explanation from ECG signal–image–text data. Second, it introduces special ECG generation tokens and aligns their hidden representations with a pretrained text-conditioned ECG diffusion model, enabling controllable signal-level ECG generation. We evaluate UniECG through grounded ECG explanation and generation-oriented qualitative analysis, examining its potential to support explanation and case-based learning. UniECG is intended as an educational aid and a research step toward interactive AI-assisted ECG learning, rather than a clinically validated diagnostic system.

16.
arXiv (CS.CV) 2026-06-17

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

17.
arXiv (CS.CV) 2026-06-16

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K–600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.

18.
arXiv (CS.LG) 2026-06-17

Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces

arXiv:2606.17419v1 Announce Type: new Abstract: We develop approximation and generalization error estimates for multi-input neural operators, with the output error measured in Sobolev norms. In contrast to standard operator-learning settings with a single input function, our framework allows multiple input functions defined on possibly different domains, with different dimensions and Sobolev regularities. The derived rates explicitly quantify the contribution of each input space to the final error bound. In particular, in the balanced regime, the approximation and generalization rates are governed by the interaction between the input dimensions, regularities, and Sobolev orders, while the dependence on the model complexity retains a \(\log\log/\log\)-type structure. Our analysis provides a general theoretical framework for multi-input operator learning, including Sobolev training, and is applicable to operator learning problems arising from partial differential equations and scientific computing.

19.
arXiv (CS.CL) 2026-06-12

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

20.
arXiv (CS.AI) 2026-06-12

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

arXiv:2606.12485v1 Announce Type: cross Abstract: Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at https://github.com/LongkunHao/SRC_gui_agent.

21.
arXiv (CS.CV) 2026-06-16

Position: The Systemic Lack of Agency in Visual Reasoning

This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to approach visual reasoning primarily as passive semantic retrieval, rather than as active, situated reasoning that depends on autonomous visual exploration. As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Diagnosing Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs. More information can be found at https://haoychen.github.io/Implicit-Reasoning/

22.
arXiv (CS.CV) 2026-06-12

DuET: Dual Expert Trajectories for Diffusion Image Editing

Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantially from the input. We introduce DuET (Dual Expert Trajectories), a training-free inference method that temporarily relaxes source-image conditioning by transitioning through a text-to-image phase before returning to edit mode, allowing the denoising trajectory to move toward the target distribution while retaining the structural benefits of image-conditioned editing. Without modifying model weights or increasing sampling cost, DuET consistently improves instruction relevance, semantic fidelity, and perceptual quality across diverse models and benchmarks. In some cases, these gains come with a modest reduction in source-image preservation, revealing a predictable trade-off between source preservation and edit fidelity.

23.
arXiv (CS.LG) 2026-06-11

Understanding Sample Efficiency in Predictive Coding

arXiv:2605.11911v2 Announce Type: replace Abstract: Predictive Coding (PC) is an influential account of cortical learning. Much of recent work has focused on comparing PC to Backpropagation (BP) to find whether PC offers any advantages. Small scale experiments show that PC enables learning that is more sample efficient and effective in many contexts, though a thorough theoretical understanding of the phenomena remains elusive. To address this, we quantify the efficiency of learning in BP and PC through a metric called ``target alignment'', which measures how closely the change in the output of the network is aligned to the output prediction error. We then derive and empirically validate analytical expressions for target alignment in Deep Linear Networks. We show that learning in PC is more efficient than BP, which is especially pronounced in deep, narrow and pre-trained networks. We also derive exact conditions for guaranteed optimal target alignment in PC and validate our findings through experiments. We study full training trajectories of linear and non-linear models, and find the predicted benefits of PC persist in practice even when some assumptions are violated. Overall, this work provides a mechanistic understanding of the higher learning efficiency observed for PC over BP in previous works, and can guide how PC should be parametrised to learn most effectively.

24.
arXiv (CS.AI) 2026-06-17

SSIL: Self-Supervised Imitation Learning for End-to-End Driving

arXiv:2308.14329v4 Announce Type: replace-cross Abstract: In autonomous driving, the end-to-end (E2E) driving approach that predicts vehicle control signals directly from sensor data is rapidly gaining attention. To learn a safe E2E driving system, one needs an extensive amount of driving data and human intervention. Vehicle control data is constructed by many hours of human driving, and it is challenging to construct large vehicle control datasets. Often, publicly available driving datasets are collected with limited driving scenes, and collecting vehicle control data is only available by vehicle manufacturers. To address these challenges, this paper proposes the first self-supervised learning framework, Self-Supervised Imitation Learning (SSIL), for E2E driving. The proposed SSIL framework can learn vision-based E2E driving networks without using driving command data or a pre-trained model. To construct pseudo steering angle data, proposed SSIL predicts a pseudo target from the vehicle's poses at the current and previous time points that are estimated with light detection and ranging sensors. In addition, we propose a new cross-attention-based conditioning approach (CACA) for a vision encoder in E2E driving, where a high-level instruction serves as the conditioning signal for visual information. Our numerical experiments with three different benchmark datasets demonstrate that the proposed SSIL framework achieves very comparable E2E driving accuracy with the supervised learning counterpart. Furthermore, the proposed pseudo-label predictor outperformed an existing one using proportional integral derivative controller, and proposed CACA achieved superior performance over existing conditioning approaches.

25.
medRxiv (Medicine) 2026-06-16

Reporting patterns of adverse drug withdrawal events using individual case safety reports in United States and European databases

Introduction: Adverse drug withdrawal events (ADWEs) are a key safety concern with deprescribing but are infrequently reported in trials. Although pharmacovigilance systems have advanced our understanding of medication-related harms, it is unclear how extensively these systems have been used for ADWEs. Objectives: To examine the reporting patterns of ADWEs for all drugs recorded in United States and European pharmacovigilance databases between 2004 and 2023. Methods: A retrospective study was conducted using two pharmacovigilance databases, the publicly available FDA-FAERS dataset and EMA-EV Level 2A (individual-level) dataset. ADWE cases were identified using relevant MedDRA preferred terms. Data on patient characteristics, reporter type, drugs, indication, ADWE outcomes, dechallenge/rechallenge, seriousness criteria, time to onset, duration, and causality were summarised. Results: A total of 158,505 ADWE reports were analysed (FDA-FAERS: 145,514; EMA-EV: 12,987), with mean ages of 46.1 (FDA; 55.3% female) and 45.5 years (EMA; 57.1% female). The frequently reported drug classes were opioids (FDA: oxycodone, 29.8%; EMA: buprenorphine, 19%), antidepressants (FDA: duloxetine, 32%; EMA: venlafaxine, 25.9%) and gabapentinoids (FDA: pregabalin, 6.7%; EMA: pregabalin, 6.0%). The most common adverse outcomes were other serious medical conditions (FDA=63.9%; EMA=46.0%), hospitalisation (FDA=15.9%; EMA=28.3%), and disability (FDA=13.3%; EMA=6.2%) and these outcomes varied significantly based on sex and age group (p