Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-16

Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

arXiv:2606.16236v1 Announce Type: new Abstract: Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and full trajectory observability, assumptions that fail in privacy-preserving or restricted scenarios where only scalar performance metrics are available. We propose Generalization via Evolutionary Reward Shaping (GERS), a bilevel optimization approach to improve generalization on unseen test environments using only scalar feedback from validation environments. At the lower level, an RL agent guided via a reward function shaped by the upper level learns a policy on a limited set of training environments with accessible trajectory data; at the upper level, CMA-ES optimizes the reward shaping parameters to maximize the cumulative unshaped reward on separate validation environments for which trajectory access is unavailable. Results on continuous control tasks indicate that GERS outperforms the standard RL baseline on unseen test environments. GERS performance is comparable to DR, despite DR treating the combined set of training and validation environments of GERS as a single training set that requires trajectory access, whereas GERS cannot access validation trajectories. These results confirm that GERS effectively enhances generalization under restricted data access constraints.

03.
arXiv (quant-ph) 2026-06-24

Resource theory of interactive quantum instruments

arXiv:2603.27676v2 Announce Type: replace Abstract: Quantum instruments describe both the classical outcome and the updated quantum state in a measurement process. To do this in a non-trivial way, instruments must have the capability to interact coherently with the state that they measure. Here, we develop a resource theory for instruments. We consider a relevant quantifier of the separation between interactive and non-interactive instruments and show that it admits three distinct operational interpretations in terms of quantum information tasks. These concern (i) the preservation of maximally entangled states after a local measurement, (ii) the average ability to preserve random states after measurement, and (iii) the ability to recover the classical information generated from measuring half of a maximally entangled state. We also introduce a natural set of allowed operations and show that the third task fully characterises the resource content of instruments. Our general framework reproduces as special cases established resource theories for channels and measurements.

04.
arXiv (CS.CV) 2026-06-24

P-MTP: Efficient Document Parsing via Multi-Token Prediction with Progressive Depth Scaling

Vision-Language Models (VLMs) have revolutionized document parsing by enabling end-to-end mapping from images to structured text, imposing a significant latency bottleneck, particularly for token-dense documents. While Multi-Token Prediction (MTP) has emerged as a promising approach for accelerating inference, its potential is constrained by optimization instability when scaling to deeper look-ahead depth. In this paper, we propose P-MTP, a framework that leverages Progressive Multi-Token Prediction with a lightweight MTP module to scale the look-ahead depth for high-throughput document parsing. Specifically, we introduce Progressive Curriculum Loss that adaptively re-weights different look-ahead depths using cumulative path reliability and retrospective target consistency. By effectively suppressing gradient noise in long-range predictions, P-MTP, facilitates an automated easy-to-hard optimization transition, enabling the model to master increasingly distant look-ahead depths. Furthermore, we propose Confidence-Gated Dynamic Drafting to maximize the effective look-ahead depth and acceptance rate by adaptively calibrating speculative length during inference, thereby minimizing computational waste and further pushing the boundaries of inference speedup. Experimental results across multiple benchmarks and architectures demonstrate that P-MTP, achieves up to a $5\times$ speedup with negligible loss in accuracy, providing the first successful validation of extensive look-ahead MTP in the document parsing domain.

05.
arXiv (quant-ph) 2026-06-17

Intrinsic Pointer Basis and Irreversible Classicality from Coherence Contraction

Authors:

arXiv:2604.23304v4 Announce Type: replace Abstract: This work analyzes an operational route to classical behavior for reduced quantum states using the intrinsic reference basis (IRB). Relative to a fixed physical conjugation, the IRB separates intrinsic populations from a real antisymmetric cohesion sector. A globally bounded cohesion index is defined and its exponential contraction is proved for phase-free dephasing dynamics aligned with the IRB; for general aligned dephasing, the corresponding modulus-based coherence functional contracts at the same computable rates. The results provide distance bounds to the IRB-diagonal description and a logarithmic upper bound on the time required to reach a prescribed experimental tolerance. The IRB projectors constitute state-derived candidate pointer sectors, and they become dynamically stable pointer sectors when the effective dephasing generator is aligned with them and damps the relevant inter-sector coherences. Degenerate population sectors lead naturally to block-classicality and protected intra-block coherence. In a two-level active sector, the cohesion index equals fringe visibility, giving a direct interferometric test of the contraction law. The construction is independent of any spacetime- or unification-emergence hypothesis and is intended as a channel-level complement to environment-induced einselection.

06.
arXiv (CS.CV) 2026-06-25

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently understood. This gap is critical for OCR reasoning, where visual corruption can induce OCR errors and structural distortions, thereby introducing uncertainty into the reasoning task. To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluating OCR reasoning robustness under visual perturbations. It contains 812 samples across two complementary subsets: OCR1.0, covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0, focusing on charts, geometry diagrams, and tables. To enable efficient yet informative evaluation, we conduct a pilot study over 18 candidate perturbations and select 5 representative types at 3 severity levels each based on their impact and cross-model discriminability. We evaluate robustness using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI), and benchmark 18 models spanning proprietary systems, open-source VLMs, and OCR+LLM pipelines. Our results show that higher clean accuracy does not necessarily imply stronger robustness, and that models can suffer pronounced degradation in the worst case on OCR tasks that are sensitive to structure, and charts and tables are substantially more fragile than document-like inputs under perturbation.

07.
arXiv (CS.CL) 2026-06-25

PhoneBuddy: Training Open Models for Agentic Phone Use

Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\% after supervised fine-tuning to 40.67\% after real-app RL and 45.33\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\% to 77.2\% to 83.2\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.

08.
arXiv (CS.AI) 2026-06-17

TRACE: Learning to Compute on Circuit Graphs

arXiv:2509.21886v3 Announce Type: replace Abstract: Learning to compute, the ability to model the functional behavior of a circuit graph, is a fundamental challenge for graph representation learning. Yet, the dominant paradigm is architecturally mismatched for this task. This flawed assumption, central to mainstream message passing neural networks (MPNNs) and their conventional Transformer-based counterparts, prevents models from capturing the position-aware, hierarchical nature of computation. To resolve this, we introduce TRACE, a new paradigm built on an architecturally sound backbone and a principled learning objective. First, TRACE employs a Hierarchical Transformer that mirrors the step-by-step flow of computation, providing a faithful architectural backbone that replaces the flawed permutation-invariant aggregation. Second, we introduce function shift learning, a novel objective that decouples the learning problem. Instead of predicting the complex global function directly, our model is trained to predict only the function shift, the discrepancy between the true global function and a simple local approximation that assumes input independence. We validate this paradigm on various circuits modalities, including Register Transfer Level graphs, And-Inverter Graphs and post-mapping netlists. Across a comprehensive suite of benchmarks, TRACE substantially outperforms all prior architectures. These results demonstrate that our architecturally-aligned backbone and decoupled learning objective form a more robust paradigm for the fundamental challenge of learning the functional behavior of a circuit graph.

09.
arXiv (math.PR) 2026-06-11

Integrated expectile-based measures of inequality

arXiv:2606.12333v1 Announce Type: cross Abstract: Expectiles provide a class of asymmetric location functionals that incorporate the magnitude of deviations and admit a natural geometric interpretation. Building on their structural consistency with the convex stochastic order, this paper introduces a family of integrated expectile functionals for measuring risk, dispersion, and inequality. The proposed functionals admit analytical representations as integrals of expectiles across asymmetry levels. For a distinguished subclass of these constructions, a geometric representation is available: the resulting quantities can be expressed as weighted areas of star-shaped sets encoding the distributional asymmetry of a random variable. This approach yields a new class of expectile-based inequality indices, constituting a natural counterpart to classical Gini-type measures while preserving desirable monotonicity and consistency properties. Empirical counterparts are derived in closed form and admit explicit decompositions over finite samples. The framework extends naturally to multivariate settings through directional expectile constructions, leading to measures capable of capturing genuinely joint forms of multivariate dispersion and inequality.

10.
arXiv (quant-ph) 2026-06-12

Characterizing the functional role of quantum coherence in energy transfer

arXiv:2606.13404v1 Announce Type: new Abstract: Quantum coherence is understood to play a role in excitation energy transfer in open quantum systems, yet a quantitative approach to assessing its influence on the transfer process is still missing. Using Nakajima-Zwanzig projection operators, we derive a general memory kernel identity that enables us to characterize and quantify the impact of coherence in the eigenenergy basis on a generalized rate of energy transfer. Applying our approach to the electronic dynamics of a dimer coupled to a structured phonon bath, we demonstrate how quantum coherence acts to modulate energy transfer.

11.
arXiv (CS.LG) 2026-06-16

TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning

arXiv:2606.05878v2 Announce Type: replace Abstract: Foundation models mark a profound paradigm shift in time series modeling, with task-specific models being superseded by general-purpose zero-shot models. Yet, current approaches primarily focus on forecasting, while real-world time series are often irregularly and partially observed, requiring models that can jointly forecast, impute missing values, and handle degraded sampling conditions. To address these challenges, we introduce TS-ICL, a novel probabilistic In-Context Learning encoder–regressor Transformer that unifies forecasting and imputation. TS-ICL formulates time series tasks as timestamp-aligned regression and naturally incorporates covariates by training on synthetic dependency structures generated from a novel causal data prior. Empirically, TS-ICL achieves a new state-of-the-art in imputation, while remaining competitive with leading forecasting foundation models across both univariate and covariate-aware benchmarks. It shows particularly strong performance in forecasting with partially observed look-back windows.

12.
bioRxiv (Bioinfo) 2026-06-20

A network approach to DNA methylation clocks

Biological age predicts health and lifespan better than chronological age, but remains difficult to measure. One leading molecular proxy for biological age is DNA methylation, which underlies age predictors known as "clocks". These clocks use penalized linear regression to predict chronological age from methylation levels using selected cytosine–guanine pairs (CpGs) along DNA. Although they predict chronological age within a few years and track mortality risk, there are several issues. Different clocks share a vanishingly small number of CpG sites, many of which show weak associations with age. Also, the clocks often do not transfer across methylation array platforms. This paper takes a network approach to better understand these issues. By using 12 public datasets from human blood, we build a co-methylation network of the sites that show the strongest age correlation. After pruning weak links, we find that it has a small number of large modules of covarying CpGs surrounded by many small modules and singleton sites. These modules are biologically interpretable, as they are associated with CpG island contexts and enriched for distinct Gene Ontology functions. We also map five established clocks onto this network (Horvath, Hannum, AltumAge, Skin & Blood, and Han) and find that they select some CpGs from the same module. This suggests that they are more similar than they appear. The network structure also suggests new ways to build clocks. A simple clock that retains one CpG per module matches the performance of established clocks. A second one, built from module-level principal components, outperforms all five established clocks in three validation cohorts and is transferable across array platforms (Illumina Infinium Methylation 450K or EPIC arrays). Overall, the network perspective shifts attention from individual CpG sites to modules of covarying sites. This perspective helps explain why DNA methylation clocks perform so well despite their differences and provides a more systematic approach for developing the next generation of aging biomarkers.

13.
arXiv (CS.LG) 2026-06-25

Low Variance Trust Region Optimization with Independent Actors and Sequential Updates in Cooperative Multi-agent Reinforcement Learning

arXiv:2606.25526v1 Announce Type: new Abstract: Cooperative multi-agent reinforcement learning assumes each agent shares the same reward function and can be trained effectively using the Trust Region framework of single-agent. Instead of relying on other agents' actions, the independent actors setting considers each agent to act based only on its local information, thus having more flexible applications. However, in the sequential update framework, it is required to re-estimate the joint advantage function after each individual agent's policy step. Despite the practical success of importance sampling, the updated advantage function suffers from exponentially high variance problems, which likely result in unstable convergence. In this work, we first analyze the high variance advantage both empirically and theoretically. To overcome this limitation, we introduce a clipping objective to control the upper bounds of the advantage fluctuation in sequential updates. With the proposed objective, we provide a monotonic bound with sub-linear convergence to $\epsilon$-Nash Equilibria. We further derive two new practical algorithms using our clipping objective. The experiment results on three popular multi-agent reinforcement learning benchmarks show that our proposed method outperforms the tested baselines in most environments. By carefully analyzing different training settings, our proposed method is highlighted with both stable convergence properties and the desired low advantage variance estimation. For reproducibility purposes, our source code is publicly available at https://github.com/giangbang/Low-Variance-Trust-Region-MARL.

14.
arXiv (CS.CL) 2026-06-15

Detecting undisclosed LLM-generated content in parliamentary texts

In this paper, we evaluate the extent of undisclosed LLM-generated content in texts from the parliaments of the United Kingdom and Sweden. In many areas, such as in journalism or in academic writing, there are often requirements to clearly disclose whether AI tools, such as LLMs, have been used. In the case of parliamentary texts, the guidelines on disclosure of AI use are more vague. However, in order to maintain transparency and retain public trust, it is generally recommended that parliamentarians should state whether or not they have used AI when writing texts, such as parliamentary motions. Here, we train an interpretable (glass-box) text classifier using pre-LLM parliamentary texts and LLM-generated versions of such texts. We then apply the classifier to a test set containing recent parliamentary texts, finding a steady increase in undisclosed LLM use, in both parliaments, from 2022 onwards.

15.
arXiv (CS.LG) 2026-06-11

Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

Authors:

arXiv:2606.11251v1 Announce Type: new Abstract: Many multivariate dynamical systems are observed only through trajectories, leaving the mechanisms governing their joint dynamics hidden. Existing approaches can impose interpretable dynamics or learn flexible state transitions, yet the resulting interaction structure is typically either specified in advance or left implicit within the learned dynamics. We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law. Each variable carries a field component, and these components evolve jointly through a learnable mechanical transition. Here, mechanical refers to the relation-to-motion organization of the transition, where learned relations shape state-dependent flows, field responses, and motion tendencies that move the field state forward. The resulting structure is part of the rollout itself: learned relations influence how the field moves, and the same internal quantities support both forecasting and structural readout. Across known-law interaction systems, chaotic benchmarks, real neural recordings, and ecological time series, MF-Net achieves competitive short- and medium-horizon forecasting while retaining inspectable structural readout. On the 40-dimensional Lorenz–96 testbed, MF-Net achieves an eight-step $R^2$ of $0.798\pm0.018$; across five seeds, its learned relation matrix recovers the local coupling support with a local/nonlocal strength ratio of $19.80\pm1.00$ and Precision@$K$ of $1.000\pm0.000$. MF-Net provides a structure-readable dynamical modeling framework in which learned relations are trained through forward evolution and, on real data, interpreted as functional predictive couplings under appropriate observational limits.

16.
arXiv (CS.LG) 2026-06-25

Flexible Gravitational-Wave Parameter Estimation with Transformers

arXiv:2512.02968v2 Announce Type: replace-cross Abstract: Gravitational-wave data analysis relies on accurate and efficient methods to extract physical information from noisy detector signals, yet the increasing rate and complexity of observations represent a growing challenge. Deep learning provides a powerful alternative to traditional inference, but existing neural models typically lack the flexibility to handle variations in data analysis settings. Such variations accommodate imperfect observations or are required for specialized tests, and could include changes in detector configurations, overall frequency ranges, or localized cuts. We introduce a flexible transformer-based architecture paired with a training strategy that enables adaptation to diverse analysis settings at inference time. Applied to parameter estimation, we demonstrate that a single flexible model, called Dingo-T1, can (i) analyze 48 gravitational-wave events from the third LIGO-Virgo-KAGRA Observing Run under a wide range of analysis configurations, (ii) enable systematic studies of how detector and frequency configurations impact inferred posteriors, and (iii) perform inspiral-merger-ringdown consistency tests probing general relativity. Dingo-T1 also improves median sample efficiency on real events from a baseline of 1.4% to 4.2%. Our approach thus demonstrates flexible and scalable inference with a principled framework for handling missing or incomplete data, key capabilities for current and next-generation observatories.

17.
PLOS Medicine 2026-05-27

Sequential chemo-immunotherapy followed by standard versus reduced thoracic radiotherapy for older and/or frail stage III non-small-cell lung cancer: A randomized open-label cohort trial

Authors:

by Wei-Xiang Qi, Shuyan Li, Mengdi Wang, Huan Li, Feifei Xu, Lei Yao, Biao Yu, Linlin Chen, Gang Cai, Cheng Xu, Xianwen Sun, Zhiyao Bao, Jiayi Chen, Yi Xiang, Shengguang Zhao Background The appropriateness of concurrent chemoradiotherapy (cCRT) for older or clinically vulnerable stage III unresectable non-small-cell lung cancer (NSCLC) patients remains contentious. Furthermore, the survival implications of de-escalating thoracic radiotherapy (RT) intensity in this population have not been conclusively elucidated. Methods and findings We conducted a phase II randomized, open-label, two-cohort (non-comparative) trial at a tertiary hospital in China (NCT05557552). Between September 30, 2022 and April 30, 2024, we enrolled 56 older and/or frail patients with stage III NSCLC who were ineligible for cCRT. The primary endpoint was the 1-year progression-free survival (PFS) rate estimated using the Kaplan–Meier method. Secondary endpoints included objective response rate (ORR), overall survival (OS), and safety. In the intention-to-treat (ITT) set, which included all 56 randomized patients who received at least one dose of study treatment, the 1-year PFS was 84.3% (95% confidence interval [CI] [70.3%, 98.3%]) in the standard RT group and 70.7% (95% CI [54.3%, 87.1%]) in the reduced RT group. In the per-protocol set (53 patients), the 1-year PFS was 82.9% (95% CI [68.9%, 98.8%]) in the standard RT group and 73.4% (95% CI [58.3%, 92.4%]), with a median follow-up of 24 months. Among 56 patients in the safety analysis set, 71.4% of patients experienced grade 3/4 adverse events (AEs) in the standard RT group and 53.6% in the reduced RT group. One patient (3.6%) in the reduced RT and three patients (10.7%) in the standardized RT experienced grade 5 AEs. The main limitations are the non-comparative design, small sample size, and lack of power to establish non-inferiority or superiority. Conclusion The current study suggested that reduced RT combined with sequential chemo-immunotherapy might be feasible for older/frail patients intolerant to cCRT, showing numerically similar survival outcomes. These exploratory findings warrant confirmation in larger, adequately powered randomized trials. Trial registration The trial had been registered on ClinicalTrials.gov on Sep 30, 2022.ClinicalTrials.gov NCT05557552

18.
arXiv (CS.CL) 2026-06-19

Source-Grounded Data Generation for Text-to-JSON Learning

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

19.
arXiv (CS.CV) 2026-06-15

A Qualitative Review of GenAI-Based Methods for Data Generation and Augmentation in Industrial Computer Vision Applications

AI-driven computer vision applications require a profound database to ensure predictable behaviors and performance. Such predictable behaviors are especially important for industrial applications in gaining trust from users. However, such a database is not readily available in industrial applications, and its acquisition is not trivial either. Active learning methods can be applied to ramp up data within a project deployment to iteratively increase the database, and thus the application predictability. Unfortunately, we observe that this often leads to a loss of user trust in the application, which is difficult to regain once lost. This leads to a "chicken-and-egg" dilemma in which neither the database nor the application is developed. In this work, we review state-of-the-art methods and approaches to further boost the database the initial active data ramp-up phase. Here, we focus on recent advancements in GenAI-based data generation and augmentation methods and review their adaptability on an industrial computer vision classification use case. Although we observe a potential for automatic data ramp-up, we also see a domain miss match in between the source (training environment) and target (industrial use-case) - regarding context defined in natural language and object characteristics.

20.
arXiv (quant-ph) 2026-06-25

Sp(2N, R) interferometry in multi-mode Gaussian bosonic systems for optimal metrology and quantum control

arXiv:2606.25768v1 Announce Type: new Abstract: Multi-mode interferometers for bosons in Gaussian states are important systems for quantum metrology with precision beyond the standard quantum limit and for bosonic quantum computing. However, there is a lack of theoretical foundation for generic $N$-mode Gaussian interferometry. In this work, we study quantum metrology and quantum control in multi-mode bosonic systems with quadratic Hamiltonians, exploiting the fundamental Sp$(2N,R)$ symmetry of such interferometers. We show that the optimal quantum control to maximize sensitivity requires aligning squeezing and displacement in the same direction. We propose Sp$(2N,R)$ echo, a multi-mode generalization of the SU$(1,1)$ interferometry, to achieve the sensitivity of phase estimation set by the quantum Fisher information. In addition, we introduce a geometrical means for reversing many-body dynamics with Sp$(2N,R)$ dynamical symmetry, such as dynamics of the bosonic Kitaev chain. Our schemes are readily realizable in optical, atomic, and mechanical platforms.

21.
arXiv (CS.CV) 2026-06-17

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.

22.
arXiv (CS.AI) 2026-06-16

LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control

arXiv:2606.16802v1 Announce Type: new Abstract: Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

23.
arXiv (CS.LG) 2026-06-25

Multifidelity-Augmented Gaussian Process Inputs for Surrogate Modeling from Scarce Data

arXiv:2603.22050v2 Announce Type: replace-cross Abstract: Supervised machine learning describes the practice of fitting a parameterized model to labeled input-output data. Supervised machine learning methods have demonstrated promise in learning efficient surrogate models that can (partially) replace expensive high-fidelity models, making many-query analyses, such as optimization, uncertainty quantification, and inference, tractable. However, when training data must be obtained through the evaluation of an expensive model or experiment, the amount of training data that can be obtained is often limited, which can make learned surrogate models unreliable. In many engineering and scientific settings, cheaper low-fidelity models may be available, for example arising from simplified physics modeling or coarse grids. These models may be used to generate additional low-fidelity training data. The goal of multifidelity machine learning is to use both high- and low-fidelity training data to learn a surrogate model which is cheaper to evaluate than the high-fidelity model, but more accurate than any available low-fidelity model. This work proposes a new multifidelity training approach for Gaussian process regression which uses low-fidelity data to define additional features that augment the input space of the learned model. Similarly to cokriging estimators, the proposed approach conditions the high-fidelity surrogate model on the predictions of all available low-fidelity surrogate models, while benefiting from the computational efficiency of autoregressive estimators. Numerical experiments on several test problems demonstrate both increased predictive accuracy and reduced computational cost relative to the state of the art.

24.
arXiv (CS.CL) 2026-06-12

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

25.
arXiv (CS.LG) 2026-06-16

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

arXiv:2606.15127v1 Announce Type: new Abstract: Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit workflows may inspect traces for misleading or biased input. In such settings, two responses can receive the same final-answer score while differing in whether the trace explicitly flags injected biasing content. Accuracy-only evaluation collapses these cases. We study this gap as a measurement blind spot for responsible evaluation and introduce a minimal trace-level diagnostic with two axes: susceptibility (whether the bias breaks a previously correct answer) and acknowledgment (whether the trace contains a rubric-defined surface reference to the injected content). Across thousands of biased GSM8K trials, GPT-4o and Claude Sonnet~4 have similar susceptibility rates ($1.3\%$ vs.\ $1.2\%$) but substantially different acknowledgment rates ($13.0\%$ vs.\ $75.0\%$) under the same rubric.