Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-12

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

02.
arXiv (CS.LG) 2026-06-12

How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

arXiv:2606.13443v1 Announce Type: new Abstract: Neural operators have emerged as a powerful data-driven approach for solving time-dependent PDEs. Among recent advances, memory-augmented neural operators explicitly incorporate past states and have achieved remarkable performance under low-resolution observation settings. However, existing approaches apply a fixed memory weight regardless of observation conditions, such as resolution or physical parameters, limiting their adaptability. Our preliminary experiments reveal that optimal memory weight varies with resolution and viscosity, implying that a fixed memory weight cannot simultaneously optimize performance across diverse settings. We propose AMGFNO, which dynamically modulates memory weight through a learnable gate. On the Kuramoto-Sivashinsky and Burgers' equations, AMGFNO achieves 55-79% nRMSE reduction over at low resolution, with the learned gate value automatically decreasing from $\bar{g} \approx 0.7$ to near-zero as resolution increases.

03.
arXiv (math.PR) 2026-06-16

Malliavin Calculus for the stochastic Cahn-Hilliard equation driven by fractional noise

arXiv:2601.10490v2 Announce Type: replace Abstract: The stochastic partial differential equation analyzed in this work is the Cahn-Hilliard equation perturbed by an additive fractional white noise (fractional in time and white in space). We work in the case of one spatial dimension and apply Malliavin calculus to investigate the existence of a density for the stochastic solution $u$. In particular, we show that $u$ admits continuous paths almost surely and construct a localizing sequence through which we prove that its Malliavin derivative exists locally, and that its law is absolutely continuous with respect to the Lebesgue measure on $\bf R$, establishing thus that a density exists. A key contribution of this work is the analysis of the stochastic integral appearing in the mild formulation: we derive sharp estimates for the expectation of the $p$-th power ($p \geq 2$) of the $L^{\infty}(D)$-norm of this stochastic integral as well as for the integral involving the $L^{\infty}(D)$-norm of the operator associated with the kernel appearing in the integral representation of the fractional noise, all of which are essential for this study.

04.
arXiv (quant-ph) 2026-06-19

Mapping molecular polariton transport via pump-probe microscopy

arXiv:2504.15501v4 Announce Type: replace Abstract: We demonstrate how the transport properties of molecular polaritons in optical cavities can be extracted from a microscopic modeling of pump-probe spectroscopy. Our approach combines a mean-field treatment of the light-matter Hamiltonian with a perturbative expansion of both light and matter components, along with spatial coarse-graining. This approach extends semiclassical cavity spectroscopy to multimode light-matter interactions, providing full access to spatially resolved transient spectra. By simulating a microscopy experiment with counter-propagating pump and probe pulses, we compute the differential transmission and show how molecular dephasing and persistent dark exciton populations drive sub-group-velocity transport of the root-mean-square displacement. We analyze transport across the polariton dispersion, showing how velocity renormalization correlates with excitonic weight, consistent with experimental observations, and further its dependence on the rate of molecular dephasing. Our results highlight the need to consider measured spectroscopic observables when characterizing transport in polaritonic systems.

05.
arXiv (CS.AI) 2026-06-18

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

arXiv:2606.01139v3 Announce Type: replace Abstract: Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates, it retains the first verifier-passing skill within the revision budget and falls back to empirical utility only when no candidate succeeds. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills transfer across both executors and task environments, suggesting that SkillRevise captures reusable procedural knowledge beyond any single executor.

06.
arXiv (CS.AI) 2026-06-11

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

arXiv:2605.14738v3 Announce Type: replace-cross Abstract: Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.

07.
arXiv (CS.AI) 2026-06-17

Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work

arXiv:2606.17099v1 Announce Type: cross Abstract: AI coding agents increasingly accept assigned software tasks, modify repositories under bounded authority, and return work packages for review. Prior work proposed the software delegation contract, covering the task, authority, returned work package, and acceptance context, as the unit of analysis for delegated coding work, but did not measure its effects. This paper reports a controlled pilot study of explicit delegation contracts for coding agents. We built a dependency-free TypeScript API task environment with seeded defects and documentation gaps, authored ten tasks across five families, and ran 64 agent executions across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored with hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, for 192 reviews. Explicit contracts did not improve objective task outcomes: all 64 runs passed hidden acceptance checks, with zero scope violations. They did improve reviewability. Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66); reviewer ambiguity decreased (p = 0.035); changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when demanded by the contract. Contracts cost +13% agent tokens and +38% wall-clock time, with larger effects for the weaker model tier. On these small tasks, delegation contracts bought reviewability rather than correctness.

08.
medRxiv (Medicine) 2026-06-22

Study protocol: Feasibility and clinical implications of real-time cerebral autoregulation monitoring in major noncardiac surgery with the Medtronic Cotrending algorithm (AUTOREGULATE-NONCARDIAC-COTRENDING)

Background: Perioperative hypotension is associated with postoperative organ injury. However, trials of hypotension avoidance have not found meaningful improvements in postoperative cardiovascular, renal, neurological or functional outcomes. One possible explanation is that organ perfusion depends on patients individual autoregulatory ranges. Hence, technology enabling monitoring of the autoregulatory status of vital organs, e.g. the brain, could provide a physiologic basis for personalising of blood pressure targets. However, current established methodologies for monitoring cerebral autoregulation in noncardiac surgery, e.g. the cerebral oximetry index (COx), are limited by performance and usability. The Medtronic Cotrending algorithm has been developed to provide automated, near real-time assessment of cerebral autoregulation. While feasibility was demonstrated in cardiac surgery, its applicability in major noncardiac surgery remains unknown. This study aims to evaluate the technical feasibility and clinical implications of Cotrending-based cerebral autoregulation monitoring in major noncardiac surgery. Objectives: Primary objective: To evaluate the technical feasibility of using the Medtronic Cotrending algorithm to monitor intraoperative cerebral autoregulation in real-time during major noncardiac surgery, drawing comparisons to the COx algorithm. Secondary objectives: to investigate the potential clinical implications of Cotrending-based cerebral autoregulation monitoring. Design: Single-centre, prospective cohort study. Setting: Swiss tertiary care centre Patients: Patients enrolled in AUTOREGULATE-NONCARDIAC who were monitored intraoperatively with the Medtronic INVOS(TM) 5100 near-infrared spectroscopy (NIRS) system. Outcomes: Technical feasibility outcomes include success rate of determination of the lower limit of cerebral autoregulation, intraoperative uptime, time to first estimate of the lower limit of cerebral autoregulation, sensitivity to external factors and to data artefacts; agreement of Cotrending-derived lower limit of cerebral autoregulation with COx-derived lower limit of cerebral autoregulation. Conclusions: N/A Trial registration: Clinicaltrials.gov NCT07630129

09.
arXiv (CS.LG) 2026-06-17

Evaluating Intersectional Fairness across Clinical Machine Learning Use Cases using Fairlogue and the All of Us Research Program

arXiv:2604.16450v2 Announce Type: replace-cross Abstract: Intersectional biases in healthcare data can produce compound disparities in clinical machine learning models, yet most fairness evaluations assess demographic attributes independently. FairLogue, a toolkit for intersectional fairness auditing, was applied across multiple clinical prediction tasks to evaluate disparities across combined demographic groups. Using the All of Us dataset, two published models were selected for replication and evaluation: (A) prediction of selective serotonin reuptake inhibitor associated bleeding events and (B) two-year stroke risk in patients with atrial fibrillation. Observational fairness metrics were computed across race, gender, and intersectional subgroups, followed by counterfactual analysis to evaluate whether disparities were attributable to group membership. Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership. These results highlight the importance of intersectional fairness auditing and demonstrate how FairLogue provides deeper insight into bias in clinical machine learning systems.

10.
arXiv (CS.LG) 2026-06-18

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

arXiv:2606.18539v1 Announce Type: new Abstract: Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.

11.
medRxiv (Medicine) 2026-06-22

Maternal-Fetal immune networks and viral signatures in the healthy amniotic cavity

The intrauterine environment has traditionally been viewed as a privileged site protected by the placental barrier. However, emerging evidence suggests that early in utero microbial exposure may prime the developing fetal immune system. Here, using target-enriched metagenomics and high-dimensional proteomics, we characterized the intra-amniotic viral landscape and immune networks in 114 healthy pregnancies including both normal and anomalous fetuses. We identify a sparse yet heterogeneous human viral signature in 26% of samples, predominantly composed of Herpesviridae, Polyomaviridae, and Picornaviridae. Although viral reads abundance was associated with fetal abnormalities, viral detection generally did not induce overt inflammatory activation, supporting a state of immune homeostasis within the amniotic cavity. Instead, viral presence was associated with subtle and selective immune modulation, including altered inducible antimicrobial peptide expression (HBD-2 and HBD-3), coupled with an attenuation of regulatory cytokines. Our results further reveal that the amniotic immune environment is primarily governed by gestational age, transitioning from a Th1-predominant "alert" phase to innate-readiness preceding parturition. These findings suggest that fragments of viral genetic material within the amniotic cavity may contribute to fetal immune instruction without triggering overt inflammation, providing a foundational framework for understanding how "silent" viral-exposure during gestation influences the developmental origins of neonatal immunity.

12.
arXiv (CS.CL) 2026-06-16

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc{\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.

13.
arXiv (CS.CV) 2026-06-16

Physics-Driven Zero-Shot MRI Reconstruction with Non-local Image Priors

Zero-Shot Self-Supervised Learning (ZS-SSL) has emerged as a promising paradigm for accelerated Magnetic Resonance Imaging (MRI) reconstruction, eliminating the reliance on fully-sampled external datasets. However, learning solely from a single under-sampled scan suffers from supervision scarcity and optimization instability, often leading to overfitting or artifacts. To address these challenges, we propose a robust physics-driven ZS-SSL framework that synergizes physical consistency with image-domain non-local priors. Our method introduces three core innovations: (1) a Coil Sensitivity Map (CSM)-Guided Dynamic Repository, which stabilizes the training trajectory by filtering physically inconsistent artifacts based on coil sensitivity constraints; (2) a SPIRiT-based regularization, which enforces k-space self-consistency via a learned correlation kernel and stochastic masking; (3) a Non-Local Self-Similarity (NSS) Pixel Bank, which leverages the high-fidelity reference established by the former modules to explicitly mine non-local anatomical similarities, thereby augmenting supervision in the image domain. Extensive experiments on the FastMRI dataset demonstrate that our approach achieves state-of-the-art performance, particularly under high acceleration factors, effectively bridging the gap between zero-shot learning and supervised methods. The code is available at https://github.com/Zolento/NS-SSL.

14.
Nature (Science) 2026-06-17

Molecular basis of polyadenylated RNA fate determination in the nucleus

作者:

Eukaryotic genomes generate a plethora of polyadenylated (pA+) RNAs1,2, which are packaged into ribonucleoprotein particles (RNPs). To ensure faithful gene expression, functional pA+ RNPs, including protein-coding RNPs, are exported to the cytoplasm, whereas transcripts within non-functional pA+ RNPs are degraded in the nucleus1–4. How cells distinguish these opposing fates remains unknown. The DExD-box ATPase UAP56 (also known as DDX39B) is a central component of functional pA+ RNPs, and promotes their docking to the nuclear pore complex-anchored TREX-25,6, which triggers transcript release from UAP56 to facilitate export7. Here we reveal that the poly(A) tail exosome targeting (PAXT) connection8 binds a TREX-2-like module, which releases pA+ RNAs from UAP56 for decay by the nuclear exosome. The core of this module consists of a LENG8–PCID2–SEM1 trimer, which we show is structurally and biochemically equivalent to the central GANP–PCID2–SEM1 trimer of TREX-2. Mutagenesis and transcriptomic data demonstrate that the nuclear fate of pA+ RNPs is governed by the contending actions of nucleoplasmic PAXT and nuclear pore complex-associated TREX-2, which interpret RNA-bound UAP56 as a signal for RNA decay or export, respectively. As RNA targets of PAXT are generally short and intron-poor, we propose an overall model for pA+ RNP fate determination whereby the distinct sub-nuclear localizations of PAXT and TREX-2 govern the degradation of short non-functional pA+ RNAs while allowing export of their longer and functional counterparts. Biochemical, structural&nbsp;and cell biological analyses reveal that UAP56 (DDX39B) assembles with a TREX-2–like module&nbsp;that redirects non-functional polyadenylated RNAs from export to degradation.

15.
arXiv (CS.CL) 2026-06-15

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

16.
arXiv (CS.CL) 2026-06-12

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

17.
arXiv (CS.LG) 2026-06-17

A fairness-aware extension of Stochastic Multicriteria Acceptability Analysis for ranking

arXiv:2606.17756v1 Announce Type: new Abstract: Fairness has become a central concern in ranking problems involving individuals or social groups, particularly under the Responsible Artificial Intelligence agenda. In Multi-Criteria Decision Analysis, Stochastic Multicriteria Acceptability Analysis (SMAA) provides a robust framework for handling uncertainty and incomplete preference information, but it does not explicitly address fairness in the resulting rankings. This paper proposes SMAA-Fair, a fairness-aware extension of SMAA for ranking problems. The approach reweights the simulated rankings generated by SMAA according to their level of group fairness, so that fairer rankings contribute more strongly to the acceptability indices and central weights vector. The framework is independent of the aggregation model and can incorporate different fairness metrics. In this study, Statistical Parity, normalized discounted Kullback–Leibler divergence (rKL) and normalized discounted cumulative Kullback–Leibler divergence (nDKL) are adopted. Rankings are derived from the fairness-adjusted acceptability matrix using expected ranking and maximum acceptability ranking. We also derive the central weight according to the degree of fairness in the obtained rankings. Numerical experiments with synthetic and real data show that SMAA-Fair improves the representation of protected groups among favourable ranking positions, while preserving robustness to preference uncertainty.

18.
arXiv (CS.LG) 2026-06-11

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

arXiv:2606.12280v1 Announce Type: new Abstract: Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion transformer (DiT), shipped as two separate-weight copies of a single-stream 34-layer backbone for classifier-free guidance and conditioned by a Qwen3-VL-8B encoder - for Ampere RTX 3090 GPUs, which lack FP8 tensor cores. Our INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and mixed-precision protection of a small high-fragility layer set) holds the FP8 quality ceiling: on a 200-prompt benchmark the paired same-seed bootstrap CI for INT8-FP8 includes zero on both Pick and CLIP, while INT8 improves on NF4 by $+1.9$ CLIP (95% CI $[+1.21,+2.64]$, excluding zero). A per-category OCR analysis, to our knowledge unreported for this model class, confirms text legibility is preserved, and an ablation isolates protection of the FFN down-projections as the dominant quality lever. Our GGUF Q4_K quantization beats NF4 at equal on-disk size and is the Pareto winner on the quality-memory frontier, with paired confidence intervals excluding zero (Q8_0 is quality neutral). Finally, we characterize where 8-bit quantization helps and where it does not: INT8's weights match FP8's footprint rather than shrink it, so a speed gain on Ampere awaits a fused INT8 kernel.

19.
arXiv (CS.CV) 2026-06-11

MedCTA: A Benchmark for Clinical Tool Agents

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

20.
arXiv (CS.AI) 2026-06-19

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

arXiv:2606.19381v1 Announce Type: cross Abstract: Code-switch (CS) Automatic Speech Recognition (ASR) remains challenging due to limited availability of high quality CS text-speech pairs for training. Although synthetic data augmentation via Text-to-speech (TTS) has been explored, existing CS TTS approaches primarily optimise reconstruction fidelity and do not explicitly enforce language-boundary consistency, thereby limiting their effectiveness for CS ASR augmentation. This paper proposes a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.

21.
arXiv (CS.LG) 2026-06-16

Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators

arXiv:2606.15280v1 Announce Type: new Abstract: Most existing anomaly detection methods rely on estimating a probability density or learning an enclosing decision boundary, implicitly assuming that normal data occupies a region of non-zero volume in the ambient space. In contrast, structural anomaly detection considers data that lies near a low-dimensional manifold, creating a mismatch between the inductive bias of existing methods and the structure of the data, often resulting in degraded performance. To address this mismatch, we introduce a geometric perspective. Specifically, we learn a projection operator onto the manifold of normal samples and define a sample as anomalous if it is altered by this projection. This formulation naturally integrates the inductive bias of manifold-supported data and reframes anomaly detection in terms of a projection residual, thereby resolving issues arising from modeling degenerate distributions. Notably, it provides a unifying interpretation of reconstruction-based methods by explaining their success and failure in terms of projection quality. In particular, it explains the strong generalization ability of projection-aligned models as a consequence of contraction behavior toward the manifold. Moreover, by decoupling anomaly detection from probabilistic modeling, it reduces the tendency to misclassify rare but normal samples, a widely recognized limitation of existing approaches. Empirically, we demonstrate that projection-aligned methods achieve strong performance, outperforming boundary-based methods while improving upon existing reconstruction-based approaches.

22.
arXiv (quant-ph) 2026-06-12

Multiple Topological Haldane Phases for Symmetry-Protected Quantum Information Processing

arXiv:2606.12685v1 Announce Type: new Abstract: Symmetry-protected topological phases have attracted significant interest at the fundamental level and as a potential platform for quantum information processing, owing to their protected edge states and resilience to perturbations. Applying these features for practical and efficient quantum computation is highly desirable, but remains an open challenge. Here, we demonstrate the partitioning into multiple independent Haldane phase subsystems of a single spin-1/2 ladder system and propose this as a scalable architecture for gate-based quantum computation, which takes advantage of the symmetry-protected topological order. We encode qubits in the two topological states of the $S^{z}=0$ sector of each subsystem. Finite-size effects, typically viewed as detrimental, instead provide a controllable energy splitting that enables single-qubit rotations using only local magnetic fields. An Ising-type interaction between neighboring subsystem edges generates entangling gates, enabling universal quantum computation driven by two control parameters that are easily accessible experimentally. Our results demonstrate how symmetry-protected topological phases can be directly harnessed for circuit-model quantum computation in realistic systems.

23.
arXiv (quant-ph) 2026-06-12

Candidate overtone shear horizontal SAW resonators in thin-film lithium niobate for intermodal acousto-optic modulation

arXiv:2606.12853v1 Announce Type: cross Abstract: The merits of thin-film surface acoustic wave (SAW) devices are pivotal to develop the high-performance intermodal acousto-optic modulators. In this work, we have proposed shear-horizontal (SH) SAW resonators for anticipated intermodal acousto-optic modulation on the thin-film lithium niobate platform. Through optimization of the cut angle of LN films, the SAW wavelength, and the thickness of interdigital transducer (IDT) electrodes, the calculated acousto-optic overlap factors utilizing SH0 modes are improved by more than an order of magnitude compared with those of Rayleigh modes. Furthermore, we have fabricated and characterized three kinds of proof-of-principle SH0 mode devices without/with grating reflectors. The electromechanical coupling coefficients (keff^2) and quality factors (Q) in the overtone resonators with grating reflectors are systematically evaluated, featuring the highest Q of 843 with the compromised keff^2 of 0.96%-4.72%. The results reveal that the temperature coefficients of frequency (TCF) of Rayleigh modes vary across various overtones, whereas the SH0 modes exhibit TCFs in the range of 32.3-68.9 ppm/C. Our fabricated SH0-mode overtone resonators demonstrate the capability of operating at power levels up to 29 dBm without electrode damage, offering a promising paradigm for robust and high-efficiency intermodal acousto-optic modulators with potential applications in integrated optical signal processing, microwave photonics,and quantum information technologies.

24.
medRxiv (Medicine) 2026-06-15

Wellbeing After Stroke-2 (WAterS-2): a feasibility study with process evaluation exploring inclusive, accessible, online psychological support after stroke

Objectives: Explore feasibility and acceptability of upskilling a workforce to deliver a co-developed intervention, based on Acceptance and Commitment Therapy (ACT), to support psychological adjustment post-stroke targeting underserved groups. Design: Multi-site, single-arm feasibility study with embedded mixed-methods process evaluation (ISRCTN17628580). Setting: Four NHS community stroke services across England. Participants: 1. Stroke survivors [&ge;]18 years of age, [&ge;]4 months post-stroke, reporting psychological difficulties adjusting to stroke, able to consent and access remote group sessions in English; 2. Group facilitators from NHS stroke services, not ACT specialists. Intervention: WAterS-2: an eight-session, remotely-delivered ACT-informed group intervention. Outcome measures: Recruitment, fidelity, safety, acceptability and perceived value were assessed using fidelity checklists, post-intervention surveys and semi-structured interviews with stroke survivors and facilitators. Clinical outcomes including mood (HADS), wellbeing (ONS4), psychological flexibility (AAQ-ABI), measured post-group and three-months later. Results: Nineteen stroke survivors recruited (mean 9.6 months post-stroke; n=5 (26%) minoritised ethnicities; n=10 (52%) with aphasia). Thirteen facilitators - including two peer support workers - delivered the intervention with fidelity following structured training across four services. Drop-out was low (2/19; 11%); with 15 (79%) attending [&ge;]5/8 sessions. Remote data collection was feasible (79% follow-up completion), with no adverse events recorded. Acceptability was high: survivors valued peer connection, grounding and mindfulness practices. ACT metaphors were helpful for some but challenging for others, including some with aphasia. Online delivery was suitable but limited informal connection. Facilitators reported increased capability, incorporating ACT skills into routine care. NHS workforce pressures and geographically-constrained referral pathways limited recruitment reach. Conclusions: WAterS-2 is feasible, safe, acceptable and inclusive. A mixed workforce, including NHS peer support workers, can be upskilled to deliver with fidelity. Inclusion of underserved groups is achievable but requires active strategies beyond standard NHS referral routes. Findings inform a provisional logic model and a future pragmatic trial.

25.
arXiv (CS.AI) 2026-06-15

No Accidental Software Agent First Canonical Code for Human Code Entropy Reduction and 30 to 500 times Lower Frontier Model Requirements

arXiv:2606.14357v1 Announce Type: cross Abstract: Frontier coding models may spend substantial capacity learning not only program behavior, but also accidental entropy in human repositories. Such repositories contain valuable signals: tests, incidents, migrations, edge cases, product judgment, and operational history. These signals are entangled with framework churn, naming drift, generated-source ambiguity, dependency rituals, CI dialects, weak proof routes, and human-oriented review customs. We propose agent-first canonical code, a proof-carrying substrate that rewrites routine product software into canonical behavior profiles, typed change algebra, proof lanes, constrained edit grammars, semantic patch cells, runtime negative memory, and proof-carrying change objects. The core hypothesis is that quotienting software by behavior equivalence under a declared oracle can collapse equivalent encodings into governed representatives with explicit evidence and proof obligations. The endpoint is amortized cost per verified correct change, including source, context, reasoning, tools, verification, security, provenance, review, failed loops, defects, and foundry cost under a common oracle. Reported reduction bands are hypotheses, not measured frontier results. The proposed limit is a No-Accident Horizon: removable accident decreases until residual novelty, evidence, governance, risk, and future optionality dominate. For supported routine-product distributions, this gives a defensible planning target near 100-fold all-in cost reduction, not a guarantee for all software. Preliminary QLoRA experiments on Qwen2.5-Coder-14B show that 64,088 canonical trajectories are learnable and suppress tested forbidden-language markers, but do not establish behavior preservation, scaling economics, or verified-change cost. The contribution is a falsifiable program centered on minimum functional description length and verified-change cost.