Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-18

Robust and Interpretable Adaptation of Equivariant Materials Foundation Models via Sparsity-promoting Fine-tuning

arXiv:2606.18691v1 Announce Type: new Abstract: Pre-trained materials foundation models, or machine learning interatomic potentials, leverage general physicochemical knowledge to effectively approximate potential energy surfaces. However, they often require domain-specific calibration due to physicochemical diversity as well as mismatches between practical computational settings and those used in constructing the pre-training data. To address this, we propose a sparsity-promoting fine-tuning method that selectively updates model parameters by exploiting the structural properties of E(3)-equivariant materials foundation models. On energy and force prediction tasks across molecular and crystalline benchmarks, our method matches or surpasses full fine-tuning and equivariant low-rank adaptation while updating only $\sim$3~\% of parameters, and in some cases as little as $\sim$0.5~\%. Beyond energy and force calibration, we further demonstrate task generalizability by applying our method to magnetic moment prediction and magnetism-aware total energy modeling. Finally, analysis of sparsity patterns reveals physically interpretable signatures, such as enhanced $d$-orbital contributions in transition metal systems. Overall, our results establish sparsity-promoting fine-tuning as a flexible and interpretable method for domain specialization of equivariant materials foundation models.

02.
arXiv (CS.AI) 2026-06-16

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

arXiv:2606.16721v1 Announce Type: new Abstract: Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception–dynamics–planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at https://github.com/1999kevin/awesome_medical_world_models.

03.
arXiv (CS.CV) 2026-06-25

GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.

04.
arXiv (quant-ph) 2026-06-17

A polynomial-time approximation scheme for minimum-weight decoding of topological codes

arXiv:2606.18145v1 Announce Type: new Abstract: Two-dimensional topological translationally invariant (2D TTI) stabilizer codes lie at the heart of fault-tolerant quantum computation, but using them requires solving the decoding problem. Minimum-weight decoding of these codes was recently shown to be NP-hard, even in basic settings, such as the color code with Pauli $Z$ errors and the toric code with Pauli $X$, $Y$ and $Z$ errors. Here, we prove that minimum-weight decoding of 2D TTI codes nonetheless admits a polynomial-time approximation scheme (PTAS), i.e., for any constant $\varepsilon>0$, a recovery operator of weight within a multiplicative factor of $1+\varepsilon$ of the minimum can be found in polynomial time. Our approach builds on Arora's PTAS for Euclidean problems, such as the traveling salesman problem, and applies when decoding can be cast in terms of point-like excitations connected by string-like errors. It therefore extends beyond two dimensions, covering certain higher-dimensional topological codes and quantum memories, including the toric code with phenomenological or circuit-level noise.

05.
arXiv (CS.CL) 2026-06-25

Unintended Negative Impacts of Promotional Language in Patent Evaluation

Promotional language has been increasingly used to aid the communication of innovative ideas in science. Yet, less is known about its role in the context of technological innovation. Here, we use a validated and domain-diagnosed lexicon of 135 promotional words to study the association between promotional language and patent evaluation outcomes among 2.7 million USPTO patent applications. Our large-scale study reveals three unexpected findings. First, in contrast to scientific evaluation, we find that a higher frequency of promotional words is negatively associated with the probability of an application being (i) granted a patent, (ii) transferred ownership, and (iii) successfully appealed. This promotional penalty holds even after accounting for a range of confounding factors and is largely robust across different technological areas. Among matched samples, the difference in the success rate between the lowest and highest promotional density quintile is 5.5, 5.9, and 5.3 percentage points for patentability, transferability, and rejection reversal. Second, contrary to institutional skepticism, we show that promotional language is not a mask of weak technology, but objectively reflects the degree of combinatorial novelty and future citation impact. Third, digging into the mechanisms, we find that the tolerance to promotional framing is strongly moderated by human factors, with men and experienced examiners showing a higher acceptance of promotional narratives than women and novice examiners. By revealing an emerging paradox in the patent system, our study offers theoretical and practical implications for improving patent evaluation through more objective scrutiny of linguistic patterns in patent filings.

06.
arXiv (quant-ph) 2026-06-24

Flexible Catalysis

arXiv:2510.01065v2 Announce Type: replace Abstract: In quantum information and computation, a central challenge is to determine which quantum states can be transformed into which others under restricted sets of free operations. While many transformations are impossible directly, catalytic processes can enable otherwise forbidden conversions: an auxiliary quantum state (the catalyst) facilitates the transformation while remaining unchanged. In this work, we introduce flexible catalysis, a generalization in which the catalyst is allowed to transform into a different auxiliary state, provided it remains a valid catalyst. We show that this framework subsumes both standard catalytic and multicopy transformations, and we analyse its advantages across several classes of free operations. In particular, we prove that when the free operations are local unitaries or permutation matrices, flexible catalysis enables state extractions that are unattainable with standard catalysis alone.

07.
arXiv (CS.CL) 2026-06-12

Unraveling Syntax: Language Modeling and the Substructure of Grammars

While language models achieve impressive results, their learning dynamics are far from understood. Many domains of interest – such as natural language syntax, coding languages, arithmetic – are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely subgrammars. We define subgrammars, and prove a set of fundamental theorems connecting language modeling and subgrammars. We show that language modeling loss recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. Under additional assumptions, and empirically, parametrized models learn subgrammars in parallel, unlike children who first master simple substructures. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently leads to internal representations that better reflect the grammar's substructure.

08.
arXiv (CS.CV) 2026-06-25

KidRisk: Benchmark Dataset for Children Dangerous Action Recognition

Children are naturally energetic, and during their spontaneous activities, they often encounter potentially dangerous situations, especially when lacking parental supervision. Identifying actions that pose risks plays a crucial role in ensuring their safety. This paper build a novel challenging dataset, namely KidRisk, including 2,500 short videos of children's actions and 10,000 images for dangerous action of children. We also introduce a benchmark on our newly constructs dataset and find that traditional deep learning models demonstrated limited effectiveness on these tasks. Therefore, we develop vision-language based baselines with exceptional context understanding of visual information. Our proposed methods achieved an accuracy of 83.53% in classifying children's actions and 96.14% in recognizing children's dangerous actions, significantly outperforming traditional approaches. These results confirm that vision-language models are not only feasible but also highly effective in detecting hazardous actions, contributing positively to safeguarding children's safety.

09.
arXiv (CS.AI) 2026-06-12

The Query Channel: Information-Theoretic Limits of Masking-Based Explanations

arXiv:2604.16689v2 Announce Type: replace Abstract: Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.

10.
arXiv (CS.LG) 2026-06-15

Gradient boosting for extremes: sampling theory and application to insurance

arXiv:2606.14268v1 Announce Type: cross Abstract: We develop a statistical learning theory for gradient boosting applied to the estimation of covariate-dependent Generalized Pareto (GP) distributions in the context of Peaks-over-Threshold modeling. After an orthogonal reparametrization of the GP likelihood that diagonalizes its Fisher information matrix, we cast the estimation problem within the Empirical Risk Minimization (ERM) framework and derive non-asymptotic error bounds for the boosting estimator. Our analysis accounts for three distinct sources of error in the process: statistical fluctuations, the approximation bias inherent to the asymptotic nature of the GP model-controlled under second-order regular variation-and the approximation error associated with the finite number of boosting iterates, making explicit the resulting bias-variance trade-off. We illustrate the practical benefits of the reparametrization through simulations, showing that it significantly reduces gradient correlation during training and improves convergence stability. The methodology is applied to a medical malpractice insurance dataset from the Texas Department of Insurance, comprising over 18 000 closed claims. The gradient boosting approach yields a good fit for the tail of settlement cost distributions and reveals that the number of days to settlement is the dominant predictor of tail heaviness, consistent with earlier findings in the reserving literature.

11.
arXiv (CS.CL) 2026-06-11

EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA

Standard Retrieval-Augmented Generation (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy. The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5.9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0.226 +/- 0.004 compared to 0.171 for GPT-only and 0.210 for unconditional RAG. Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6.3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.

12.
arXiv (math.PR) 2026-06-11

Capital Asset Pricing Model with Size Factor and Normalizing by Volatility Index

arXiv:2411.19444v5 Announce Type: replace-cross Abstract: The Capital Asset Pricing Model (CAPM) relates a well-diversified stock portfolio to a benchmark portfolio. We insert size effect in CAPM, capturing the observation that small stocks have higher risk and return than large stocks, on average. For some size-based stock portfolios, dividing their returns by the Volatility Index makes them closer to independent and normal. In this article, we combine these ideas to create a new discrete-time model, which includes volatility, relative size, and CAPM. We fit this model using real-world data, prove the long-term stability, and connect this research to Stochastic Portfolio Theory. We fill important gaps in our previous article on CAPM with the size factor.

13.
arXiv (quant-ph) 2026-06-25

Collective rotational cat states of molecules in microwave cavities

arXiv:2606.25815v1 Announce Type: new Abstract: We show theoretically that an ensemble of polar molecules coupled to a microwave cavity supports hybrid rotational-photonic cat states. The cavity couples to a symmetric rotor in the bright manifold of $N$ molecules with $\sqrt{N}$-enhancement. In the dispersive limit of the collective strong coupling regime, virtual multilevel transitions induce an effective Kerr nonlinearity, as confirmed by Wigner tomography and a Schrieffer-Wolff analysis, leading to parity-locked cat structure in the cavity sectors. Collective molecular rotations thus provide a new route to hybrid light-matter cat states.

14.
arXiv (CS.LG) 2026-06-17

Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias

arXiv:2606.17886v1 Announce Type: new Abstract: Monotonicity has been a long-running architectural inductive bias for neural networks, motivated by tabular, scientific, and economic settings where outputs are known to respond monotonically to certain inputs. Existing approaches are MLP- or flow-based and lack per-edge functional transparency; the only Kolmogorov–Arnold Network (KAN) variant with monotonicity, MonoKAN, enforces the constraint only on a restricted parameter subset and requires a projection-style training procedure. We close this gap with MKAN, a KAN with hard monotonicity guaranteed for all parameter values via exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation. Training reduces to standard unconstrained gradient descent. Our headline theoretical contribution is a representation-cost theorem: any $C^K, K >0$ feature extractor inducing a ball-shaped semantic-neighborhood partition admits a monotone realization of the equivalent neighborhood structure at $N' = N^* + k \le 2N^*$, where $k$ is the number of non-monotone coordinates of the original. The bound is architecture-agnostic and gives a principled sizing rule for monotone encoders. Empirically, MKAN is competitive with state-of-the-art monotone NNs on the SMM/ICML-2024 benchmark while being the only method that combines hard unconstrained monotonicity with KAN's per-edge functional transparency; the $2N^*$ prediction is validated in a self-supervised feature-size sweep on four real datasets, and on a controlled monotone-generative dataset MKAN recovers ground-truth factors with substantially higher Spearman alignment than KAN, MLP, and linear baselines.

15.
arXiv (CS.AI) 2026-06-15

CisTransCell: Single-Cell Perturbation Prediction via Gene Function, Regulatory Control, and Cellular Context

arXiv:2606.13713v1 Announce Type: cross Abstract: Predicting cellular transcriptional responses to genetic perturbations is a central problem in single-cell biology, especially in the zero-shot setting where the perturbed gene or gene combination is unseen during training. A major difficulty is that perturbation effects are not determined by expression state alone: they depend on how the perturbed gene product influences other genes and proteins, how those downstream factors act on cis-regulatory elements, and which regulatory programs are active in the current cell state. To better capture this biological complexity, we propose CisTransCell, a cell-conditioned multi-modal framework for single-cell perturbation prediction that augments each gene with two complementary priors: a regulatory-sequence prior that captures how the gene is controlled, and a coding-sequence prior that captures what the gene product does. By integrating these priors with cellular expression state, CisTransCell models perturbation response as a cascade from gene function to regulatory control to downstream transcriptional change. Experiments on benchmark single-cell perturbation datasets show that CisTransCell achieves strong performance in zero-shot perturbation prediction.

16.
arXiv (CS.CV) 2026-06-17

ProCUA-SFT Technical Report

Training computer-use agents (CUAs) – models that interact with graphical desktops through screenshots and keyboard/mouse actions – requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content – 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs – and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld – an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

17.
arXiv (CS.AI) 2026-06-16

AI Supply Chain Galaxy: 3D Visual Analytics for License Compliance

arXiv:2606.16292v1 Announce Type: cross Abstract: The rapid proliferation of machine learning model reuse has transformed the AI ecosystem into a highly interconnected supply chain. Traditional compliance tools and static reports struggle to navigate these massive, multi-hop dependency networks. To address this, we present AI Supply Chain Galaxy (AISCG), an interactive 3D visual analytics system for model provenance and compliance auditing. AISCG maps models into a 3D spatial layout, integrating explicit structural dependencies with a rule-based compliance engine. It supports multi-scale exploration, from global community detection to localized, path-aware lineage tracing. We demonstrate its efficacy through an ecosystem-scale empirical analysis of 908,449 models from Hugging Face. Our findings reveal a concerning landscape: 55.46% of models exhibit compliance risks or metadata conflicts/omissions. We also identified distinct risk patterns, including a 56.67% license omission rate in adapter derivations and an 8.05% "license drift" rate in fine-tuning. Through a case study on the complex Llama model family, we show how AISCG empowers analysts to intuitively trace inherited restrictive terms and identify root causes across deep topological networks, significantly reducing the cognitive load of compliance auditing.

18.
arXiv (CS.AI) 2026-06-19

Sensorimotor World Models: Perception for Action via Inverse Dynamics

arXiv:2606.20104v1 Announce Type: cross Abstract: Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

19.
arXiv (CS.CV) 2026-06-25

VPA-Guard: Defending and Benchmarking Image-to-Video Generation Against Visual Prompt Attacks

Recent advancements in Image-to-Video (I2V) generation have transformed input images from simple appearance references into interactive control interfaces where visual cues such as arrows, sketches, and emojis orchestrate complex video dynamics with unprecedented controllability. However, these seemingly innocuous static cues can be interpreted by models as executable temporal instructions, unfolding into harmful actions in the generated videos. Despite the severity of this threat, existing safety benchmarks remain predominantly focused on text-based and content-only image-based jailbreaks, leaving implicit visual prompt attacks insufficiently explored. To bridge this gap, we present VVA-Bench, the first systematic benchmark for evaluating video generation safety under categorized vision-centric prompt attacks. Extensive experiments on VVA-Bench demonstrate that state-of-the-art models are highly susceptible to such attacks, with Attack Success Rates (ASR) reaching 100.0\% on Wan 2.7 and 74.8\% on Veo 3.1. To mitigate these risks, we propose VPA-Guard, a retrieval-augmented and self-evolving defense framework. By leveraging few-shot reasoning to identify latent malicious intents, our method reduces the attack ASR by 44.2\% and the harmfulness score by 73.4\% on average, while maintaining the model's utility for legitimate user edits. Our work provides both a rigorous benchmark and an effective defense strategy to advance safe and socially responsible multimodal generation.

20.
arXiv (CS.LG) 2026-06-16

Bayesian Optimization for Learning Nonlinear MPC in Autonomous Agent Navigation

arXiv:2606.14763v1 Announce Type: cross Abstract: Real-time autonomous navigation in dynamic, unknown environments remains a fundamental challenge for mobile robotics. We propose a map-free framework that tightly integrates reactive rolling-horizon planning with nonlinear Model Predictive Control (MPC). At each control cycle, a LiDAR-based Gaussian occupancy representation is constructed and used to generate collision-free trajectories via A* search, which are then tracked by a CasADi/IPOPT MPC formulation incorporating a smooth sigmoid obstacle barrier. To improve robustness to parameter sensitivity, we adopt an offline Bayesian optimization scheme based on Tree-structured Parzen Estimators (TPE), which identifies near-optimal controller parameters with respect to a composite navigation objective. In addition, a Gaussian Process surrogate is used to analyze parameter sensitivity and provide insight into the optimization landscape. The proposed framework is robot-agnostic and is evaluated on the Unitree Go2 quadruped in simulation using Gazebo, followed by deployment on the physical robot. Experimental results show that parameters tuned in simulation transfer effectively to hardware, maintaining comparable performance without additional tuning. The full system achieves up to a 90.0\% navigation success rate when deployed, along with a 38.9\% average improvement in the evaluation metrics across simulated environments.

21.
arXiv (CS.AI) 2026-06-25

The Token Not Taken: Sampling, State, and the Stochasticity of AI Agents

arXiv:2606.08998v2 Announce Type: replace Abstract: Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated. At the core of many current agents is a foundation model, a large pretrained model adaptable to many downstream tasks, embedded in an orchestration loop that plans, calls tools, observes results, and updates state. One explicit intrinsic source of variability in such systems is token generation: the model computes scores over possible next tokens, the scores are converted into probabilities, and a decoder may sample tokens using a pseudo-random number generator. A small sampled token difference can then propagate upward into a different tool call, code path, search query, or agent state. Other sources of variability are extrinsic to token sampling, including changing environments, live data, serving infrastructure, batch effects, and numerical details. By separating these layers, this tutorial clarifies what it means to call agentic AI systems stochastic, when such variability can be reproduced under matched conditions, and why deterministic execution need not imply identical behavior in deployed settings.

22.
arXiv (CS.AI) 2026-06-17

Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work

arXiv:2606.17099v1 Announce Type: cross Abstract: AI coding agents increasingly accept assigned software tasks, modify repositories under bounded authority, and return work packages for review. Prior work proposed the software delegation contract, covering the task, authority, returned work package, and acceptance context, as the unit of analysis for delegated coding work, but did not measure its effects. This paper reports a controlled pilot study of explicit delegation contracts for coding agents. We built a dependency-free TypeScript API task environment with seeded defects and documentation gaps, authored ten tasks across five families, and ran 64 agent executions across two model tiers under three conditions: a realistic issue-style prompt, an explicit delegation contract, and a contract with a required evidence bundle. Each run was scored with hidden acceptance tests, mutation checks, and scope analysis, then reviewed by three independent condition-blinded model-based reviewers using a fixed rubric, for 192 reviews. Explicit contracts did not improve objective task outcomes: all 64 runs passed hidden acceptance checks, with zero scope violations. They did improve reviewability. Evidence sufficiency improved in 22 of 30 paired comparisons and worsened in none (+0.83 on a 5-point scale, p < 0.0001, Cliff's delta = 0.66); reviewer ambiguity decreased (p = 0.035); changed-file lists, known-limitations sections, residual-risk sections, and reviewer checklists appeared mostly or only when demanded by the contract. Contracts cost +13% agent tokens and +38% wall-clock time, with larger effects for the weaker model tier. On these small tasks, delegation contracts bought reviewability rather than correctness.

23.
arXiv (CS.CL) 2026-06-15

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.

24.
medRxiv (Medicine) 2026-06-15

HPV Self-Sampling in Cervical Screening: A Rapid Review

Introduction Cervical cancer is the fourth largest cause of cancer deaths in women. HPV self-sampling could increase uptake of cervical screening. This rapid review aimed to determine the accuracy, concordance, uptake and acceptability of self-sampling over clinician-collected samples in high income countries. Method We followed Cochrane Rapid Reviews Methods. Top-up of 4 systematic reviews and meta-analyses was performed. Narrative data synthesis was conducted and meta-analysis where applicable. Databases searched were MEDLINE, EMBASE, CENTRAL and clinical trial registries. Risk of bias was assessed using AMSTAR 2, QUADAS, the Cochrane Risk of Bias (RoB), or the Nudelman and Otto, 2020 tool, depending on the study type. Findings The review included 39 studies for accuracy, 38 studies for concordance, 37 uptake and 48 studies for acceptability. Self-sampling has similar accuracy as clinician-collected samples when PCR-based assays are used. The overall agreement of self-sampling and clinician-collected samples was 87.1%(95%CI;85.6-88.6) with a kappa value of 0.70(95%CI;0.67-0.73). Mail-to-all strategies had higher uptake with participation differences of 11.3%(95%CI:8.4-14.2) in the intention-to-treat analysis and 7.7%(95%CI:4.7-10.8) in the per protocol analysis. Self-sampling is acceptable to non-attendees (91%(95%CI;85.3-94.6). Conclusion and Recommendation Self-sampling shows good performance on the four clinical effectiveness indicators of accuracy, concordance, uptake and acceptability.

25.
arXiv (CS.CL) 2026-06-11

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.