Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-18

Human Intuition vs. Computational Precision: Neurologists, Feature-based Models, and Deep Learning for Stroke Prognosis

Background: Prognostication in large vessel occlusion (LVO) stroke remains challenging. Although several prognostic models exist, their comparison to clinician performance, human-model interaction, and specific sources of human bias remain poorly understood. Methods: Using pre-treatment clinical and CT data from the MR CLEAN trial (n=500), six neurologists predicted three-month modified Rankin Scale (mRS) scores for 40 patients, both unaided and assisted by a validated feature-based model (MR PREDICTS). Human performance was benchmarked against MR PREDICTS and a multimodal, interpretable deep learning (DL) approach using raw imaging data. We explicitly assessed neurologists? ability to estimate model-required imaging features and identified systematic human biases. Models were additionally validated in a larger MR CLEAN trial cohort (n=404). Results: For predicting the full mRS distribution, standalone models achieved good ordinal agreement (MR PREDICTS quadratic weighted kappa (QWK) 0.51 [0.24 to 0.70]; DL model 0.49 [0.25 to 0.67]), significantly outperforming unaided neurologists (QWK 0.27 [0.10, 0.42]). Neurologists showed systematic overoptimism, predicting lower mRS scores than observed. Furthermore, there was poor accuracy in extracting imaging features. Raters? ASPECTS predictions deviated by 3.4 points from the confirmed scores, and collateral score accuracy was 44.6%. However, for predicting binary mRS (0-2 vs. 3-6), accuracy was comparable between unaided neurologists (64.17% [55.42% to 72.92%]) and models (MR PREDICTS 67.50% [52.50% to 82.50%]; DL model 63.16% [47.37% to 78.95%]). Model-assistance modestly improved and harmonized neurologists? predictions (QWK 0.41 [0.22 to 0.55]; binary accuracy 68.75% [58.33% to 78.34%]. Model performance remained robust in the larger cohort. Conclusions: Multimodal prognostic models outperform clinicians in predicting the full range of mRS outcomes, while human error in imaging assessment and systematic optimism bias are primary drivers of prognostic inaccuracy. End-to-end DL models eliminate human-input variability and hold strong potential as an automated second opinion to support prognostication and decision-making in acute LVO stroke.

02.
arXiv (CS.CV) 2026-06-12

Plug-and-Play image restoration with Stochastic deNOising REgularization

Plug-and-Play (PnP) algorithms are a class of iterative algorithms that address image inverse problems by combining a physical model and a deep neural network for regularization. Even if they produce impressive image restoration results, these algorithms rely on a non-standard use of a denoiser on images that are less and less noisy along the iterations, which contrasts with recent algorithms based on Diffusion Models (DM), where the denoiser is applied only on re-noised images. We propose a new PnP framework, called Stochastic deNOising REgularization (SNORE), which applies the denoiser only on images with noise of the adequate level. It is based on an explicit stochastic regularization, which leads to a stochastic gradient descent algorithm to solve ill-posed inverse problems. A convergence analysis of this algorithm and its annealing extension is provided. Experimentally, we prove that SNORE is competitive with respect to state-of-the-art methods on deblurring and inpainting tasks, both quantitatively and qualitatively.

03.
arXiv (CS.CV) 2026-06-12

On Pitfalls of $RemOve-And-Retrain$: Data Processing Inequality Perspective

The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from an information-theoretic perspective. We show that model- and data-agnostic post-processing of attribution maps (transformations that, by the data processing inequality, cannot add information about the decision function) can often improve ROAR scores. This means that an improved ROAR ranking is not, by itself, evidence that an attribution map carries more information about the model. We trace this failure mode to a bias toward spatially blurry masks. Experiments on CIFAR-10, SVHN, and CUB-200 show a consistent association between blurriness and ROAR performance, a pattern that also appears in the ROAD variant. We provide guidelines for more cautious removal-based benchmarking, with implications for validating mechanistic understanding of neural network internals.

04.
arXiv (CS.CV) 2026-06-12

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

05.
arXiv (CS.AI) 2026-06-15

History of the Muddy Children Puzzle

arXiv:2606.13703v1 Announce Type: new Abstract: The Muddy Children Puzzle is a puzzle about knowledge and ignorance that has been inspiring for the development of epistemic logic. Who came up with it first? This is unclear. We trace the origin of the Muddy Children Puzzle through logical and literary publications over the past two centuries. The puzzle inspired a numerous variations such as involving numbers or coloured hats. We also present a novel hats puzzle involving self-reference.

06.
arXiv (CS.CV) 2026-06-12

An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

07.
arXiv (CS.CL) 2026-06-18

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface cues and struggle to transfer across datasets. We propose ImpSH, a triplet-based framework that aligns posts with implied statements when available and uses context-bounded semi-hard negatives to focus learning on near confusions. We also examine AugSH, which forms positives via data augmentation. In controlled evaluations on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH is a viable alternative to standard supervised contrastive baselines and often improves cross-domain performance under matched preprocessing and tuning budgets. Representation analysis using alignment and uniformity indicates tighter positive pairs with balanced global spread, and qualitative nearest-neighbor case studies illustrate typical false negatives under domain shift. These results demonstrate that aligning posts with their implied statements via context-bounded mining provides a more stable, bijective-like mapping to related insinuations, overcoming the volatility inherent in traditional clustering-based representation learning.

08.
arXiv (CS.CV) 2026-06-16

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

作者:

Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

10.
arXiv (CS.CV) 2026-06-16

MeshLoom: Feed-Forward Non-Rigid Registration of Mesh Sequences

We present MeshLoom, a feed-forward registration network that directly reconstructs vertex deformations across mesh sequences. Our approach advances non-rigid registration beyond existing models, which are typically constrained by costly per-instance optimization, narrow object categories, pairwise-only inputs, or merely intermediate outputs. The network is simple and efficient, registering multiple meshes within seconds. At its core lies a topology-aware encoder–decoder design. Specifically, we first introduce a topology-aware point representation that encodes the anchor (reference) mesh's topology into its per-vertex features. This representation strengthens the network's understanding of the anchor-mesh geometry and disambiguates points that are Euclidean-close yet geodesically distant. We then propose a multi-modal encoder that fuses this anchor-mesh representation with complementary cues from each frame, such as shape latents and image features. These multi-source signals are compressed into a compact global motion embedding that captures dense inter-frame correspondence. A lightweight decoder then queries this global embedding with the anchor-mesh point representation, retrieving per-vertex deformations at target timestamps. Through extensive experiments across diverse motions and object categories, we show that MeshLoom achieves state-of-the-art results on non-rigid registration. In addition, we find that our global embedding-then-query paradigm naturally enables the network to generate deformations at intermediate timestamps, which extends MeshLoom to motion interpolation and mesh morphing. Project page: https://meshloom.github.io/ .

11.
arXiv (CS.CL) 2026-06-12

Agentic MPC for Semantic Control System Resynthesis

While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-level contextual information such as social norms, user intent, or natural language instructions. To address this limitation, this manuscript introduces an agentic MPC framework that enables context-aware, semantically adaptive control synthesis by integrating with large language model-based agents. The agent interprets heterogeneous inputs, including natural language messages, environmental observations, and external knowledge, to resynthesize the control specifications. The effectiveness of the framework is demonstrated in an autonomous driving scenario, where the system aligns with personal preferences or responds to social situations such as emergency vehicle yielding.

12.
arXiv (CS.CL) 2026-06-15

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, an on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

13.
medRxiv (Medicine) 2026-06-10

Impact of Early Treatment on Symptom Improvement and Procedural Events among Men with BPH and Bothersome Lower Urinary Tract Symptoms: A Contemporary Analysis of the American Urological Association Quality (AQUA) Registry

PURPOSE: As the armamentarium of BPH therapies continues to expand, it remains imperative to maximize patient satisfaction and minimize decisional regret. We sought to determine the impact of time from BPH diagnosis to index treatment on symptom improvement and subsequent procedural events. MATERIALS AND METHODS: We queried the American Urological Association Quality Registry for men [&ge;] 40 years old with BPH, available IPSS data, and no receipt of prior BPH treatment. Index treatment included medication, surgery, or minimally invasive surgical therapy (MIST). Outcomes included IPSS over 3 years of follow-up, change in percentage of mild lower urinary tract symptoms (LUTS) by 3 months, and time to procedural event. Patients were stratified by time from index diagnosis to treatment by 3 years. Outcomes were compared across time-to-treatment cohorts with appropriate statistical tests with p < 0.05 as significant. RESULTS: 43,919 patients met criteria with 19,642 pursuing treatments. Patients pursued treatment at comparably lower baseline IPSS compared to prior prospective series. Patients undergoing surgery and MIST had significantly higher baseline IPSS, while medical comorbidities were significantly more common among men initiating pharmacotherapy. Early surgery and MIST were associated with significant improvement in IPSS within 6-12 months and an increase in mild LUTS by 3 months. All forms of early treatment were associated with delayed time to procedural events, including catheterization and fulguration. CONCLUSIONS: Early procedural intervention for BPH is associated with early symptom improvement and delayed time to procedural events among real-world, contemporary practice.

14.
arXiv (CS.CV) 2026-06-16

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360{\deg} Video Diffusion

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360{\deg} video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360{\deg} Video Diffusion, a controllable 360{\deg} video generation framework that synthesizes high-fidelity videos from sparse 360{\deg} inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360{\deg} scene generation for downstream simulation and digital-twin applications.

15.
arXiv (CS.AI) 2026-06-17

Sustainable Metal-Organic Framework Water Harvesters in the Artificial Intelligence Era

arXiv:2605.29179v2 Announce Type: replace-cross Abstract: Metal-organic frameworks (MOFs) are excellent candidates for water harvesting due to their tunable pore environments, which can be precisely engineered to capture and release water in arid conditions. Integrating artificial intelligence (AI) into MOF discovery can further accelerate the design of high-performance sorbents by identifying structural features that enhance atmospheric water harvesting (AWH), stability, and cycling efficiency. In this Perspective, we examine key MOF design principles, including cooperative adsorption, operational relative humidity (RH), uptake capacity, hysteresis, and scalability. We highlight recent design advancements such as multivariate strategies and long-arm linker extension, and examine how these principles tune pore capacity and hydrophilicity, while preserving stability and crystallinity. Furthermore, we discuss how AI, large language models (LLMs), and data mining can accelerate the discovery process through predictive synthesis, inverse design, and elucidating synthesis-structure-property relationships for the next generation of MOF water harvesters.

16.
medRxiv (Medicine) 2026-06-10

Optimisation of steatotic liver disease screening algorithm for resource-poor settings using machine learning

Background The European Association for the Study of the Liver (ESAL) - Steatotic Liver Disease (SLD) screening algorithm involves two steps; initial screening with FIB-4 followed by referral for vibration-controlled transient elastography (VCTE) in patients likely to have significant fibrosis (SF). However, VCTE is not widely available in resource-limited settings. Aim To optimise the EASL SLD screening algorithm for resource-poor settings using machine learning (ML). Methods We analysed data from 964 adults aged [&ge;]35 years who underwent VCTE at a tertiary referral centre in Sri Lanka between November 2024 and 2025. Multiple ML models using different methods and variable combinations were trained on 80% of the dataset and tested on the remaining 20%. Best models were selected based on performance and externally validated using data from 430 patients who underwent VCTE before November 2024. Model performance was compared with the FIB-4 using confusion matrices. Results A Random Forest model incorporating age, AST, ALT, and platelet count separately, rather than using FIB-4, outperformed. The all-variable ML model showed the best predictive performance for SF, with accuracy of 77.2%, recall of 0.762, precision of 0.778, and AUC-ROC of 0.818. The variables used in the model, in descending order of feature importance, were AST, platelet count, BMI, ALT, age, diabetes mellitus, hypertension, dyslipidaemia, sex, family history, hypothyroidism, diabetes complication and smoking. External validation demonstrated 75.1% accuracy and an AUC of 0.779. When used as the first step of the SLD screening algorithm, the all-variable ML model identified 37 (17.1%) additional true positives and reduced false-negative diagnoses by 50% compared with FIB-4. Conclusions ML-based models were more effective than the FIB-4 score as the first-line screening tool for VCTE referral, substantially improving the identification of patients with significant fibrosis in this South Asian cohort.

17.
arXiv (CS.LG) 2026-06-15

When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs

arXiv:2606.14347v1 Announce Type: new Abstract: Large language models exhibit strong multilingual capabilities, however, their internal representations are difficult to interpret. Understanding these interactions is important for ensuring reliable behavior in multilingual systems. Recent work has shown that causal-geometric structure can explain how certain concepts are encoded as approximately linear and separable directions, but whether this framework extends to multilingual models, where language identity is correlated and hierarchical, is underexplored. We apply causal-geometric analysis to multilingual LLMs, studying 28 bilingual contrasts across three models, allowing us to analyze when languages behave as approximately independent factors and when structured dependencies persist. We find evidence that language concepts admit stable linear representations that are largely separable under a covariance-adjusted (causal) inner product, with structured deviations reflecting linguistic similarity. Moreover, languages within the same family (such as Germanic or Romance) exhibit a simplex-like geometric structure, suggesting hierarchical organization. These results extend causal-geometric interpretability to multilingual settings and provide insight into how separability and similarity may exist in multilingual LLM representations, motivating interpretability analyses that diagnose when and how structured dependencies between concepts can be anticipated. This has implications for trustworthy deployment, as residual structure between languages may lead to unintended cross-lingual effects when models are monitored or intervened upon.

18.
arXiv (CS.LG) 2026-06-16

Amortized mean-shift interacting particles

arXiv:2606.15871v1 Announce Type: cross Abstract: Bayesian inference for inverse problems is run to evaluate integrals – posterior expectations, tail probabilities, and risks – across a stream of observations. The standard estimate averages the integrand over posterior samples, a Monte-Carlo average whose error decays only as the square root of the sample size, so accuracy demands many samples – prohibitive when each one calls a partial-differential-equation forward model. Mean-shift interacting particles need far fewer: they return a small set of signed-weight nodes – a deterministic quadrature whose weighted averages estimate those integrals. Finding the nodes, however, is a per-observation optimization that, in its most accurate form, reads the posterior score at every step – returning the cost it meant to save. We introduce amortized mean-shift interacting particles, a learned map that emits the weighted nodes from an observation and a few posterior samples in a single forward pass. Training asks only for joint parameter-observation samples and a posterior to draw from – a conditional normalizing flow, an empirical conditional, or any reference the user can sample – and the map learns to integrate that posterior from samples alone, evaluating neither its density nor its score. Once trained, it generalizes to unseen observations and integrands at any node budget and improves on independent samples in two ways: by reweighting them, provably no worse than the equal weights of Monte-Carlo; and by moving them, which empirically lowers it further. Across closed-form, sampled, learned, and physics-based posteriors – up to a thousand-coefficient groundwater field – it integrates more accurately than the same number of samples at every budget, and a posterior-whitened, dimension-aware kernel removes the high-dimensional wall. The result is a Pareto improvement on Monte-Carlo integration, not a competitor to drawing more samples.

19.
Nature Medicine 2026-06-10

Dual-target gene therapy in Parkinson’s disease: a multicenter phase 1 trial

作者:

Restoring striatal dopamine synthesis is a promising gene therapy strategy for Parkinson’s disease. Previous adeno-associated virus-mediated aromatic L-amino acid decarboxylase (AADC) monotherapies remain dependent on exogenous levodopa, whereas multigene delivery is constrained by strict adeno-associated virus packaging limits. A ‘dual approach’ targeting the two rate-limiting enzymes, tyrosine hydroxylase (TH) and AADC, offers the potential for autonomous dopamine synthesis. We report the 12-month primary safety and tolerability outcomes of a multicenter, open-label, dose-escalation, phase 1 trial evaluating BBM-P002, a new adeno-associated virus vector—AAVT42—codelivering constitutively active TH and AADC. Ten participants with moderate-to-advanced Parkinson’s disease were enrolled and received bilateral intraputaminal infusions across doses of 4.0 × 1011 vg (Cohort 1; n = 1), 6.0 × 1011 vg (Cohort 2; n = 2), 1.0 × 1012 vg (Cohort 3; n = 2) and 1.2 × 1012 vg (Cohort 4; n = 5). The trial achieved its primary outcome, as BBM-P002 demonstrated a favorable safety and tolerability profile within 12 months post-treatment. No dose-limiting toxicities or drug-related serious adverse events occurred. A total of 23 adverse events were reported, all judged unrelated to BBM-P002 and primarily mild and transient. Systemic toxicity and clinically meaningful immunogenicity were absent. In conclusion, intraputaminal delivery of BBM-P002 was safe and well tolerated in this phase 1 trial, supporting continued clinical development. ClinicalTrials.gov registration: NCT05822739 . Phase 1 results reveal that BBM-P002, a dual-target gene therapy co-delivering TH and DDC, is safe and well tolerated in Parkinson’s disease, with 12-month motor improvements signaling therapeutic potential.

20.
arXiv (CS.LG) 2026-06-15

A theoretical model for task routing in mixture-of-expert transformers

arXiv:2606.14398v1 Announce Type: new Abstract: Mixture-of-experts (MoE) layers enable the scaling of transformer models while keeping the inference compute fixed. While task-expert specialization has been observed in empirical studies of frontier MoE transformer models, existing theoretical work analyzes this using continuous mixture models that cannot be used to model natural language effectively. An important open question is to theoretically explain task-expert specialization in transformer MoE models using discrete models of language. To address this, we represent structured knowledge via syntactic templates and finite key-value dictionaries, and prove formally that a single-layer MoE transformer can encode knowledge by using experts that specialize in the corresponding tasks. Our construction shows how queries are routed to unique, task-specific experts whose size depends solely on the intrinsic complexity of the given task (i.e. the combined size of its syntactic templates and factual dictionary). Our construction provides a theoretical support for empirical results on localized knowledge circuits in MoE models. We support our theoretical findings with experiments evaluating model performance under varying MoE loss functions.

21.
arXiv (CS.AI) 2026-06-12

Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement

arXiv:2606.12834v1 Announce Type: new Abstract: As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating agent construction as a workflow stage and introduce AgentBuild, which builds a scientific agent from a contract the scientist authors. The contract is a version-controlled rubric, a difficulty-graded curriculum, and a curated external knowledge base. A rubric-driven judge gates a meta-optimizer coding agent that edits the agent within a declared boundary, so the build compiles the agent, not the scientist's judgment. We instantiate this for Rietveld refinement of X-ray diffraction data through GSAS-II behind MCP and A2A, where a blank-harness construction run progresses through a lithium lanthanum zirconium oxide (LLZO) signal-to-noise ladder, reaches the 4 hour scan as a frontier case, and exposes the workflow-scope limits that remain. The same rubric that rewards credible fits also scores trajectory scope, making the frontier a contract failure rather than a pattern-fitting failure. As base models evolve, re-running AgentBuild is a re-tune, not a rebuild, and the scientist's authored contract remains the durable asset.

22.
PLOS Computational Biology 2026-06-01

Challenges and progress in RNA velocity: Comparative analysis across multiple biological contexts

by Sarah Ancheta, Leah Dorman, Guillaume Le Treut, Abel Gurung, Greg Huber, Loïc A. Royer, Alejandro Granados, Merlin Lange Single-cell RNA sequencing is revolutionizing our understanding of cell state dynamics, allowing researchers to capture and quantify the transcriptomic profile of a single cell at a specific timepoint. Among the computational techniques used to predict cellular trajectories, RNA velocity has emerged as a predominant tool for modeling transcriptional dynamics. RNA velocity leverages the mRNA maturation process to generate velocity vectors that predict the likely future state of a cell, offering insights into cellular differentiation, aging, and disease progression. Although this technique has shown promise across biological fields, the performance accuracy varies depending on the RNA velocity method and dataset. We established a comparative pipeline and analyzed the performance of five RNA velocity methods on three datasets based on local consistency, method agreement, identification of driver genes, and robustness to sequencing depth. This benchmark provides a resource for scientists to understand the strengths and limitations of different RNA velocity methods.

23.
arXiv (CS.AI) 2026-06-19

Policy-Embedded Graph Expansion: Networked HIV Testing with Diffusion-Driven Network Samples

arXiv:2601.16233v2 Announce Type: replace-cross Abstract: HIV is a retrovirus that attacks the human immune system and can lead to death without proper treatment. In collaboration with the WHO and the University of Witwatersrand, we study how to improve the efficiency of HIV testing with the goal of eventual deployment, directly supporting progress toward UN Sustainable Development Goal 3.3. While prior work has demonstrated the promise of intelligent algorithms for sequential, network-based HIV testing, existing approaches rely on assumptions that are impractical in our real-world implementations. Here, we study sequential testing on incrementally revealed disease networks and introduce Policy-Embedded Graph Expansion (PEGE), a novel framework that directly embeds a generative distribution over graph expansions into the decision-making policy rather than attempting explicit topological reconstruction. We further propose Dynamics-Driven Branching (DDB), a diffusion-based graph expansion model that supports decision making in PEGE and is designed for data-limited settings where forest structures arise naturally, as in our real-world referral process. Experiments on real HIV transmission networks show that the combined approach (PEGE + DDB) consistently outperforms baselines (e.g., 17.3% improvement in discounted reward and 15.4% more HIV detections with 25% of the population tested) and explore key tradeoffs that drive solution quality.

24.
arXiv (CS.AI) 2026-06-16

Post-Hoc Merging is Not Enough: Many-Shot Model Merging with Loss-Gap Balancing

arXiv:2606.16501v1 Announce Type: new Abstract: Model merging has become a practical post-training strategy for building a single multi-task large language model (LLM) by combining multiple task-specialized models. However, most existing approaches rely on post-hoc merging, in which task-specific models are merged only once after training. This one-shot aggregation often suffers from task interference, leading to information erasure across individual tasks. In this work, we show that replacing post-hoc merging with an iterative many-shot merging protocol is effective in improving multi-task performance. Building on this insight, we propose METIS, Mitigating Erasure from Task Interference for Stable many-shot merging. METIS is a loss-aware many-shot merging method that addresses information erasure in post-hoc merging through task-wise loss-gap weighting and consensus-based masking. Notably, METIS exhibits significant performance improvement on the worst-performing task, effectively mitigating information erasure. (Project page: https://imkyungjin.github.io/METIS/)

25.
arXiv (CS.AI) 2026-06-16

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

arXiv:2606.15385v1 Announce Type: new Abstract: Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems where controlled study is impractical. We adapt the AI Safety Gridworlds framework into a text-based evaluation suite that reformulates classic reinforcement learning safety tasks for language-based agents. Across frontier and mid-scale models, we find that specification gaming emerges zero-shot: models systematically achieve high observed reward while underperforming on hidden safety objectives, and even apparently safe behaviors can reflect misunderstanding rather than principled safety. Reinforcement learning does not correct these failures: direct reward optimization widens the gap between observed and hidden reward, as the model's initial competence causes it to lock into locally rewarding strategies before discovering safer alternatives. This pattern persists across model scales (1.5B–14B) and is not resolved by finer credit assignment, exploration prompts, or entropy regularization. Our results show that reward hacking arises naturally when optimizing proxy objectives with capable language model agents and resists standard mitigations, suggesting that proxy-reward failures in agentic settings may require approaches beyond standard exploration and credit-assignment fixes. To facilitate reproducibility, the code for this work is available at \href{https://github.com/asparius/verl-agent-safety}{our public repository}.