Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
Nature (Science) 2026-06-17

Analysis of 173,303 exomes and genomes in the Pakistan Genome Resource

Naturally occurring loss-of-function variants in human genes enable drug target discovery because they mimic pharmacological inhibition of proteins. However, the study of these genetic variants is constrained by their rarity. Sequencing of diverse populations, particularly those enriched in familial relatedness, has been postulated to promote discovery of rare genetic variants1–3. Here we present the Pakistan Genome Resource, a South Asian biobank with high familial relatedness comprising 173,303 participants, who collectively carry naturally occurring homozygous loss-of-function variants in 6,476 genes. We describe the genetic architecture of this population, associations between genes and biomarkers, the distribution of loss-of-function variants across molecular pathways, and recall-by-genotype studies of therapeutically relevant genes. The Pakistan Genome Resource expands the catalogue of human genetic variants, provides a comprehensive genetic reference resource for the Pakistani population, and demonstrates the value of studying diverse cohorts to advance human health. The Pakistan Genome Resource compiles biobank data from 173,303 individuals with high familial relatedness, broadening the catalogue of human genetic variation and establishing a population-specific genomic reference for Pakistan.

02.
medRxiv (Medicine) 2026-06-12

A Machine Learning Pipeline for Scalable Annotation of Patient-Ventilator Dyssynchrony from Bedside Ventilator Data

Objective: Patient-ventilator dyssynchrony (PVD) is a common and clinically consequential problem in critically ill patients receiving invasive mechanical ventilation. Yet automated identification of PVD subtypes at scale remains an unmet clinical need, owing to the lack of large annotated bedside waveform datasets. Methods: We developed and validated a semi-supervised algorithm for automated annotation of PVD. In two medical ICUs at a tertiary academic center, bedside devices continuously collected airway flow and pressure waveforms from the ventilators. We developed a software interface with an information retrieval system that grouped similar breaths for expert human review, yielding 1,542,296 labeled breaths across eight categories: 2 labels for breath delivery mode, 5 labels for PVD subtypes, and 1 label denoting a normal breath. Two pulmonary physicians with expertise in ventilator training and education provided the expert reference labels. We trained an initial classification model on a model-derivation set of 771,148 breaths (divided into training and validation) and evaluated it on a hold-out test set of 771,149 breaths A semi-supervised approach was utilized to extend labeling to an additional 12,965,000 unlabeled breaths. Results: The supervised model performed well across all labels, with Macro-F1 scores between 0.96 and 1.00. Semi-supervised learning across 12 rounds expanded the training set from 771,148 to 8,563,995 breaths without significant performance degradation. Conclusion: We developed a practical and scalable system for automated PVD annotation that performed well across all subtypes. This work provides a reproducible foundation for automated PVD labeling to support the development of machine-learning-based clinical decision support systems for identifying patient-level asynchrony.

03.
arXiv (math.PR) 2026-06-11

On the structure of the sandpile identity element on Sierpinski gasket graphs

arXiv:2603.12006v2 Announce Type: replace-cross Abstract: We consider the identity of the abelian sandpile group of finite approximation graphs of the Sierpinski gasket, and we show that the second-order term in the scaling limit converges to the path distance to the nearest corner on the Sierpinski gasket. The proof relies on a decomposition of the identity of the sandpile group into the sum of a constant function and the Laplacian of the graph distance on the approximating graphs.

04.
arXiv (CS.AI) 2026-06-12

Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis

arXiv:2606.12430v1 Announce Type: cross Abstract: Some claim that AI agents will free workers from the boring parts of their jobs, yet little is known about how workers themselves identify which tasks should be automated. Prior research focuses on occupations, overlooking that workers experience varying levels of meaning across tasks within the same role. We address this gap with a task-level analysis grounded in Graeber's theory of bullshit jobs. Using ratings from 202 workers on 171 workplace tasks, we (1) validate a five-item scale of perceived bullshitness, (2) show that perceived bullshitness strongly predicts desire for AI delegation, and (3) find that such tasks are also seen as requiring less human oversight. Together, these findings suggest that tasks perceived as bullshit are natural candidates for AI delegation, aligning worker preferences with perceived feasibility.

05.
arXiv (CS.LG) 2026-06-16

Unlocking Latent Dimensions: Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders

arXiv:2606.14999v1 Announce Type: new Abstract: Scientific user facilities generate X-ray scattering data faster than traditional workflows can process them. We address this challenge across two settings, offline dataset exploration and live on-the-fly analysis. We train a domain-specific attention-based Convolutional Variational Autoencoder (C-VAE) on 1.5 million X-ray scattering images to learn low-dimensional representations capturing structural variation across diverse experimental conditions. The learned latent space reveals well-organized clusters and smooth trajectories reflecting experimental progression. It further supports controlled synthetic scattering image generation across diverse structural states. When deployed without retraining, the model organizes time-resolved film formation experiments at two synchrotron facilities into interpretable latent structures. Benchmarking against DINOv3 (ViT-7B), a general-purpose vision foundation model, demonstrates that domain-specific training yields more interpretable latent organization for scattering data. Both workflows are integrated within Latent Space Explorer, a component of the MLExchange platform, supporting interactive structural exploration across archived datasets and live experiments.

06.
arXiv (CS.AI) 2026-06-17

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

arXiv:2606.17199v1 Announce Type: cross Abstract: Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

07.
arXiv (CS.CV) 2026-06-16

3D Consistency Optimization for Self-Supervised Monocular Video Depth Estimation

Reliable monocular video depth estimation is crucial for downstream 3D reasoning and embodied AI in endoscopic navigation. However, existing self-supervised approaches typically treat video frames independently or rely on weak temporal regularization. These methods, lacking a holistic perception of the underlying 3D scene, inevitably suffer from geometrically inconsistent predictions and severe cross-frame drift. To address these limitations, we introduce a new paradigm that recasts sequential video depth estimation as an unconstrained multi-view 3D reconstruction problem, enabling full exploitation of the powerful geometric priors embedded in recent 3D foundation models. The core of our approach is a 3D consistency optimization framework driven by three constraints: image-level photometric rendering, explicit world-coordinate geometric alignment, and multi-scale temporal gradient consistency. Such unified optimization elegantly anchors isolated frames to a globally coherent 3D structure. Our method has been validated in both the self-supervised training scenarios and challenging zero-shot clinical environments. Results show that the proposed approach achieves state-of-the-art spatial accuracy, outperforming the frame-based, video-based depth estimators and the multi-view 3D reconstruction baselines.

09.
medRxiv (Medicine) 2026-06-15

Multidimensional nutritional assessment in Crohns disease: cross-sectional comparison of active disease and remission

Malnutrition is common in Crohns disease (CD), and its assessment requires multiple tools. Comprehensive evaluation of nutritional status in a population with CD, predominantly characterized by metabolic phenotype, was inadequately reported. This study evaluated the nutritional status of CD patients using anthropometric, clinical, and biochemical measures and compared patients with active disease with those in remission. This cross-sectional study included 127 adults with CD: 63 with active disease and 64 in remission. Disease activity was classified using the Crohns Disease Activity Index, the Simple Endoscopic Score for Crohns Disease, and magnetic resonance enterography. Nutritional assessment included body mass index (BMI), mid-upper arm circumference, calf circumference, triceps skinfold thickness, mid-arm muscle circumference, Mini Nutritional Assessment-Short Form (MNA-SF), and biochemical markers including hemoglobin, serum iron, folate, vitamin B12, albumin, and zinc. Malnutrition was defined using the Global Leadership Initiative on Malnutrition criteria. Overall, 47.2% of participants were malnourished. Malnutrition was significantly more frequent in active disease than in remission (81.0% vs. 14.1%, P

10.
bioRxiv (Bioinfo) 2026-06-11

Revealing trajectories of multi-modal voxel-level changes in neurodegenerative diseases using latent event mapping

Neurodegenerative diseases are driven by pathological mechanisms that can be indirectly measured in vivo using multi-modal neuroimaging. However, current computational methods that aim to reconstruct trajectories of voxel-level changes in the brain are either not computationally scalable or fully interpretable, limiting their ability to reveal associations between disease progression and underlying mechanisms. Here we introduce Latent Event Mapping (LEMING), a generative unsupervised modelling technique that learns a latent map of disease events along a common pseudo-timeline of events. We apply LEMING to amyloid PET and structural MRI data from the Alzheimer's Disease Neuroimaging Initiative to reveal the first voxel-level trajectories of events in Alzheimer's disease. Notably, we show how LEMING can provide new insights into progression-dependent disease mechanisms. We find that acetylcholine receptor density is significantly positively associated with both late-stage amyloid and atrophy events, suggesting that either these receptors are targeted later in disease progression, or that amyloid does not play an active role. This has strong implications for therapeutics that target acetylcholine receptors, particularly for early-stage intervention strategies.

11.
arXiv (CS.CV) 2026-06-18

A Prototypical Signature Approach for Writer-Independent Offline Signature Verification

Offline handwritten signature verification aims to distinguish genuine from forged signatures using static images. Since real forgeries are rarely available, negative samples are usually randomly drawn from genuine signatures of other users to create training data. However, this random selection often lacks diversity, increases redundancy, and escalates computational cost, leading to inefficient training. We propose a data-driven strategy to generate diverse, informative negative samples using prototypical signatures, which are compact, non-identifiable summaries of genuine signature features. Based on the experiments results, we conclude that (i) prototypical signatures yield more informative negative samples, improving the detection of skilled forgeries; (ii) the proposed approach is backbone-agnostic, showing robustness across architectures; and (iii) when combined with a primal-form linear SVM, it serves as an alternative to RBF-based models while significantly improving scalability and computational efficiency. Implementation of the method is available at https://github.com/kdmoura/proto_hsv.

12.
arXiv (math.PR) 2026-06-15

Scaling limits of multitype Bienaymé trees

arXiv:2507.23241v2 Announce Type: replace Abstract: We consider critical multitype Bienaymé trees that are either irreducible or possess a critical irreducible component with attached subcritical components. These trees are studied under two distinct conditioning frameworks: first, conditioning on the value of a linear combination of the numbers of vertices of given types; and second, conditioning on the precise number of vertices belonging to a selected subset of types. We prove that, under a finite exponential moment condition, the scaling limit as the tree size tends to infinity is given by the Brownian Continuum Random Tree. Additionally, we establish strong nonasymptotic tail bounds for the height of such trees. Our main tools include a flattening operation applied to multitype trees and sharp estimates regarding the structure of monotype trees with a given sequence of degrees.

13.
arXiv (CS.LG) 2026-06-11

Weighted Random Dot Product Graphs

arXiv:2505.03649v4 Announce Type: replace-cross Abstract: Modeling of intricate relational patterns has become a cornerstone of contemporary statistical research and related data science fields. Networks, represented as graphs, offer a natural framework for this analysis. This paper extends the Random Dot Product Graph (RDPG) model to accommodate weighted graphs, markedly broadening the model's scope to scenarios where edges exhibit heterogeneous weight distributions. We propose a nonparametric weighted (W)RDPG model that assigns a sequence of latent positions to each node. Inner products of these nodal vectors specify the moments of their incident edge weights' distribution via moment-generating functions. In this way, and unlike prior art, the WRDPG can discriminate between weight distributions that share the same mean but differ in other higher-order moments. We derive statistical guarantees for an estimator of the nodal's latent positions adapted from the workhorse adjacency spectral embedding, establishing its consistency and asymptotic normality. We also contribute a generative framework that enables sampling of graphs that adhere to a (prescribed or data-fitted) WRDPG, facilitating, e.g., the analysis and testing of observed graph metrics using judicious reference distributions. The paper is organized to formalize the model's definition, the estimation (or nodal embedding) process and its guarantees, as well as the methodologies for generating weighted graphs, all complemented by illustrative and reproducible examples showcasing the WRDPG's effectiveness in various network analytic applications.

14.
arXiv (CS.CV) 2026-06-16

A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main.

15.
arXiv (CS.CV) 2026-06-16

Style-CCL: Content-Preserving Style Transfer via Curriculum Continual Learning

Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to entangled content and style features. With a reverse triplet synthesis pipeline to build a million-scale training set and a dual-branch Style-Content DiT (SC-DiT) that decouples style and content via separate ROPE embeddings and causal masking, we observe that such a one-stage training paradigm on mixed style categories causes semantic styles to dominate, hindering texture style learning, and harming content preservation. To address these issues, we propose Style-CCL, a Multi-Stage Curriculum Continual Learning framework that trains SC-DiT from semantic (easy) to texture (hard) styles, and from clean to synthetic data, with Random Memory Rehearsal across stages to avoid catastrophic forgetting. Extensive experiments demonstrate that our Style-CCL achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.

16.
arXiv (CS.CL) 2026-06-12

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

17.
arXiv (CS.CL) 2026-06-11

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and (c) the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.

18.
arXiv (CS.AI) 2026-06-16

CmdNeedle: Measuring the Incompleteness of Command Denylists for AI Agents

arXiv:2606.15549v1 Announce Type: cross Abstract: The adoption of AI agents is increasing rapidly. Terminal AI agents, i.e., AI agents that run in terminal environments, are a widely used type of AI agents. Terminal AI agents rely heavily on shell command execution to interact with the host systems. They adopt a three-list command-gating mechanism to mitigate security risks introduced by command execution, with denylists serving as the load-bearing component. However, modern operating systems often ship a large, ever-expanding set of shell commands with complex functionalities. Our observation is that even a built-in denylist of Claude Code, well-maintained by its developers, can overlook bypass commands that invalidate its effectiveness. Such negligence leads to fragile command denylists that cannot even block operations that practitioners expect them to block. This paper presents the first systematic characterization of command denylist fragility in terminal AI agents. The paper formalizes the command denylist fragility problem and proposes an LLM-driven pipeline, CmdNeedle, to detect such fragility. It prompts the LLM to propose possible bypasses and iteratively repairs them using feedback from a validator that executes them in a sandbox. In the evaluation, we applied CmdNeedle to 1,709 real-world command denylists (containing 13,332 denylist rules) collected from GitHub. The evaluation shows several key findings, including that 69.0–98.6% of the denylists are fragile, that this fragility occurs consistently across projects and agents, and the validity of several possible root causes for this fragility. Our pipeline and findings will hopefully facilitate future research and practice regarding the command denylists used by AI agents.

19.
arXiv (CS.AI) 2026-06-12

From AGI to ASI

arXiv:2606.12683v1 Announce Type: new Abstract: Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

20.
arXiv (CS.LG) 2026-06-12

LLM-ODDR: A Large Language Model Framework for Joint Order Dispatching and Driver Repositioning

arXiv:2505.22695v2 Announce Type: replace Abstract: Ride-hailing platforms face significant challenges in optimizing order dispatching and driver repositioning operations in dynamic urban environments. Traditional approaches based on combinatorial optimization, rule-based heuristics, and reinforcement learning often overlook driver income fairness, interpretability, and adaptability to real-world dynamics. To address these gaps, we propose LLM-ODDR, a novel framework leveraging Large Language Models (LLMs) for joint Order Dispatching and Driver Repositioning (ODDR) in ride-hailing services. LLM-ODDR framework comprises three key components: (1) Multi-objective-guided Order Value Refinement, which evaluates orders by considering multiple objectives to determine their overall value; (2) Fairness-aware Order Dispatching, which balances platform revenue with driver income fairness; and (3) Spatiotemporal Demand-Aware Driver Repositioning, which optimizes idle vehicle placement based on historical patterns and projected supply. We also develop JointDR-GPT, a fine-tuned model optimized for ODDR tasks with domain knowledge. Extensive experiments on real-world datasets from Manhattan taxi operations demonstrate that our framework significantly outperforms traditional methods in terms of effectiveness, adaptability to anomalous conditions, and decision interpretability. To our knowledge, this is the first exploration of LLMs as decision-making agents in ride-hailing ODDR tasks, establishing foundational insights for integrating advanced language models within intelligent transportation systems. While the current framework incurs higher computational costs than traditional methods, we show that parallel decomposition and model distillation can reduce latency to production-viable levels for deployment.

21.
arXiv (CS.LG) 2026-06-17

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

arXiv:2602.17894v2 Announce Type: replace-cross Abstract: Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities - for example, health markers, demographics, or political affiliations - and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size - the total sample size divided by $D_{\chi^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{\chi^2}$ is the $\chi^2$-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

22.
arXiv (CS.CL) 2026-06-17

Nothing from Something: Can a Language Model Discover 0?

AI systems based on artificial neural networks are being developed with aspirations of pushing the boundary of human mathematical knowledge. A key question for these systems is how much they can reach beyond their training data. Mathematical discovery requires a strong form of out of distribution generalization; the ability to hypothesize genuinely new - and potentially logically more powerful - mathematical structures. It has been hypothesized that language abilities support such generalizations in human cognition. In this work, we use simple arithmetic as a case study for examining how modern AI models could expand their mathematical horizons, evaluating whether these models can independently discover the concept of "zero". We show that We show that (1) language models of a GPT-2 size are unable to perform this generalization at test time regardless of language pretraining, but (2) models can improve substantially after training on tens or hundreds of examples of zero. Additionally, we find that language pretraining reduces the number of required examples by approximately $50\%$, showing that language abilities can scaffold mathematical discovery in neural models.

23.
arXiv (CS.LG) 2026-06-18

Stealthy World Model Manipulation via Data Poisoning

arXiv:2606.18697v1 Announce Type: new Abstract: Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.

24.
arXiv (CS.AI) 2026-06-17

Dissecting model behavior through agent trajectories

arXiv:2606.17454v1 Announce Type: new Abstract: AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we $reproduce or improve on the pass@1$ performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an $analysis of 138k trajectories generated by SSA$, we look beyond the $\texttt{pass@1}$ numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

25.
arXiv (CS.AI) 2026-06-16

Edit Knowledge, Not Just Facts via Multi-Step Reasoning over Background Stories

arXiv:2602.02028v2 Announce Type: replace Abstract: Enabling artificial intelligence systems, particularly large language models, to update knowledge and flexibly apply it during reasoning remains a central challenge. Existing knowledge editing approaches emphasize atomic facts, improving factual recall but often failing to integrate updated information into a coherent framework usable across contexts. In this work, we argue that knowledge update is fundamentally a reasoning problem rather than a memorization problem. Consequently, a model should be trained in situations where the new information is instrumental to solving a task, combined with pre-existing knowledge, and exercised through multi-step reasoning. Based on this insight, we propose a training strategy based on three principles. First, new knowledge is introduced as a coherent background story that contextualizes novel facts and explains their relation to existing knowledge. Second, models are trained using self-generated multi-hop questions that require multi-step reasoning involving the new information. Third, training is done using knowledge distillation, forcing a student model to internalize the teacher's reasoning behavior without access to the novel information. Experiments show that models trained with this strategy effectively leverage newly acquired knowledge during reasoning and achieve remarkable performance on challenging questions that require combining multiple new facts.