Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (math.PR) 2026-06-18

First to reach $n$ game

arXiv:2506.08782v4 Announce Type: replace Abstract: We consider a game with two players, consisting of a number of rounds, where the first player to win $n$ rounds becomes the overall winner. Who wins each individual round is governed by a certain urn having two types of balls (type 1 and type 2). At each round, we randomly pick a ball from the urn, and its type determines which of the two players wins. We study the game under three regimes. In the first and the third regimes, a ball is taken without replacement, whilst in the second regime, it is returned to the urn with one more ball of the same colour. We study the properties of the random variables equal to the properly defined overall net profits of the players, and the results are drastically different in all three regimes.

02.
arXiv (CS.AI) 2026-06-16

A Multi-Level Architecture for Reusable Materials Ontologies – The OntoCrafter Ceramics Ontology (OCO) as Reference Implementation

arXiv:2606.14814v1 Announce Type: cross Abstract: The Materials Science and Engineering ontology landscape is fragmented along multiple axes simultaneously. Horizontally: a recent survey identified 94 ontologies of which over 40 are structurally incompatible; each new application domain – ceramics, polymers, batteries, smart materials – typically restarts ontology design from scratch. Vertically: EU regulation (CSRD, CSDDD, PPWR, CBAM, R2R, AI Act, ESPR) forces material, manufacturing, supply-chain, and lifecycle data into integrated digital product passports, leaving ontologies that only address horizontal fragmentation incomplete for any contemporary consumer. And mechanistically: a vocabulary that records that BNT-BT has $d_{33} \approx 580$ pC/N stores a fact but cannot surface why – Bi-6s$^2$ lone-pair stereo-activity, anomalous Born effective charges, soft modes, defect chemistry – without a systematic explanation skeleton. We propose a multi-level modular architecture with two independent classification axes – level of abstraction (L0 bridges, L1 material-agnostic laboratory-notebook, L2 material-class-specific, L3 categorical reasoning) and consumer audience (material vs. compliance) – in which the material-specific level is internally organised by a seven-tier mechanistic-explanation skeleton (Symmetry, Energy/DFT, Thermo/CALPHAD, Kinetics, Microstructure, Defect chemistry, Bonding) applicable to any crystalline ionic oxide. The level-and-audience modularity dissolves the horizontal fragmentation, the compliance audience absorbs the vertical regulation pressure, and the seven-tier organisation of Level 2 delivers the mechanistic explanation depth. We instantiate the architecture as the OntoCrafter Ceramics Ontology (OCO v0.94): 5,196 classes across 44 modules; 167,348 OWL axioms (40,454 logical); 1,674 properties; 829 cross-ontology bridge mappings; 1,172 SHACL shapes; 163 published competency questions.

04.
arXiv (CS.LG) 2026-06-19

A Solver-Free Training Method for Predict-then-Optimize

arXiv:2606.19587v1 Announce Type: cross Abstract: We propose a scalable method for training prediction (machine learning) models in the predict-then-optimize paradigm, where model outputs serve as coefficients for a subsequent linear optimization task. Directly minimizing the empirical decision regret is intractable for linear programming and combinatorial optimization since the decision mapping is piecewise constant, and the gradients are zero almost everywhere. While existing methods address this by smoothing the differentiation process, they suffer from scalability issues, since a computationally expensive solver call is required for every gradient evaluation. To address this, we propose a decision-focused learning pipeline based on a measure transformation principle, which yields a new surrogate loss that is completely optimization-solver-free during training. We establish theoretical guarantees, including Fisher consistency and excess risk bounds. Empirically, our method achieves decision quality competitive with state-of-the-art methods while reducing training time by orders of magnitude.

05.
arXiv (quant-ph) 2026-06-11

Generating function and Bloch representation for quantum Fisher tensor

arXiv:2603.04615v2 Announce Type: replace Abstract: The Uhlmann relative amplitude between two density matrices is shown to be a generating function, through which the quantum Fisher tensor that contains both the quantum Fisher information matrix and the mean Uhlmann curvature can be obtained via differentiation over system parameters. In the pure state limit, our generating function recovers that of the quantum geometric tensor proposed by Het\'{e}nyi and L\'{e}vay, and also clarifies the fidelity and phase between two quantum states as the generating functions of the quantum metric and Berry curvature, respectively. A generic expression for the quantum Fisher tensor in terms of the Bloch representation of density matrices is derived, which facilitates the calculation of the tensor, mean Uhlmann curvature, and geometric properties derived from the quantum Fisher information matrix. Canonical ensembles of spins are adopted to demonstrate our formalism, which reveals a constant Ricci scalar, a vacuum Einstein equation, and a cosmological constant on the 3D Euclidean manifold of the magnetic field.

06.
arXiv (CS.LG) 2026-06-17

Conditional Attribution for Root Cause Analysis in Time-Series Anomaly Detection

arXiv:2604.17616v3 Announce Type: replace Abstract: Root cause analysis (RCA) for time-series anomaly detection is critical for the reliable operation of complex real-world systems. Existing explanation methods often rely on unrealistic feature perturbations and ignore temporal and cross-feature dependencies, leading to unreliable attributions. We propose a conditional attribution framework that explains anomalies relative to contextually similar normal system states. Instead of using marginal or randomly sampled baselines, our method retrieves representative normal instances conditioned on the anomalous observation, enabling dependency-preserving and operationally meaningful explanations. To support high-dimensional time-series data, contextual retrieval is performed in learned low-dimensional representations using both variational autoencoder latent spaces and UMAP manifold embeddings. By grounding the retrieval process in the system's learned manifold, this strategy avoids out-of-distribution artifacts and ensures attribution fidelity while maintaining computational efficiency. We further introduce confidence-aware and temporal evaluation metrics for assessing explanation reliability and responsiveness. Experiments on the SWaT and MSDS benchmarks demonstrate that the proposed approach consistently improves root-cause identification accuracy, temporal localization, and robustness across multiple anomaly detection models. These results highlight the practical utility of conditional attribution for explainable anomaly diagnosis in complex time-series systems. Code and models are available at: https://github.com/dfki-av/Conditional-Attribution-for-Root-Cause-Analysis-in-Time-Series-Anomaly-Detection.

07.
arXiv (CS.LG) 2026-06-17

Non-negative Matrix Factorisation with Topological Regularisation

arXiv:2606.17531v1 Announce Type: new Abstract: We investigate the learning of interpretable bases in non-negative matrix factorisation (NMF) by regularising the topology of the learned basis functions. Our approach is motivated by the observation that many data modalities can be viewed as non-negative functions on a structured domain, where the quality of a basis is intrinsically linked to its topology. However, naive methods for incorporating the topology of the support are often hindered by discreteness and threshold dependence, rendering them unsuitable for continuous optimisation. We address these challenges by employing persistent homology as a stable, threshold-free topological quantifier and by designing topological scores that integrate into the NMF objective as regularisers. The resulting framework encompasses spatially coherent image components, periodic time-series structures, and clique-like graph signals within a unified modelling language.

08.
medRxiv (Medicine) 2026-06-17

Treatment of Multi-Drug-Resistant Tuberculosis with Second-Line All-Oral Drugs in Ghana: Incidence of Adverse Events.

Introduction: The treatment of multidrug-resistant tuberculosis (MDR-TB) remains challenging due to the toxicity of second-line medications and suboptimal treatment outcomes. This study aimed to determine the incidence of adverse events and identify factors associated with these events in patients undergoing treatment for MDR-TB with second-line all-oral drugs in Ghana. Methods: This retrospective cohort study reviewed the medical records of 384 MDR-TB patients treated with second-line all-oral drugs at selected health facilities in Ghana, including the Greater Accra Regional Hospital, Eastern Regional Hospital, and Kumasi South Hospital. Data were extracted using the Kobo Collect tool, capturing patient demographics, baseline clinical and laboratory characteristics, treatment regimens, and adverse events. The study period spanned from 2020 to August 2024. Results: The study included a total of 384 MDR-TB patients, with a mean age of 45 years (SD = 15). The majority of patients were male (65.78%), and most were within the 45-64 years age group (33.85%), followed by those aged 25-44 years (31.25%). Regionally, the highest number of cases were reported from the Greater Accra Region (39.06%), followed by the Eastern Region (31.25%) and Kumasi South Hospital (29.69%). Approximately one in four patients (25%) presented with comorbidities, with HIV being the most common (19.5%). The most frequently reported adverse events were diarrhea (14%), dizziness (13.7%), and vomiting (12.3%). Most of these were mild to moderate in severity and tended to decrease as treatment progressed. Severe adverse events, such as leukopenia and acute kidney injury, were rare, occurring in less than 5% of patients. Over the course of treatment, gastrointestinal adverse events such as vomiting and nausea showed a significant decline, indicating possible patient adaptation or improved clinical management. Results from the multivariate Poisson regression analysis revealed that age and comorbidities were significant predictors of adverse events. Patients aged 65 years and above had a 56% lower risk of developing adverse events compared to younger patients (Adjusted Risk Ratio [aRR] = 0.44, 95% CI: 0.25-0.79, p = 0.005). Conversely, patients with comorbid conditions such as diabetes or hypertension were approximately 2.6 times more likely to experience adverse events compared to those without comorbidities (aRR = 2.65, 95% CI: 1.58-4.43, p < 0.001). The effect of sex was not statistically significant after adjustment (aRR = 1.03, 95% CI: 0.70-1.50, p = 0.86). At the end of the treatment period, 74.9% of patients achieved successful outcomes, including both those who were cured and those who completed treatment without being classified as cured. However, 25.1% had unsuccessful outcomes, which included treatment failure, relapse, or death. Conclusion: In conclusion, adverse events are common in the treatment of MDR-TB with second-line All-Oral drugs, with gastrointestinal adverse events being the most prevalent. These findings highlight the importance of monitoring and managing adverse events to optimize treatment outcomes for MDR-TB patients in Ghana.

09.
arXiv (quant-ph) 2026-06-17

A matching decomposition algorithm for simulating quantum walk Hamiltonians

arXiv:2601.11418v3 Announce Type: replace Abstract: In this work, we present a new algorithm for generating quantum circuits that efficiently implement continuous time quantum walks on arbitrary simple sparse graphs. The algorithm, called matching decomposition, works by decomposing a continuous-time quantum walk Hamiltonian into a collection of exactly implementable Hamiltonians corresponding to matchings in the underlying graph followed by a novel graph compression algorithm that merges edges in the graph. We develop a greedy matching heuristic and a compression-aware matching heuristic, both of which can be used in the quantum circuit algorithm. Lastly, we convert the walks to a circuit and Trotterize over these components. The dynamics of the walker on each edge in the matching can be implemented in the circuit model as sequences of CX and CRx gates. We do not use Pauli decomposition when implementing walks along each matching. Furthermore, we compare greedy (compression-aware) matching decomposition to a standard Pauli-based simulation pipeline and find that greedy (compression-aware) matching decomposition consistently yields substantial resource reductions, requiring up to 43$\%$ (70\%) fewer controlled gates and up to 54$\%$ (75\%) shallower circuits than Pauli decomposition across multiple graph families. Finally, we also present examples and theoretical results for when matching decomposition can exactly simulate a continuous-time quantum walk on a graph.

10.
arXiv (CS.CL) 2026-06-12

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types – nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

11.
arXiv (CS.AI) 2026-06-16

Do Large Language Models Have Emotions?

arXiv:2606.14742v1 Announce Type: cross Abstract: Do LLMs have emotions? A recent paper from Anthropic reports finding internal representations of emotion concepts in Claude Sonnet 4.5, concluding that the LLM has 'functional emotions.' We evaluate this claim against what is known about how emotions actually function in biological systems. We argue that emotions serve two core functions: the context-sensitive interpretation of situations, and the reorganization of processing across multiple systems in response to those interpretations. The Anthropic findings offer partial support for the first function, though the consistent, discrete emotional representations identified in Claude sit uneasily with affective neuroscience findings that human emotion is characterized by variable rather than uniform neural signatures. On the second function, the evidence is mixed: Claude's representations modulate output without producing the dynamic reorganization of attention, decision speed, and motivational state that defines emotion in biological systems. We close by proposing what it would take for an LLM to have emotions.

12.
arXiv (quant-ph) 2026-06-17

Quantum mechanics in configuration space in context

arXiv:2606.17622v1 Announce Type: new Abstract: To enhance the way in which wave-particle duality is implemented in the modelling of quantum mechanical systems, Bukhari et al. [New J. Phys. 27, 084501 (2025)] recently introduced an alternative approach to quantum mechanics, namely quantum mechanics in configuration space. This formalism is based on a physically motivated quantisation of Newtonian mechanics and promotes the classical position-velocity states (x,v) to pairwise distinguishable quantum states. The resulting |x,v> states form the basis of the Hilbert space of individual quantum mechanical particles and evolve along classical trajectories. In this paper, we consider the modelling of a mechanical particle in free space and put quantum mechanics in configuration space into context. It is shown that this formalism increases the continuity between quantum and classical mechanics by avoiding a conceptual inconsistency associated with the definition of momentum in canonical quantisation. In addition, we emphasise that standard quantum mechanics and quantum mechanics in configuration space are based on two distinct formulations of classical mechanics.

13.
arXiv (CS.LG) 2026-06-12

Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

arXiv:2606.12735v1 Announce Type: new Abstract: Physics-Informed Neural Networks (PINNs) are a machine learning method for solving forward and inverse Partial Differential Equations (PDEs). When applied to PDEs with Dirac delta functions in the forcing terms, boundary conditions, or initial conditions, PINNs require approximating them with smooth surrogate functions, a practice that can introduce significant modeling errors. In this work, we exploit the interpretation of PINNs as Residual Least Squares (RLS) methods and show that this perspective enables direct treatment of Dirac delta terms by integrating the weak-form equation. Among RLS formulations other than PINN, we focus on the Radial Basis Function (RBF) expansion (also known as a single-layer RBF Network). We show that while integrating out the Dirac delta in PINNs causes residuals to fail to converge to zero, RBF-RLS consistently provides good forward and inverse solutions to transport problems. We explain this finding using the Neural Tangent Kernel (NTK) theory. We test both approaches on linear PDEs that represent groundwater flow and transport in porous media and rivers. We solve inverse problems to fit synthetic data, noisy synthetic data, and real-world measurements.

14.
arXiv (CS.CL) 2026-06-18

REVES: REvision and VErification–Augmented Training for Test-Time Scaling

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

15.
arXiv (CS.CV) 2026-06-16

BBR-Net: Boundary-Balanced Replay for Continual Medical Image Segmentation

Continual learning for medical image segmentation remains challenging under domain shift because replay-based methods often preserve appearance information without explicitly modeling anatomical structure. This study investigates whether structural consistency governs knowledge retention in continual cardiac ultrasound segmentation. We propose the Boundary-Balanced Replay Network (BBR-Net), which selects replay samples using boundary-aware priority and class balance to preserve anatomically informative regions. The method is evaluated on CAMUS and CardiacNet under forward (CAMUS to CardiacNet) and reverse (CardiacNet to CAMUS) task orders. In the forward setting, BBR-Net retains source-task performance close to an offline joint-training reference, while markedly reducing catastrophic forgetting and preserving competitive target-task adaptation. Ablation results show that boundary-aware prioritization contributes to retention and improves the balance between source-task preservation and target-task adaptation when combined with class-aware sampling. In contrast, the reverse setting reveals that structure-aware replay fails when initial representations are learned from noisy and structurally inconsistent data. To isolate this effect, we conduct a controlled structural perturbation analysis by progressively corrupting source-task boundaries while keeping the dataset, architecture, and training protocol fixed. Forgetting increases consistently as structural reliability decreases, suggesting that replay effectiveness is strongly influenced by the quality of stored structural information, rather than by memory capacity alone. These findings indicate that preserving anatomical structure under domain shift is a central factor in continual medical image segmentation, and that replay mechanisms should account for structural reliability to support robust knowledge retention.

16.
arXiv (CS.LG) 2026-06-12

Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

arXiv:2606.12615v1 Announce Type: new Abstract: ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model's outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

17.
arXiv (CS.LG) 2026-06-17

Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness

arXiv:2603.20775v2 Announce Type: replace Abstract: In personalized marketing, uplift models estimate the incremental effect of an intervention by modeling how customer behavior would change under alternative treatments using counterfactual analysis. However, real-world marketing data often exhibit various biases, such as selection bias, spillover effects, measurement error, and unobserved confounding. These biases can adversely affect both the accuracy of uplift estimation and the validity of evaluation metrics. Despite the importance of bias-aware assessment, there remains a lack of systematic studies evaluating how different models and metrics perform under such biased conditions. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets inherently lack counterfactual ground truth. This limitation renders the direct validation of evaluation metrics infeasible and prevents the precise quantification of biases. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking. This approach effectively bridges the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) the stability of evaluation metrics is linked to their mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and evaluation metrics under real-world data imperfections.

18.
arXiv (CS.CV) 2026-06-16

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

19.
arXiv (CS.AI) 2026-06-11

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

arXiv:2505.03296v2 Announce Type: replace-cross Abstract: We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at https://midigap.cs.uni-freiburg.de.

20.
arXiv (CS.CL) 2026-06-12

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at https://github.com/leoluolol/forge-benchmark.

21.
arXiv (CS.AI) 2026-06-17

Mental Health AI Safety Claims Must Preserve Temporal Evidence

arXiv:2605.08827v2 Announce Type: replace Abstract: The safety of mental health AI is often judged at the wrong temporal scale. Current evaluations typically score isolated responses, endpoint outcomes, or aggregate dialogue quality, while clinically consequential failures may arise from the order and accumulation of interactions themselves, including delayed escalation, repeated reinforcement, dependency formation, failed repair, and gradual deterioration across turns. This paper argues that this mismatch is not merely a limitation of evaluation coverage but a source of invalid safety conclusions. We introduce Temporal Safety Non-Identifiability, a formal account of why safety properties that depend on sequence, timing, accumulation, or recovery cannot be certified by protocols that discard those features. From this formalization, we develop SCOPE (Safety Claims Over Preserved Evidence) as a general principle for aligning safety claims with the evidence an evaluation actually retains, and instantiate it as SCOPE-MH, a mental-health instantiation of this reporting standard. We operationalize SCOPE-MH through a proof-of-concept on the AnnoMI dataset of expert-annotated motivational interviewing conversations, which reveals mechanisms of failure that per-turn behavior scoring does not represent. We propose SCOPE-MH as a diagnostic complement to existing evaluation infrastructure and argue that evaluation preserving temporal evidence is necessary, not optional, for safety-critical mental health AI deployment.

22.
arXiv (CS.CV) 2026-06-12

Possibilistic Predictive Uncertainty for Deep Learning

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modeling. Existing methods for uncertainty modeling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous connections between their specific objectives and epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled framework grounded in possibility theory. We define a possibilistic posterior over parameters, project it to the prediction space via supremum operators, and approximate the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Despite its simplicity, extensive experiments across diverse benchmarks show that DAPPr achieves competitive or superior uncertainty quantification performance over state-of-the-art second-order predictors while maintaining both principled derivation and computational efficiency. Code is available at https://github.com/MaxwellYaoNi/DAPPr.

23.
arXiv (CS.AI) 2026-06-16

The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers

arXiv:2606.16974v1 Announce Type: new Abstract: The reproducibility crisis has directed the AI research community toward improving documentation practices. Several studies have identified methodological issues, and in response, the most impactful venues in the field have introduced reproducibility checklists. We seek to understand whether documentation practices have changed over time by assessing all published papers at five leading AI conferences over the past decade. Seven reproducibility variables were identified, quality-assured and used to analyse 56 800 publications. Our analysis reveals that in the period 2014 to 2024, documentation practices have improved; papers sharing both code and data increased nearly sixfold, from 11% to 64% Building on empirical reproducibility rates from a prior study, we estimate - inferred from documentation practices, not direct testing - that reproducibility increased from 28% in 2014 to 64% in 2024. Improvements in documentation practices predate the introduction of reproducibility checklists, suggesting these changes reflect a broader movement toward open science rather than a direct response to formal requirements.

24.
arXiv (CS.LG) 2026-06-15

Operator Calculus for Population-Based Optimization: A Mean-Field Convergence Theory

arXiv:2606.14289v1 Announce Type: cross Abstract: Population-based and distributional optimization methods, from evolution strategies and consensus-based optimization to covariance-matrix adaptation and stochastic gradient methods viewed as distributional dynamics, are widely used for nonconvex or black-box problems, yet their convergence analyses remain fragmented across algorithm-specific techniques. We introduce an operator calculus in which a broad class of such methods, after choosing an appropriate state space and, where necessary, augmenting the state by memory or strategy variables, is described as a composition of three elementary operators (mutation, selection, and recombination) acting on probability measures. Under explicit stability and regularity conditions, the composite operator admits a pre-generator whose continuous-time limit is a transport-reaction-jump (TRJ) PDE that preserves the operator splitting. On this foundation we establish a modular Lyapunov principle. If a state-space Lyapunov function both dissipates under the full generator and controls the relevant search-space gauges, then the state-space Lyapunov functional and the induced search errors decay exponentially. The additive generator structure allows dissipation estimates to be assembled operator by operator, providing a toolkit for certifying convergence of composite mean-field algorithms.

25.
arXiv (CS.CV) 2026-06-12

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.