Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-12

NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems

Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.

02.
arXiv (CS.CL) 2026-06-12

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

03.
arXiv (CS.CV) 2026-06-18

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under heterogeneous clinical imaging conditions such as noise, resolution loss, contrast variation, intensity shift, and artifacts. This instability can limit reliable deployment in real-world medical imaging workflows. We propose Robustness via Augmented Multi-corruption Pipeline (RAMP), a robustness-oriented augmentation framework for CT segmentation. RAMP combines anatomically constrained spatial perturbations, CT intensity transformations, and stochastic multi-corruption composition to expose models to clinically plausible image degradation during training. Across two CT segmentation evaluation settings, RAMP achieved the strongest corrupted-image performance and the smallest clean-to-corrupted robustness gap. In the five-organ noisy evaluation benchmark, RAMP improved mean corrupted Dice from 0.610 to 0.753 and reduced the robustness gap from 0.264 to 0.064 compared with the nnU-Net baseline. In Abdomen1K, RAMP improved mean corrupted Dice from 0.633 to 0.789 and reduced the robustness gap from 0.290 to 0.070. Although RAMP did not achieve the highest clean-image Dice, it substantially mitigated worst-case segmentation collapse under severe image degradation. These results suggest that multi-corruption augmentation can serve as a practical pre-deployment strategy for improving the reliability of CT segmentation systems in heterogeneous clinical environments.

04.
medRxiv (Medicine) 2026-06-24

Cognitive and Neuroimaging Biomarker Intra-Individual Variability in Alzheimer's Disease

Background Greater cognitive intra-individual variability (IIV) reflects increased heterogeneous performance across cognitive domains and has been linked to a higher risk of Alzheimer's disease (AD). However, it remains unclear whether cognitive IIV is linked to heterogeneous dispersion of regional AD pathology. Hence, we aimed to examine the association between cognitive IIV and AD neuroimaging biomarker IIV. Methods This study included participants with normal cognition (CN) and mild cognitive impairment (MCI) from the Alzheimer's Disease Neuroimaging Initiative. Cognitive IIV was computed as the within-person standard deviation of five domain-specific neuropsychological test z-scores. Four neuroimaging biomarker IIV metrics were similarly derived using regional amyloid-{beta} (n = 1,021), tau (n = 719), cortical thickness (n = 2,148), and combined amyloid-tau-neurodegeneration (ATN, n = 258). Associations between cognitive IIV and each biomarker IIV were evaluated using linear regression models, adjusted for relevant covariates. Results Higher cognitive IIV was associated with greater biomarker IIV across amyloid-{beta} ({beta} = 0.039, SE = 0.014, p = .006), tau ({beta} = 0.196, SE = 0.033, p < .001), cortical thinning ({beta} = 0.036, SE = 0.008, p < .001), and ATN ({beta} = 0.176, SE = 0.043, p < .001). Interaction analyses revealed that the associations of cognitive IIV with tau IIV, cortical thickness IIV, and ATN IIV were stronger in MCI than CN individuals. Significant interactions between cognitive IIV and biomarker positivity status showed that the effect with amyloid-{beta} IIV was attenuated in A- ({beta} = 0.004, SE = 0.014, p = .78) but that the effect with tau IIV remained robust even in T- individuals ({beta} = 0.088, SE = 0.022, p < .001). Conclusion Elevated cognitive IIV is associated with greater heterogeneity in cortical dispersion of AD-related pathology, particularly in prodromal AD and in the presence of abnormal pathology. As a novel measure that captures variation in topographical scattering of AD pathological burden across the cortex, AD biomarker IIV may offer research and clinical utility beyond evaluating absolute biomarker load or thresholds.

05.
medRxiv (Medicine) 2026-06-15

Two Blood-based Endotypes Reveal Divergent Clinical Outcomes of Fibrotic Hypersensitivity Pneumonitis

Rationale: Fibrotic hypersensitivity pneumonitis (fHP) is an antigen-driven, life-threatening interstitial lung disease characterized by heterogeneous radiologic features, clinical outcomes, and treatment responses. Objectives: To identify blood-based fHP endotypes that inform mechanism, prognosis and therapeutic response. Methods: We performed integrative analyses of multi-compartment transcriptomic data derived from whole blood, peripheral blood mononuclear cells, bronchoalveolar lavage, and surgical lung biopsies, alongside circulating plasma proteomics. Multiple clustering algorithms were cross-compared to ensure robustness and reproducibility of endotypes identification. Immune cell composition was inferred using bulk RNA-seq deconvolution and annotated with BAL single-cell RNA-seq. Pathway activities were characterized using Gene Set Enrichment Analysis. Transplant-free survival (TFS) was evaluated for endotype and corticosteroid exposure by Kaplan-Meier methods, with hazard ratios analyzed using multivariable Cox proportional hazards models. Results: Two molecular endotypes, lymphocytic-associated (L-fHP) and non-lymphocytic-associated (N-fHP), were identified and validated. L-fHP showed enrichment of adaptive immune signaling and lymphocyte predominance, whereas N-fHP demonstrated myeloid-cell activation with neutrophil and macrophage predominance. Corticosteroid exposure was associated with worse TFS in L-fHP but not in N-fHP after adjusting for age, sex, and baseline pulmonary function. Compared to L-fHP, N-fHP had poorer baseline pulmonary function, faster 12-month FVC decline, and shorter TFS. N-fHP also exhibited elevated neutrophil-associated markers, including matrix metalloproteinase-9, across paired transcriptomic and proteomic datasets, supporting a neutrophil-driven, cross-compartment disease process. Conclusion: Multi-omic, multi-compartment analysis identifies two reproducible fHP endotypes with distinct clinical outcomes and corticosteroid responses, supporting a precision medicine approach beyond current clinical and radiologic classification.

06.
arXiv (CS.CV) 2026-06-25

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.

07.
medRxiv (Medicine) 2026-06-24

IgG2 Galactosylation is related to higher antibody dependent enhancement for dengue in cross-reactive antibodies from Sars-CoV-2

Cross-reactive antibodies against dengue virus are known to cause antibody-dependent enhancement (ADE) of infection or disease severity under specific conditions. In our previous study, we showed that primary immunization with the COVID-19 vaccine induces induces cross-reactive IgG causing ADE against dengue. In the present study, we investigated the influence of IgG Fc-glycosylation (analyzed by LC-MS/MS) on ADE mediated by cross-reactive IgG against dengue from IgG against SARS-CoV-2. We found a clear correlation between anti-DENV2 E IgG2 galactosylation and the ADE capacity of cross-reactive IgG against dengue in individuals vaccinated against COVID-19. IgG2 sialylation increased over time; however, it was not correlated with ADE capacity. This phenomenon was restricted to IgG2, whereas anti-DENV2 E IgG1 Fc-glycosylation remained stable after COVID-19 vaccination.

08.
arXiv (CS.AI) 2026-06-18

Recursive Joint Simulation in Games

arXiv:2402.08128v3 Announce Type: replace Abstract: Game-theoretic dynamics between AI agents could differ from traditional human-human interactions in various ways. One such difference is that it may be possible to accurately simulate an AI agent, for example because its source code is known. Such an agent would then be fundamentally uncertain whether it is in the real world or in a simulation. Our aim is to explore ways of leveraging this possibility to achieve more cooperative outcomes in strategic settings. In this paper, we study an interaction between AI agents where the agents run a recursive joint simulation. That is, the agents first jointly observe a simulation of the situation they face. This simulation in turn recursively includes additional simulations (with a small chance of failure, to avoid infinite recursion), and the results of all these nested simulations are observed before an action is chosen. We show that the resulting interaction is strategically equivalent to an infinitely repeated version of the original game, allowing a direct transfer of existing results such as the various folk theorems. As evidence that the equivalence is robust, we show that it holds even when we relax some of the assumptions and that it also holds ``from the inside'' – meaning, for an agent that finds itself inside the game and has self-locating uncertainty.

09.
arXiv (CS.CL) 2026-06-11

Neuron-based Personality Trait Induction in Large Language Models

Large language models (LLMs) have become increasingly proficient at simulating various personality traits, an important capability for supporting related applications (e.g., role-playing). To further improve this capacity, in this paper, we present a neuron-based approach for personality trait induction in LLMs, with three major technical contributions. First, we construct PersonalityBench, a large-scale dataset for identifying and evaluating personality traits in LLMs. This dataset is grounded in the Big Five personality traits from psychology and is designed to assess the generative capabilities of LLMs towards specific personality traits. Second, by leveraging PersonalityBench, we propose an efficient method for identifying personality-related neurons within LLMs by examining the opposite aspects of a given trait. Third, we develop a simple yet effective induction method that manipulates the values of these identified personality-related neurons. This method enables fine-grained control over the traits exhibited by LLMs without training and modifying model parameters. Extensive experiments validate the efficacy of our neuron identification and trait induction methods. Notably, our approach achieves comparable performance as fine-tuned models, offering a more efficient and flexible solution for personality trait induction in LLMs. We provide access to all the mentioned resources at https://github.com/RUCAIBox/NPTI.

10.
arXiv (CS.AI) 2026-06-19

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

arXiv:2606.20280v1 Announce Type: cross Abstract: Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.

11.
Nature Medicine 2026-06-24

Automated reanalysis of genomic data for rare disease diagnostics at scale

Reanalysis of genomic data in rare disease is highly effective in increasing diagnostic yields but remains limited by manual approaches. Automation and optimization for high specificity will be necessary to ensure scalability, adoption and sustainability of iterative reanalysis. We developed Talos, an open-source tool that automates variant prioritization by integrating dynamically updated gene−disease and variant-level evidence with inheritance-aware filtering and validated its performance using data from 1,089 individuals with rare disease. Trio-based analysis identified 90% of known diagnoses, returning 1.3 variants per case on average. Variant burden reduced to one variant per 200 cases on iterative monthly reanalysis. Application to an unselected cohort of 4,735 undiagnosed individuals identified 241 diagnoses (5.1% yield): 78 (32%) due to new gene−disease relationships, 54 (22%) due to new variant-level evidence and 109 (45%) due to improved analysis strategies. Our automated, iterative reanalysis model demonstrates the feasibility of delivering frequent, systematic reanalysis at scale. Talos, a new tool for the automated analysis of genomic data, demonstrates the feasibility and diagnostic utility of systematic reanalyses of data in rare diseases.

12.
arXiv (CS.AI) 2026-06-16

Deep Q-Learning on Hölder Spaces

作者:

arXiv:2606.16846v1 Announce Type: cross Abstract: We study the operator-theoretic core of Q-learning in continuous-time stochastic control with continuous states and actions. In value-based reinforcement learning, each Q-learning or DQN update is built from a Bellman optimality target; our analysis isolates this target in a diffusion setting and studies its regularity and approximation complexity. Under uniform ellipticity and Hölder-regular coefficients, we show that a Bellman update maps bounded inputs into an anisotropic regularity class, smoothing the state variable while leaving only Lipschitz dependence on the action variable. This yields a compact family of Bellman iterates and motivates a tensor-product DeepONet architecture adapted to the mixed regularity of the problem. We then derive explicit approximation and resource bounds, together with a stiffness–complexity trade-off as the time step $\delta \to 0$. The resulting theory makes a direct contribution to Q-learning theory at the level of Bellman target regularity and approximation in continuous stochastic control. At the same time, we do not claim a full convergence theorem for practical sampled Q-learning with exploration, replay, and stochastic gradient updates.

13.
arXiv (CS.CL) 2026-06-16

When Cognitive Graphs Meet LLMs: BDEI Cognitive Pathways for Panic Emotional Arousal Prediction

Predicting individual panic emotional arousal timing before manifestation is essential for proactive emergency intervention. Existing methods incorporate cognitive elements but none explicitly model the emotional arousal process, making them ill-suited for emotional arousal timing prediction. We argue that grounding prediction in appraisal emotion theory is necessary because it explicitly models this process, but three problems must be solved. (1) Appraisal theory posits that emotion arises from simultaneous evaluation across multiple threat dimensions, yet no prior work fuses these inputs into risk perception. (2) Existing cognitive models lack an Emotion node, decoupling threat appraisal from emotional arousal and forcing emotions to be inferred indirectly from behaviors. (3) Given their generalizable cognitive reasoning, current approaches adopt LLMs as the primary decision-maker, yet overlook the fragility and hallucination-proneness of their outputs. To address these issues, we introduce PanicCognitivePath (PCP), a framework that addresses all three. A Psychological Safety Distance (PSD) model, grounded in psychological distance theory, maps four-domain signals into a unified risk metric as the entry condition for subsequent cognitive reasoning. An explicit Emotion node grounded in appraisal emotion theory is introduced into BDI, forming a Belief-Desire-Emotion-Intention (BDEI) pathway. Agents whose risk metric exceeds the PSD threshold enter this pathway, coupling threat appraisal directly to emotional arousal. The BDEI pathway governs all state transitions while the LLM is confined to parameter estimation for the Belief-to-Desire transition, confining hallucinations to a single step and preventing error propagation. Experiments on Hurricane Sandy show PCP improves arousal timing accuracy by 10.68% over baselines, reduces peak count error to 7.07%.

14.
arXiv (CS.CV) 2026-06-24

Quantum CT via Dynamic Interval Encoding and Prior-Balanced QUBO Reconstruction

Quadratic unconstrained binary optimization (QUBO)-based quantum computed tomography (CT) casts reconstruction as a binary quadratic problem for quantum annealing and hybrid quantum–classical solvers. For grayscale CT, however, image encoding is constrained by the binary-variable budget: fixed global bit-plane encodings increase QUBO size and coupling complexity as gray-level precision improves, whereas low-bit encodings introduce quantization error. We propose a QUBO-based grayscale CT reconstruction framework that combines dynamic interval encoding with prior-balanced optimization. Each refinement round encodes active pixels only within local gray-level intervals around the current estimate, and a boundary-hit-guided update rule adaptively switches between search expansion and local refinement. To improve optimization stability, the method balances projection-domain data consistency and an edge-preserving quadratic prior before forming the final QUBO. Sparse-view and limited-angle fan-beam CT experiments show that the proposed method recovers structures and gray-level distributions more faithfully than the evaluated analytic, iterative, variational, and representation-based baselines. Expressivity analysis and ablation studies further indicate that the improvement mainly arises from effective gray-level representation through dynamic local encoding and more stable data-fidelity–prior coupling. Experiments on the D-Wave hybrid binary quadratic model (BQM) solver further demonstrate that the formulation is executable on a hardware-backed hybrid quantum–classical backend.

15.
arXiv (CS.CV) 2026-06-16

Navigating Distribution Shifts in Medical Image Analysis: A Survey

Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges posed by distribution shifts, where models trained on specific datasets underperform on others from varying hospitals, or patient populations. To address this issue, researchers have been actively developing strategies to increase the adaptability of DL models, enabling their effective use in unfamiliar environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Rather than organizing existing methods by technical characteristics, we explicitly bridge real-world clinical constraints – such as limited data accessibility, strict privacy requirements, and heterogeneous collaboration protocols – with the technical paradigms able to address them. By establishing this connection between operational constraints and methodological evolution, we categorize existing works into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, each aligned with specific healthcare scenarios. Beyond this taxonomy, our empirical analysis suggests that, as domain information becomes progressively less accessible across these paradigms, performance improvements become increasingly constrained, and further uncovers a gradual shift in methodological focus from explicit distribution alignment toward uncertainty-aware modeling, ultimately pointing to the need for more deployability-aware design in real-world MedIA.

16.
arXiv (CS.AI) 2026-06-18

Do Neural Networks Lose Plasticity in a Gradually Changing World?

arXiv:2602.09234v2 Announce Type: replace-cross Abstract: Continual learning has become a trending topic in machine learning. Recent studies have discovered an interesting phenomenon called loss of plasticity, referring to neural networks gradually losing the ability to learn new tasks. However, existing plasticity research largely relies on benchmarks with abrupt task transitions, without examining whether the abruptness itself contributes to the observed plasticity loss. In this paper, we investigate the role of transition abruptness by simulating gradually changing environments through input/output interpolation and task sampling. We perform theoretical and empirical analysis, showing that the severity of plasticity loss is closely tied to the abruptness of task transitions, and can be substantially reduced when the environment changes gradually.

17.
arXiv (CS.AI) 2026-06-12

Real-rootedness of the Poincaré polynomials of $\overline{\mathcal M}_{0,n}$: an AI-assisted proof

arXiv:2605.29151v2 Announce Type: replace-cross Abstract: We prove real-rootedness for the Poincaré polynomial \[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \] of the Deligne–Mumford moduli space $\overline{\mathcal M}_{0,n}$ of stable $n$-pointed rational curves, proving a conjecture of Aluffi–Chen–Marcolli. The proof starts from the Keel–Manin–Getzler recurrence, but its main new idea is a bivariate deformation $F_m(y,t)$ of the Poincaré polynomial. This deformation reveals a hidden interlacing structure not visible in the one-variable recurrence. For fixed $t

18.
medRxiv (Medicine) 2026-06-10

Developing a Unified Criminal Justice Pathway into Drug and Alcohol Treatment from Police Custody: A Public Health Service Evaluation and Pathway-Design Project in Blackpool, United Kingdom

Introduction: Blackpool, England's most deprived local authority, has the highest drug-related death rate in the country. People in police custody with problem substance use are a key Core20PLUS5 inclusion-health group, yet referral from the police into structured drug and alcohol treatment is fragmented and relies heavily on self-report. We evaluated the current police-to-treatment route in Blackpool and designed an evidence-informed unified pathway. Materials and Methods: A mixed-methods service evaluation and pathway-design project was conducted during a six-month General Practice / Public Health rotation. Routinely collected referral data from Horizon (the local specialist drug and alcohol service) covering the 47-month period from December 2019 to October 2023 were analysed. Findings were triangulated with national policy, the Project ADDER and Liaison and Diversion evaluations, and the international evidence on police-led pre-arrest diversion. Results: Of 5,900 total referrals into Horizon over 47 months, only 269 (4.56%) originated from the police. Police referrals accounted for fewer than 5% of monthly referrals in 30 of 47 months, for 5 to 9.9% in 16 months, and for >/= 10% in only one month (10.8%, December 2022). Blackpool recorded 76 drug-misuse deaths in 2019-21 (19.4 per 100,000, approximately four times the England rate). A six-step unified pathway is proposed: Initiate Referral (opt-out, from ADDER Police and Liaison and Diversion); Initial Assessment; Tailored Treatment Plan; Continuous Support; Collaboration and Monitoring; and Evaluation and Adjustment. Conclusions: Police contact is markedly under-used as a gateway to treatment despite Blackpool having the highest drug-related mortality in England. An opt-out, multi-agency pathway anchored in Core20PLUS5 has the potential to narrow the treatment gap, reduce re-offending, and address the structural health inequalities that drive premature mortality.

19.
arXiv (CS.LG) 2026-06-16

When to use what Schatten-$p$ norm in deep learning?

arXiv:2606.15268v1 Announce Type: new Abstract: Schatten-$\infty$ based optimizers such as Muon have shown promising empirical performance, but there remains seemingly conflicting observations regarding whether they are beneficial. We resolve this conflict by showing that the conclusion is regime dependent. Even when the objective is smooth in the Schatten-$\infty$ geometry, smaller Schatten-$p$ geometries can be optimal, specifically in the low-dimensional regime, which we show includes Chinchilla scaling. This conclusion follows from a new noise-robust acceleration result for the SODA framework for $p>2$. The same analysis explains why Muon-like methods do not require warmup, why they naturally favor large batches, and yields a batch size scaling rule for arbitrary $p$.

20.
arXiv (CS.AI) 2026-06-24

AutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming

arXiv:2606.24245v1 Announce Type: cross Abstract: Large language model (LLM) agents increasingly automate complex tasks by integrating language models with external tools and environments. However, their autonomy poses significant safety risks: agents may execute destructive commands, leak sensitive data, or violate domain constraints. Existing safety approaches face a fundamental tradeoff: hand-crafted rules are interpretable but brittle, with overly conservative rules blocking safe operations (high false positives) while permissive rules miss unsafe behaviors (high false negatives). Neural classifiers lack the interpretability required for safety-critical deployments. We present AutoSpec, a framework that automatically evolves deployed expert-designed safety rules from user safe/unsafe annotations through counterexample-guided inductive synthesis (CEGIS) guided by inductive logic programming (ILP). Starting from the expert rules and a stream of annotated traces, AutoSpec iteratively evaluates rules, mines false-positive and false-negative counterexamples, uses ILP to learn which predicates discriminate them, generates candidate rule edits, and verifies candidates to select the best revision. The key insight is that ILP efficiently identifies predicates that appear frequently in false negatives but rarely in false positives (or vice versa), dramatically pruning the exponential search space of rule edits. This continues until convergence, producing interpretable rules that balance precision and recall. We evaluate AutoSpec on 291 execution traces spanning code execution and embodied agent domains. AutoSpec raises rule F1 to 0.98 and 0.93 across the two domains, achieving up to 94% false positive reduction while maintaining high recall, and converges within 4-5 iterations. The ILP-guided approach achieves up to 4.8x higher F1 than heuristic CEGIS. The learned rules are human-readable, auditable, and generalize to unseen scenarios.

21.
arXiv (CS.LG) 2026-06-16

Discrimination-free Insurance Pricing with Privatized Sensitive Attributes

arXiv:2504.11775v3 Announce Type: replace-cross Abstract: Fairness has become an important concern in insurance pricing as insurers increasingly rely on machine learning models to predict expected losses. At the same time, regulatory and privacy constraints often restrict insurers' ability to access or use sensitive attributes such as gender or race. Recent actuarial research addresses fairness in this context through the concept of the discrimination-free premium, which removes both the direct and indirect effects of sensitive attributes while preserving actuarial consistency. However, implementing this approach typically requires access to the sensitive attributes themselves, which may not be available in practice. This paper studies the estimation of discrimination-free insurance premiums when sensitive attributes are observed only in privatized or noise-perturbed form. We consider a multi-party data setting in which insurers observe non-sensitive attributes and outcomes, while a trusted third party holds privatized sensitive attributes generated through a privacy mechanism. Within this framework, we develop statistical methods for estimating discrimination-free premiums using only the privatized attributes. We study two settings of practical relevance: when the privacy mechanism is known and when its noise level is unknown. For both cases, we establish theoretical guarantees for the proposed estimators. Numerical experiments and empirical applications demonstrate that the proposed approach enables fair insurance pricing while respecting privacy and regulatory constraints.

22.
arXiv (CS.LG) 2026-06-16

Peak-Based Nuclide Identification in HPGe $\gamma$-Spectrometry with Machine Learning and SHAP

arXiv:2606.14874v1 Announce Type: cross Abstract: High-purity germanium gamma spectra often require time-consuming analyses from subject matter experts. Photopeaks within these spectra are carefully fitted and numerical methods are employed to assist with nuclide identification (NID) and quantification. Amending the list of nuclides identified by analysis software can be nontrivial. When many samples need to be analyzed, it is therefore challenging to make timely and correct decisions. Supervised machine-learning-based NID can serve as an expert-informed, automated tool to improve the initial set of radionuclides suggested to an analyst and more effectively drive subsequent quantification. To that end, we implemented machine learning models that map photopeaks carefully fitted by analysts to NID results for experimental spectra containing various isotopic combinations drawn from a set of 65 isotopes. The best model achieved an F1 score of 0.97, markedly surpassing the F1 score of 0.84 achieved by traditional software when compared using a nuclide library comprising the same 65 isotopes assessed by the models. Finally, we illustrated the most important input features for model predictions using Shapley Additive Explanations. These explanations revealed that the models use physically relevant photopeaks when making predictions for the isotopes in our nuclide library.

23.
arXiv (quant-ph) 2026-06-24

Introduction to matrix-product states and tensor networks

arXiv:2606.24803v1 Announce Type: cross Abstract: These notes provide an introduction to tensor-network methods in quantum many-body physics, with an emphasis on matrix-product states (MPS). They develop the basic tensor-network language, including graphical notation, virtual indices, bond dimensions, gauge freedom, canonical forms, QR and singular-value decompositions, and the role of entanglement in controlling the efficiency of the representation. The main MPS algorithms are then introduced, including contractions, correlation functions, matrix-product operators, DMRG, and time-evolution methods. The notes also briefly discuss projected entangled-pair states (PEPS) as a higher-dimensional generalization of MPS, together with the basic ideas behind approximate PEPS contraction. Finally, tensor-network representations of mixed states, quantum channels, and Lindblad dynamics are presented, with applications to thermal states and open quantum systems. The presentation is accompanied by short Julia code examples based on ITensor, ITensorMPS, and TensorMixedStates. These notes were written for the 9th Les Houches Summer School on Computational Physics: Open Quantum Systems, held in June 2026.

24.
arXiv (CS.AI) 2026-06-18

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

arXiv:2606.19047v1 Announce Type: new Abstract: Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary – where successes and failures are roughly balanced – contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.

25.
arXiv (CS.CV) 2026-06-11

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.