Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-10

Prediction of immunotherapy response using live tumor fragments from routine clinical biopsies

Functional ex vivo assays using live tumor tissues have demonstrated strong predictive accuracy for response to immune checkpoint inhibitors (ICIs) but are not scalable, requiring manual processing of large resections collected at academic centers. Here, an ex vivo live tumor fragment (LTF) platform was developed using standard-of-care biopsies from 228 patients with suspected malignancy collected across prospective, multicenter observational trials and biobanks. Hierarchical clustering of ICI-mediated changes in cytokine production identified two groups: responders and nonresponders. A binary classifier (elive index) using 8 cytokines achieved an AUC of 0.99 for cluster prediction. elive index correctly predicted clinical benefit in 93% (26/28) of patients (P = 3.2x10-5) and accurately identified 83% (10/12) of objective responders. Critically, elive responders were identified among biomarker-negative patients, highlighting the platform as a scalable approach that complements existing companion diagnostics and expands the population of patients identified to benefit from ICI therapy.

02.
medRxiv (Medicine) 2026-06-19

Specific epigenetic age acceleration measures are associated with oral health outcomes in U.S. adults

Objectives: Oral health conditions impact a significant proportion of the global population. Chronological age is a known risk factor; however, characterization of epigenetic age remains limited and is expected to provide additional insight into biological mechanisms. Materials and Methods: The National Health and Nutrition Examination Survey (NHANES) was used to analyze the effect of epigenetic age measures of DunedinPoAm, and epigenetic age acceleration (EAA) of Horvath, Hannum, Weidner, Lin, VidalBralo, PhenoAge, GrimAge, and GrimAge2, on various oral health outcomes from survey and examination results. Univariable and multivariable logistic regression were performed, adjusting for sex, race-ethnicity, education, poverty income ratio categories, and dental insurance coverage status. Results: DunedinPoAm was associated with the last dental appointment being for an existing issue (p=0.0093), poor general oral condition (p=0.0226), limiting food due to teeth problems (p=0.0031), and recommendation to see a dentist within the next two weeks (p=0.0171). EAAs for PhenoAge, GrimAge, and GrimAge2, were associated with a smaller number of oral health outcomes, whereas EAAs for Horvath, Hannum, Weidner, Lin, and Vidal-Bralo showed no associations. Conclusions: In a representative U.S. population, DunedinPoAm was most consistently positively associated with different adverse oral health outcomes compared with other epigenetic aging measures. Tracking specific epigenetic ages such as DunedinPoAm, EAA GrimAge, EAA GrimAge2, and PhenoAge, may aid in additional monitoring of oral health outcomes. Understanding specific aging-related CpGs associated with oral health may aid in elucidating underlying molecular mechanisms.

03.
arXiv (CS.CV) 2026-06-19

World Engine: Towards the Era of Post-Training for Autonomous Driving

Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail'' events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.

04.
arXiv (CS.CL) 2026-06-15

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

As language models take integrated roles across many domains, the response of LLMs to user pushback becomes a critical alignment property. Yet many existing evaluations treat compliance as unidirectional, measuring whether models resist pressure but not whether they resist it selectively. We introduce Compliance Asymmetry (A = BCR/HCR), a bidirectional diagnostic that compares beneficial output change under helpful nudges with harmful change under misleading nudges. Across 9 models and 972,000 nudge-condition responses, we find that this selectivity differs in factual and moral judgments: models follow helpful nudges more than harmful ones on factual questions (A = 1.58), but follow both directions at nearly identical rates on moral questions (A = 1.04). This phenomenon persists across model families, capability levels, and nudging types. Interestingly, we also find that chain-of-thought prompting amplifies helpful and harmful compliance together, while identity-based prompting suppresses both by nearly identical margins. These results identify direction-blind moral compliance as a distinct failure mode in current LLMs and suggest that alignment should target directionally calibrated updating rather than lower compliance alone.

05.
arXiv (CS.AI) 2026-06-15

The Journal of Prompt-Engineered (Moral) Philosophy Or: Why AI-Assisted Ethics Research Requires Process Transparency

作者:

arXiv:2511.08639v4 Announce Type: replace-cross Abstract: Existing AI disclosure mandates in scholarship require that AI assistance be reported but leave transparency philosophically unspecified: they fix the duty without explaining what the duty serves. We argue that ethical inquiry is essentially contested at two independent levels – about what it is, and about what it demands of the inquirer – defeating output-only evaluation and welfare-economic dismissal of the transparency question, and, by extension, reproducibility framings imported from the empirical sciences. The transparency duty is grounded instead in agent-integrity: the legibility, before a community of inquiry, of the identity-constituting commitments that the author's mode of philosophising expresses. Because the standards for evaluating such work are not communally settled, the achievable goal for transparency is not evaluation against agreed criteria but tracking – accumulating the evidentiary record that lets each tradition assess the work on its own terms and makes future normative judgments possible. We develop a documentation-adequacy framework that operationalises Meaningful Human Control through five transparency elements – declaration, navigation, documentation account, process documentation, and development records – demonstrated by the paper itself, whose full documentation record is archived at a persistent identifier. The framework is a first iteration subject to revision, not a settled standard.

06.
arXiv (math.PR) 2026-06-16

Risk-averse mean field games: exploitability and non-asymptotic analysis

arXiv:2301.06930v5 Announce Type: replace-cross Abstract: In this paper, we use mean field games (MFGs) to investigate approximations of $N$-player games ($N$pGs) with uniformly symmetrically continuous heterogeneous closed-loop actions. To incorporate agents' risk aversion (beyond the classical expected utility of total costs), we use an abstract evaluation functional for their performance criteria. Centered around the notion of exploitability, we conduct non-asymptotic analysis on the approximation capability of MFGs from the perspective of state-action distributions without requiring the uniqueness of equilibria. Under suitable assumptions, we first show that scenarios in the $N$pGs with large $N$ and small average exploitabilities can be well approximated by approximate solutions of MFGs with relatively small exploitabilities. We then show that $\delta$-mean field equilibria can be used to construct $\varepsilon$-equilibria in $N$pGs. Furthermore, in this general setting, we prove the existence of mean field equilibria. This proof reveals a possible avenue for incorporating penalization for randomized action into MFGs.

07.
arXiv (CS.AI) 2026-06-24

MultiMem: Measuring and Mitigating Memorization in Multi-Modal Contrastive Learning

arXiv:2606.22220v2 Announce Type: replace-cross Abstract: Memorization in machine learning models enables high performance on rare in-distribution samples by capturing their atypical patterns. However, it also causes harmful retention of noise and outliers, degrading generalization. While memorization has been extensively studied in both supervised and self-supervised learning in the vision domain, it remains unexplored in multi-modal contrastive learning. We address this gap by introducing MultiMem, the first metric designed to quantify memorization in multi-modal contrastive learning. Through our systematic analysis, we demonstrate that cross-modal semantic misalignment has the strongest influence on memorization, with text being the dominant modality driving memorization, followed by video, image, and audio. We show that targeted augmentations applied across all modalities effectively reduce memorization as measured by our MultiMem metric and improve model performance. Overall, this work establishes the first framework for measuring and mitigating memorization in multi-modal contrastive learning, preventing harmful data retention and contributing to higher-performing models.

08.
arXiv (CS.AI) 2026-06-17

Rethinking Multimodal Fusion for Time Series: Text Modalities Need Constrained Fusion

arXiv:2603.22372v2 Announce Type: replace-cross Abstract: Recent advances in multimodal learning have motivated the integration of auxiliary modalities such as text or vision into time series (TS) forecasting. However, most existing methods provide limited gains, often improving performance only in specific datasets or relying on architecture-specific designs that limit generalization. In this paper, we show that multimodal models with naive fusion strategies (e.g., simple addition or concatenation) often underperform unimodal TS models, which we attribute to the uncontrolled integration of auxiliary modalities which may introduce irrelevant information. Motivated by this observation, we explore various constrained fusion methods designed to control such integration and find that they consistently outperform naive fusion methods. Furthermore, we propose Controlled Fusion Adapter (CFA), a simple plug-in method that enables controlled cross-modal interactions without modifying the TS backbone, integrating only relevant textual information aligned with TS dynamics. CFA employs low rank adapters to filter irrelevant textual information before fusing it into temporal representations. We conduct over 20K experiments across various datasets and TS/text models, demonstrating the effectiveness of the constrained fusion methods. Code is available at: https://github.com/seunghan96/cfa.

09.
medRxiv (Medicine) 2026-06-15

Toward a National Registry for Inborn Errors of Immunity in Peru: A Qualitative Implementation Study

Background: Peru lacks an integrated information system for patients with Inborn Errors of Immunity (IEI). Although disease registries are essential tools for data management and health planning, their success depends on implementation science approaches that account for local contextual factors. This study reports Phase I of a three-phase mixed-methods implementation project to design and develop a national IEI registry. Methods: Phase I consisted of a phenomenological qualitative study exploring stakeholder perspectives. Semi-structured focus groups and in-depth interviews were conducted with 29 key stakeholders across four groups: policy-makers, clinical experts, end-users (immunologists, residents, allied health personnel), and patient organization representatives. Interviews followed a guide structured around four a priori domains (structure, navigation, feasibility, and perception of existing systems). Discussions were conducted in Spanish, audio-recorded, transcribed verbatim, and coded using ATLAS.ti. A hybrid thematic analysis combining deductive and inductive coding was performed. Data elements proposed for the registry were triangulated with qualitative findings. Results: Thirty-six initial codes were consolidated into 15 categories, which were further integrated into four overarching themes conceptualized as pathways toward intention to use: (1) Environment, where governance, regulatory backing, and sustainable financing were identified as key enablers, while limited interoperability emerged as a structural barrier; (2) Technical Dimension, emphasizing usability, alignment with clinical workflow, and a hierarchical data architecture (demographic, clinical, therapeutic); (3) Users, highlighting clinical leadership, protected time, digital readiness, and perceived usefulness as stronger motivators than financial incentives; and (4) Patients, underscoring data protection, transparency, trust, and advocacy as essential for legitimacy and sustainability. Conclusions: A national IEI registry in Peru is perceived as necessary and feasible if implemented with strong regulatory foundations, interoperable design, robust data security, and user-centered architecture. These findings informed the development of an initial functional prototype and the operational plan for Phase II, focused on usability evaluation.

10.
arXiv (CS.AI) 2026-06-16

EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

arXiv:2606.15141v1 Announce Type: cross Abstract: While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.

11.
Science (Express) 2026-05-21

Observation of quantum vortex core fractionalization and skyrmion formation in a superconductor | Science

作者: 未知作者

Magnetic fields can penetrate a superconductor in the form of quantum vortices, which consist of a core singularity with circulating currents. London’s quantization implies that there is one core singularity per quantum of magnetic flux in single-component superconductors. Here, we report signatures of quantum vortex core fractionalization on the potassium-terminated surface of a multiband superconductor KFe 2 As 2 . The observed splitting of single integer-flux vortices into several fractional vortices results in a disparity between the numbers of flux quanta and vortex cores. These fractional vortices often arrange in chains, which calculations show are characterized by a ℂP 2 skyrmionic topological invariant; this constitutes a different type of topological defect: the chiral skyrmion. The disparate natures of integer and fractional vortices comprising skyrmions lead to distinct spectroscopic signatures.

12.
arXiv (CS.LG) 2026-06-18

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

arXiv:2606.18774v1 Announce Type: new Abstract: We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on router-level decision quality. For each user query, multiple routing strategies independently recommend candidate models under the same model pool and budget constraints. The selected model responses are then presented to users through anonymous pairwise comparisons, and the resulting user preferences are attributed back to the routing strategies behind the compared responses. Each evaluation record stores the query, routing decisions, model responses, preference labels, cost, latency, and task metadata, enabling preference-aware, cost-aware, and task-conditioned analysis of LLM routers. To support the continuous expansion of routing methods in RouteJudge, we further release ORBIT (Optimal Routing and Budgeted Inference Toolbox), a modular and extensible toolbox that standardizes the end-to-end workflow of LLM routing. ORBIT provides unified interfaces for benchmark loading, query representation, router implementation, budget-aware evaluation, and method comparison, allowing researchers to develop and evaluate routing algorithms under consistent protocols. It also serves as the submission and integration layer for RouteJudge: researchers can implement routing methods within ORBIT, validate them on existing routing benchmarks, and submit compatible routers for online preference-based evaluation. The code of ORBIT is available at https://github.com/AIGNLAI/LAMDA-ORBIT.

13.
arXiv (CS.CV) 2026-06-19

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using 62,876 CFPs from 44,501 unique participants from the UK Biobank, DL models were trained to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

14.
arXiv (CS.LG) 2026-06-24

Accelerated Stochastic Min-Max Optimization Based on Bias-corrected Momentum

arXiv:2406.13041v3 Announce Type: replace Abstract: Lower-bound analyses for nonconvex strongly-concave minimax optimization problems have shown that stochastic first-order algorithms require at least $\mathcal{O}(\varepsilon^{-4})$ sample complexity to find an $\varepsilon$-stationary point. Some works indicate that this complexity can be improved to $\mathcal{O}(\varepsilon^{-3})$ when the stochastic loss gradient is Lipschitz continuous. The question of achieving enhanced convergence rates under distinct conditions, remains open. In this work, we address this question for optimization problems that are nonconvex in the minimization variable and strongly concave or Polyak-Lojasiewicz (PL) in the maximization variable. We introduce novel bias-corrected momentum algorithms utilizing efficient Hessian-vector products. We establish convergence conditions and demonstrate a lower iteration complexity of $\mathcal{O}(\varepsilon^{-3})$ for the proposed algorithms. The effectiveness of the proposed method is validated through applications to robust logistic regression and robust adaptive cruise control.

15.
arXiv (math.PR) 2026-06-15

Longest weakly increasing subsequences of discrete random walks on the integers with heavy tailed distribution of increments

arXiv:2603.29047v2 Announce Type: replace-cross Abstract: We investigate the behavior of the length of the longest weakly increasing subsequences (weak LIS) of $n$-step random walks with nonzero integer increments $k = \pm 1, \pm 2, \dots$ given by a symmetric heavy tailed mass distribution proportional to $|k|^{-1-\alpha}$ for several values of the real parameter $\alpha > 0$ together with that of the simple random walk ($k=\pm 1$), to which the $n$-step heavy tailed walks reduce when $\alpha$ grows large enough that step jumps beyond $\pm 1$ become essentially absent on the scale of $n$. By means of exploratory fits, weighted nonlinear least squares, and nested-model comparisons, we found that the sample average length $\langle{L_{n}}\rangle$ scales like $\langle{L_{n}}\rangle \sim \sqrt{n}\log{n}$ when the distribution of increments has finite variance ($\alpha > 2$) and $\langle{L_{n}}\rangle \sim n^{\theta}$ with a varying exponent $\theta > 0.5$ when the variance is infinite ($\alpha \leq 2$). Distributional diagnostics indicate that the bulk of the $L_{n}$ distribution is very well-approximated by a lognormal model, though systematic deviations are observed in the tails. Our results corroborate and expand upon previous results for the LIS of other types of heavy-tailed random walks and raise a conjecture as to whether the distribution of $L_{n}$ is given, or can be effectively described, by a lognormal distribution.

16.
arXiv (CS.CV) 2026-06-17

UoU: A Universal Fingerprint Foundation Model Based on Large-Scale Unsupervised Learning

Fingerprint recognition is still dominated by task-specific pipelines, where enhancement, structural parsing, alignment, and matching are optimized in isolation. Although effective in narrow settings, this design limits representation reuse across sensors, qualities, and downstream applications. We therefore present UoU, short for ``a Universal fingerprint foundation model based on large-scale Unsupervised learning,'' which reframes fingerprint feature extraction as a domain-specific foundation-model problem. UoU is organized around a multi-level representation hierarchy spanning image restoration, structural fields, semantic tokens, point-level biometric entities, and compact global descriptors. Its training recipe combines a supervised cold start on precise annotations, large-scale weakly supervised refinement, and large-scale unsupervised consolidation, with the latter two stages iterated during large-scale training so that weak supervision broadens semantic coverage while unsupervised learning stabilizes correspondences, invariances, and representation geometry. Rather than treating fingerprint imagery as generic texture, UoU exploits domain-specific symmetries and intermediate structure, including orientation flow, periodic ridge patterns, sparse biometric entities, and spatial equivariance. The framework is intentionally architecture-agnostic: while the present study includes an initial transformer-based structured-prediction instantiation, the broader design supports multi-task learning, scalable model configurations, and downstream specialization for matching, alignment, enhancement, registration, and related fingerprint applications. This paper presents the technical motivation, system design, and validation protocol of UoU, and part of the baseline implementation is publicly available at https://github.com/XiongjunGuan/UoU.

17.
arXiv (CS.CV) 2026-06-16

Mutual Distillation of Dual-Foundation Models for Semi-Supervised PET/CT Segmentation

Organ segmentation from PET/CT is critical for quantitative analysis and radiotherapy planning in oncology. To ease the high annotation cost of PET/CT segmentation, semi-supervised learning (SSL) provides a practical and effective solution for developing deep models with limited labeled data. Recent developments in visual foundation models have demonstrated remarkable adaptability with improved efficiency. In this work, we propose a mutual distillation framework that seamlessly exploits both structural and functional foundation models, which act as modality-specific generalists for distilling knowledge from structural CT and metabolic PET imaging. By bridging the gap between the task-specific precision of student models and the segmentation priors of generalist foundation models, we propose MuDuo, a mutual distillation framework that synergistically leverages SAM-Med3D for CT and SegAnyPET for PET to distill their knowledge into a lightweight student network. Our approach eliminates the need for manual prompts while maximizing the utility of unlabeled data for automatic segmentation, achieving state-of-the-art performance on the AutoPET dataset with only 5 labeled cases. Our source code is available at https://github.com/Wu-beining/MuDuo.

18.
arXiv (CS.AI) 2026-06-24

Repeated Shared Access Enables Grokking, but Edit Propagation Depends on an Addressable Memory

作者:

arXiv:2606.20737v2 Announce Type: replace Abstract: We study factual edit propagation in a controlled synthetic knowledge-graph QA setting using a 2x2 grid that crosses loop recurrence with shared-memory access: a dense transformer (Dense), a looped transformer (Loop), a dense backbone with shared memory (Dense+Mem), and a looped backbone with shared memory (loop-memory coupling, LMC). The two factors dissociate. For learning, both routes to repeated shared access – looped recomputation and repeated memory rereading – cross the out-of-distribution (OOD) grokking barrier that Dense fails, so repeated shared access is the behavioral regularity, not a specific architecture. For editing, the substrates split along a different axis: applying a single localized factual edit (conditioned on direct success) and measuring 2-hop propagation on a shared pre-edit-correct set, the edit propagates strongly in both memory-bearing cells (LMC 0.78-0.92, Dense+Mem 0.71-0.96) and only weakly in the memory-free ones (Loop 0.04-0.30, Dense 0.00-0.03). The split is along the memory axis, not the loop axis: every memory-bearing seed exceeds every memory-free seed, with no detectable difference between the two memory cells. Crucially Dense+Mem has no recurrence, so the propagating ingredient is an addressable site that an edit can write to and later computation rereads, not loop recomputation; Loop is at best a partial intermediate. The affordance survives coarsening the store (N=128 to N=13): propagation attenuates but the memory/no-memory split persists, so fine granularity buys precision rather than the affordance itself. These results dissociate learning competence from editing affordance – repeated shared access suffices to grok, but edit propagation depends on whether the substrate exposes an addressable memory that the forward computation can write to and later reread, an affordance that loop recurrence provides only partially.

19.
arXiv (quant-ph) 2026-06-24

Quantum Correlations of Neutrinos in the Kerr-Newman Space-time

arXiv:2605.10424v2 Announce Type: replace-cross Abstract: Quantum phases provide a connection between gravitation and quantum information, which proposes a novel avenue to explore the properties of space-time. In this paper, we investigate the quantum correlations (QCs) of neutrinos in the Kerr–Newman space-time. Both radial and non-radial propagations are considered under the weak-field approximation. The results show that, for inward propagations, the oscillation probabilities and QCs differ significantly from those obtained in the Schwarzschild metric. In the case of radial outward propagation, the larger angular momentum $a$ increases the oscillation period of the survival probability $P_{ee}$, entanglement, and monogamy of nonlocality, whereas the larger charge $Q$ decreases the corresponding periods. For non-radial propagations, $M$ and $a$ can noticeably modulate the amplitudes of the considered QCs, which is not observed in the case of radial propagations. Furthermore, we find that, despite differences in their variation ranges, entanglement and coherence exhibit highly consistent oscillation behaviors in both radial and non-radial propagation cases. These findings provide a comprehensive understanding for the neutrinos-based relativistic quantum information.

20.
arXiv (CS.LG) 2026-06-12

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

arXiv:2606.13589v1 Announce Type: new Abstract: We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" – the mathematical reality that the L1 norm is constant on the simplex and fails to prune – by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

21.
arXiv (CS.CL) 2026-06-17

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

22.
arXiv (CS.LG) 2026-06-16

Benchmarking Instance-Dependent Label Noise with Controlled Corruptions

arXiv:2606.14965v1 Announce Type: new Abstract: Synthetic instance-dependent label noise (IDN) benchmarks are widely used to evaluate noisy-label learning methods, yet existing approaches typically generate noise through imperfect annotators or classifier raters, leaving the source of ambiguity implicit. We introduce CILN, a benchmark generation framework that creates IDN through controlled input corruptions. A diverse voter pool labels corrupted instances, producing benchmark datasets in which both the source and severity of ambiguity are explicit and controllable. Using CIFAR10, MNIST, and Adult, we construct 90 benchmark settings spanning multiple corruption families and severity levels. Our experiments show that the resulting benchmarks exhibit genuine instance-dependent noise, provide diverse confusion structures, and, on CIFAR-10, can produce label distributions that are closer to human uncertainty than an existing synthetic IDN benchmark. We further demonstrate that corruption-mediated IDN can expose failure modes of popular noisy-label learning methods, including Co-Teaching and DivideMix, that are not observed under comparable levels of rater-fallibility noise. These findings suggest that noise structure, not only noise rate, plays an important role in benchmark difficulty and algorithm behavior. By making ambiguity generation explicit and controllable, CILN provides a complementary benchmarking framework for studying noisy-label learning under diverse sources of instance difficulty.

23.
arXiv (CS.CL) 2026-06-11

Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.

24.
arXiv (CS.AI) 2026-06-24

Uncertainty-Aware Longitudinal Forecasting of Alzheimer's Disease Progression Using Deep Learning

arXiv:2606.24604v1 Announce Type: new Abstract: Longitudinal modelling of Alzheimer's disease progression is clinically useful only if it can describe not just the most likely next diagnosis, but how a patient may evolve over time and how reliable that forecast is. Most deep learning approaches reduce this problem to single-step classification, treating cognitively normal, mild cognitive impairment, and dementia as flat categories while providing limited insight into how uncertainty accumulates across future visits. We propose a probabilistic framework that combines ordinal diagnosis prediction, multi-horizon trajectory generation, and decomposed uncertainty estimation. A Temporal Fusion Transformer encoder is adapted with a CORAL ordinal output layer, asymmetric loss weighting, and converter oversampling to respect disease-stage ordering and improve sensitivity to MCI-to-dementia transitions. Conditioned on the learned patient-context representation, an autoregressive Mixture Density Network generates five-year probabilistic trajectories for diagnosis state, CDR Sum of Boxes, MMSE orientation, and hippocampal volume. On ADNI, the model outperforms linear, recurrent, and transformer baselines for next-visit diagnosis prediction, with the strongest gains on MCI-versus-dementia discrimination. Generated trajectories achieve near-nominal 90% credible interval coverage, widening uncertainty across the forecast horizon, and biomarker dynamics consistent with expected Alzheimer's disease progression. We further separate aleatoric from epistemic uncertainty using analytic mixture variance and a five-member bootstrap ensemble, which provides the strongest encoder diversity and output-level epistemic signal. Epistemic uncertainty is higher for rare progression archetypes, MCI and dementia patients, and under external evaluation on OASIS-3, where it increases alongside prediction error.

25.
arXiv (CS.LG) 2026-06-19

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

arXiv:2606.09547v2 Announce Type: replace-cross Abstract: Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.