Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-15

Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops

arXiv:2606.14149v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed in healthcare settings, yet their tendency to hallucinate poses risks when clinical decisions are involved. This study examine whether LLMs recommend recently banned or withdrawn pharmaceuticals when answering clinical questions and tests an agent-based method for reducing such errors. We developed a five-agent "Trust but Verify" system using a single LLM backbone. To measure regulatory knowledge obsolescence, we created an adversarial dataset of 103 clinical MCQs where historically correct answers now refer to banned substances. This scale ensures statistical significance across various therapeutic classes. We evaluated three open-access model families (GPT-OSS, Llama-3, Falcon-3) under vanilla and agentic conditions. Performance was measured via pointwise score, label accuracy, Hallucination Error Rate (HER), and Component Fidelity (CF) score. We also observed clinical safety regression in proprietary models. In default configurations, all models showed high hallucination rates, consistently selecting banned drugs that matched training data patterns. Our proposed agentic architecture reduced HER by approximately 53% across models. Pointwise scores shifted from -0.25 (unsafe recommendation) toward 0.0 (appropriate refusal). The safety audit intercepted dangerous outputs even when models' parametric knowledge favored the banned substance. The proposed multi-agent framework offers a model-agnostic method for enforcing regulatory compliance that prioritizes patient safety over fluent text generation. Our work demonstrates a practical approach for deploying autonomous AI systems in safety-critical healthcare settings. It shows how real-time regulatory data can be integrated into LLM pipelines to support clinical decision-making.

02.
arXiv (CS.AI) 2026-06-18

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

arXiv:2606.18698v1 Announce Type: cross Abstract: The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.

03.
arXiv (CS.LG) 2026-06-18

MOLAR: Learning Multimodal Molecular Representations from Noisy Labels

arXiv:2606.18390v1 Announce Type: new Abstract: Motivation: Noisy labels are a common challenge in molecular property prediction because molecular annotations are often obtained from assays, curated databases, or weak annotation pipelines rather than directly observed clean biological states. Treating recorded labels as reliable supervision can cause models to memorize corrupted observations and learn misleading molecular evidence. In multimodal molecular representation learning, this issue can be amplified by graph-text fusion or alignment, which may propagate label-induced errors across modalities. Results: We propose MOLAR, a noise-aware framework for learning multimodal molecular representations from noisy labels. MOLAR separates latent clean-property inference from recorded-label observation: graph and text views contribute residual evidence to a clean-property distribution, and a categorical label-observation channel maps this distribution to recorded labels for training. This formulation derives posterior label reliability and modality-specific molecular evidence from the model. Experiments on naturally noisy molecular benchmarks and controlled label-flipping benchmarks show that MOLAR consistently outperforms representative baselines. Visualization analyses further show that MOLAR provides interpretable reliability and modality-evidence diagnostics.

04.
arXiv (CS.LG) 2026-06-18

Detecting Hidden ML Training With Zero-Overhead Telemetry

arXiv:2606.19262v1 Announce Type: new Abstract: Hardware-enabled monitoring of GPU workloads underpins many proposals for AI compute governance, but if developers can defeat monitoring mechanisms, such schemes are unworkable. We evaluate the adversarial robustness of GPU workload classification using only zero-overhead, privacy-preserving NVML telemetry: content-agnostic signals that observe physical effects of computation without accessing model weights, training data, or hyperparameters. Across 5 rounds of monitor-evader iteration, we evaluate 20 evasion strategy families on 9 GPU models spanning 4 architecture generations. We develop a classifier that achieves 98.2% binary accuracy at identifying training workloads across the whole corpus, and 43-87% accuracy against the most challenging unexpected workloads even when they are adversarially disguised.

05.
medRxiv (Medicine) 2026-06-22

Exploring the association of Obesity on Cold and Warm Autoimmune Hemolytic Anemia in San Joaquin Valley: A Retrospective Cross-Sectional Study

The relationship between obesity and specific autoimmune diseases haas been well-established, specifically due to obesity's role in promoting pro-inflammatory states. Although not much literature has been documented regarding obesity association with AIHA. As such, this study aims to assess any correlations in patients with elevated body mass index (BMI) and autoimmune hemolytic anemia (AIHA). Here we present a retrospective cross-sectional study conducted over a four-year period, across four medical centers during which a new electronic medical record was implemented. The study included 25 patients who had a previously documented history of AIHA from another facility, DAT positive with indicators of hemolysis, or DAT positive with monomer specific antisera. The patients BMI was recorded at the time of presentation to the hospital. However, for patients with a prior history of AIHA or those transferred from another facility, the BMI that was closest to the time period of when the patient was diagnosed with AIHA was used as an adjunct. Our results show that there is an association of patients with elevated BMI (>25) and AIHA; however, various other confounding variables should be taken into consideration, and further research should be done to establish a causal relationship.

06.
arXiv (CS.CL) 2026-06-15

GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge

Large language model (LLM) reasoning is ephemeral: chains of thought vanish with the context window, pruned search branches leave no record, and memory buffers cannot be diffed, merged, or audited. Every other complex software process (code, infrastructure, data, experiments) is version-controlled; reasoning is not. We introduce GitOfThoughts, which stores an agent's reasoning tree as a git repository: every scored thought is a commit, scores are notes, outcomes are tags, and retrieval is "git log" over the agent's own history. This makes reasoning replayable, auditable, and mergeable across agents at near-zero engineering cost. We then ask the harder question: does memory, in any substrate, actually improve accuracy? Across five substrates (none, markdown, vector, graph, git), two benchmarks, two model scales, and pre-registered replications, the answer for novel problems is no. No memory format reliably helps, and a promising early result collapsed under its own pre-registered replication. Memory pays only above what we call the copyability threshold: when the retrieved case is a near-duplicate of the current problem (similarity >~ 0.8), accuracy jumps sharply; below it, nothing. The gain is answer retrieval, not method transfer: a 4.5x larger model doubles the near-duplicate payoff yet still cannot extract a transferable method from a worked example. The only general lever we find is test-time sampling. The case for git-as-substrate is therefore auditability, provenance, and mergeability at accuracy parity. We document a retracted result and a refuted hypothesis to model the evaluation standard we hold ourselves to.

07.
arXiv (CS.LG) 2026-06-12

QoS Improvement in Multi User Cellular-Symbiotic Radio Network Assisted by Active-STAR-RIS

arXiv:2401.08301v2 Announce Type: replace-cross Abstract: In this article, we employ active simultaneously transmitting and reflecting reconfigurable intelligent surfaces (ASRIS) to enhance the quality of 6G cellular network services. The network integrates commensal symbiotic radio (CSR) subsystems to facilitate communication between passive Internet of Things (IoT) users and active users, referred to as symbiotic backscatter devices (SBDs) and symbiotic user equipments (SUEs), respectively. Since the SBDs are passive, transmitting information to the SUEs poses significant challenges. To overcome this challenge, we harness the capabilities of massive multiple input multiple output (MIMO) antennas within the base station (BS) to relay the information transmitted by SBDs with greater power. This scheme uses the non-orthogonal multiple access (NOMA) technique for multiple access among all users, and potential interferences are eliminated using successive interference cancellation (SIC). The primary objective is to maximize the throughput between SBDs and SUEs. To achieve this, we formulate an optimization problem involving variables such as active beamforming coefficients at the BS and ASRIS, phase adjustments of ASRIS, and scheduling parameters between CSR and cellular networks. To solve this optimization problem, we used three deep reinforcement learning (DRL) methods: proximal policy optimization (PPO), twin delayed deep deterministic policy gradient (TD3), and asynchronous advantage actor critic (A3C). These methods were simulated, and the results demonstrate that A3C, TD3, and PPO have the best convergence speeds and achieve the highest increases in network throughput, respectively. Finally, the proposed scheme was evaluated using passive simultaneously transmitting and reflecting RIS (STAR-RIS), which demonstrated poorer performance compared to ASRIS.

08.
PLOS Computational Biology 2026-06-18

scMagnifier: Resolving fine-grained cell subtypes via GRN-informed perturbations and consensus clustering

作者:

by Zhenhui He, Dong Kangning Resolving fine-grained cell subtypes in single-cell RNA sequencing (scRNA-seq) data remains challenging, as their subtle transcriptional differences are often obscured by technical noise and data sparsity. Here, we present scMagnifier, a consensus clustering framework that leverages gene regulatory network (GRN)-informed in silico perturbations to amplify subtle transcriptional differences and uncover latent cell subpopulations. scMagnifier perturbs candidate transcription factors (TFs), propagates perturbation effects through cluster-specific GRNs to simulate post-perturbation expression profiles, and integrates clustering results across multiple perturbations into stable subtype assignments. Additionally, scMagnifier introduces regulatory perturbation consensus UMAP (rpcUMAP), a perturbation-aware visualization that provides clearer separation between cell subtypes and guides the selection of the optimal number of clusters. In both single-batch and multi-batch benchmarks, scMagnifier consistently improves the resolution and accuracy of fine-grained cell type identification. Notably, when integrated with spatial clustering methods such as STAGATE, scMagnifier is compatible with spatial transcriptomics workflows and effectively reveals tumor cell subtypes and their spatial organization in ovarian cancer.

10.
arXiv (CS.LG) 2026-06-18

TimeLAVA: Learning-Agnostic Data Valuation for Time Series

arXiv:2606.18729v1 Announce Type: cross Abstract: Data valuation quantifies the intrinsic quality of individual samples to enable principled data curation, quality control, and robust learning. For time series in critical domains such as healthcare, finance, and industrial monitoring, effective valuation methods are essential yet fundamentally lacking. Existing approaches are either model-dependent, limiting their generalizability, or designed for i.i.d. data and thus fail to capture temporal dependencies, multi-scale patterns, and non-stationary dynamics inherent to sequential data. We introduce TimeLAVA, a learning-agnostic framework that values temporal segments by their marginal contribution to minimizing distributional discrepancy between evaluated and reference data. At its core is a novel Selective Wavelet-based Wasserstein discrepancy combining multi-scale wavelet transforms for temporal localization with unbalanced optimal transport for robustness to distributional shifts. Segment values are efficiently computed via sensitivity analysis without requiring model training and aggregated into point-wise scores. We provide theoretical guarantees linking valuation to model-agnostic generalization and prove bounded sensitivity to outlier contamination. Extensive experiments across anomaly detection, data pruning, and label noise detection demonstrate that TimeLAVA produces significantly more informative value scores than existing methods on diverse real-world datasets.

11.
arXiv (quant-ph) 2026-06-16

Towards Quantum Limited Spatial Resolution of NV-Diamond Magnetometry

arXiv:2508.13438v2 Announce Type: replace Abstract: Optically addressable ensembles of solid-state defects, such as nitrogen vacancy (NV) centers, are a leading modality for imaging-based magnetometry, thermometry and strain sensing. However, monitoring the fluorescence of individual defects within a sub-diffraction ensemble remains an outstanding challenge that currently limits access to atomic-scale features and dynamics. For compact clusters of NVs, we formulate imaging-based atomic sensing as a low-dimensional multiparameter estimation task in which one seeks to localize each defect and quantify the field strength in its immediate vicinity. In this work, we employ optical spatial mode demultiplexing (SPADE) to enhance localization and brightness estimation accuracy at sub-diffraction scales. Specifically, we develop a two-stage sensing protocol that augments direct imaging by projecting the incoming optical field onto point spread function (PSF)-adapted, i.e., PAD spatial modes and Yuen-Kennedy-Lax (YKL) spatial modes enabling efficient extraction of emitter positions and brightnesses. The YKL-SPADE measurement employed for brightness estimation is shown to be quantum-optimal in the case of two emitters and establishes a new connection between quantum detection and estimation theories. We numerically evaluate the statistical performance of our protocol for sub-diffraction optically detected magnetic resonance (ODMR) and Rabi sensing experiments. Compared to conventional focal plane intensity measurements, our protocol improves emitter localization accuracy by 6$\times$ and brightness estimation accuracy by 2$\times$ for tightly confined ensembles, residing well below the diffraction limit.

12.
arXiv (CS.CV) 2026-06-11

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

13.
arXiv (CS.CV) 2026-06-11

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision–language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).

14.
arXiv (CS.CL) 2026-06-18

GraphPO: Graph-based Policy Optimization for Reasoning Models

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

15.
arXiv (CS.AI) 2026-06-18

X+Slides: Benchmarking Audience-Conditioned Slide Generation

arXiv:2606.19256v1 Announce Type: new Abstract: Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $\tau_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

16.
arXiv (math.PR) 2026-06-18

Second-Order Approximation of Limit Order Books in a Single-Scale Regime

arXiv:2308.00805v3 Announce Type: replace-cross Abstract: We establish a first- and second-order approximation for an infinite dimensional limit order book model in a single (critical) scaling regime where market and limit orders arrive at a common time scale. With our choice of scaling we obtain non-degenerate first- and second-order approximations for the price and volume dynamics. While the first-order approximation is given by a coupled ODE-PDE system, the second-order approximation is described in terms of an infinite-dimensional stochastic evolution equation driven by a cylindrical Brownian motion. The driving noise processes exhibit a non-trivial correlation in terms of the model parameters. We prove that the evolution equation has a unique solution and that the sequence of standardized limit order book models converges weakly to the solution of the evolution equation. The proof uses a non-standard martingale problem. We calibrate a linearized model to market data and explain how our model can be used for deriving confidence intervals of portfolio liquidation values.

17.
arXiv (quant-ph) 2026-06-19

Maximum entropy principle for quantum processes

arXiv:2506.24079v3 Announce Type: replace Abstract: The maximum entropy principle, as applied to quantum systems, is a fundamental prescript positing that for a quantum system for which we only have partial knowledge, the maximum entropy state consistent with the partial knowledge is a valuable choice as the system's state. An intriguing result is that in case the only prior knowledge is of a fixed energy, the maximum entropy state turns out to be the thermal state, a ubiquitous state in several arenas, especially in statistical mechanics. We extend the consequences of this principle from static quantum states to dynamic quantum processes. We establish that a quantum channel attains maximal output entropy under a fixed energy constraint if and only if it is an absolutely thermalizing channel, where the fixed output is the thermal state corresponding to that energy. Our results have potential implications for understanding the informational and thermodynamic utility of quantum channels under physical constraints. As an application, we examine the consequences for private randomness distillation from fixed energy constrained quantum processes.

18.
arXiv (CS.AI) 2026-06-17

Symplectic Transversality and Endpoint Green Estimates for Finite-Horizon Pontryagin Systems

arXiv:2606.17762v1 Announce Type: cross Abstract: We study horizon-uniform local branches of finite-horizon discrete-time Pontryagin boundary value systems after smooth control elimination. The central input is a two-point endpoint inverse for the linearization. We verify this inverse from scaled stable–unstable boundary transversality, prove the associated endpoint-corrected Green estimate, and combine it with weighted contractions to obtain existence, uniqueness, Lipschitz dependence, and first-order expansions with constants independent of the horizon. The framework covers smooth nonlinear endpoint maps, including the original Pontryagin rows that fix the initial state and couple the terminal costate to the terminal state. Symplectic and Riccati criteria verify the inverse hypothesis at the level of the matrix data; in particular, every stabilizable linear-quadratic system with invertible dynamics and definite weights is covered, including noncommuting coupled data. A numerical section illustrates the certificates and the horizon-uniform first-order expansion.

19.
arXiv (CS.CV) 2026-06-15

Memento: Reconstruct to Remember for Consistent Long Video Generation

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

20.
arXiv (CS.AI) 2026-06-19

SIMBA: ABidirectional Retrieval Forward Simulation Framework for Modeling FY-4A GIIRS Hyperspectral Infrared Radiances Toward NWP Applications

arXiv:2606.19943v1 Announce Type: cross Abstract: Hyperspectral infrared observations are an important data source for numerical weather prediction (NWP) because they provide rich information on the vertical structure of atmospheric temperature and humidity. However, most existing deep learning methods mainly focus on one-way retrieval from radiances to atmospheric profiles, while the reverse radiance simulation process and the consistency between atmospheric state space and radiance observation space are insufficiently considered. In this study, we propose SIMBA, a unified bidirectional retrieval-forward simulation framework for FY-4A GIIRS hyperspectral infrared radiance modeling toward NWP applications. The framework jointly performs atmospheric profile retrieval and radiance reconstruction, introduces a cycle-consistency constraint to strengthen the coupling between the two processes, and employs a bidirectional Mamba state-space module to capture long-range dependencies along pressure levels. Using collocated FY-4A GIIRS observations and ERA5 reanalysis data, the proposed method is evaluated for temperature retrieval, specific humidity retrieval, long-wave radiance reconstruction, and medium-wave radiance reconstruction. Experimental results show that SIMBA outperforms several representative deep learning baselines across both retrieval and reconstruction tasks, while ablation experiments confirm the contribution of the bidirectional design and cycle-consistency mechanism. These results demonstrate that the proposed framework is effective for joint atmospheric profile retrieval and hyperspectral infrared radiance modeling, and suggest potential for future Jacobian-related analysis and NWP-oriented extensions.

21.
arXiv (quant-ph) 2026-06-17

Cavity method for permutation models on Cayley trees

arXiv:2606.17751v1 Announce Type: new Abstract: Motivated by permutation statistical models arising in random tensor networks, we study permutation models on a Cayley tree whose variables take values in the symmetric group $\Sn$. The pair interaction is assumed to depend only on the cycle type of the relative permutation. Then the Boltzmann weight is written as a class function on $\Sn$. This property diagonalizes the edge convolution operator in irreducible representation sectors. As a result, the linear stability of the uniform paramagnetic cavity solution is controlled by the character eigenvalue ratios. For cycle-factorized weights, these eigenvalues can be expressed as specializations of Schur functions. We derive the instability criteria and also verify their validity by comparison with direct numerical iterations of the cavity equation.

22.
arXiv (CS.AI) 2026-06-17

Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation

arXiv:2606.17383v1 Announce Type: cross Abstract: Agentic artificial intelligence systems introduce a new class of model risk. Unlike traditional predictive models, autonomous agents continuously acquire information, form beliefs regarding latent states of the environment, generate forecasts, select actions, and adapt their behavior over time. Existing validation methodologies focus primarily on predictive accuracy and therefore provide limited insight into the quality of the underlying decision process. This paper proposes a model validation framework for agentic AI based on Partially Observable Markov Decision Processes (POMDPs). The framework decomposes autonomous decision making into information, beliefs, forecasts, actions, and utility, allowing each component to be validated independently. Large language models (LLMs) are formalized as approximate Bayesian filtering operators, and a model-risk taxonomy is developed encompassing state-space, filtering, forecast, policy, utility-specification, and parameter risks. The model risk validation methodology is demonstrated through a portfolio-management case study in which an agent infers latent market regimes from market and macroeconomic information, generates belief-conditioned forecasts, and constructs portfolios using a Black–Litterman framework. Empirical validation combines performance analysis, belief calibration diagnostics, coverage tests, ablation studies, and parameter-sensitivity analysis. The results indicate that latent-state inference contributes independently to decision quality and that the principal conclusions remain robust across a broad range of parameter values. The principal contribution of the paper is a practical framework for extending established model risk management concepts to autonomous AI systems and providing a rigorous foundation for their validation, governance, and monitoring.

23.
arXiv (CS.CV) 2026-06-17

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

24.
medRxiv (Medicine) 2026-06-22

Artificial Intelligence-Enabled Cardiac Function Estimation from Phone Videos of Echocardiograms

Importance: Mobile phone-recorded echocardiogram videos are commonly used in point of care, telemedicine, and resource-limited workflows, but artificial intelligence models for left ventricular ejection fraction (LVEF) estimation have primarily been evaluated on native Digital Imaging and Communications in Medicine (DICOM) videos. Objective: To evaluate whether previously described artificial intelligence models for LVEF estimation retain performance when applied to mobile phone-recorded echocardiographic videos. Design: Multicenter model validation study comparing model-estimated LVEF with clinician reported LVEF. Setting: Three medical centers: Kaiser Permanente Northern California, Beth Israel Deaconess Medical Center through MIMIC-IV-ECHO, and Cedars-Sinai Medical Center. Participants: Source studies with clinician reported LVEF and apical 4-chamber or apical 2-chamber views, yielding 6209 phone-recorded videos from 2648 studies and 2611 patients. Exposures: Mobile phone recording of native echocardiographic videos and fine-tuning of pretrained models using mobile phone-recorded videos from the Kaiser Permanente Northern California training cohort. Main Outcomes and Measures: Mean absolute error in ejection fraction percentage points, R^2 for continuous estimation, and area under the receiver operating characteristic curve for identifying ejection fraction greater than 50%. Results: The study included 6209 mobile phone recorded echocardiographic videos from 2648 studies and 2611 patients; the weighted mean age was 68.4 years, and 1031 patients were male (39.5%). Without phone-video fine-tuning, the primary model achieved a mean absolute error of 7.00 percentage points, coefficient of determination of 0.49, and area under the receiver operating characteristic curve of 0.91 on phone-recorded videos; corresponding native DICOM performance was 6.08 percentage points, 0.60, and 0.93, respectively. On the 2396-video fine-tuning evaluation cohort, fine-tuning improved primary model performance to a mean absolute error of 6.96 percentage points, coefficient of determination of 0.61, and area under the receiver operating characteristic curve of 0.93. Fine-tuning the public EchoNet-Dynamic model improved performance from 9.36 percentage points, 0.37, and 0.84 to 7.86 percentage points, 0.50, and 0.89, respectively. Progressive central zoom preprocessing degraded model performance. Conclusions and Relevance: These findings suggest that artificial intelligence assisted left ventricular ejection fraction estimation from mobile phone-recorded echocardiograms may be feasible when native image export is unavailable, although prospective evaluation is needed before clinical deployment.

25.
arXiv (CS.CV) 2026-06-12

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.