Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-19

Exit-and-Join Dynamics for Decentralized Coalition Formation

Authors:

arXiv:2606.19683v1 Announce Type: new Abstract: This paper studies coalition formation as a decentralized dynamical process driven by unilateral exit-and-join decisions. Agents evaluate local moves using the Aumann-Dreze value, so payoffs are computed within the agent's current coalition rather than through a globally negotiated coalition structure. The resulting model links cooperative payoff allocation with noncooperative best-response behavior: a terminal partition is precisely a coalition structure with no admissible, individually profitable exit-and-join deviation. We establish equilibrium characterizations, identify conditions under which the dynamics admit scalar Lyapunov or exact-potential representations, and analyze how switching and acceptance costs shape local stability. Numerical experiments test finite-time stabilization, cost sensitivity, and a special convex-game benchmark.

02.
arXiv (CS.AI) 2026-06-19

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

arXiv:2509.19658v2 Announce Type: replace-cross Abstract: In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn – a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. Through diverse experiments on the LIBERO benchmark, we demonstrate the effectiveness of applying SSMs to ICIL, achieving improved generalization to both unseen and long-horizon tasks than Transformer-based ICIL methods by handling longer contexts at test-time. These results show for the first time that SSMs are an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

04.
arXiv (CS.LG) 2026-06-16

Priority-Aware Shapley Value

arXiv:2602.09326v2 Announce Type: replace Abstract: Shapley values are widely used for model-agnostic data valuation and feature attribution, yet they implicitly assume contributors are interchangeable. This can be problematic when contributors are dependent (e.g., reused/augmented data or causal feature orderings) or when contributions should be adjusted by factors such as trust or risk. We propose Priority-Aware Shapley Value (PASV), which incorporates both hard precedence constraints and soft, contributor-specific priority weights. PASV is applicable to general precedence structures, recovers precedence-only and weight-only Shapley variants as special cases, and is uniquely characterized by natural axioms. We develop an efficient adjacent-swap Metropolis-Hastings sampler for scalable Monte Carlo estimation and analyze limiting regimes induced by extreme priority weights. Experiments on data valuation (MNIST/CIFAR10) and feature attribution (Census Income) demonstrate more structure-faithful allocations and a practical sensitivity analysis via our proposed "priority sweeping".

05.
arXiv (CS.LG) 2026-06-15

Riemannian Metric Matching for Scalable Geometric Modeling of Distributions

arXiv:2606.14334v1 Announce Type: new Abstract: High-dimensional datasets often concentrate near low-dimensional structures, but estimating their geometry from samples typically relies on graphs and kernels that scale poorly with dataset size and dimension. We propose Riemannian metric matching: a denoising probabilistic framework for learning the Riemannian geometry of data using neural networks. Specifically, we learn the carré du champ operator, which, using diffusion geometry, gives us access to the Riemannian geometry toolkit for downstream machine learning and statistical tasks. Our key observation is that the carré du champ operator can be formulated as a conditional expectation over random perturbations of the data, which can be exploited for sample-wise training and constant cost, amortized inference without explicit kernel construction. Empirically, metric matching rivals or improves the accuracy of $k$-NN-based diffusion geometry estimators, while enabling amortized inference that is up to $400\times$ faster, and supports graph-free geometric analysis on high-dimensional images where nearest neighbors break down.

06.
PLOS Computational Biology 2026-06-18

A comparison of contact patterns derived from the population structure in agent-based models and empirical contact survey data

Authors:

by Janik Suer, Johannes Ponge, Michael Brüggemann, Jan Pablo Burgard, Vitaly Belik, Bernd Hellingrath, Alejandra Rincón Hidalgo, Andrzej K. Jarynowski, Richard Pastor, Huynh Thi Phuong, Steven Schulz, Ashish Thampi, Chao Xu, Marlli Zambrano, Rafael Mikolajczyk, André Karch, Veronika K. Jaeger, on behalf of the OptimAgent Consortium Agent-based models (ABMs) are powerful tools for simulating disease spread, relying on individual-level interaction rules from which emergent dynamics arise. An important component in ABMs is contact behaviour. To reduce computational complexity, contact behaviour in ABMs is often assumed as random mixing within structurally defined settings (as, e.g., workplaces). with setting composition typically based on empirical data such as census information. However, the validity of this approach to represent contacts remains unclear. To address this gap, we compare the contact structure derived through this approach in a large-scale ABM with empirical contact survey data with respect to age contact matrices for households, schools, workplaces, all remaining contact settings, and all contacts combined (based on difference matrices and sum of squared errors (SSE)). Our results demonstrate that random mixing in settings with known age compositions like households (SSE:0.7(95%CI0.4–0.9)), schools (SSE:0.7(95%CI:0.3–1.1)) and workplaces (SSE:0.5(95%CI:0.2-0.7)), captures basic interaction patterns but fails to account for age-related variation in contact numbers. The largest differences arise for contacts outside these settings (SSE:3.8(95%CI:1.2–6.5)), as ABMs typically use random regional contacts that do not capture age-structured behaviour observed in contact surveys. Applying contact matrices from both approaches to an age-structured compartmental model, leads to noticeable differences in simulated epidemic outcomes regarding reproduction numbers and spreading dynamics between age groups. Our results suggest that naïve approaches to represent contact behaviour in ABMs based on population structure can be valid in settings with defined age-structures while settings with low a priori structure require more advanced methods to represent contact behaviour observed in contact surveys.

07.
arXiv (CS.CV) 2026-06-18

RUB: Evaluating Residual Knowledge in Unlearned Models

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

08.
arXiv (CS.LG) 2026-06-16

Hidden Degradation Costs in Energy-Cost-Only HEMS Optimisation: Study on Battery and PV Sensitivity

arXiv:2606.16051v1 Announce Type: cross Abstract: Residential battery energy storage systems (BESS) are increasingly deployed alongside photovoltaic (PV) generation to reduce household energy costs under volatile time-of-use (TOU) tariffs. Model predictive control (MPC) is a widely adopted optimisation strategy for home energy management systems (HEMS), typically formulated to minimise net energy cost, subject to physical and operational constraints. However, battery degradation is rarely embedded in the optimisation objective, meaning its cost is unquantified and aggressive; high-cycle-count strategies could incur significant losses once deployed to physical systems. This paper presents a receding-horizon mixed-integer linear programming (MILP) baseline for a UK residential HEMS, using demand data from the REFIT dataset. A 3 by 3 sensitivity study is conducted across three battery sizes and three PV array sizes, with post-hoc degradation cost estimated using the Naumann stress model and rainflow cycle counting. Results show that degradation remains constant for each battery size and can exceed energy cost savings by up to 1,060 %. These results demonstrate that energy-cost-only optimisation systematically underestimates the true system cost, motivating a degradation-aware control formulation.

09.
arXiv (CS.CL) 2026-06-19

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

Authors:

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

10.
arXiv (CS.AI) 2026-06-11

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

arXiv:2605.23243v2 Announce Type: replace-cross Abstract: We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

11.
arXiv (CS.CV) 2026-06-16

UtVAA: Ultra-tiny Vision Transformer with Affix Attention for Mobile Image Classification

Vision Transformers (ViTs) have demonstrated strong representation capability in image classification. However, their quadratic self-attention complexity and large parameter counts limit deployment on resource-constrained mobile and edge devices. This paper introduces UtVAA, an ultra-tiny Vision Transformer architecture designed for efficient visual recognition under strict computational budgets. It incorporates a novel Affix Attention block that combines depthwise-pointwise local feature extraction, linear self-attention, coordinate attention for spatial dependency modelling, and a lightweight ternary fusion strategy to integrate local and global representations. In addition, Dilated Bottleneck blocks expand the receptive field using dilated depthwise separable convolutions while maintaining low FLOPs and stable optimisation through residual connections. UtVAA is implemented in scalable Tiny, Medium, and Large variants, with the smallest model containing 204.67K parameters and 53.95M FLOPs. Experimental results on CIFAR-10, CIFAR-100, PlantVillage-Tomato and SLIF-Tomato datasets show that UtVAA achieves competitive accuracy within a sub-million-parameter regime. Overall, the results demonstrate that transformer-based vision models can be redesigned into ultra-tiny architectures without significant loss in discriminative performance, making UtVAA suitable for mobile and edge deployment. Code is available at https://github.com/romiyal/UtVAA

12.
arXiv (CS.AI) 2026-06-19

Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU

arXiv:2606.20074v1 Announce Type: cross Abstract: Burst suppression (BS) is a clinically relevant electroencephalographic (EEG) pattern used to monitor sedation depth and brain activity in critically ill patients, particularly during induced coma in Intensive Care Units (ICUs). Automatic burst detection remains challenging because BS patterns vary substantially between patients and annotated datasets are scarce. Recently, EEG Foundation Models (FMs) have shown promise across several downstream EEG applications, but their usefulness for BS detection remains unexplored. We present the first study to evaluate EEG FMs for burst detection in reduced-montage ICU EEG without patient-specific calibration. We compare REVE-base, LUNA-large and LuMamba-Tiny with an adaptive thresholding baseline and a task-specific EEGNet baseline. Additionally, we complement conventional EEG window-based classification with event-based burst detection evaluation. This helps assessing clinically whether burst episodes are correctly detected, reducing the impact of expected annotation variability. The best model, REVE-base, achieved the highest event-based F1-score ($0.868 \pm 0.167$) and reduced burst-per-minute error by 52.1% and 36.2% compared to EEGNet and adaptive thresholding respectively, supporting FMs for scalable EEG monitoring in ICU. Ablation experiments showed that full fine-tuning was the most effective adaptation strategy with respect to frozen-backbone training, two-step fine-tuning, and LoRA-based adaptation, improving event-based F1-score over frozen-backbone training by up to $+0.102$ for LUNA-large. With reduced labeled datasets, pretrained REVE-base outperformed random initialization by $+0.723$ event-based F1 points at 25% of the cohort, demonstrating the benefit of pretraining FM representations when adapted to burst detection with limited labeled data.

13.
medRxiv (Medicine) 2026-06-18

Maternal and fetal HLA heterozygosity in preeclampsia: Insights from a large multi-ancestry pregnancy cohort

Preeclampsia (PE) is a leading cause of maternal and neonatal morbidity, with immune dysregulation at the maternal-fetal interface central to its pathogenesis. The highly polymorphic human leukocyte antigen (HLA) region mediates maternal immune tolerance of the semi-allogeneic fetus, yet the contribution of HLA diversity to PE risk remains poorly defined. Whether the HLA heterozygote advantage observed in other immune disorders is relevant to PE has not been systematically evaluated. Using data from the multi-ancestry TOPMed Boston-Colombia Collaborative for Adverse Pregnancy Outcomes (n = 12,790; 4,770 PE, 8,020 controls; 10,808 maternal, 1,982 fetal, including 1,848 pairs), we evaluated associations between heterozygosity across eight classical HLA loci and PE and four sub-phenotypes, adjusting for genetic ancestry. HLA heterozygosity was common across most loci (>80%). No individual maternal HLA locus was associated with overall PE; however, heterozygosity across class I loci showed a protective effect in preterm PE (OR=0.82, 95%CI:0.69-0.97), with a similar pattern for HLA-A heterozygosity (OR=0.78, 95%CI:0.64-0.96). In contrast, fetal heterozygosity at HLA-DQB1 was nominally associated with increased risk of PE (OR=1.36, 95%CI:1.03-1.79) and preterm PE (OR=1.73, 95%CI:1.13-2.73). No individual maternal or fetal HLA alleles were associated with PE. Maternal-fetal mismatch analysis demonstrated locus-specific associations with preterm PE, including increased risk with HLA-DQA1 mismatch and reduced risk with HLA-C mismatch. These findings highlight distinct maternal and fetal immunogenetic contributions to PE risk and underscore the importance of considering HLA diversity-rather than individual alleles alone-in studies of PE etiology.

14.
arXiv (CS.LG) 2026-06-16

Context-Aware Markov VAE for CSI Compression in Wireless Systems

arXiv:2606.16607v1 Announce Type: cross Abstract: This paper considers neural channel state information (CSI) compression for time-varying massive multiple-input multiple-output (MIMO) channels in frequency division duplex (FDD) systems with limited feedback resources. The main challenge lies in obtaining a compact and efficient representation of the CSI given that it exhibits strong temporal correlation across successive snapshots. Existing memoryless compression models do not exploit this property, while simple temporal extensions often incorporate multiple observations without explicitly modeling the latent dynamics. We propose a context-aware compression framework based on a k-memory Markov variational autoencoder (k-MMVAE), which uses a finite temporal window to capture the evolution of CSI in the latent space. The model introduces Markov-structured latent dynamics with finite memory, enabling efficient use of temporal dependencies for compression. Simulation results show that the proposed approach improves target CSI reconstruction performance compared to memoryless and weakly sequential baselines, particularly at low and moderate compression rates. These results suggest that explicit latent temporal modeling can provide an effective mechanism for CSI compression under limited feedback constraints.

15.
arXiv (CS.LG) 2026-06-16

The Algebra of Units: From Buckingham's Pi-grec Theorem to Latent-Variable Learning

arXiv:2606.16737v1 Announce Type: cross Abstract: Engineers often measure many quantities-speed, pressure, temperature, length-expressed in different physical units. The Buckingham Pi-grec theorem states that these variables can always be combined into a smaller set of dimensionless numbers whose values fully determine the system's behaviour. Identifying the appropriate dimensionless groups has traditionally required expert knowledge and physical insight. This paper shows that they can instead be discovered automatically from data, without prior knowledge of the governing physics. The key observation is that, after logarithmic transformation, measurements collected under different scalings of the same system lie on a low-dimensional manifold whose geometry is determined by the underlying dimensionless groups. Singular value decomposition (SVD) identifies this manifold directly from data. A subsequent search over integer-exponent combinations recovers candidate dimensionless quantities, while a repeating-variable filter retains only those constructed from the machine's characteristic scales. This procedure recovers familiar engineering groups, including the flow coefficient, head coefficient, and Mach number, while excluding equivalent but less interpretable alternatives. The method is demonstrated on a synthetic compressor dataset containing 16,000 measurements. Starting from raw dimensional variables and no physics input, it recovers the correct dimensionless groups to numerical precision and reproduces the compressor performance map with an error below 0.01%. More broadly, the work reveals a close connection between classical dimensional analysis and modern data-driven learning. Both rely on the same underlying algebraic structure, suggesting new approaches for building physical models that are simultaneously interpretable, scalable, and data-efficient.

16.
PLOS Computational Biology 2026-06-05

StPedf: Cell trajectory inference of spatial transcriptomics via spatial proximity embedding and spatial density-adaptive fusion

Authors:

by Yuan Zhang, Ziyan Sun, Zhixin Shi, Mengdi Nan, Yuhan Fu, Qing Ren, Jie Gao Spatial transcriptomics is transforming our multidimensional understanding of cellular spatial organization and its functional mechanisms in processes such as development and disease by systematically resolving the spatial heterogeneity of gene expression within tissues. To delve deeper into the dynamic processes underlying spatial expression patterns, spatial trajectory inference integrates genetic and spatial information to reconstruct the spatial developmental trajectories of cells within tissues. This approach reveals the patterns of differentiation and dynamic changes as cellular states evolve continuously along spatial axes. However, existing methods often struggle to uniformly model the complex, nonlinear interactions between high-dimensional gene expression and spatial coordinates. Here, we introduce StPedf, whose core lies in employing a neural network with a masking mechanism to capture complex nonlinear interactions between high-dimensional genes and spatial positions. It further leverages spatial proximity information as a guiding cue, dynamically and adaptively adjusting the embedding of gene and spatial information and the weighting of spatial proximity information based on spatial density. This enables trajectory inference guided by spatial information. This enables optimal transport to derive intercellular transition matrices, reconstruct cellular differentiation trajectories, and construct pseudo-spatiotemporal maps. StPedf demonstrates superior performance over existing methods on five structurally distinct simulated datasets. Using StPedf, we successfully mapped distinct lineages in the spatial trajectories of telencephalon regeneration in the Ambystoma mexicanum, multiple malignant lineages expanding within primary tumors, and developmental spatial trajectories and pseudo-spatiotemporal maps in human dorsolateral prefrontal cortex (DLPFC). StPedf significantly enhances the accuracy and interpretability of spatial trajectory inference, providing critical technical support for revealing the dynamic patterns of cellular fate transitions within tissue microenvironments.

17.
arXiv (CS.AI) 2026-06-18

EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film

arXiv:2606.18266v1 Announce Type: cross Abstract: EMORSION is an exploratory proof-of-concept study examining how film audio design shapes audience emotion and immersion in acinema setting. Four film scenes were selected across the horror (2) and drama (2) genres, balanced between mainstream and independent productions. For each scene, multiple alternative audio mixes were created by systematically manipulating three core aspects of audio design, frequency (pitch), dynamics (loudness), and directionality (spatial placement). Three audience groups viewed the scenes, with each group exposed to one manipulated mix alongside a control mix for each scene. Audience responses were assessed through a triangulated multimodal framework combining self-reported emotion and immersion via a questionnaire, physiological measures including heart rate monitoring, and video-based motion tracking. The protocol successfully captured measurable, interpretable differences across audio conditions, indicating that even subtle changes in audio design can shape emotional perception and immersion. Unconventional mixes tended to produce greater variability in audience interpretation, while conventional immersive mixes were associated with stronger cross-audience agreement. These findings establish the feasibility of the EMORSION protocol and motivate larger-scale studies to characterise the role of specific audio parameters in shaping audience experience.

18.
medRxiv (Medicine) 2026-06-22

The Protective Role of Belonging and Socioeconomic Status in Dropout Intent Among Minority Ethnic Students: A Mixed Methods Study

Improving minority ethnic student retention is a global higher education priority. This mixed-methods study investigated how institutional belonging and socioeconomic status interact to shape dropout intentions among minority university students in the UK (N = 182). Quantitative results revealed that perceived course difficulty and lower subjective socioeconomic status were the strongest predictors of dropout intent. While the interaction between socioeconomic status and difficulty was non-significant, qualitative accounts showed distinct structural vulnerabilities. Financial strain restricted social integration, turning socioeconomic disparities into campus isolation. Conversely, representative curricula, diverse peer networks, and stable cultural in-groups (e.g., religious affiliations, living in the parental home) functioned as essential psychological buffers against academic exhaustion and alienation. Universities must shift from transactional models to sustained structural equity to protect vulnerable student groups.

19.
arXiv (CS.CL) 2026-06-15

SuperThoughts: Reasoning Tokens in Superposition

Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stability and fail to scale to complex, long-horizon tasks due to lack of supervision signal. We propose SuperThoughts, which compresses pairs of consecutive CoT tokens into single latent representations and decodes two tokens per step via a lightweight Multi-Token Prediction (MTP) module. This preserves discrete token supervision at training time while doubling throughput at inference time. We finetune Qwen2.5-Math-1.5B-Instruct, Qwen2.5-Math-7B-Instruct, Qwen2.5-Math-14B-Instruct, and evaluate on MATH500, AMC, OlympiadBench, and GPQA-Diamond. With a confidence-based adaptive mechanism that falls back to standard decoding when uncertain, SuperThoughts achieves $\sim$20–30\% CoT length reduction while maintaining accuracy with minimal degradation (1-2 points accuracy drop on most tasks).

20.
arXiv (CS.LG) 2026-06-18

Optimal scenario design for climate emulation

arXiv:2606.19302v1 Announce Type: cross Abstract: As deep learning for physical systems continues to grow in popularity, efforts to improve generalizability have primarily focused on designing architectures that embed physical constraints. However, for machine-learning surrogate climate models (emulators), we show that the low structural diversity in existing scenarios commonly used to generate training data places a ceiling on predictive skill. Here, we examine whether training datasets themselves can be optimized to improve generalization. We introduce a method to create datasets that produce emulators capable of generalizing to new, structurally different scenarios absent from the training data. We use a differentiable Simple Climate Model (SCM) to calculate the sensitivity of emulator loss to perturbations in the training data, iteratively updating the training data to maximize emulator skill. For an SCM, training on one scenario optimized in this fashion outperforms an emulator trained on six standard ScenarioMIP pathways. We achieve this higher predictive skill despite training on a smaller dataset, finding that our emulator successfully isolates distinct physical behaviors of different climate forcing agents (e.g., greenhouse gases vs. aerosols) without single-forcing runs. We then demonstrate that scenarios optimized using an SCM, when used to drive an intermediate-complexity climate model, produce a training dataset that yields a more skillful emulator than training on ScenarioMIP outputs. Our results suggest that, in the compute-constrained environment of running full-scale climate models, generating a small number of dynamically rich scenarios provides greater marginal value for emulation and characterizing system responses than expanding the suite of traditional emissions pathways.

21.
arXiv (CS.CV) 2026-06-17

EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems

Continual Test-time adaptation (CTTA) continuously adapts the deployed model on every incoming batch of data. While achieving optimal accuracy, existing CTTA approaches present poor real-world applicability on resource-constrained edge devices, due to the substantial memory overhead and energy consumption. In this work, we first introduce a novel paradigm – on-demand TTA – which triggers adaptation only when a significant domain shift is detected. Then, we present OD-TTA, an on-demand TTA framework for accurate and efficient adaptation on edge devices. OD-TTA comprises three innovative techniques: 1) a lightweight domain shift detection mechanism to activate TTA only when it is needed, drastically reducing the overall computation overhead, 2) a source domain selection module that chooses an appropriate source model for adaptation, ensuring high and robust accuracy, 3) a decoupled Batch Normalization (BN) update scheme to enable memory-efficient adaptation with small batch sizes. Extensive experiments show that OD-TTA achieves comparable and even better performance while reducing the energy and computation overhead remarkably, making TTA a practical reality.

22.
arXiv (CS.AI) 2026-06-17

Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation

arXiv:2606.17383v1 Announce Type: cross Abstract: Agentic artificial intelligence systems introduce a new class of model risk. Unlike traditional predictive models, autonomous agents continuously acquire information, form beliefs regarding latent states of the environment, generate forecasts, select actions, and adapt their behavior over time. Existing validation methodologies focus primarily on predictive accuracy and therefore provide limited insight into the quality of the underlying decision process. This paper proposes a model validation framework for agentic AI based on Partially Observable Markov Decision Processes (POMDPs). The framework decomposes autonomous decision making into information, beliefs, forecasts, actions, and utility, allowing each component to be validated independently. Large language models (LLMs) are formalized as approximate Bayesian filtering operators, and a model-risk taxonomy is developed encompassing state-space, filtering, forecast, policy, utility-specification, and parameter risks. The model risk validation methodology is demonstrated through a portfolio-management case study in which an agent infers latent market regimes from market and macroeconomic information, generates belief-conditioned forecasts, and constructs portfolios using a Black–Litterman framework. Empirical validation combines performance analysis, belief calibration diagnostics, coverage tests, ablation studies, and parameter-sensitivity analysis. The results indicate that latent-state inference contributes independently to decision quality and that the principal conclusions remain robust across a broad range of parameter values. The principal contribution of the paper is a practical framework for extending established model risk management concepts to autonomous AI systems and providing a rigorous foundation for their validation, governance, and monitoring.

23.
arXiv (CS.CL) 2026-06-19

TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

Target Safety Assessment (TSA) requires systematic integration of genetic, transcriptomic, target homology, pharmacological, and clinical data to evaluate potential safety liabilities of therapeutic targets. This process is labor-intensive and expert-dependent, posing challenges in scalability and reproducibility. We present TSAssistant, a human-in-the-loop multi-agent framework that decomposes TSA report generation into a workflow of specialized subagents: Research Subagents that each ground and cite a single TSA domain, and Synthesis Subagents that integrate findings across domains. Subagents retrieve and synthesize evidence from curated biomedical sources through standardized tool interfaces and produce individually citable, evidence-grounded sections, with behavior shaped by a hierarchical instruction architecture that separates coordination logic from domain expertise and user intent. To complement these soft constraints, programmatic execution hooks and persistent memory stores enforce hard constraints across the workflow, while an interactive refinement loop allows experts to review and revise individual sections with full conversational context preserved across iterations. Rather than a single holistic comparison, we decompose report quality into reproducibility, evidential grounding, task-level accuracy, and controllability under expert oversight, finding high reproducibility and grounding, substantial agreement with the human reference, and net-positive expert-driven refinement.

24.
arXiv (CS.CV) 2026-06-16

When the Past Matters: FlashBack Memory for Precipitation Nowcasting

Accurate precipitation nowcasting is crucial for disaster mitigation and socio-economic planning, yet existing methods often struggle with false alarms, missed events, and long range dependency modeling at high spatiotemporal resolution. To address these challenges, we propose FlashBack Memory (FB), a module that dynamically retrieves key historical states and integrates them via an adaptive fusion gate, enhancing the spatiotemporal representation capability of recurrent-based models. We incorporate FB into PredRNN, PredRNNpp, MIM, MotionRNN, and PredRNN-V2, and evaluate on CIKM2017, Shanghai2020, and SEVIR datasets. Experimental results demonstrate that FB significantly improves MSE, MAE, SSIM, and CSI metrics, particularly for high-intensity rainfall and long-sequence predictions, while reducing false alarms and missed events and enhancing temporal consistency and spatial localization. The proposed method provides a general and efficient memory enhancement mechanism, improving the overall performance of recurrent-based precipitation nowcasting models.

25.
arXiv (CS.AI) 2026-06-16

AI Engram: In Search of Memory Traces in Artificial Intelligence

arXiv:2606.14997v1 Announce Type: new Abstract: Memory formation is fundamental to intelligence, yet whether deep neural networks preserve identifiable memory traces analogous to biological memory units remains an open question. This work introduces a geometric framework to identify such "AI engrams" by formalizing the neuroscientific criteria of specificity, reactivation, sufficiency, and necessity into a constrained inverse problem. We derive a closed-form estimator that isolates individual memory traces from globally entangled parameters, and show that this biologically-derived solution corresponds to a natural gradient update on the parameter manifold. AI engrams enable surgical manipulation of learned knowledge: any subset of memories can be composed or erased through linear arithmetic, without iterative optimization. Experiments ranging from simple MLPs to LLMs demonstrate the causal validity and substantial scalability of AI engrams. Together, these results bridge theories of biological memory and artificial representation learning and offer geometric insight into how deep networks simultaneously support functional specificity within distributed storage.