Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-24

Exploring the relationship between human-centric AI and firm idiosyncratic risks

arXiv:2606.24224v1 Announce Type: new Abstract: Despite the extensive discussions of human-centric AI (HCAI) in Industry 5.0, its effects on firms' idiosyncratic risks (IR) remains underexplored. This is an imperative issue for firms navigate financial risks during the current technological revolution, as IR reflects investor reactions to corporate heterogeneous AI strategies and implementations by isolating firm-level stock volatility from systematic factors. Integrating situated AI theory with social-technical systems theory, we conceptualise HCAI as a situated AI strategy that reduces AI-related ethical risks and fosters AI-Human synergies in firms' business operations, ultimately reducing IR by aligning with stakeholders' diverse expectations. Moreover, socio-technical factors, namely digitalisation, operational efficiency, executive shareholding, and CEOs with IT background, may moderate the HCAI-IR relationship. Using a multi-source panel dataset of Chinese listed firms from 2015 to 2023, we find that HCAI is associated with lower firm IR. Furthermore, digitalisation and executive shareholding strengthen this risk-reducing effect, whereas operational efficiency and CEOs with IT background surprisingly attenuate it. Our findings offer theoretical contributions and practical insights for both ethical AI governance and firm financial risk management in the AI era.

02.
bioRxiv (Bioinfo) 2026-06-08

TRACEY: an updated resource for SNARE protein domain annotation with improved HMMs and expanded sequence coverage

Motivation: SNARE proteins catalyse membrane fusion across the eukaryotic endomembrane system, from synaptic vesicle exocytosis to intracellular trafficking, endosomal and vacuolar transport, and autophagy, and their accurate domain annotation depends on the quality of profile models and the sequence diversity behind them. The original SNARE domain classification predates the recent expansion of eukaryotic sequence data, leaving its HMM profiles and subgroup coverage unable to resolve divergent and lineage-specific paralogs. Results: We present an updated release of TRACEY built on a resynchronized, non-redundant collection of 18,915 curated SNARE proteins spanning 1,188 species, together with a consolidated set of 83 HMM profiles, including 43 models for newly defined subgroups, reconstructed through an iterative, mixture-model-driven procedure. In direct comparison with the legacy models, at least ~75% of sequences in every overlapping group scored better with the new HMMs, indicating systematic gains in domain detection. A redesigned web interface adds multiparameter querying, FASTA download, and direct scanning of user-submitted sequences against the curated profiles. Availability and implementation: TRACEY is freely available at https://tracey.unil.ch.

03.
arXiv (CS.CV) 2026-06-15

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework (FAST-AR) for FAST-AutoRegressive diffusion, consisting of three components: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5 - x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

04.
arXiv (quant-ph) 2026-06-17

Variational Quantum Eigensolver-Based Quantum Bootstrap Embedding for Molecules

作者:

arXiv:2606.17095v1 Announce Type: cross Abstract: Simulating strongly correlated molecular systems on near-term quantum hardware remains challenging due to modern hardware's limited quantum volume and moderate-fidelity qubits. One potential way to circumvent this challenge is through bootstrap embedding (BE). Bootstrap embedding breaks molecules into smaller fragments that are then embedded into the "bath" of other fragments in an iterative way. Bootstrap embedding is appealing for quantum simulation because fragmenting the system reduces the qubit requirements for any given fragment. In this work, we develop a quantum bootstrap embedding (QBE) workflow that uses variational quantum eigensolver (VQE) fragment solvers and study the algorithmic choices that determine the overall VQE-QBE algorithm's success. To improve efficiency, we introduce FastAdaptVQE, a sparse matrix-accelerated form of the adaptive variational quantum eigensolver (ADAPT-VQE) that replaces symbolic commutator evaluation with direct statevector linear algebra, and MatrixFreeAdaptVQE, a matrix-free extension that removes the sparse-matrix memory bottleneck that appears when treating larger fragments. We also modify the ADAPT-VQE operator selection step by replacing the purely greedy choice with a look-ahead strategy. Benchmarks on $H_4$ and $F_2$ reach chemical accuracy, within 1 kcal/mol of bootstrap embedding results using a full configuration interaction (FCI) solver. These results show that combining QBE with VQE can accurately calculate energies of molecular systems. This research lays the foundation for extending energy calculations to larger molecular systems and quantum materials on near-term quantum hardware.

05.
arXiv (CS.LG) 2026-06-16

Near-Optimal Stochastic Linear Bandits with Delay

arXiv:2606.16656v1 Announce Type: new Abstract: We study stochastic linear bandits with delayed feedback under several delay models and establish near-optimal regret guarantees. Our results identify when delayed linear bandits exhibit the same qualitative behavior as multi-armed bandits (MAB), and when the linear structure creates fundamentally new challenges. Specifically, (1) for loss-independent delays, where the delay does not depend on the realized loss (but potentially depends on the arm), we show that delays incur only an additive regret penalty. Under stochastic delays, this penalty scales with the expected delay, while under adversarial delays, it scales with the maximum number of outstanding observations. Notably, both delay penalties are dimension-free, improving upon the state-of-the-art results; (2) for loss-dependent delays, we show that linear bandits are substantially harder than MAB: unlike in MAB, we prove matching (up to log factors) upper and lower bounds in linear bandits, whose delay penalty depends on the square root of the dimension. (3) for the delay-as-payoff model, a special case of loss-dependent delay, we show that the optimal MAB guarantee, which depends only on the delay of the optimal arm, is also unattainable in linear bandits. Together, these results provide a sharp characterization of how delayed feedback interacts with linear generalization.

06.
arXiv (CS.AI) 2026-06-17

Temporal Preference Optimization for Unsupervised Retrieval

arXiv:2606.17664v1 Announce Type: cross Abstract: Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documents-an important aspect when a document collection spans multiple time periods (e.g., retrieving documents from 2018-2025 for "Who is the president in 2019?" introduces temporal ambiguity). Existing methods rely on supervised training with explicit timestamps, which are not always feasible. We propose TPOUR (Temporal Preference Optimization for Unsupervised Retriever), which uses our novel training method Temporal Retrieval Preference Optimization (TRPO). TRPO reinterprets preference learning in the temporal dimension, guiding the retriever to favor temporally aligned documents. TPOUR further generalizes to unseen time periods via interpolation in a learned time embedding, enabling continuous temporal alignment. Experiments on temporal information retrieval (T-IR), TPOUR outperforms both unsupervised and supervised baselines. Compared to Qwen-Embedding-8B, despite being about 72.7x smaller, TPOUR Contriever improves average nDCG@5 by +4.04 (+12.15%) on explicit and +4.98 (+15.21%) on implicit queries. We provide our code at https://github.com/agwaBom/TPOUR.

07.
arXiv (CS.CL) 2026-06-12

SupraBench: A Benchmark for Supramolecular Chemistry

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

08.
medRxiv (Medicine) 2026-06-22

Rare loss-of-function variants in POLD1, PMS1 and FAN1 modify age at onset of motor symptoms in Huntington's disease

Huntington's disease is a rare neurodegenerative disease whose primary risk factors are inherited expansions of a CAG repeat tract in the HTT gene. Somatic expansion of these tracts leads to neuronal toxicity, neuronal death and clinical disease progression. To identify genetic factors with a major impact on disease onset and progression, we genome sequenced 18,825 individuals for the ENROLL-HD study. Our results show rare inactivating mutations in three genes, all involved in DNA damage repair, are major determinants of age of onset for motor symptoms (n=10,610) and other clinical manifestations. Heterozygote carriers of predicted loss-of-function (pLoF) variants in POLD1 and PMS1 developed motor symptoms an average 20 years (n=3; P=1x10-5) and 7 years (n=6; P=2x10-3) later than non-carriers, respectively. Conversely, heterozygote carriers of pLoF variants in FAN1 (n=30) developed symptoms 10 years earlier (P=2x10-10). Our findings highlight therapeutic strategies and help predict age of onset for at-risk individuals.

09.
arXiv (CS.CL) 2026-06-16

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

作者:

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight – 3.2x the collapse observed in text-only self-evaluation – while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion – the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across four evaluator configurations (N=53 total independent repetitions, 15,592 API calls) reveals a clear hierarchy: cross-model evaluation (GPT-4o, N=8) produces strong but symmetric bidirectional contagion (mean gamma_{T->V}=1.176, gamma_{V->T}=1.089, Delta=-0.088, p=0.575, Cohen's d=0.29); high round counts (DashScope, 50 rounds) cause collapse to single-strategy dominance (70% zero contagion); and self-evaluation provides near-complete immunity – 97% of runs (N=30, DeepSeek-chat) yield exactly zero contagion (mean gamma=0.033, 95% CI [-0.031, 0.010], p=0.642, d=0.07). No evaluator condition shows statistically significant directional asymmetry. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC experimental framework, and identify cross-model evaluator architecture as the primary risk factor for preference contagion.

10.
arXiv (CS.AI) 2026-06-17

Optimism Stabilizes Thompson Sampling for Adaptive Inference

arXiv:2602.06014v2 Announce Type: replace-cross Abstract: Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study adaptive inference for Thompson sampling with Gaussian randomized indices in $K$-armed stochastic bandits with independent sub-Gaussian reward noises, and identify optimism as a key mechanism for restoring stability, meaning that each arm's pull count concentrates around a deterministic scale. This stability yields asymptotically valid Wald inference despite adaptive sampling. First, we prove that variance-inflated TS is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal, with asymptotically uniform allocation over optimal arms and sharp logarithmic pull-count asymptotics for suboptimal arms. This resolves the $K$-armed extension question raised by \citet{halder2025stable}, using new winner-map and Lyapunov-drift techniques to control allocation among multiple optimal arms. Second, we analyze an alternative optimistic modification that keeps the Gaussian index variance unchanged but adds an explicit mean bonus to the index center, and establish a similar stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid Wald inference in multi-armed bandits, while incurring only a mild additional regret cost.

11.
arXiv (CS.AI) 2026-06-24

Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation

arXiv:2606.24042v1 Announce Type: new Abstract: Recommender systems often induce filter bubbles and semantic homogenization by monolithically optimizing for immediate user engagement. Standard single-objective models, including traditional Deep Q-Networks, are ill-equipped to navigate the trade-offs between platform retention and critical societal values like information diversity and provider fairness. To address these limitations, we introduce a multi-objective reinforcement learning framework that formalizes recommendation as a semantic multi-objective Markov decision process. By integrating high-fidelity semantic embeddings with a Pareto-DQN agent, our architecture treats engagement, diversity, and fairness as distinct, non-aggregable reward signals, avoiding the pitfalls of static reward scalarization. Empirical evaluations on the MovieLens small dataset shows that our hypervolume based action selection disrupts the feedback loops responsible for semantic collapse. By sustaining high state-trajectory variance, the Pareto-DQN effectively maps the Pareto frontier, achieving gains in auxiliary societal objectives with only marginal impacts on engagement. This work provides a path toward intrinsically aligned, responsible recommender systems.

12.
arXiv (CS.AI) 2026-06-17

Membership Inference Attacks against Large Audio Language Models

arXiv:2603.28378v2 Announce Type: replace-cross Abstract: We present the first systematic Membership Inference Attack (MIA) evaluation of LALMs. Using Multi-modal Blind Baselines based on textual, spectral and prosodic features, we demonstrate that common audio datasets exhibit near-perfect train/test separability (AUC ~ 1.0) even without model inference, thus MIA may primarily detect distribution shift. We therefore introduce a blind-baseline protocol to control for this confound. Under this protocol, we identify that the distribution-matched datasets enable reliable MIA evaluation without distribution-shift artifacts. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations. Our codebase is available at https://github.com/snooow1029/ALM_MIA.

13.
arXiv (CS.AI) 2026-06-24

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

arXiv:2606.24231v1 Announce Type: new Abstract: Multimodal driving planning faces a long-standing tension between two paradigms: scoring-based methods benefit from dense reward supervision but are confined to a fixed action vocabulary, while anchor-based methods generate proposals dynamically yet suffer from sparse supervision constrained to a single ground-truth trajectory. In this work, we propose FlowR2A, which resolves this tension by reframing simulation-based rewards from discriminative targets into generative conditions. By learning the reward-conditioned action distribution from dense trajectory-reward pairs with a flow-matching decoder, FlowR2A unifies the dense supervision of scoring-based methods with the proposal generation of anchor-based methods in a single generative model, forcing the model to internalize the correlation between an action and its outcomes in safety, progress, comfort, and rule compliance. To balance hard safety constraints against soft progress objectives, we introduce fine-grained per-timestep reward conditioning and reward noise augmentation. The generative formulation naturally supports controllable test-time sampling via reward guidance and anchored sampling, producing high-quality proposals. FlowR2A achieves state-of-the-art results on the NAVSIM v1 and v2 benchmarks, with multimodal proposals of substantially higher quality than prior methods.

14.
arXiv (CS.LG) 2026-06-18

Generative models for decision-making under distributional shift

arXiv:2604.04342v2 Announce Type: replace Abstract: Many data-driven decision problems are formulated using a nominal distribution estimated from historical data, while performance is ultimately determined by a deployment distribution that may be shifted, context-dependent, partially observed, or stress-induced. This tutorial presents modern generative models, particularly flow- and score-based methods, as mathematical tools for constructing decision-relevant distributions. From an operations research perspective, their primary value lies not in unconstrained sample synthesis but in representing and transforming distributions through transport maps, velocity fields, score fields, and guided stochastic dynamics. We present a unified framework based on pushforward maps, continuity, Fokker-Planck equations, Wasserstein geometry, and optimization in probability space. Within this framework, generative models can be used to learn nominal uncertainty, construct stressed or least-favorable distributions for robustness, and produce conditional or posterior distributions under side information and partial observation. We also highlight representative theoretical guarantees, including forward-reverse convergence for iterative flow models, first-order minimax analysis in transport-map space, and error-transfer bounds for posterior sampling with generative priors. The tutorial provides a principled introduction to using generative models for scenario generation, robust decision-making, uncertainty quantification, and related problems under distributional shift.

15.
arXiv (CS.CL) 2026-06-24

Are We Ready For An Agent-Native Memory System?

Memory for large language model (LLM) agents has rapidly evolved from simple retrieval-augmented mechanisms into a data management system that supports persistent information storage, retrieval, update, consolidation, and dynamic lifecycle governance throughout agent execution. Despite this evolution, existing evaluations still benchmark agent memory mainly through end-to-end task success metrics (e.g., F1, BLEU), while treating the underlying system as a monolithic black box. As a result, critical system-level concerns, including operational costs, architectural trade-offs across memory modules, and robustness under dynamic knowledge updates, remain insufficiently explored. In this paper, we present a systematic experimental study of agent memory from a data management perspective. We propose an analytical framework that decomposes agent memory into four core modules: memory representation and storage, extraction, retrieval and routing, and maintenance. Under this framework, we evaluate 12 representative memory systems and two reference baselines across five benchmark workloads spanning 11 datasets. Our extensive end-to-end evaluation shows that no single architecture dominates across all scenarios; instead, effectiveness depends heavily on how well the memory structure aligns with the workload bottleneck. Furthermore, through fine-grained ablation studies, we quantify their individual effects on representation fidelity, retrieval precision, update correctness, and long-horizon stability. Finally, we reveal cost-performance trade-offs under realistic workloads, showing localized maintenance is more cost-efficient than global reorganization. Based on these findings, we identify promising directions towards building truly agent-native memory systems. The code is publicly available at https://github.com/OpenDataBox/MemoryData.

16.
arXiv (CS.LG) 2026-06-11

Learning Object Manipulation from Scratch via Contrastive Interaction

arXiv:2606.11525v1 Announce Type: cross Abstract: Contrastive Reinforcement Learning (CRL) has seen recent success in a wide variety of goal-conditioned robotics tasks by learning structured representations of the dynamics. However, despite its success in locomotion and simpler control domains, CRL often struggles in interaction-rich manipulation. We argue that a key source of this difficulty is object-centric interaction, such as contact or grasping, that induces distinct changes in the underlying dynamic modes. In this work, we formulate manipulation dynamics as a piecewise-smooth Markov process and show that interaction-induced mode changes create piecewise nonlinear reachability structures that are difficult for standard CRL energy functions to represent and plan over. Based on this analysis, we introduce Interaction-weighted Resampling (IWR). IWR performs interaction-aware resampling around phases before, during, and after interactions, encouraging the learned representation to preserve the mode boundaries that determine future reachability to capture multi-modal and piecewise nonlinear reachability. Across interaction-centric environments, including 2D dynamic control, robotic manipulation, and robot air hockey, IWR improves both sample efficiency and overall performance over prior CRL methods, with 19.8% average improvement in simulation. Finally, using a sim-to-real pipeline with policies trained by IWR, we demonstrate the first real-world goal-conditioned robot air hockey agent capable of hitting goals, improving success from 25% to 60%. Project Page: IWR-arxiv.github.io.

17.
arXiv (CS.LG) 2026-06-16

Dynamic Link Prediction with Temporally Enhanced Signed Graph Neural Networks

arXiv:2605.26290v2 Announce Type: replace Abstract: Temporal signed networks (TSNs) model the time evolution of cooperative and adversarial relationships that arise in applications such as social media analysis, trust and reputation systems, and financial transaction networks. While graph neural networks (GNNs) perform well for static or unsigned link prediction, effective learning in temporal signed graphs remains challenging due to the interaction of signed relations, evolving structure, and balance-theoretic constraints. To address this gap, we propose a modular temporal enhancement framework for signed GNNs that integrates historical context into otherwise static architectures. The framework introduces a Historical Context Integration Module (HCIM) that combines learnable recency-aware temporal weighting, LSTM-based embedding trajectory modeling, and multi-head temporal attention to capture both short- and long-term signed interaction dynamics. Historical information is fused with current node representations using either global or node-adaptive weighting, allowing the architecture-agnostic framework to accommodate heterogeneous temporal behaviors. We instantiate the approach on the Self-Explainable Signed Graph Transformer (SE-SGformer), preserving interpretability while extending it with temporal awareness. Experiments on real-world and synthetic TSNs, including Bitcoin OTC, Bitcoin Alpha, Reddit, and small-world network models, demonstrate consistent and statistically significant improvements over the static baseline.

18.
medRxiv (Medicine) 2026-06-23

Associations Among Changes in Inflammatory Biomarkers, Pain Intensity, and Health-Related Quality of Life Following a 12-Week Aerobic Exercise Programme in Individuals with Non-Specific Chronic Low Back Pain

Abstract Background: Non-specific chronic low back pain (NSCLBP) is associated with persistent pain, reduced health-related quality of life (HRQoL), and low-grade systemic inflammation. This study examined associations among changes in inflammatory biomarkers, pain intensity, and HRQoL following a 12-week aerobic exercise programme. Methods: This secondary analysis used data from a randomized controlled trial involving 41 participants with NSCLBP (intervention, n = 21; control, n = 20). Participants received either supervised aerobic exercise plus health education or health education alone for 12 weeks. Change scores for tumour necrosis factor-alpha (TNF-), interleukin-6 (IL-6), high-sensitivity C-reactive protein (hs-CRP), pain intensity, and HRQoL domains were analysed using correlation and multiple regression analyses. Results: Improvements in IL-6 (r = 0.434, p = 0.005) and hs-CRP (r = 0.444, p = 0.004) were significantly associated with improvements in pain intensity. No significant associations were observed between biomarker changes and HRQoL domains. Treatment allocation was the strongest independent predictor of improvement in physical HRQoL ({beta} = 0.492, p = 0.017) and pain intensity ({beta} = -0.512, p = 0.006). Conclusions: Improvements in IL-6 and hs-CRP were associated with reductions in pain intensity but not with improvements in HRQoL. Treatment allocation was the strongest predictor of clinical improvement, suggesting that mechanisms beyond systemic inflammation may contribute to the benefits of aerobic exercise in NSCLBP. Keywords: non-specific chronic low back pain; aerobic exercise; inflammation; interleukin-6; high-sensitivity C-reactive protein; pain intensity; health-related quality of life.

19.
arXiv (CS.AI) 2026-06-11

TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning

arXiv:2606.11640v1 Announce Type: cross Abstract: Few-shot tabular learning provides a cost-effective approach for real-world applications where annotation is costly and collecting sufficient samples for new tasks is difficult. Existing Traditional and LLM-based methods have demonstrated effectiveness in few-shot scenarios. However, traditional methods need additional training on unlabeled or generated data, which incur significant computational overhead. In addition, LLM-based methods that directly feed raw tabular data into LLMs raise privacy and compliance concerns. More importantly, both paradigms largely overlook the semantic relationships between features, which provide structural and semantic prior for constructing a semantic graph. Semantic graph is essential for modeling meaningful feature interactions in few-shot scenarios. In this paper, we propose TAROT, a GNN-based framework that encodes the structural and semantic prior by constructing and refining a task-adaptive semantic graph from this prior, thereby improving predictive performance in few-shot tabular learning. TAROT first encodes heterogeneous tabular data into unified node semantic representations via a Unified Semantic Tabular Node Encoder (USTNE). Then, it prompts LLMs to infer the semantic relationship between features based on the task description and feature names to construct a semantic graph. To mitigate structural noise introduced by the hallucination of LLMs, TAROT introduces Task-adaptive Semantic Graph Refinement that prunes spurious or task-unrelated edges and adds missing task-related ones, aligning the graph structure with the downstream objective. Finally, a GNN performs message passing over the refined graph to capture task-related semantic dependencies for prediction. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of TAROT, establishing it as a state-of-the-art approach in this domain.

20.
arXiv (CS.CL) 2026-06-16

A Unified Definition of Hallucination: It's The World Model, Stupid!

Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.

21.
arXiv (CS.AI) 2026-06-18

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

arXiv:2606.19004v1 Announce Type: cross Abstract: Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69–77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

22.
arXiv (CS.AI) 2026-06-19

Latent Confounded Causal Discovery via Lie Bracket Geometry

arXiv:2606.19610v1 Announce Type: cross Abstract: Recent work on Kan-Do-Calculus (KDC) has established that the boundary between passive observation and active intervention in causal inference is a category-theoretic bi-adjunction, with interventions modeled by left Kan extensions and conditioning by right Kan extensions. This paper introduces two causal discovery algorithms under latent confounding, building on the information-geometric and categorical consequences of KDC. In smooth statistical settings, Radon-Nikodym derivatives between observational and interventional measures induce local causal vector fields; failures of these fields to close under Lie brackets become computable Frobenius residuals, which we interpret as witnesses of failed visible integrability and possible latent or unmodeled structure. Our first algorithm, BRIDGE (Bracket Residuals for Interventional Discovery and Geometric Estimation), combines an interventional density or Radon-Nikodym-ratio engine with a geometric screen that proposes a high-recall family of admissible arrows, identifies non-closing visible pairs as latent-obstruction candidates, and passes the reduced family to downstream score-based or differentiable discovery routines. The second algorithmic contribution, Spectral Kan-Do Flow Matching (SKFM), learns amortized intervention fields and factors latent curvature spectrally, exposing the direct Lie-space endpoint toward which BRIDGE points. A detailed set of experiments show that both algorithms are capable of discovering causal models with latent confounders while collapsing the super-exponential space of possible DAGs by many orders of magnitude. This paper introduces a new paradigm in causal discovery, where latent structure is inferred directly from the geometry of intervention-induced flows.

23.
arXiv (CS.LG) 2026-06-16

AI for Social Good: An Investigation of the Causal Relationship Between Environmental Regulations and Their Effects on Air Pollution in London, UK

arXiv:2606.15257v1 Announce Type: new Abstract: Air pollution regulation is central to urban public health governance, but estimating its effects is difficult because policies are implemented non-randomly and pollution trajectories are shaped by meteorology, socioeconomic change, temporal trends, and overlapping interventions. This study develops an uncertainty-aware Bayesian deep learning framework to estimate the aggregate effect of air pollution regulations on PM$_{2.5}$ concentrations in London from 2010 to 2020. The framework integrates daily PM$_{2.5}$ observations from Inner London monitoring stations, meteorological covariates, annual socioeconomic indicators, month-of-year and day-of-week indicators, and daily regulation status data for 32 policy measures. A Bayesian LSTM captures temporal dependencies in environmental and socioeconomic covariates, Bayesian embedding layers represent temporal and regulation status inputs, and a regulation status prediction branch supports propensity score-based adjustment for non-random policy implementation. Regulatory effects are estimated by comparing observed PM$_{2.5}$ concentrations with counterfactual predictions under a hypothetical no-regulation scenario, with uncertainty summarized across repeated Bayesian training runs and bootstrap resampling. Results show that London's regulations were associated with an average PM$_{2.5}$ reduction of 1.88 $\mu$g/m$^3$, a relative reduction of 12.35%, with a 95% confidence interval of 1.64-2.12 $\mu$g/m$^3$. Estimated effects were limited before 2013, became clearer from 2013 to 2017, and were strongest in 2018 and 2019. The findings suggest that sustained and cumulative regulatory interventions contributed to measurable improvements in London's air quality. This study demonstrates how uncertainty-aware causal AI can support environmental accountability, public health protection, and evidence-based governance for environmental decision-making.

24.
arXiv (CS.AI) 2026-06-16

SPARK: Security Knowledge Priming and Representation-Guided Knowledge Activation for LLM-based Secure Code Generation

arXiv:2606.16244v1 Announce Type: cross Abstract: Large language models routinely generate code with exploitable security flaws. Prior literature attributes this limitation to a lack of security expertise, steering current defense mechanisms toward heavy fine-tuning or external knowledge retrieval, which introduces significant computational overhead and data bias through redundant code examples. Contrary to this view, we argue that pretraining corpora are already rich in security material. The bottleneck is activation: without an explicit and brief cue, statistical pressure toward common training-distribution patterns suppresses the model's safety-relevant representations. We present SPARK, an inference-time security harness that activates this latent knowledge without any retraining. The harness has two parts. Component~I retrieves a few of the relevant Common Weakness Enumeration (CWE) entries for each coding task and appends a short structured cue to the prompt; this alone is enough to surface the model's existing security representations. Component~II adds a precomputed token bias to the logits at every decoding step. We obtain the bias by projecting a safe-direction vector, the unit difference between the mean safe and mean unsafe last-layer hidden states, through the language model head. The bias is computed once offline; applying it costs a single vector addition per generated token. We evaluate SPARK on 9 open-source models across C++, Java, and Python, and compare with 7 baselines spanning fine-tuning and retrieval-augmented methods. SPARK matches or improves on the best baseline in every setting while preserving HumanEval utility. We further test Component~I in a black-box setting on 7 of today's strongest models, including Claude, DeepSeek, and GPT, demonstrating the bottleneck of insecure code generation and the improvements enabled by our method.

25.
arXiv (CS.LG) 2026-06-12

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

arXiv:2606.12639v1 Announce Type: new Abstract: Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa – the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner – to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.