论文广场 - AcademicHub

01.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2606.12411

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

作者:

Yeongseo Jung ↗Jaehyeok Kim ↗Eunseo Jung ↗Jiachuan Wang ↗Yongqi Zhang ↗Ka Chun Cheung ↗Simon See ↗Lei Chen ↗

Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

阅读与讨论 → 访问原文 →

02.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.15467

Hardy and Cabello Arguments in Spatial and Temporal Frauchiger-Renner Scenarios

作者:

Ehsan Erfani Maharat ↗Ali Ahanj ↗Mohsen Sarbishaei ↗

arXiv:2606.15467v1 Announce Type: new Abstract: We investigate Hardy- and Cabello-type logical structures within spatial and temporal extensions of the Frauchiger–Renner (FR) framework, embedding these constructions directly into the FR multi-observer architecture. In the spatial multi-observer scenario, both Hardy and Cabello contradictions arise, with the Cabello construction yielding the stronger violation,$$\Delta_Cabello^{\max}=0.1078$$, which exceeds the maximal Hardy probability $$P_{H}^{\max}=\frac{5\sqrt{5}-11}{2}\approx 0.09017$$. We then develop a sequential temporal FR protocol based on coherent multi-observer measurements performed on a single spin-$\tfrac12$ system. In this temporal setting, the Hardy contradiction disappears identically due to dynamical constraints imposed by sequential state updates, whereas a finite Cabello-type violation survives, $\Delta_Cabello^{\max}\approx 0.0674$. Our results establish a fundamental structural distinction between spatial entanglement and temporal multi-observer correlations in FR-type logical scenarios, and demonstrate that certain observer-independent description failures persist even without spacelike separation.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2502.18795

Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs

作者:

Xiulin Yang ↗Tatsuya Aoyama ↗Yuekun Yao ↗Ethan Gotlieb Wilcox ↗

Do language models (LMs) offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LMs can learn arbitrary inputs as easily as natural languages. We test this claim by training LMs to model impossible and typologically unattested languages. Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 languages from 4 language families with two newly constructed parallel corpora. Our results show that while GPT-2 small can largely distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg's Universal 20. We find that the model's perplexity scores do not distinguish attested vs. unattested word orders, while its performance on the generalization test does. These findings suggest that LMs exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.14049

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

作者:

Shiyao Wang ↗Xijuan Zeng ↗Hui Wang ↗Shiwan Zhao ↗Feng Deng ↗Chen Zhang ↗Yong Qin ↗

We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.12066

Performance Analysis of YOLOv11 and YOLOv8 for Mixed Traffic Object Detection under Adverse Weather Conditions in Developing Countries

作者:

Quoc Thuan Nguyen ↗Ha Anh Vu ↗Ngo Dang Thanh Ngan ↗Minh Phuc Hoang Ngoc ↗

In modern vehicular systems, robust performance under harsh conditions has become a critical problem of autonomous driving. Our study delivers a comprehensive evaluation of the newest iteration of the YOLO series, which is YOLOv11 Nano architecture benchmarked against the widely adopted YOLOv8 Nano as a baseline on a custom fused dataset that combines the Indian Driving Dataset (IDD) [1] and Berkeley Deep Drive Dataset (BDD100K) [2]. We have analyzed the trade-offs among detection accuracy, inference speed, and computational efficiency in high-entropy scenarios involving dense mixed traffic, rain, and low-light conditions. Specifically, YOLOv11n achieves a mean Average Precision (mAP@50) of 46.6%, with a notable 3.2% improvement in Precision over the baseline, effectively reducing false positives in cluttered scenes. Furthermore, the proposed model exhibits enhanced energy efficiency, requiring 22% fewer FLOPs (6.3G vs. 8.1G) while maintaining real-time inference speed of 70.9 FPS on a Tesla T4 GPU, offering an optimal trade-off for safety-critical edge deployment.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2605.25796

SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness

作者:

Jiahao Huo ↗Wenjie Qu ↗Yibo Yan ↗Kening Zheng ↗Jiaheng Zhang ↗Xuming Hu ↗Philip S. Yu ↗Mingxun Zhou ↗

Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose SAMark, a self-anchored watermarking framework that removes the dependency on sentence order by establishing a step-independent green region in semantic space. To improve detectability, we introduce a multi-channel hyperbolic scoring mechanism that amplifies watermark signals while suppressing noise from weakly aligned candidates. We further propose a diversity-aware filtering strategy that combines hard filtering with soft regularization, extending beyond simple n-gram repetition filters to address semantic redundancy. Experimental results show that SAMark achieves up to 90.2% TP@FP1% under typical paragraph-level paraphrasing attacks, outperforming the strongest prior baseline by more than 30% on average, while maintaining generation quality competitive with unwatermarked text and breaking the robustness-quality trade-off that limits prior methods.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.15778

DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

作者:

Ali Sarabadani ↗Mahtab Tajvidiyan ↗

Large Language Models (LLMs) struggle to incorporate new knowledge without forgetting or costly retraining. We propose DYNA, a lightweight framework that augments a frozen LLM with a temporal knowledge graph where events are nodes and temporal relations are directed, timestamped edges. The graph serves as an external, updatable memory. At query time, DYNA retrieves relevant nodes via random walks and centrality measures, then augments the LLM's response. Evaluated on three temporal recall tasks, DYNA reduces catastrophic forgetting by ~7% compared to fine-tuning and improves temporal ordering by ~5% over standard RAG. Higher graph clustering coefficients correlate with better retrieval, showing that graph structure matters. Contributions: (1) episodic memory as temporal KG, (2) retraining-free LLM augmentation, (3) graph properties as predictors of retrieval performance.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2605.09169

Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)

作者:

Ankit Hemant Lade ↗Sai Krishna Jasti ↗Indar Kumar ↗Aman Chadha ↗

arXiv:2605.09169v2 Announce Type: replace-cross Abstract: A Mamba state-space model trained only for next-step prediction appears to recover Granger-causal structure through a simple readout $S = |W_{out} W_{in}|$, with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at $p < 10^{-5}$. We package the protocol used to test that claim – standardized synthetic generators (VAR/Lorenz/CauseMe-style), three intervention semantics ($do(X=c)$, soft-noise, random-forcing), edge-provenance cards on three real datasets, and size-matched control arms – as a reusable falsification benchmark, and walk the claim through it in five stages. The method-level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe-style benchmarks, and on Lorenz-96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample-size confound, and the residual disappears under standard $do(X=c)$ interventions, surviving only under a non-standard random-forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger – the effect is method-agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.

阅读与讨论 → 访问原文 →

09.

medRxiv (Medicine) 2026-06-17 DOI: HASH:7cbc1d15a0124de420954af06f63b3eb

The Unreliable Judges: Assessing Reproducibility and Self-Preference Bias of LLMs as Free-Text Evaluators

作者:

Alvarez-Arenas ↗J. I ↗mananes ↗jimenez-carretero ↗Sanchez-Cabo ↗

Large Language Models (LLMs) are transforming clinical practice and research, but their adoption requires rigorous evaluation. While human assessment is ideal, its cost has driven the widespread use of LLMs as evaluators. We introduce an open-source reciprocal framework comparing 71 human experts against six LLMs. AI evaluators show a strong self-preference bias, yet neither group reliably identified whether a response was human- or AI-generated. AI scores correlated with surface features such as length and lexical diversity, whereas human scores did not. By probing the evaluator's hidden states and applying targeted steering, we show that verbosity is a major causal driver of the bias. Moreover, shuffling question-response pairings shows that long responses keep high scores even when they no longer answer the question, whereas short ones do not, demonstrating that AI judges reward verbosity largely independently of content alignment. Finally, API-based and batch inference inflate stochasticity, underscoring the need for controlled deployment.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.14048

WAM4D: Fast 4D World Action Model via Spatial Register Tokens

作者:

Ying Li ↗Xiaobao Wei ↗Jiajun Cao ↗Hao Wang ↗Xiaowei Chi ↗Chengyu Bai ↗Qianpu Sun ↗Jiajun Li ↗Xiaojie Zhang ↗Jian Tang ↗Sirui Han ↗Shanghang Zhang ↗…

World action models (WAMs) have recently shown promise in jointly modeling future observations and executable robot actions. However, most existing WAMs still operate in 2D video or latent spaces, where visually plausible rollouts miss the 3D spatial constraints and occluded contact geometry required for precise manipulation. While geometric foundation models offer strong priors for recovering dense 3D structure and motion from visual observations, forcing WAMs to predict the dense 4D representation introduces costly geometric decoding and slows down causal action generation. To address the trade-off, we present WAM4D, a fast 4D world action model that uses lightweight spatial register tokens as training-time future-depth readouts to transfer pretrained geometric priors into a causal video-action transformer, then removes the register branch for lightweight action inference. To prevent non-causal shortcuts, we further design causal mixture attention for the Mixture-of-Transformers (MoT) WAM backbone, defining modality-specific visibility among video, action, and geometry tokens. Comprehensive experiments on RoboTwin 2.0 and challenging real-world manipulation tasks show that WAM4D improves spatial consistency and achieves competitive action prediction while maintaining efficient inference.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.19899

Measuring Biological Capabilities and Risks of AI Agents

作者:

Patricia Paskov ↗Jeffrey Lee ↗Kyle Brady ↗Alyssa Worland ↗

arXiv:2606.19899v1 Announce Type: cross Abstract: This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations – drawing from our own evaluations – that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.15142

MotionVLA: Vision-Language-Action Model for Humanoid Motion

作者:

Nonghai Zhang ↗Siyu Zhai ↗Yanjun Li ↗Zeyu Zhang ↗Zhihan Yin ↗Yandong Guo ↗Boxin Shi ↗Hao Tang ↗

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.LG) 2026-06-15 DOI: arXiv:2506.06542

Direct Fisher Score Estimation for Likelihood Maximization

作者:

Sherman Khoo ↗Yakun Wang ↗Song Liu ↗Mark Beaumont ↗

arXiv:2506.06542v2 Announce Type: replace-cross Abstract: We study the problem of likelihood maximization when the likelihood function is intractable but model simulations are readily available. We propose a sequential, gradient-based optimization method that directly models the Fisher score based on a local score matching technique which uses simulations from a localized region around each parameter iterate. By employing a linear parameterization to the surrogate score model, our technique admits a closed-form, least-squares solution. This approach yields a fast, flexible, and efficient approximation to the Fisher score, effectively smoothing the likelihood objective and mitigating the challenges posed by complex likelihood landscapes. We provide theoretical guarantees for our score estimator, including bounds on the bias introduced by the smoothing. Empirical results on a range of synthetic and real-world problems demonstrate the superior performance of our method compared to existing benchmarks.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2412.00107

Virtual Sensing to Enable Real-Time Monitoring of Inaccessible Locations & Unmeasurable Parameters

作者:

Kazuma Kobayashi ↗Farid Ahmed ↗Jaewan Park ↗Subhankar Sarkar ↗Souvik Chakraborty ↗Syed Bahauddin Alam ↗

arXiv:2412.00107v2 Announce Type: replace-cross Abstract: Real-time monitoring of safety-critical interior states remains an open problem in energy systems where physical instrumentation is infeasible. Existing approaches rely on explicit governing equations, finite-dimensional state vectors, or per-instance retraining, which prevents mesh-independent, field-level inference at arbitrary interior coordinates under real-time constraints. We introduce operator-based virtual sensing for nuclear-grade thermal-fluid systems: we use the neural-operator framework to learn solution operators that map sparse boundary measurements to coupled internal fields in physically inaccessible regions, framing the problem class explicitly to distinguish it from classical state estimation and pointwise soft sensing. We instantiate this framework with MIMONet, a branch-trunk operator extended with three practical choices: multi-modal branch encoders for heterogeneous (scalar and function-valued) inputs; multiplicative branch fusion to preserve the bilinear PDE coupling structure; and shared-latent multi-field decoding with per-channel basis projections at the trunk's final layer. Evaluated across escalating complexity, from canonical lid-driven cavity flow to pressurized water reactor subchannels to fully coupled heat exchangers, MIMONet achieves below 5% relative errors and sub-millisecond inference on data-center accelerators (0.35 ms / 46 mJ per heat-exchanger inference on an NVIDIA H200, and sub-millisecond across the A40-H200-GH200 range), while remaining stable under 50% sensor noise. By staying accurate as geometric confinement and physics coupling intensify, MIMONet shows that operator-based virtual sensing can restore observability where physical instrumentation fails, establishing simulation-based feasibility within the evaluated operating envelopes as a step toward future experimental and cross-solver validation for safety-critical energy systems.

阅读与讨论 → 访问原文 →

15.

bioRxiv (Bioinfo) 2026-06-14 DOI: HASH:92aacd3ee3b53628f781ce995bb67381

Virtual phenotypic screening discovers novel scaffolds inhibiting the PI3K/mTOR pathway

作者:

Wu ↗A. P ↗Yao ↗Hoeckendorf ↗Gaskins ↗Kosaisawe ↗Lu ↗Hanslovsky ↗Mayba ↗Skelton ↗Scalia ↗Moffat ↗…

Phenotypic drug discovery has yielded many first-in-class small-molecule drugs by discovering modulators of disease phenotypes in physiologically relevant cellular systems. However, high-content phenotypic assays lack the ultra-high-throughput scalability of target-based screens. Recent advances in virtual screening present an opportunity to address this bottleneck, but have been limited to simple phenotypes like viability, restricted to small repurposing libraries, or lack in-depth biological validation. Here, we present PhenoCompass, a multimodal co-embedding model that aligns compound structures and high-content phenotypic imaging to enable virtual phenotypic screening over billion-compound libraries. Following training on the Joint Undertaking in Morphology dataset with more than 100,000 Cell Painting compound profiles, retrospective validation with historical biochemical high-throughput screening data demonstrates that PhenoCompass ranks compounds according to their biochemical target engagement. Leveraging PhenoCompass, we performed a prospective screen of 3.8 billion Enamine REAL compounds for inhibitors of PI3K/mTOR pathway, a critical signaling cascade whose aberrant activation is a common tumor driver. This search identified 11 novel compounds with pathway-consistent Cell Painting readout and diverse scaffolds, a 54-fold enrichment over the training set. Orthogonal validation experiments using a FOXO3A reporter assay and direct kinase inhibition confirmed seven structurally novel inhibitors with distinct mechanisms of action. These results highlight the convergence of diverse molecular target profiles onto a shared morphological pathway signature and establish PhenoCompass as a robust framework for high-content phenotypic virtual screening.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2602.04396

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

作者:

Andrej Jovanovi\'c ↗Alex Iacob ↗Mher Safaryan ↗Ionut-Vlad Modoranu ↗Lorenzo Sani ↗William F. Shen ↗Xinchi Qiu ↗Dan Alistarh ↗Nicholas D. Lane ↗

arXiv:2602.04396v2 Announce Type: replace-cross Abstract: Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M–$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.14686

CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

作者:

Rafi Ahamed ↗Md. Abir Rahman ↗Tasnia Tarannum Roza ↗Munaia Jannat Easha ↗Md. Asif Khan ↗Sudeepta Mandal ↗

Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of "CottonLeafVision" is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model's capabilities on real life agriculture. This paper shows the deep learning model's capabilities to classify the disease in real-life cotton disease management situations.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.LG) 2026-06-15 DOI: arXiv:2606.14313

Nonlocal Bayesian Modeling of Continuous Spatio-Temporal Dynamics

作者:

Jaeyeong Lee ↗Heeyoung Kim ↗

arXiv:2606.14313v1 Announce Type: cross Abstract: Real-world spatio-temporal forecasting must handle irregular time points, spatially sparse observations, and the need for uncertainty quantification. This setting is often further compounded by nonlocal interactions (long-range spatial coupling). Modeling continuous-space, continuous-time nonlocal dynamics naturally leads to infinite-dimensional integro-differential equations (IDEs), making principled Bayesian inference intractable. We propose the NonLocal Bayesian Spatio-Temporal model (NLBST), a hierarchical Bayesian framework for continuous spatio-temporal fields that learns explicit nonlocal coupling while retaining tractable inference. NLBST represents the latent field via a coordinate-based spatial basis expansion and models the coefficient process with a continuous-time ODE whose learnable linear operator corresponds to a Galerkin reduction of a nonlocal IDE; a Neural ODE residual captures additional nonlinear dynamics. A linear-Gaussian observation model enables Kalman-style sequential updates under missing and irregular observations, while the spatial basis representation enables inductive prediction at unmeasured locations without retraining. Global parameters are learned via variational inference, and uncertainty is handled through a Bayesian hierarchy. Experiments on synthetic and real-world datasets demonstrate strong forecasting and spatial generalization with well-calibrated uncertainty, yielding substantial gains over baselines in strongly nonlocal and partially observed regimes.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.AI) 2026-06-12 DOI: arXiv:2606.13385

Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents

作者:

Zihao Wang ↗Yiming Li ↗Yutong Wu ↗Zheyu Liu ↗Kangjie Chen ↗Fok Kar Wai ↗Pin-Yu Chen ↗Vrizlynn L. L. Thing ↗Bo Li ↗Dacheng Tao ↗Tianwei Zhang ↗

arXiv:2606.13385v1 Announce Type: cross Abstract: Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an attack-centric perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \sysname, a stakeholder-centric benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from stealthy parasitism (attack succeeds without disrupting the user's delegated task) to misaligned disruption (task disrupted without attack success) and compounded failure (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

阅读与讨论 → 访问原文 →

20.

medRxiv (Medicine) 2026-06-22 DOI: HASH:04ed6cdb383bd67a060af68674fb4810

Agentic Artificial Intelligence for Hospital Readmission Review: A Single-Center Blinded Evaluation and Exploratory Qualitative Analysis

作者:

Gensheimer ↗M. F ↗Adhikari ↗Parmer-Chow ↗Liu ↗Ma ↗Shieh ↗

Background: Manual review of 30-day hospital readmissions can identify actionable quality and safety problems, but it is labor-intensive. We developed and evaluated an agentic AI workflow for evidence-grounded readmission review. Materials and methods: We studied adult patients with unplanned 30-day readmission after discharge from a medicine hospitalist service at a single academic health system. An AI agent using a large language model queried a database containing notes, encounters, procedures, laboratory results, and other clinical data, and completed the same structured readmission-review rubric used by physicians. In the primary comparative evaluation, 20 randomly selected readmissions from 2025 were each reviewed by two physicians and the AI system. Blinded physician evaluators rated review quality. After rubric refinement, the AI workflow was applied to 100 recent readmissions in an exploratory expanded-cohort analysis of recurring improvement opportunities. Results: In the primary comparative evaluation, the AI classified 9/20 readmissions (45%) as preventable, compared with 19/40 physician reviews (47.5%). Blinded overall quality ratings were similar for AI and physician reviews (4.35 vs. 4.20 on a 1-5 scale; mean difference 0.15, 95% CI -0.20 to 0.48; p=0.49), as were factuality/support and usefulness/actionability ratings. No AI hallucinations were identified during factuality review. Agreement on preventability and primary readmission category was low for both AI-human and human-human comparisons. The AI system cost $0.23 per chart; physician reviewers took a median of 15 minutes, corresponding to an estimated $42.43 per chart. In the exploratory expanded-cohort analysis, AI-assisted review identified recurring vulnerabilities in post-discharge follow-up plans, incomplete inpatient workups, medication-safety transitions, and indwelling-device transitions. Conclusions: Agentic AI produced readmission reviews with similar blinded quality ratings to physician reviews in this small single-center primary comparative evaluation and supported identification of recurring quality-improvement themes in the exploratory expanded-cohort analysis. Preventability judgments remained variable among both AI and physicians, underscoring the need for human oversight and prospective evaluation before operational use.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2606.11357

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

作者:

Wesley Pang ↗Gregory Hyegang Jun ↗Feiyang Liu ↗Deming Chen ↗

arXiv:2606.11357v1 Announce Type: cross Abstract: With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present TileFuse, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2512.18295

AL-GNN: Privacy-Preserving and Replay-Free Continual Graph Learning via Analytic Learning

作者:

Xuling Zhang ↗Jindong Li ↗Yifei Zhang ↗Mingqi Yang ↗Menglin Yang ↗

arXiv:2512.18295v2 Announce Type: replace-cross Abstract: Continual graph learning (CGL) aims to enable graph neural networks to incrementally learn from a stream of graph structured data without forgetting previously acquired knowledge. Existing methods particularly those based on experience replay typically store and revisit past graph data to mitigate catastrophic forgetting. However, these approaches pose significant limitations, including privacy concerns, inefficiency. In this work, we propose AL GNN, a novel framework for continual graph learning that eliminates the need for backpropagation and replay buffers. Instead, AL GNN leverages principles from analytic learning theory to formulate learning as a recursive least squares optimization process. It maintains and updates model knowledge analytically through closed form classifier updates and a regularized feature autocorrelation matrix. This design enables efficient one pass training for each task, and inherently preserves data privacy by avoiding historical sample storage. Extensive experiments on multiple dynamic graph classification benchmarks demonstrate that AL GNN achieves competitive or superior performance compared to existing methods. For instance, it improves average performance by 10% on CoraFull and reduces forgetting by over 30% on Reddit, while also reducing training time by nearly 50% due to its backpropagation free design.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2606.11560

LLMs+Graphs: Toward Graph-Native, Synergistic AI Systems

作者:

Arijit Khan ↗Longxu Sun ↗Xin Huang ↗

arXiv:2606.11560v1 Announce Type: cross Abstract: Large Language Models (LLMs) have advanced rapidly, but their limitations in structured and multi-hop reasoning underscore the need for graph-native, synergistic artificial intelligence (AI) systems. Graph-structured data underpins critical applications across social, biological, financial, transportation, web, and knowledge domains, making it essential to understand how LLMs can leverage graph computation for grounded, context-rich inference. Three complementary synergies are emerging: LLMs augmented with graph computation for retrieval and reasoning; bidirectional integration between LLMs and knowledge graphs (KGs), where LLMs support KG construction and curation while KGs enforce semantic constraints and factual consistency; and AI agents strengthened by graph algorithms for planning, decision making, and multi-step reasoning. In parallel, LLMs introduce new capabilities for graph data management and graph machine learning (ML) through natural language interfaces and hybrid LLM-graph neural network (GNN) pipelines. This tutorial synthesizes the algorithms, systems, and design principles driving these converging directions, offering data science and data mining researchers a unified perspective on integrating LLMs, graph data management, graph mining, graph ML, and agentic computation into next-generation graph-native AI systems.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.LG) 2026-06-11 DOI: arXiv:2606.11256

My Chemical Harness: Evolutionary Molecular Design over Synthetic Pathways with Large Language Model Agents

作者:

C\'esar Ojeda ↗Darius A. Faroughy ↗Maryam Karimi ↗Payam Zarrintaj ↗Mir Mehdi Seyedebrahimi ↗Mart\'in Carballo-Pacheco ↗

arXiv:2606.11256v1 Announce Type: cross Abstract: Designing molecules with target properties is most useful when candidate structures are accompanied by feasible synthetic routes. We introduce My Chemical Harness, a route-native evolutionary framework for goal-directed molecular design in which the search population consists of executable synthetic pathways rather than isolated molecular graphs. Each route is built from purchasable building blocks and reaction templates, executed by deterministic chemistry tools, and scored through task-specific molecular oracles. Large language models (LLMs) are used only as strategy controllers that select high-level preferences over route length, move type, reaction families, motifs, and exploration pressure, while local code performs route construction, validation, deduplication, scoring, selection, and memory updates. This separation lets the LLM guide exploration without allowing it to introduce hallucinated products or unsupported reaction steps. On a soluble epoxide hydrolase proxy task, our LLM agent improves over single pass LLM and deterministic controllers, reaching state-of-the-art performance across the sEH score, synthetic accessibility score, and AiZynthFinder success rate metrics. These results suggest that constrained LLM agents can play a significant role in molecular discovery without requiring training, fine-tuning, or dedicated generative models.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.20075

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

作者:

Xinghao Chen ↗Chak Tou Leong ↗Wenjin Guo ↗Jian Wang ↗Wenjie Li ↗Xiaoyu Shen ↗

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络