论文广场 - AcademicHub

01.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15141

EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

作者:

Siyuan Zhang ↗Jian Zong ↗Junyu Wang ↗Peiyuan Jiang ↗Jiahao Yan ↗Jingyu Zhang ↗Tianrui Wang ↗Xiaobao Wang ↗Longbiao Wang ↗Jianwu Dang ↗

arXiv:2606.15141v1 Announce Type: cross Abstract: While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.

阅读与讨论 → 访问原文 →

02.

arXiv (CS.AI) 2026-06-15 DOI: arXiv:2603.10444

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

作者:

Hengjie Cao ↗Zhendong Huang ↗Mengyi Chen ↗Yifeng Yang ↗Fang Dong ↗Anrui Chen ↗Ruijun Huang ↗Xin Zhang ↗Mingzhi Dong ↗Yujiang Wang ↗Jinlong Hou ↗Qin Lv ↗…

arXiv:2603.10444v2 Announce Type: replace-cross Abstract: FP4 training promises substantial memory and compute savings for large language models, but remains fragile because blockwise quantization is dictated by extreme activation magnitudes, which inflate dynamic range and compress long-tail signals. We identify a counterintuitive source of this failure: dominant activation outliers are not merely arbitrary sparse events, but are largely induced by a coherent rank-one mean bias, whose direction aligns with the leading anisotropic spectral component. This mean component strengthens during training, is amplified and reshaped by attention and FFN operators, and increasingly dominates top activation magnitudes. Crucially, this discovery reveals that a seemingly complex outlier-suppression problem admits a truly simple solution: isolate the coherent mean before quantization. We therefore propose Averis, a mean-residual splitting quantization method that separates the mean component using only reductions and elementwise subtractions before FP4 quantization. Across Qwen3 0.6B Dense trained on 100B tokens and Qwen3 7B A1.5B MoE trained on 50B tokens, Averis enables robust W4A4G4 FP4 training, reducing BF16 loss gaps to 1.19%/0.81% versus 2.05%/1.10% for NVIDIA's recently released Hadamard-based outlier-smoothing method, while limiting downstream gaps to 0.89/0.71 points. With only 2.20% end-to-end overhead over vanilla NVFP4, about 30% of NVIDIA's Hadamard-based design, Averis provides a hardware-efficient path to stable low-bit LLM training. Complementary to Hadamard, Averis further reduces the Qwen3-0.6B loss and downstream gaps to 0.94% and 0.73 points when combined. Code is available at: https://anonymous.4open.science/r/averis-504D.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17453

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

作者:

Lubin Bai ↗Mengyu Cao ↗Sixue Wang ↗Zhongwei Wan ↗Yue Pan ↗Jiale Hou ↗Xiang Li ↗Xiuyuan Zhang ↗

arXiv:2606.17453v1 Announce Type: new Abstract: Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.AI) 2026-06-18 DOI: arXiv:2606.18356

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

作者:

Yuchuan Tian ↗Mengyu Zheng ↗Haocheng Mei ↗Ye Yuan ↗Chao Xu ↗Xinghao Chen ↗Hanting Chen ↗Yu Wang ↗

arXiv:2606.18356v1 Announce Type: cross Abstract: Tool-using language-model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Existing evaluations often collapse these stages into a single attack success rate, making it difficult to tell whether a model merely agreed with an attacker or actually produced observable harm. We introduce SafeClawBench, a staged benchmark for tool-using agent security with 600 controlled adversarial tasks across six attack families: direct and indirect prompt injection, tool-return injection, memory poisoning, memory extraction, and ambiguity-driven unsafe inference. SafeClawBench reports three separate endpoints: semantic attack acceptance, audit-visible harm evidence, and sandbox-observed tool/state harm. Evaluating five agent endpoints under four prompt-level policies, we find that these endpoints capture different failure modes. Without additional prompt protection, semantic failure rates vary widely across models, from 9.0% to 44.2%. Audited harm evidence is narrower than semantic failure, and under a separate executable protocol some matched task identities produce sandbox harm despite passing the Semantic Core call: in a 12,000-row matched analysis, 291 of 347 observed sandbox harms occur in rows that pass the semantic check. Prompt policies change endpoint outcomes, but their effects depend on both model and protocol. SafeClawBench provides a reproducible framework for comparing agent models and prompt-policy conditions without conflating textual compliance, evidence-supported harm, and executable state changes. The open-source dataset is available at https://huggingface.co/datasets/sairights/safeclawbench.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2605.29563

Planning with the Views via Scene Self-Exploration

作者:

Kangrui Wang ↗Linjie Li ↗Zhengyuan Yang ↗Shiqi Chen ↗Zihan Wang ↗Li Fei-Fei ↗Jiajun Wu ↗Leonidas Guibas ↗Lijuan Wang ↗Manling Li ↗

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.LG) 2026-06-18 DOI: arXiv:2509.22020

Task-Adaptive Parameter-Efficient Fine-Tuning for Weather Foundation Models

作者:

Shilei Cao ↗Hehai Lin ↗Jiashun Cheng ↗Yang Liu ↗Guowen Li ↗Xuehe Wang ↗Juepeng Zheng ↗Haoyuan Liang ↗Meng Jin ↗Chengwei Qin ↗Hong Cheng ↗Haohuan Fu ↗…

arXiv:2509.22020v2 Announce Type: replace Abstract: While recent advances in machine learning have equipped Weather Foundation Models (WFMs) with substantial generalization capabilities across diverse downstream tasks, the escalating computational requirements associated with their expanding scale increasingly hinder practical deployment. Current Parameter-Efficient Fine-Tuning (PEFT) methods, designed for vision or language tasks, fail to address the unique challenges of weather downstream tasks, such as variable heterogeneity, resolution diversity, and spatiotemporal coverage variations, leading to suboptimal performance when applied to WFMs. To bridge this gap, we introduce WeatherPEFT, a novel PEFT framework for WFMs incorporating two synergistic innovations. First, during the forward pass, Task-Adaptive Dynamic Prompting (TADP) dynamically injects the embedding weights within the encoder to the input tokens of the pre-trained backbone via internal and external pattern extraction, enabling context-aware feature recalibration for specific downstream tasks. Furthermore, during backpropagation, Stochastic Fisher-Guided Adaptive Selection (SFAS) not only leverages Fisher information to identify and update the most task-critical parameters, thereby preserving invariant pre-trained knowledge, but also introduces randomness to stabilize the selection. We demonstrate the effectiveness and efficiency of WeatherPEFT on three downstream tasks, where existing PEFT methods show significant gaps versus Full-Tuning, and WeatherPEFT achieves performance parity with Full-Tuning using fewer trainable parameters. The code of this work is available at https://github.com/ShileiCao/WeatherPEFT.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.LG) 2026-06-15 DOI: arXiv:2603.18464

AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

作者:

Chengxuan Lu ↗Shukuan Wang ↗Yanjie Li ↗Yingying Fang ↗Huoyan Wang ↗Tian Zhang ↗Wei Liu ↗Shiji Jin ↗Fuyuan Qian ↗Peiming Li ↗Chao Xu ↗Baigui Sun ↗…

arXiv:2603.18464v3 Announce Type: replace Abstract: Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models is severely bottlenecked by synchronization barriers and the high cost of environment data acquisition. To overcome these challenges, we propose AcceRL, a distributed asynchronous RL framework that physically isolates environment rollouts, model inference, and gradient updates. By eliminating the cascading long-tail idle bubbles inherent in synchronous systems, AcceRL maximizes hardware utilization and ensures scalable throughput. Furthermore, AcceRL features a modular design that supports the integration of diverse, plug-and-play world models into its distributed pipeline. Extensive experiments demonstrate that the base framework achieves highly competitive performance across all four LIBERO[liu2023libero] task suites. Systematically, the asynchronous architecture delivers a $2.4\times$ throughput speedup over leading synchronous baselines. Algorithmically, by leveraging a world model pre-trained on 1,000 offline trajectories, AcceRL achieves up to a $200\times$ improvement in online sample efficiency on LIBERO-Spatial, establishing a robust framework that is both sample-efficient and time-efficient for embodied AI. Code is included in the supplementary material. Code is available at https://github.com/distanceLu/AcceRL.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.19388

Beyond the GUI Paradigm: Do Mobile Agents Need the Phone Screen?

作者:

Li Gu ↗Zihuan Jiang ↗Linqiang Guo ↗Zhixiang Chi ↗Ziqiang Wang ↗Huan Liu ↗Yuanhao Yu ↗Tse-Hsun Chen ↗Yang Wang ↗

Recent advances in mobile agents are dominated by the GUI paradigm, in which agents perceive UI information and emit screen interactions. However, mobile platforms also expose a command-line interface (CLI) that provides direct access to device services and data. We argue CLI deserves first-class consideration alongside GUI. We evaluate three coding agents (Claude Code, Terminus-2, mini-swe-agent) across four model APIs on AndroidWorld and MobileWorld without any mobile-specific post-training, comparing against three reproducible GUI baselines (GUI-Owl-1.5-32B, MAI-UI, Qwen3-VL-32B). Claude Code (Opus 4.7) reaches 71.8\% and 51.9\%, outperforming every reproducible GUI baseline (69.3/68.1/57.8\% on AndroidWorld; 43.2/26.3/13.3\% on MobileWorld), while every other CLI configuration remains competitive. To establish the paradigm's ceiling, we provide oracle CLI solutions that reach 88.8\% on AndroidWorld (103/116 tasks CLI-solvable) and 86.3\% on MobileWorld (101/117 tasks CLI-solvable), indicating substantial room for future improvement. To cover everyday user intents beyond the GUI scope, we introduce the CLI-Advantage Task Suite, comprising 45 templates across five categories: bulk operations, multi-condition filtering, aggregation, cross-app workflows, and hidden device state. Every CLI agent outperforms every GUI baseline in all five categories, with substantially fewer steps per task (10.7 vs.\ 18.6). To support future research on mobile CLI agents, we will open-source agent implementations, oracle solutions, the CLI-Advantage suite, and evaluation infrastructure.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.11792

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

作者:

Yuansheng Gao ↗Wenbin Xing ↗Jiahao Yuan ↗Kaiwen Zhou ↗Han Bao ↗Zonghui Wang ↗Wenzhi Chen ↗

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.AI) 2026-06-12 DOI: arXiv:2606.13079

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

作者:

Jiaqi Luo ↗Jiarun Dai ↗Zhile Chen ↗Jia Xu ↗Weibing Wang ↗Yawen Duan ↗Brian Tse ↗Geng Hong ↗Xudong Pan ↗Yuan Zhang ↗Min Yang ↗

arXiv:2606.13079v1 Announce Type: cross Abstract: Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2603.01761

Position: Modular Memory is the Key to Continual Learning Agents

作者:

arXiv:2603.01761v2 Announce Type: replace-cross Abstract: Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing human performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model's parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15197

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

作者:

Jiajun Li ↗Yu Ding ↗Shisi Guan ↗Ran Hou ↗Wanyuan Wang ↗

arXiv:2606.15197v1 Announce Type: cross Abstract: Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2605.27023

Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

作者:

Yinan Liu ↗Wenjin Xu ↗Zhiyuan Zha ↗Xiaochun Yang ↗Bin Wang ↗

arXiv:2605.27023v2 Announce Type: replace Abstract: Knowledge graphs (KGs) have become the core backbone of numerous downstream tasks such as question answering and recommender systems. However, despite all this, KGs are often very incomplete. To perform zero-shot knowledge graph completion in unseen KGs, which have different relational vocabularies from those used for pre-training, KG foundation models (KGFMs) receive a wide range of attention. Existing KGFMs often perform training using random negative triples, which are constructed by replacing the head or tail entity of a positive triple with a random entity. However, these negative triples are often constructed with limited quality, providing weak supervision for KGFM training. In this paper, we propose a simple yet effective adaptive negative sampling approach, KMAS, to enhance existing KGFMs. KMAS constructs hard negative triples through the updated relation embeddings generated from the existing KGFM's relation encoder. To further adaptively align with the evolving capability of the KGFM during the training process, KMAS adjusts the ratio of hard negative triples dynamically throughout the whole training process: after a warmup phrase, it increases the ratio linearly and then decreases linearly. Extensive experiments are conducted over 44 data sets. Experimental results demonstrate that our proposed negative sampling method can enhance many SOTA KGFMs without requiring excessive additional time or memory consumption.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2601.16093

SAMTok: Representing Any Mask with Two Words

作者:

Yikang Zhou ↗Tao Zhang ↗Dengxian Gong ↗Yuanzheng Wu ↗Ye Tian ↗Haochen Wang ↗Haobo Yuan ↗Jiacong Wang ↗Lu Qi ↗Hao Fei ↗Anran Wang ↗Zhuochen Wang ↗…

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CL) 2026-06-17 DOI: arXiv:2606.17579

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

作者:

Zhongyuan Wang ↗Pratyusha Vemuri ↗

Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig

阅读与讨论 → 访问原文 →

16.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.08415

CoVEBench: Can Video Editing Models Handle Complex Instructions?

作者:

Jiangtao Wu ↗Jiaming Wang ↗Yiwen He ↗Yuanxing Zhang ↗Shihao Li ↗Dunyuan Liu ↗Xuedong Zhao ↗Jialu Chen ↗Zekun Moore Wang ↗Jiaheng Liu ↗

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.20487

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

作者:

Shu Yao ↗Yuhua Luo ↗Qian Long ↗Jingru Fan ↗Zhuoyuan Yu ↗Yuheng Wang ↗Lin Wu ↗Yufan Dang ↗Huatao Li ↗Chen Qian ↗

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose H-RePlan, a hierarchical replanning framework for multi-device agents with unified API–CLI–GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce HeraBench, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.16494

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

作者:

Jieyuan Liu ↗Jianyang Gu ↗Shijie Chen ↗Jefferson Chen ↗Zhen Wang ↗

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.AI) 2026-06-15 DOI: arXiv:2606.14409

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

作者:

He Zhang ↗Lingzhu Xiang ↗Haitao Lin ↗Zeyu Huang ↗Minghui Wang ↗Dingyan Zhong ↗Yubo Dong ↗Yihao Wu ↗Yongming Rao ↗Dongsheng Zhang ↗Wanjia He ↗Ling Chen ↗…

arXiv:2606.14409v1 Announce Type: cross Abstract: In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

阅读与讨论 → 访问原文 →

20.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2510.02605

Towards CONUS-Wide ML-Augmented Conceptually-Interpretable Modeling of Catchment-Scale Precipitation-Storage-Runoff Dynamics

作者:

Yuan-Heng Wang ↗Yang Yang ↗Fabio Ciulla ↗Hoshin V. Gupta ↗Charuleka Varadharajan ↗

arXiv:2510.02605v2 Announce Type: replace Abstract: While many modern studies are dedicated to ML-based large-sample hydrologic modeling, these efforts have not necessarily translated into predictive improvements that are grounded in enhanced physical-conceptual understanding. Here, we report on a CONUS-wide large-sample study (spanning diverse hydro-geo-climatic conditions) using ML-augmented physically-interpretable catchment-scale models of varying complexity based in the Mass-Conserving Perceptron (MCP). Results were evaluated using attribute masks such as snow regime, forest cover, and climate zone. Our results indicate the importance of selecting model architectures of appropriate model complexity based on how process dominance varies with hydrological regime. Benchmark comparisons show that physically-interpretable mass-conserving MCP-based models can achieve performance comparable to data-based models based in the Long Short-Term Memory network (LSTM) architecture. Overall, this study highlights the potential of a theory-informed, physically grounded approach to large-sample hydrology, with emphasis on mechanistic understanding and the development of parsimonious and interpretable model architectures, thereby laying the foundation for future models of everywhere that architecturally encode information about spatially- and temporally-varying process dominance.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2606.12191

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

作者:

Jiachun Li ↗Zhuoran Jin ↗Tianyi Men ↗Yupu Hao ↗Kejian Zhu ↗Lingshuai Wang ↗Dongqi Huang ↗Longxiang Wang ↗Shengjia Hua ↗Lu Wang ↗Jinshan Gao ↗Hongbang Yuan ↗…

Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent-environment co-evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. And three paradigms of environment evolution are identified, namely neural-driven, difficulty-driven, and scaling-driven approaches. At last, several promising future directions are discussed, including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

阅读与讨论 → 访问原文 →

22.

arXiv (quant-ph) 2026-06-17 DOI: arXiv:2509.25847

Hybrid Acousto-Optical Double Dressing of a Two-Level System

作者:

Yuan Zhan ↗Zixuan Wang ↗Richard P. Mirin ↗Kevin L. Silverman ↗Shuo Sun ↗

arXiv:2509.25847v2 Announce Type: replace Abstract: We experimentally investigate resonance fluorescence from a two-level system in a novel configuration where a strong laser drives an optical Rabi oscillation while an acoustic field parametrically modulates the frequency of the two-level system. We observe emission spectra that deviate markedly from the standard Mollow triplet, including dynamical cancellation of the central peak. A doubly dressed state model incorporating hybridization among the emitter, optical field, and acoustic field captures these features. Guided by this model, we experimentally validate the condition for optimal cooling of acoustic phonons in an emitter-optomechanical system. These results reveal new regimes of strongly driven quantum nonlinear interactions.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.19815

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

作者:

Jiechao Gao ↗Rohan Kumar Yadav ↗Yuangang Li ↗Yuandong Pan ↗Jie Wang ↗Ying Liu ↗Michael Lepech ↗

Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2606.13003

The Illusion of Multi-Agent Advantage

作者:

Prathyusha Jwalapuram ↗Hehai Lin ↗Chuyuan Li ↗Fangkai Jiao ↗Sudong Wang ↗Yifei Ming ↗Zixuan Ke ↗Chengwei Qin ↗Giuseppe Carenini ↗Shafiq Joty ↗

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2606.16612

Beyond Artifacts: Towards Generalizable Synthetic Song Detection via Music-Intrinsic Features

作者:

Yan Han ↗Zhibin Wen ↗Yuan Wang ↗Shuangrun Shao ↗Xiaobing Li ↗Yang Xu ↗Wei Li ↗

arXiv:2606.16612v1 Announce Type: cross Abstract: The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sofia (Synthetic-song detection framework via music features), a flexible framework that models music-intrinsic attributes via feature-specific experts and an adaptive Mixture-of-Experts (MoE) module. By configuring Sofia with representative Vocal, Audio-effect, Global structure features, and their combinations, we present their individual and complementary contributions. To comprehensively evaluate our framework, we further construct MUSIC8K, a challenging benchmark featuring lastest emerging generators and realistic audio perturbations. Experiments show that Sofia learns generator-agnostic representations from music-intrinsic features, improving the F1 score by 18.5 points over the strongest baseline on MUSIC8K-O while maintaining strong robustness.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络