论文广场 - AcademicHub

01.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2506.14990

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

作者:

Tristan Tomilin ↗Luka van den Boogaard ↗Samuel Garcin ↗Constantin Ruhdorfer ↗Bram Grooten ↗Fabrice Kusters ↗Yali Du ↗Andreas Bulling ↗Mykola Pechenizkiy ↗Meng Fang ↗

arXiv:2506.14990v3 Announce Type: replace Abstract: Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 sequential tasks, as CPU-bound environments make longer sequences impractical. Meanwhile, continual learning in cooperative multi-agent settings remains largely unexplored. To address these gaps, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark for continual multi-agent RL. By leveraging JAX and GPU acceleration, MEAL enables training on sequences of 100 tasks in a few hours on a single GPU. We find that long task sequences reveal failure modes that do not appear at smaller scales.

阅读与讨论 → 访问原文 →

02.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.13432

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

作者:

Jiwen Liu ↗Shujuan Li ↗Zhixue Fang ↗Xiaohan Li ↗Yan Zhou ↗Zijie Meng ↗Zhimin Zhang ↗Yawen Luo ↗Guoxin Zhang ↗Yu-Shen Liu ↗Pengfei Wan ↗

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

阅读与讨论 → 访问原文 →

03.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2605.21472

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

作者:

Kaichen Zhou ↗Zeyang Bai ↗Xinhai Chang ↗Mengyu Wang ↗Paul Liang ↗Fangneng Zhan ↗

View-conditioned 3D generators such as SAM 3D, TRELLIS, and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: https://stream-3d.github.io/stream3d.github.io/.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.CL) 2026-06-15 DOI: arXiv:2606.13686

Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces

作者:

Zijing Shi ↗Meng Fang ↗Ling Chen ↗

As autonomous web agents are increasingly deployed to perform real-world tasks, ensuring their safety has become a critical concern. In this work, we study web agent behavior under realistic deceptive interfaces in the e-commerce domain. We introduce WebDecept, a lightweight and configurable plugin framework that enables controlled injection of deceptive interface patterns into existing web environments. Using WebDecept, we instantiate seven deceptive patterns commonly observed on the open web, including targeted advertisements, domain redirection, and shopping manipulation. By injecting these patterns into the frontend during task execution, we perform controlled evaluation of multiple multimodal web agents. Our results show that current web agents are highly susceptible to multiple classes of deceptive interfaces, and that prompt-based constraints are often insufficient to mitigate these failures. We further analyze how the design choices of deceptive patterns influence the success of such manipulations. These findings highlight safety challenges that should be addressed as web agents are scaled toward real-world deployment.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15306

LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure

作者:

Daksh Mittal ↗Tommaso Castellani ↗Thomson Yen ↗Naimeng Ye ↗Fangyu Wu ↗Minghui Chen ↗Tiffany Cai ↗Emmanouil Koukoumidis ↗William Zeng ↗Hongseok Namkoong ↗

arXiv:2606.15306v1 Announce Type: cross Abstract: We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future decisions. This cross-task experiential learning capability is pivotal in domains such as personalization and interactive assistance, but existing training/evaluation frameworks do not provide shared, controllable latent structures and cannot measure whether or why agents improve. We introduce LatentGym: a controllable suite in which each environment is organized around a ground-truth latent variable governing the structure across tasks. Our construction yields metrics that separate exploration (whether the agent's actions gather information about the latent) from exploitation (whether the agent uses what it has gathered). We demonstrate our suite on empirical studies addressing three questions: how and why frontier models fail to adapt across related tasks; whether post-training on related task sequences improves general cross-task adaptation, and where those gains come from; and how design choices such as inter-task feedback shape training dynamics and generalization. Together, these results establish a controlled foundation for studying how LLM agents learn from experience across tasks, and for designing agents that adapt more reliably in sequential, personalized, and interactive settings.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.AI) 2026-06-15 DOI: arXiv:2603.10444

The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

作者:

Hengjie Cao ↗Zhendong Huang ↗Mengyi Chen ↗Yifeng Yang ↗Fang Dong ↗Anrui Chen ↗Ruijun Huang ↗Xin Zhang ↗Mingzhi Dong ↗Yujiang Wang ↗Jinlong Hou ↗Qin Lv ↗…

arXiv:2603.10444v2 Announce Type: replace-cross Abstract: FP4 training promises substantial memory and compute savings for large language models, but remains fragile because blockwise quantization is dictated by extreme activation magnitudes, which inflate dynamic range and compress long-tail signals. We identify a counterintuitive source of this failure: dominant activation outliers are not merely arbitrary sparse events, but are largely induced by a coherent rank-one mean bias, whose direction aligns with the leading anisotropic spectral component. This mean component strengthens during training, is amplified and reshaped by attention and FFN operators, and increasingly dominates top activation magnitudes. Crucially, this discovery reveals that a seemingly complex outlier-suppression problem admits a truly simple solution: isolate the coherent mean before quantization. We therefore propose Averis, a mean-residual splitting quantization method that separates the mean component using only reductions and elementwise subtractions before FP4 quantization. Across Qwen3 0.6B Dense trained on 100B tokens and Qwen3 7B A1.5B MoE trained on 50B tokens, Averis enables robust W4A4G4 FP4 training, reducing BF16 loss gaps to 1.19%/0.81% versus 2.05%/1.10% for NVIDIA's recently released Hadamard-based outlier-smoothing method, while limiting downstream gaps to 0.89/0.71 points. With only 2.20% end-to-end overhead over vanilla NVFP4, about 30% of NVIDIA's Hadamard-based design, Averis provides a hardware-efficient path to stable low-bit LLM training. Complementary to Hadamard, Averis further reduces the Qwen3-0.6B loss and downstream gaps to 0.94% and 0.73 points when combined. Code is available at: https://anonymous.4open.science/r/averis-504D.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2606.11425

JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization

作者:

Ge Shi ↗Jun Yin ↗Donglin Xie ↗Fangyi Liu ↗Yucan Li ↗Menglin Liu ↗

arXiv:2606.11425v1 Announce Type: cross Abstract: Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can adapt but often relies on low-level mutations that require many target queries. We propose JailbreakOPT, a tool-assisted framework for improving iterative single-turn jailbreak prompt optimization. JailbreakOPT organizes diverse atomic jailbreak prompts into an attack tool library and composes them through a unified intra-episode optimization abstraction to generate stronger standalone attack prompts. To reuse experience across attack episodes, JailbreakOPT further frames tool selection as a contextual bandit problem and applies contextual Thompson sampling to guide exploration and exploitation based on past outcomes. Experiments across multiple target LLMs and attack goals show that JailbreakOPT improves attack success rate (ASR) while reducing the number of attacks until success (No.A) compared with atomic single-turn attacks and existing iterative optimization baselines. This paper may contain offensive or harmful content.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.02800

Cosmos 3: Omnimodal World Models for Physical AI

作者:

NVIDIA ↗Aditi ↗Niket Agarwal ↗Arslan Ali ↗Jon Allen ↗Martin Antolini ↗Adeline Aubame ↗Alisson Azzolini ↗Junjie Bai ↗Maciej Bala ↗Yogesh Balaji ↗Josh Bapst ↗…

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI – effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2606.11042

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

作者:

Liya Zhu ↗Jingzhe Ding ↗Jian Zhang ↗Jianbo Xue ↗Shihao Liang ↗Ge Zhang ↗Yi Zhu ↗Duju Zeng ↗Xiang Gao ↗Qingshui Gu ↗Mailun Gao ↗Huimin Che ↗…

arXiv:2606.11042v2 Announce Type: replace Abstract: Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.19348

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

作者:

DeepSeek-AI ↗Anyi Xu ↗Bangcai Lin ↗Bing Xue ↗Bingxuan Wang ↗Bingzheng Xu ↗Bochao Wu ↗Bowei Zhang ↗Chaofan Lin ↗Chen Dong ↗Chenchen Ling ↗Chengda Lu ↗…

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models – DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) – both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2606.05405

Agents' Last Exam

作者:

Yiyou Sun ↗Xinyang Han ↗Weichen Zhang ↗Yuanbo Pang ↗Tianyu Wang ↗Yuhan Cao ↗Yixiao Huang ↗Chris Duroiu ↗Haoyun Zhang ↗Jeffrey Lin ↗Weishu Zhang ↗Tyler Zeng ↗…

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17540

TaFD: Threat-Aware Frequency Decoupling for Adversarial Robustness against Heterogeneous Attacks

作者:

Mengda Xie ↗Yiling He ↗Meie Fang ↗

Multi-threat robustness remains a fundamental challenge in deep learning. Although joint adversarial training (JAT) is widely adopted, it suffers from negative transfer under heterogeneous threats, particularly between $\ell_p$-bounded and semantic attacks. Through first-order gradient analysis, we formalize this as gradient incompatibility and theoretically establish the necessity of decoupled optimization. We further reveal that these conflicting threats exhibit separable spectral characteristics in the frequency domain. Motivated by this observation, we propose Threat-aware Frequency Decoupling (TaFD), a two-stage defense framework that reformulates JAT as a frequency-domain divide-and-conquer paradigm. TaFD first discovers latent threat domains via unsupervised clustering of attack spectral prototypes and trains a lightweight classifier for inference-time threat domain identification. Conditioned on the prediction, TaFD employs a Frequency-Conditional Convolution that learns threat-domain-specific spectral masks and routes each sample to the corresponding expert, enforcing structural parameter separation and alleviating optimization conflicts. We validate TaFD on three representative image-classification benchmarks (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and on two representative architectures (the convolutional ResNet and the hybrid-transformer MobileViT). Extensive results demonstrate that TaFD achieves more balanced robustness against heterogeneous attacks than existing JAT and frequency-domain baselines, improving average robust accuracy by approximately 11\% over the strongest baseline while maintaining leading clean accuracy.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.18249

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

作者:

Wujian Peng ↗Lingchen Meng ↗Yuxuan Cai ↗Xianwei Zhuang ↗Yuhuan Yang ↗Rongyao Fang ↗Chenfei Wu ↗Junyang Lin ↗Zuxuan Wu ↗Shuai Bai ↗

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.15079

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

作者:

Ang Li ↗Ben Liu ↗Bin Han ↗Bin Hu ↗Bin Jing ↗Binbin Hu ↗Bing Li ↗Cai Chen ↗Caizhi Tang ↗Changxin Tian ↗Chao Huang ↗Chao Zhang ↗…

Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

阅读与讨论 → 访问原文 →

15.

arXiv (math.PR) 2026-06-12 DOI: arXiv:2606.08683

Exact Fourier dimensions of dyadic Mandelbrot cascades under minimal integrability

作者:

Yin Cai ↗Guozheng Cheng ↗Xiang Fang ↗Menghan Li ↗Hongdou Qu ↗Chengbo Xiao ↗

arXiv:2606.08683v2 Announce Type: replace Abstract: We determine the Fourier dimension of dyadic Mandelbrot cascades under the minimal Kahane-Peyriere integrability condition. The interval theorem is proved in a vector-valued dyadic cascade model in which sibling weights may have arbitrary dependence. For every balanced energy-admissible vector law, almost surely on non-extinction, dim_F(mu)=dim_E(mu)=dim_2(mu)=D_E(X). In the canonical scalar case, under W>=0, E W=1, E[W log_2^+ W]

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络