论文广场 - AcademicHub

01.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15598

Integrating Reasoning and Generalization in Text-to-SQL via Self-Enhanced Fine-Tuning

作者:

Feng Lyu ↗Jinfeng Cen ↗Sijing Duan ↗Hao Wu ↗Shucheng Li ↗Weixu Zhang ↗Haolun Wu ↗

arXiv:2606.15598v1 Announce Type: new Abstract: Text-to-SQL aims to translate natural language questions into executable SQL queries over structured databases, enabling non-expert users to access data intuitively. While recent advances in large language models (LLMs) have shown promise in this task, existing LLM-based approaches often struggle to strike a balance between strong reasoning capabilities and robust generalization. To address these limitations, we propose CoTE-SQL to enhance the LLM-based text-to-SQL generation with three key innovations: (i) self-enhanced reasoning traces distilled from LLMs without human annotation, (ii) structured chain-of-thought (CoT) prompting with modular decomposition and examples retrieval, and (iii) error-aware revision based on SQL execution feedback. Extensive experiments on the Spider and Bird benchmarks demonstrate that CoTE-SQL achieves new state-of-the-art performance among methods built on open-source LLMs with comparable model sizes on Bird (53.39% EX / 59.02 VES) and strong results on Spider (79.60% EX / 77.19 VES), with especially significant gains on complex queries. Results highlight the effectiveness of combining self-enhancement, structured reasoning, and execution-time feedback within an LLM-based framework for text-to-SQL design.

阅读与讨论 → 访问原文 →

02.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2601.16093

SAMTok: Representing Any Mask with Two Words

作者:

Yikang Zhou ↗Tao Zhang ↗Dengxian Gong ↗Yuanzheng Wu ↗Ye Tian ↗Haochen Wang ↗Haobo Yuan ↗Jiacong Wang ↗Lu Qi ↗Hao Fei ↗Anran Wang ↗Zhuochen Wang ↗…

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.CL) 2026-06-15 DOI: arXiv:2601.05106

Token-Level LLM Collaboration via FusionRoute

作者:

Nuoya Xiong ↗Yuhang Zhou ↗Hanqing Zeng ↗Zhaorun Chen ↗Furong Huang ↗Shuchao Bi ↗Lizhu Zhang ↗Zhuokai Zhao ↗

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.11619

Precision-Aware Illumination-Disentangled Vision Transformer for Spacecraft 6D Pose Estimation

作者:

Zongwu Xie ↗Yifan Yang ↗Yonglong Zhang ↗Guanghu Xie ↗Yang Liu ↗Shuo Zhang ↗

Vision sensors provide a lightweight solution for spacecraft proximity operations, but monocular spacecraft 6D pose estimation remains difficult under illumination variation, specular reflection, shadowing, weak texture, and background interference. These factors make local visual evidence spatially unreliable and can destabilize pose regression. This article proposes a Precision-Aware Illumination-Disentangled Vision Transformer (PAID-ViT) for robust spacecraft pose estimation.The proposed model separates pose-relevant structure tokens from illumination-sensitive appearance tokens, estimates patch reliability before pose aggregation, and uses foreground mask supervision to preserve silhouette cues. A parameter-free geometric recovery module converts normalized crop coordinates, log-depth, and a continuous 6D rotation representation into camera-frame rotation and translation. Experiments on SPEED+ V2, the SPEED+ validation/lightbox/sunlamp evaluation configuration used in this study, suggest that PAID-ViT reduces translation error and improves robustness in the challenging sunlamp domain, while ablation studies support the complementary roles of illumination disentanglement, reliability-aware token aggregation, mask supervision, and training-side regularization.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.AI) 2026-06-18 DOI: arXiv:2606.18820

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

作者:

Jiaxi Liu ↗Aiping Yang ↗Yuhang Yang ↗Shuqi Zhang ↗Zewei Dong ↗Jiangming Yang ↗Xuebin Chen ↗

arXiv:2606.18820v1 Announce Type: cross Abstract: Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard MDP formulations typically flatten this structure into stage-dependent state descriptions and action masks, thereby obscuring the nested information–action asymmetry that determines which decisions are urgent and which can be deferred. We introduce Maturing Markov Decision Processes (MMDPs), a formulation built around this information–action asymmetry. We characterize one of its key consequences through an expiring-action priority principle, which identifies the actions that must be resolved before the next stage. Motivated by this structure, we develop a structure-aware reinforcement learning framework with stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments on a controlled multi-supplier replenishment problem, simplified cash-management environments of increasing complexity, and a production-scale simulator show that explicitly modeling this asymmetry improves learning efficiency and becomes increasingly valuable as decision problems scale.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2601.19506

Bridging Information Asymmetry: A Hierarchical Framework for Blind Face Restoration with Reduced Uncertainty

作者:

Zhengjian Yao ↗Jiakui Hu ↗Kaiwen Li ↗Hangzhou He ↗Xinliang Zhang ↗Shuang Zeng ↗Lei Zhu ↗Yanye Lu ↗

Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative paradigms, while capable of synthesizing realistic facial details, remain limited by the under-constrained nature of blind restoration, where severely degraded inputs can be mapped to plausible yet identity-inconsistent outputs. To address this issue, we present Pref-Restore, a hierarchical framework for BFR with reduced restoration uncertainty. Our design is organized around three complementary principles: (1) Semantic Information Augmentation, where an auto-regressive semantic branch converts image and text cues into structured tokens that provide a stable high-level anchor; (2) Texture-level Fidelity Alignment, where the diffusion generator is trained under this anchor to recover identity-relevant details; and (3) Fidelity-constrained Preference Optimization, where a face-aware reward refines the diffusion trajectory while controlling the quality–fidelity trade-off. Extensive experiments on synthetic and real-world benchmarks show that Pref-Restore achieves state-of-the-art performance, with stronger identity-sensitive fidelity and lower restoration uncertainty across repeated sampling. Systematic ablations further attribute these gains to the proposed hierarchical design, showing the necessity of staged training, the robustness and quality dependence of the text pathway, and the benefit of fidelity-constrained preference optimization.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.13368

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

作者:

Tao Hu ↗Jiaxin Ai ↗Licheng Wen ↗Xueheng Li ↗Shu Zou ↗Siqi Li ↗Nianchen Deng ↗Xinyu Cai ↗Hongbin Zhou ↗Pinlong Cai ↗Daocheng Fu ↗Yu Yang ↗…

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17546

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

作者:

Congjie Zheng ↗Chuanyi Xue ↗Bin Liang ↗Jun Yang ↗Changshui Zhang ↗

arXiv:2606.17546v1 Announce Type: new Abstract: Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.13432

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

作者:

Jiwen Liu ↗Shujuan Li ↗Zhixue Fang ↗Xiaohan Li ↗Yan Zhou ↗Zijie Meng ↗Zhimin Zhang ↗Yawen Luo ↗Guoxin Zhang ↗Yu-Shen Liu ↗Pengfei Wan ↗

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

阅读与讨论 → 访问原文 →

10.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16721

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

作者:

Ke Liu ↗Mengxuan Li ↗Yanyi Bao ↗Tianyun Zhang ↗Chong Chu ↗Jiajun Bu ↗Haishuai Wang ↗

arXiv:2606.16721v1 Announce Type: new Abstract: Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception–dynamics–planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at https://github.com/1999kevin/awesome_medical_world_models.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2603.03485

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

作者:

Haoran Lu ↗Shang Wu ↗Songling Liu ↗Jianshu Zhang ↗Maojiang Su ↗Guo Ye ↗Chenwei Xu ↗Lie Lu ↗Pranav Maneriker ↗Fan Du ↗Manling Li ↗Zhaoran Wang ↗…

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present Phys4D, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts a three-stage training paradigm that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of 4D world consistency evaluation that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

阅读与讨论 → 访问原文 →

12.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17446

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

作者:

Haoran Lu ↗Mutian Shen ↗Shuyang Yu ↗Yu Xiao ↗Songling Liu ↗Jianshu Zhang ↗Shang Wu ↗Yue Chen ↗Guo Ye ↗Jiayi Wang ↗Zhaoran Wang ↗Han Liu ↗…

Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17800

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

作者:

Lichen Bai ↗Tianhao Zhang ↗Shitong Shao ↗Dingwei Tan ↗Qiyu Zhong ↗Zhengpeng Xie ↗Haopeng Li ↗Qinghao Huang ↗Dandan Shen ↗Tengjiao Ji ↗Wei Wang ↗Peicheng Wu ↗…

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.AI) 2026-06-15 DOI: arXiv:2606.14502

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

作者:

Yongheng Zhang ↗Ziang Liu ↗Jiaxuan Zhu ↗Shuai Wang ↗Xiangqi Chen ↗Haojing Huang ↗Jiayi Kuang ↗Siyu Chen ↗Ao Shen ↗Hao Wu ↗Qiufeng Wang ↗Qian-Wen Zhang ↗…

arXiv:2606.14502v1 Announce Type: new Abstract: Large Language Models (LLMs) are undergoing a fundamental transformation from conversational generators into integrated AI systems capable of reasoning, action, memory, and self-improvement. We conceptualize this transition as a shift from Chatbot to Digital Colleague: from conversational answers to persistent work. We organize this transition along two tightly coupled dimensions. First, at the cognitive core level, LLMs are advancing from Chatbot-era "fast thinking" systems driven by next-token prediction toward Thinking LLMs that leverage inference-time computation, Chain-of-Thought reasoning, reflection, process supervision, and reinforcement learning to support more deliberate and reliable cognition. Second, at the tool-augmented task execution level, LLMs are progressing from tool-calling Agents that invoke external resources in an ad hoc manner toward OpenClaw-style workstation systems (OpenClaw) equipped with persistent Workspaces, skills, verification loops, and governance. The "Workspace + Skill" paradigm makes episodic tool use colleague-like via state persistence, reusable procedures, task closure, and experience reuse. We examine data construction shifts from instruction-response pairs to State-Action-Observation trajectories and evaluation from static benchmarks to sandboxed, auditable, self-evolving AI ecosystems.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.11200

Detecting AI-Generated Content on Social Media with Multi-modal Language Models

作者:

Chenyang Yang ↗Shen Yan ↗Yibo Yang ↗Litao Hu ↗Yuchen Liu ↗Yuan Zeng ↗Hanchao Yu ↗Yinan Zhu ↗Sumedha Singla ↗Brian Vanover ↗Huijun Qian ↗Zihao Wang ↗…

Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2602.13344

FireRed-Image-Edit-1.0 Technical Report

作者:

Super Intelligence Team ↗Changhao Qiao ↗Chao Hui ↗Chen Li ↗Cunzheng Wang ↗Dejia Song ↗Jiale Zhang ↗Jing Li ↗Qiang Xiang ↗Runqi Wang ↗Shuang Sun ↗Wei Zhu ↗…

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. To support future research, our code, models, and benchmark suite are publicly available at https://github.com/FireRedTeam/FireRed-Image-Edit/ .

阅读与讨论 → 访问原文 →

17.

arXiv (CS.CV) 2026-06-19 DOI: arXiv:2606.20515

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

作者:

Yalun Dai ↗Hao Li ↗Shulin Tian ↗Runmao Yao ↗Yuhao Dong ↗Fangzhou Hong ↗Zhaoxi Chen ↗Fangfu Liu ↗Baoliang Tian ↗Dingwen Zhang ↗Tao Wang ↗Kim-Hui Yap ↗…

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textsc{S-Agent}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (e.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2606.11709

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

作者:

Leyi Pan ↗Shuchang Tao ↗Yunpeng Zhai ↗Lingzhe Zhang ↗Zhaoyang Liu ↗Bolin Ding ↗Aiwei Liu ↗Lijie Wen ↗

On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology privilege-induced style drift, which destabilizes training or causes response length to shrink. To address this, we propose RLCSD (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2605.13909

TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

作者:

Erica Zhang ↗Fangzhao Zhang ↗Aneesh Pappu ↗Batu El ↗Jose Blanchet ↗Susan Athey ↗Jiashuo Liu ↗James Zou ↗

arXiv:2605.13909v2 Announce Type: replace-cross Abstract: Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms-Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.

阅读与讨论 → 访问原文 →

20.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.19992

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

作者:

Mugeng Liu ↗Shuoqi Li ↗Yixuan Zhang ↗Yun Ma ↗

arXiv:2606.19992v1 Announce Type: cross Abstract: In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain static endpoints that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an executable tool program that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4\% and client-side traffic by up to 96.1\%, with larger gains under higher network latency and workflow complexity.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15284

CAP: Towards PPG Universal Representation Learning with Patient-level Supervision

作者:

Chenyang He ↗Xinyi Shao ↗Shun Huang ↗Bosong Huang ↗Daoqiang Zhang ↗Ming Jing ↗Cheng Ding ↗

arXiv:2606.15284v1 Announce Type: cross Abstract: Photoplethysmography (PPG) plays a central role in wearable health monitoring and clinical decision support. Yet existing approaches to universal PPG representation learning largely focus on signal-level objectives and often overlook patient-level health context, which limits generalization to complex clinical tasks and heterogeneous cohorts. To address this gap, we construct a large-scale paired PPG-EHR multimodal dataset by distilling fragmented medical histories and clinical records into cohesive, patient-level electronic health records (EHR). Building on this resource, we propose Clinical Anchored Pretraining for PPG (CAP). During pretraining, CAP performs cross-modal contrastive alignment that anchors PPG representations to patient-level clinical semantics, guiding the encoder beyond waveform fitting toward modeling consistency in a patient's overall physiological state. During downstream adaptation, the pretrained PPG encoder provides clinically grounded representations that strengthen inductive bias and improve robustness and transferability. Experiments demonstrate that CAP consistently outperforms strong baselines on four diverse downstream tasks. CAP achieves a particularly large gain on respiratory rate prediction (up to +87.6% relative improvement over the state-of-the-art baseline) and delivers an average relative +26.7% across all tasks. We further enhance the interpretability of our approach through comprehensive analyses, including ablations and multiple complementary visualizations of the learned representations. The code for our experiments is available at: https://github.com/gody123gody/CAP .

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.17030

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

作者:

Jie Zhang ↗Xiaoyue Chen ↗Anzhe Chen ↗Chenxu Lv ↗Deqing Li ↗Gengze Zhou ↗Hang Yin ↗Haoqi Yuan ↗Haoyang Li ↗Jiahao Li ↗Jiazhao Zhang ↗Jingren Zhou ↗…

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2602.03300

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

作者:

Jingyi Zhang ↗Tianyi Lin ↗Huanjin Yao ↗Xiang Lan ↗Shunyu Liu ↗Jiaxing Huang ↗

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2606.11042

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

作者:

Liya Zhu ↗Jingzhe Ding ↗Jian Zhang ↗Jianbo Xue ↗Shihao Liang ↗Ge Zhang ↗Yi Zhu ↗Duju Zeng ↗Xiang Gao ↗Qingshui Gu ↗Mailun Gao ↗Huimin Che ↗…

arXiv:2606.11042v2 Announce Type: replace Abstract: Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2602.22638

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

作者:

Zhiheng Song ↗Jingshuai Zhang ↗Chuan Qin ↗Chao Wang ↗Chao Chen ↗Longfei Xu ↗Kaikui Liu ↗Xiangxiang Chu ↗Hengshu Zhu ↗

arXiv:2602.22638v2 Announce Type: replace Abstract: Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络