论文广场 - AcademicHub

01.

arXiv (CS.AI) 2026-06-24 DOI: arXiv:2606.24112

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

作者:

Chenhao Dang ↗Dantong Zhu ↗Jun Yang ↗Conghui He ↗Weijia Li ↗

arXiv:2606.24112v1 Announce Type: new Abstract: Multimodal misinformation detection is increasingly important because viral posts now combine long multilingual narratives, several images, mixed provenance, and subtle text–image framing errors. Existing benchmarks and methods remain poorly matched to this setting: they usually isolate short captions, single images, binary labels, or one manipulation source, while agentic verification remains costly under realistic evidence search. We present ReMMD, a realistic multilingual multi-image agentic verification framework for multimodal misinformation detection. ReMMD includes ReMMDBench, a real-world multimodal misinformation detection benchmark with 500 samples, 2,756 images, five monolingual languages, two cross-lingual settings, three text-length tiers, multi-image posts, five-way veracity labels, eight distortion labels, evidence provenance, and rationales. It also includes ReMMD-Agent, a persistent-memory verifier that decomposes posts into atomic points, builds a reusable evidence set, and predicts structured L1/L2/L3 outputs. Across proprietary systems, open LVLMs, MMD-Agent, and T2-Agent, ReMMD-Agent obtains the best five-way veracity performance, with 41.80% accuracy and 39.12% macro-F1 using GPT-5.2, while reducing cost by 17.5% relative to MMD-Agent and 79.9% relative to T2-Agent. The project is available at https://dang-ai.github.io/ReMMD.

阅读与讨论 → 访问原文 →

02.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15455

Understanding Diversity Collapse in RLVR via the Lens of Overtraining

作者:

Suqin Yuan ↗Jinkun Chen ↗Jiyang Zheng ↗Muyang Li ↗Lei Feng ↗Dadong Wang ↗Tao Xiang ↗Tongliang Liu ↗Bo An ↗

arXiv:2606.15455v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from diversity collapse: Pass@$1$ improves while high-$k$ Pass@$k$ degrades, which is viewed as a narrowing of the model's reasoning boundary. We formalize this diversity collapse through the lens of overtraining: once a problem's contribution to the reference metric has effectively saturated, further updates no longer expand what the model can solve but still concentrate probability mass on the trajectories favored by on-policy sampling. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-$k$ Pass@$k$, so most updates in standard RLVR are overtraining from the boundary perspective. This perspective also suggests a reading of whether RLVR can expand the model's reasoning abilities beyond the base model: since RLVR is structurally biased against high-$k$ Pass@$k$, its aggregate decline does not by itself mean that no new reasoning gains occurred. Interventionally, restricting updates to problems with zero observed success lifts Pass@$256$ above the base model on difficult benchmarks; observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Building on these findings, we propose Bayesian Boundary Gating (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@$k$ across a wide range of $k$.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17591

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

作者:

Yanwei Cui ↗Xing Zhang ↗Yulong Zhang ↗Li Shao ↗Xiaofeng Shi ↗Guanghui Wang ↗Peiyang He ↗

arXiv:2606.17591v1 Announce Type: new Abstract: Training-free verbal reinforcement learning enables LLM agents to learn from world feedback – objective signals such as dynamic task outcomes, market returns, or demand forecasts – by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma – outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance – and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture – rules, evidence, and skills – connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15052

PANDA: An LLM-Enhanced Performance-Driven Analog Design Framework Bridging Design Intent and Layout Generation

作者:

Haoyi Zhang ↗Weijian Fan ↗Xiaohan Gao ↗Bingyang Liu ↗Runsheng Wang ↗Yibo Lin ↗

arXiv:2606.15052v1 Announce Type: cross Abstract: Traditional design of analog circuits heavily relies on manual interventions across topology, sizing, and layout, with prior automation addressing stages in isolation. In this work, we propose PANDA, an LLM-enhanced framework that bridges high-level design intent to final layout by actively managing cross-stage dependencies through guided topology synthesis, substructure-aware sizing, and constraint-driven layout generation. This shifts automation from algorithm-centric execution to intent-centric co-design, reducing turnaround time from days or weeks to hours while improving design performance.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2606.11092

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

作者:

Yichao Zhong ↗Yidan Lu ↗Yuhang Lu ↗Tianyang Tang ↗Haoguang Mai ↗Yixuan Pan ↗Tianyu Li ↗Li Chen ↗Jingbo Wang ↗Zhongyu Li ↗Peng Lu ↗Hongyang Li ↗…

arXiv:2606.11092v2 Announce Type: replace-cross Abstract: Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: https://opendrivelab.com/RoboNaldo.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15315

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

作者:

Tingting Yang ↗Chenhao Xue ↗Jun Chen ↗

arXiv:2606.15315v1 Announce Type: new Abstract: Personalized public transit routing in public transit systems remains challenging due to the difficulty of capturing and integrating diverse user preferences into routing algorithms. This paper presents ChatPlanner, a novel framework that leverages Large Language Models (LLMs) to enable preference aware public transit routing. Our approach employs fine-tuned LLMs with Retrieval-Augmented Generation (RAG) to extract routing parameters and interpret nuanced user preferences from natural language queries, subsequently integrating these preferences into the objective function of a public transit routing algorithm. This study designs preference aware datasets incorporating eight personas and five contexts to establish scoring standards for both fine-tuning and RAG. This work conducted three experiments to validate the solutions' feasibility, extraction of routing information and preferences, and solution set quality and completeness. Results demonstrate that ChatPlanner generates feasible solutions reliably. Fine-tuning enforces the required output structure and learns general preference patterns, while RAG provides query-specific context to resolve imprecise or conversational expressions and calibrate continuous scores. The combination of both achieves the highest accuracy in routing information extraction and user preference interpretation. Results based on selected case studies show that by capturing user preferences, ChatPlanner identifies valuable solutions across different dimensions that existing route planners overlook, generating more valuable route alternatives. This research establishes a new paradigm for integrating natural language understanding into transportation optimization.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.20122

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

作者:

Zhibang Yang ↗Xinke Jiang ↗Yuzhen Xiao ↗Ruizhe Zhang ↗Yue Fang ↗XinFei Wan ↗Zhengxing Song ↗Yuxuan Liu ↗Yuheng Huang ↗Xu Chu ↗Junfeng Zhao ↗Yasha Wang ↗…

arXiv:2606.20122v1 Announce Type: new Abstract: Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

阅读与讨论 → 访问原文 →

08.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.16199

Efficient Magic State Factory Via Transversal Non-Clifford Gate

作者:

I-Chi Chen ↗Hrushikesh Pramod Patil ↗Huiyang Zhou ↗Andrew Sornborger ↗

arXiv:2606.16199v1 Announce Type: new Abstract: Magic-state preparation is a central component of fault-tolerant quantum computing. Recent theoretical and experimental successes in code-switch-based magic-state preparation have underscored the promise of these methods for quantum error correction. Similarly, magic-state cultivation has likewise been demonstrated in both numerical and experimental settings. However, a thorough comparison between magic-state cultivation and code-switch-based magic-state factories is still missing. In this work, we carry out end-to-end simulations of magic-state preparation using code switching and compare its resource requirements and performance against magic-state cultivation. As part of this analysis, we develop a lattice-surgery protocol for transfer between the doubled color code and the rotated surface code. We extend the complete code-switching protocol to the $d=5$ doubled color code and perform the corresponding end-to-end simulations. Finally, we propose two fault-tolerant magic-state preparation protocols that combine phase-kickback checks with a transversal non-Clifford gate.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2510.18289

Food4All: An Agentic Framework and Benchmark for Food Resource Navigation with Adaptive User Understanding

作者:

Yiyang Li ↗Weixiang Sun ↗Tianyi Ma ↗Kaiwen Shi ↗Zheyuan Zhang ↗Yanfang Ye ↗

Food assistance referral requires conversational agents to translate underspecified, often noisy help-seeking dialogues into locally valid resource recommendations. We present Food4All, an agentic food-resource referral framework and benchmark grounded in 686 structured Indiana food resources. Food4All couples a food-specific search tool with 300 multi-turn evaluation tasks spanning single food needs, composite cases with access or document constraints, and five non-ideal user interaction traits: unreasonable demands, rambling responses, impatience, incomplete answers, and inconsistent information. We evaluate six Large Language Models (LLMs) on requirement grounding, resource retrieval, final referral correctness, and interaction efficiency. Although the strongest model achieves 96.33% referral accuracy, our diagnostics reveal persistent failures in grounding schedule, eligibility, intake, and document constraints, as well as failures to preserve valid retrieved resources in the final recommendation. Trait-level analysis further shows that different non-ideal behaviors stress different parts of the referral pipeline. Food4All provides a controlled testbed for studying tool-calling agents in constraint-sensitive food assistance referral under realistic user interaction challenges.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2602.22629

CRAG: Can 3D Generative Models Help 3D Assembly?

作者:

Zeyu Jiang ↗Sihang Li ↗Siqi Tan ↗Chenyang Xu ↗Juexiao Zhang ↗Julia Galway-Witham ↗Xue Wang ↗Scott A. Williams ↗Radu Iovita ↗Chen Feng ↗Jing Zhang ↗

Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Project Page: https://ai4ce.github.io/CRAG/

阅读与讨论 → 访问原文 →

11.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.13376

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

作者:

Yang Zhou ↗Ziheng Wang ↗Yuqin Lu ↗Haofeng Liu ↗Jun Liang ↗Shengfeng He ↗Jing Li ↗

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2512.21201

Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

作者:

Yu He ↗Da Huang ↗Zhenyang Liu ↗Zixiao Gu ↗Qiang Sun ↗Guangnan Ye ↗Yanwei Fu ↗Yu-Gang Jiang ↗

Zero-shot object navigation (ZSON) requires robots to find target objects in unseen environments without task-specific fine-tuning or pre-built maps, a key capability for general-purpose service robots. Yet methods that perform well in simulation often degrade in cluttered real-world scenes with severe occlusion and latent hazards, where large unseen regions make single-scene inference brittle and unsafe. We propose Schrödinger's Navigator, a belief-aware framework that reasons at inference time over multiple trajectory-conditioned imagined 3D futures. Given candidate paths, a trajectory-conditioned 3D world model predicts hypothetical observations and maintains a superposition of plausible scene realizations rather than committing to one map. An adaptive occluder-aware sampler directs imagination to uncertainty-critical regions, while a Future-Aware Value Map (FAVM) aggregates imagined futures for robust, proactive action selection. Experiments in simulation and on a physical Go2 quadruped show that Schrödinger's Navigator outperforms strong ZSON baselines, improving hidden-target discovery and risk-aware waypoint selection in occlusion-heavy navigation scenarios. These results highlight imagined 3D futures as a scalable and generalizable strategy for zero-shot navigation in uncertain real-world environments.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17730

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

作者:

Zhexiao Xiong ↗Yizhi Song ↗Hao Kang ↗Qing Yan ↗Liming Jiang ↗Jenson Yang ↗Zhoujie Fu ↗Stathi Fotiadis ↗Angtian Wang ↗Zichuan Liu ↗Bo Liu ↗Yiding Yang ↗…

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.19943

SIMBA: ABidirectional Retrieval Forward Simulation Framework for Modeling FY-4A GIIRS Hyperspectral Infrared Radiances Toward NWP Applications

作者:

Jingdong Shen ↗Fu Wang ↗Qifeng Lu ↗Hao Huang ↗Chunqiang Wu ↗Chi Yang ↗Xiaofang Liu ↗

arXiv:2606.19943v1 Announce Type: cross Abstract: Hyperspectral infrared observations are an important data source for numerical weather prediction (NWP) because they provide rich information on the vertical structure of atmospheric temperature and humidity. However, most existing deep learning methods mainly focus on one-way retrieval from radiances to atmospheric profiles, while the reverse radiance simulation process and the consistency between atmospheric state space and radiance observation space are insufficiently considered. In this study, we propose SIMBA, a unified bidirectional retrieval-forward simulation framework for FY-4A GIIRS hyperspectral infrared radiance modeling toward NWP applications. The framework jointly performs atmospheric profile retrieval and radiance reconstruction, introduces a cycle-consistency constraint to strengthen the coupling between the two processes, and employs a bidirectional Mamba state-space module to capture long-range dependencies along pressure levels. Using collocated FY-4A GIIRS observations and ERA5 reanalysis data, the proposed method is evaluated for temperature retrieval, specific humidity retrieval, long-wave radiance reconstruction, and medium-wave radiance reconstruction. Experimental results show that SIMBA outperforms several representative deep learning baselines across both retrieval and reconstruction tasks, while ablation experiments confirm the contribution of the bidirectional design and cycle-consistency mechanism. These results demonstrate that the proposed framework is effective for joint atmospheric profile retrieval and hyperspectral infrared radiance modeling, and suggest potential for future Jacobian-related analysis and NWP-oriented extensions.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CL) 2026-06-15 DOI: arXiv:2606.14580

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

作者:

Liancheng Gong ↗Zhiyang Wang ↗Yiwei Xu ↗Julia Mendelsohn ↗

Identifying persuasive rhetorical cues is critical across domains, from detecting information manipulation and improving AI safety to advancing public health communication. We propose Persuasion Index (PI), a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication, and one transparent implementation using 55 sub-features built from lexicons and rule-based detectors. The taxonomy is modular: individual detectors can be replaced while preserving the theoretical structure. By evaluating PI on four public datasets varying in domain, style, and outcome measures, we show that PI provides a shared feature space for interpreting rhetorical patterns associated with persuasion-related outcomes. Linear models show that PI features carry meaningful predictive signal while remaining computationally lightweight. Dimension-level analyses reveal recurring associations between PI dimensions and persuasion outcomes across datasets, while also highlighting topic- and stance-specific variation. We release PI as an open-source package and web interface for principled and auditable analysis of human and AI-mediated communication.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2507.05163

4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

作者:

Yutian Chen ↗Shi Guo ↗Tianshuo Yang ↗Lihe Ding ↗Xiuyuan Yu ↗Jinwei Gu ↗Tianfan Xue ↗

Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.14801

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

作者:

Yifan Ruan ↗Chenyang Cao ↗Andreas Burger ↗Ali Pesaranghader ↗Kaveh Kamali ↗Jaehong Kim ↗Nandita Vijaykumar ↗Alan Aspuru-Guzik ↗Igor Gilitschenski ↗Nicholas Rhinehart ↗

arXiv:2606.14801v1 Announce Type: cross Abstract: Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2602.12279

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

作者:

Leon Liangyu Chen ↗Haoyu Ma ↗Zhipeng Fan ↗Ziqi Huang ↗Animesh Sinha ↗Xiaoliang Dai ↗Jialiang Wang ↗Zecheng He ↗Jianwei Yang ↗Chunyuan Li ↗Junzhe Sun ↗Chu Wang ↗…

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.18644

Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration

作者:

Chen Zhao ↗Xiantao Hu ↗Song Wu ↗Qian Wang ↗Chen Wu ↗Rui Xie ↗Jian Yang ↗Ying Tai ↗

Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

阅读与讨论 → 访问原文 →

20.

arXiv (quant-ph) 2026-06-19 DOI: arXiv:2606.19430

Solving Nonequilibrium Dynamics via Influence Matrix Bootstrap: Floquet-PXP Model

作者:

Xiao-Yang Yang ↗He-Ran Wang ↗Zhong Wang ↗

arXiv:2606.19430v1 Announce Type: new Abstract: Studies of integrable systems have profoundly deepened the fundamental understanding of quantum many-body physics. While equilibrium properties such as ground states and thermodynamics can often be characterized efficiently, accurately characterizing nonequilibrium integrable dynamics remains a significant challenge. Here, we address this problem in the "Rule 201" quantum cellular automaton, an integrable Trotterization of the PXP Hamiltonian. Using the tensor-network approach of the influence matrix, we develop local conditions called generalized zipper conditions that allow exact solutions of local dynamics. We also introduce a numerical bootstrap method for solving influence matrices with finite but relatively large bond dimensions. This uncovers a rich landscape of nonequilibrium behavior exhibiting initial-state dependence. As an example, we investigate the fate of persistent oscillating dynamics under local non-integrable perturbations, and present analytical results for non-thermal relaxation constrained by conservation laws. We also obtain numerically exact results for entanglement growth across a broad class of initial states. Furthermore, from an information-theoretic perspective, we identify a refined structure of multitime correlations termed the hidden Markov order: the memory encoded in the dynamics separates into finite-length and long-range distributed components, which becomes transparent in an exact split-index matrix-product-state representation of the influence matrix. Our approach enables unified investigations of nonthermalizing and thermalizing regimes of nonequilibrium dynamics within a single analytically tractable model, and can be tested experimentally in state-of-the-art quantum simulators such as Rydberg atom arrays.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.13515

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

作者:

Hanyang Yu ↗Haitao Lin ↗Jingbo Zhang ↗Wenyao Zhang ↗Chenghao Gu ↗Heng Li ↗Ping Tan ↗

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2606.13452

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

作者:

Chenggang Yang ↗Chengzhi Zhang ↗

Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper of scientific rigor, rigorously evaluates the novelty of papers, yet a cognitive gap may exist between author self-promotion and reviewer evaluation. To investigate this, we analyzed 15,328 academic papers published in Nature Communications from 2016 to 2021, along with their peer-review comments. We found that both reviewers and authors emphasize result-oriented innovation, with reviewers adopting a more comprehensive evaluation perspective. Furthermore, by examining promotional intensity against inherent paper novelty, we found that its effect depends on the paper's actual innovation level. Highly innovative papers benefit from stronger promotional language, receiving more positive evaluations. We also found that promotional language significantly correlates with reviewer disagreement on novelty specifically for papers of moderate innovativeness, whereas it has negligible impact for papers with either very high or very low novelty. This reveals how promotional language operates most prominently in the gray area of academic evaluation.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16152

The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

作者:

Haolong Qian ↗Xianliang Yang ↗Yinuo ma ↗Lirong Che ↗Feng Lu ↗Ye Guo ↗Lei Song ↗Jiang Bian ↗Chun Yuan ↗

arXiv:2606.16152v1 Announce Type: new Abstract: Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive Quality-Utility Paradox in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling across Qwen2.5, LLaMA-3, and DeepSeek families. Our analysis shows that Oracle refinement couples logical repair with distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost and can outweigh the benefit of improved reasoning logic. To test this mechanism, we introduce Style-Aligned Refinement, which preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility. These findings suggest that effective mathematical reasoning distillation should jointly optimize perceived solution quality and learner-data compatibility, rather than relying solely on reward-model scores. The datasets and code are available at https://github.com/Dracoqhl/Quality-Utility-Paradox.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2602.07429

Brep2Shape: Boundary and Shape Representation Alignment via Self-Supervised Transformers

作者:

Yuanxu Sun ↗Yuezhou Ma ↗Haixu Wu ↗Guanyang Zeng ↗Muye Chen ↗Jianmin Wang ↗Mingsheng Long ↗

arXiv:2602.07429v2 Announce Type: replace-cross Abstract: Boundary representation (B-rep) is the industry standard for computer-aided design (CAD). While deep learning shows promise in processing B-rep models, existing methods suffer from a representation gap: continuous approaches offer analytical precision but are visually abstract, whereas discrete methods provide intuitive clarity at the expense of geometric precision. To bridge this gap, we introduce Brep2Shape, a novel self-supervised pre-training method designed to align abstract boundary representations with intuitive shape representations. Our method employs a geometry-aware task where the model learns to predict dense spatial points from parametric Bézier control points, enabling the network to better understand physical manifolds derived from abstract coefficients. To enhance this alignment, we propose a Dual Transformer backbone with parallel streams that independently encode surface and curve tokens to capture their distinct geometric properties. Moreover, the topology attention is integrated to model the interdependencies between surfaces and curves, thereby maintaining topological consistency. Experimental results demonstrate that Brep2Shape offers significant scalability, achieving state-of-the-art accuracy and faster convergence across various downstream tasks.Code is available at this repository: https://github.com/thuml/Brep2Shape.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.11200

Detecting AI-Generated Content on Social Media with Multi-modal Language Models

作者:

Chenyang Yang ↗Shen Yan ↗Yibo Yang ↗Litao Hu ↗Yuchen Liu ↗Yuan Zeng ↗Hanchao Yu ↗Yinan Zhu ↗Sumedha Singla ↗Brian Vanover ↗Huijun Qian ↗Zihao Wang ↗…

Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络