论文广场 - AcademicHub

01.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.14971

FastMix: Fast Data Mixture Optimization via Gradient Descent

作者:

Haoru Tan ↗Sitong Wu ↗Yanfeng Chen ↗Jun Xia ↗Ruobing Xie ↗Bin Xia ↗Xingwu Sun ↗Xiaojuan Qi ↗

arXiv:2606.14971v1 Announce Type: cross Abstract: While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (https://github.com/hrtan/fastmix)

阅读与讨论 → 访问原文 →

02.

arXiv (CS.CL) 2026-06-15 DOI: arXiv:2606.14694

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

作者:

Junlong Tong ↗Wenqi Xu ↗Yingqi Fan ↗Anhao Zhao ↗Xuan Lu ↗Yang Tan ↗Xiaoyu Shen ↗

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.15079

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

作者:

Ang Li ↗Ben Liu ↗Bin Han ↗Bin Hu ↗Bin Jing ↗Binbin Hu ↗Bing Li ↗Cai Chen ↗Caizhi Tang ↗Changxin Tian ↗Chao Huang ↗Chao Zhang ↗…

Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

阅读与讨论 → 访问原文 →

04.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.16375

Scalable Graph State Generation with O(1) Local Feedforward in Quantum Networks

作者:

Xiaoyi Zheng ↗Chan-Tong Lam ↗Lin Chen ↗Zheng Xing ↗

arXiv:2606.16375v1 Announce Type: new Abstract: The development of quantum networks faces a key challenge: the contradiction between probabilistic long-range entanglement generation and finite coherence time. Existing routing protocols typically focus on global state computation or path optimization. As the network scales up, classical delays accumulate and exacerbate decoherence, leading to a decrease in entanglement fidelity. To reduce routing decision delays to levels far below the coherence time of qubits, we propose a protocol based on local measurement and classical feedforward. This protocol reduces the local decision complexity to amortized O(1) level, ensuring that the decision delay is always much smaller than the coherence time of qubits. We map this protocol onto a dual-species trapped-ion platform and perform hybrid simulations. The results show that the proposed protocol performs well in terms of both resource efficiency and time feasibility. Noise analysis indicates that readout fidelity is the main bottleneck of this protocol, but noise suppression can be achieved by employing an erasure transformation in the dual-species architecture, combined with spatial multiplexing and branch independence, thereby ensuring the generation of high-fidelity star subgraphs. This protocol provides a clear path to achieving high-fidelity star subgraphs. These subgraphs can serve as general modules, merging to construct arbitrary subgraphs, providing a feasible solution for future fault-tolerant distributed quantum computing.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.CV) 2026-06-19 DOI: arXiv:2606.19849

ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

作者:

Yang Tan ↗Junlong Tong ↗Linan Yue ↗Hao Wu ↗Pengfei Fang ↗Xiaoyu Shen ↗

Streaming VideoLLMs must continuously process incoming video while maintaining low query latency, making both video-ingestion throughput and query-time responsiveness critical for real-time deployment. Existing methods largely focus on accelerating individual modules, such as visual encoding, token pruning, or KV-cache compression, but provide limited insight into whether the resulting system can sustain real-time streaming performance. We formulate streaming VideoLLM inference as a coordinated pipeline spanning visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Building on this formulation, we propose ViCoStream (Video Coordinated Streaming), a stage-wise coordinated streaming framework that combines chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to bound per-chunk computation and memory costs. We further provide a systematic study of bottleneck migration, revealing how chunk size, token retention, attention locality, and retrieval scope shape the throughput-accuracy trade-off. Experiments with Qwen2.5-VL-3B/7B-Instruct across multiple streaming benchmarks show that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU while maintaining accuracy close to full-history baselines.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2604.10242

MedVeriSeg: Teaching LISA-Like Medical Segmentation Models to Verify Query Validity Without Extra Training

作者:

Qinyue Tong ↗Xiaozhen Wang ↗Ziqian Lu ↗Jun Liu ↗Yunlong Yu ↗Zheming Lu ↗

Despite recent progress in text-prompt-based medical image segmentation, existing LISA-like MLLM-based methods typically generate masks regardless of whether the target specified in the query is present, leading to hallucinated segmentation. In this work, we propose MedVeriSeg, a training-free query verification framework that enables LISA-like medical segmentation models to reject false segmentation queries. MedVeriSeg first quantifies the response quality between the [SEG] token and image features through a Similarity Response Quality Scoring Module. To further improve robustness, it employs a Lightweight Routed Multi-Agent Verification Module, which fuses quantitative score evidence with qualitative agent evidence to comprehensively verify the validity of the query. To support systematic evaluation, we construct MedVeriSeg-Bench, a benchmark designed for query verification in medical image segmentation. Experimental results demonstrate that MedVeriSeg effectively identifies false segmentation queries and reduces hallucinated segmentation, while maintaining a high acceptance rate for valid queries, thereby largely preserving the segmentation utility of LISA-like medical segmentation models.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17846

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

作者:

Haoqi Yuan ↗Zhixuan Liang ↗Anzhe Chen ↗Ye Wang ↗Haoyang Li ↗Pei Lin ↗Yiyang Huang ↗Zixing Lei ↗Tong Zhang ↗Jiazhao Zhang ↗Jie Zhang ↗Jingyang Fan ↗…

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $\pi$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2509.10005

TUNI: Unifying Pre-training and Fine-tuning with Modality-Aware Mutual Learning and Rectification for RGB-T Semantic Segmentation

作者:

Xiaodong Guo ↗Xianda Guo ↗Tong Liu ↗Zhihong Deng ↗Yanlun Peng ↗Xiang Li ↗Wujie Zhou ↗

RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing RGB-T segmentation frameworks suffer from suboptimal multi-modal feature extraction and fusion, unbalanced modality dependency, and inadequate utilization of thermal information. To address these challenges, we propose TUNI, a unified pre-training and fine-tuning framework for efficient and real-time RGB-T semantic segmentation. It pre-trains an RGB-T encoder that incorporates an RGB-T local module that selectively emphasizes salient consistent and distinct local features across modalities, thereby integrating cross-modal feature extraction and fusion in a unified manner. To alleviate the modality bias issue during RGB-T pre-training, modality-inverted contrastive mutual learning is introduced to enable knowledge exchange between two RGB-dominated and thermal-dominated encoders. In the fine-tuning phase, modality rectification learning fully exploits residual thermal information by focusing on correct yet divergent prediction regions between two modality-specific decoders. We further develop three TUNI variants, covering lightweight, balanced, and high-performance requirements. Extensive experiments on five RGB-T semantic segmentation datasets demonstrate that TUNI achieves superior accuracy, generalization, and compactness compared with 15 state-of-the-art models. The code is available at https://github.com/xiaodonguo/TUNI-v2.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.LG) 2026-06-18 DOI: arXiv:2412.16468

The Road to Artificial SuperIntelligence: A Comprehensive Survey of Superalignment

作者:

HyunJin Kim ↗DongHyun Ryu ↗Xiaoyuan Yi ↗Jing Yao ↗Jianxun Lian ↗Muhua Huang ↗Shitong Duan ↗JinYeong Bak ↗Xing Xie ↗

arXiv:2412.16468v4 Announce Type: replace Abstract: The emergence of large language models (LLMs) has sparked discussion on Artificial Superintelligence (ASI), a hypothetical AI system that surpasses human intelligence. Although ASI remains hypothetical and far beyond current AI capabilities, discussing its potential and exploring its feasibility and potential risks is critical for the development of future AI systems. The idea of superalignment originates from scalable oversight, which studies how to supervise increasingly capable AI systems when direct human supervision becomes insufficient. In this paper, we focus on the superalignment problem: "The process of supervising, controlling, and governing artificial superintelligence." We first review scalable oversight paradigms-Sandwiching, Self-Enhancement, and Weak-to-Strong Generalization – then analyze the limitations of current paradigms through the lens of possibility and impossibility, discuss key challenges, and propose pathways for the safe and continual improvement of future AI systems.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2602.12670

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

作者:

Xiangyi Li ↗Yimin Liu ↗Wenbo Chen ↗Bingran You ↗Zonglin Di ↗Yifeng He ↗Shenghan Zheng ↗Kyoung Whan Choe ↗Jiankai Sun ↗Shuyi Wang ↗Chujun Tao ↗Binxu Li ↗…

arXiv:2602.12670v4 Announce Type: replace Abstract: Agent Skills are structured packages of procedural knowledge that augment large language model (LLM) agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark whose current inventory contains 87 tasks across 8 domains paired with curated Skills and deterministic verifiers. Our latest aggregate evaluation runs the 87-task benchmark under matched no-Skills and curated-Skills conditions for 18 model-harness configurations. Curated Skills raise the average pass rate from 33.9% to 50.5% (+16.6 percentage points; 25.5% normalized gain), with configuration-level gains ranging from +4.1 to +25.7 pp. Focused Skills with at most three modules outperform larger or exhaustive bundles, and smaller models with Skills can match larger models without them. SkillsBench establishes paired evaluation as the foundation for rigorous measurement of Skill efficacy on agentic, expertise-heavy work.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2603.01131

MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis

作者:

Yuqi Zhan ↗Xinyue Wu ↗Tianyu Lin ↗Yutong Bao ↗Xiaoyu Wang ↗Weihao Cheng ↗Huangwei Chen ↗Feiwei Qin ↗Zhu Zhu ↗

arXiv:2603.01131v3 Announce Type: replace-cross Abstract: Clinical diagnosis is a gradual process of evidence integration, in which physicians move from symptoms and medical history to examinations, competing hypotheses, disease relations, and treatment decisions. Large language models have advanced medical text understanding and generation. Yet their clinical use remains limited by weak evidence grounding, opaque reasoning, and inconsistent links among differential diagnosis, final diagnosis, diagnostic basis, and treatment planning. We introduce MedCollab, a multi-agent framework for full-cycle clinical diagnosis and report generation. MedCollab coordinates specialist and examination agents according to patient records. It structures agent deliberation with an Issue-Based Information System (IBIS) protocol, so that each diagnostic position is supported by patient-specific evidence and medical knowledge. It also builds Hierarchical Disease Relation Chains (HDRC) to connect accepted hypotheses through progression, complication, and comorbidity relations. During multi-round deliberation, a verifier-guided consensus module evaluates evidence support, medical plausibility, and logical conflicts. It then adjusts agent contributions and filters unsupported reasoning. Experiments on ClinicalBench and MIMIC-IV show that MedCollab outperforms leading LLMs and medical multi-agent baselines in diagnostic accuracy, evidence consistency, and clinical reasoning quality. These results indicate that structured and auditable collaboration can produce more faithful and clinically coherent diagnostic reports.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.15129

EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP–OCT Pretraining

作者:

Zhuo Deng ↗Ruiheng Zhang ↗Ziheng Zhang ↗Weihao Gao ↗Yitong Li ↗Qian Wang ↗Lei Shao ↗Jiaoyue Dong ↗Zhixi Zeng ↗Lijian Fang ↗Haibo Wang ↗Xiaobin Lin ↗…

Color fundus photography (CFP) is the mainstay for large-scale retinal screening, yet its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, but is less accessible in population-level screening. Here, we present EyeMVP, a cross-modal retinal foundation model that uses paired CFP–OCT pretraining to learn OCT-informed CFP representations. EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP–OCT image triples from 112,642 patients across eight hospitals in China. The model uses cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks. Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models and shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs.~0.852 for EyeCLIP) and 0.825 for myopic macular schisis. In an exploratory reader study, EyeMVP exceeds junior and intermediate ophthalmologist groups but does not reach senior ophthalmologist performance on macular edema, while showing numerically higher balanced accuracy than all reader groups on myopic macular schisis. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2603.04592

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

作者:

Junlong Tong ↗Zilong Wang ↗YuJie Ren ↗Peiran Yin ↗Hao Wu ↗Wei Zhang ↗Xiaoyu Shen ↗

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.19348

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

作者:

DeepSeek-AI ↗Anyi Xu ↗Bangcai Lin ↗Bing Xue ↗Bingxuan Wang ↗Bingzheng Xu ↗Bochao Wu ↗Bowei Zhang ↗Chaofan Lin ↗Chen Dong ↗Chenchen Ling ↗Chengda Lu ↗…

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models – DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) – both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.02800

Cosmos 3: Omnimodal World Models for Physical AI

作者:

NVIDIA ↗Aditi ↗Niket Agarwal ↗Arslan Ali ↗Jon Allen ↗Martin Antolini ↗Adeline Aubame ↗Alisson Azzolini ↗Junjie Bai ↗Maciej Bala ↗Yogesh Balaji ↗Josh Bapst ↗…

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI – effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.11670

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

作者:

Zijie Meng ↗Jiwen Liu ↗Yufei Liu ↗Chengzhuo Tong ↗Xiaoqiang Liu ↗Yuanxing Zhang ↗Yulong Xu ↗Pengfei Wan ↗

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17199

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

作者:

Anhao Zhao ↗Junlong Tong ↗Yingqi Fan ↗Ping Nie ↗Wenjie Li ↗Xiaoyu Shen ↗

arXiv:2606.17199v1 Announce Type: cross Abstract: Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2605.02427

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

作者:

Tu Nguyen ↗Matthieu Zimmer ↗Rasul Tutunov ↗Xiaotong Ji ↗Haitham Bou Ammar ↗

arXiv:2605.02427v3 Announce Type: replace Abstract: A recurring pattern in "reasoning without training" is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. Power sampling provides a principled way to bias decoding toward such modes by targeting p_theta(x)^alpha with alpha > 1, but practical approximations must account for future-dependent correction factors that determine which prefixes remain promising. We introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for approximating the sequence-level power target with a bounded population of partial solutions. APPS propagates hypotheses in parallel using proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries. This redistributes finite compute across competing prefixes rather than committing to a single unfolding path, while providing a direct scaling knob in the particle count and predictable peak memory. We instantiate the future-value signal with short-horizon rollouts and also study an amortized variant that replaces rollouts with a lightweight learned selection head. AMore broadly, APPS improves the accuracy–runtime trade-off of training-free decoding, further supporting the view that inference-time power approximation can recover gains often attributed to post-training.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2405.06995

Benchmarking Cross-Domain Audio-Visual Deception Detection

作者:

Xiaobao Guo ↗Zitong Yu ↗Nithish Muthuchamy Selvaraj ↗Bingquan Shen ↗Adams Wai-Kin Kong ↗Alex C. Kot ↗

Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

阅读与讨论 → 访问原文 →

20.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.15169

Label Shift Aware Adaptation for Online Zero-shot Learning with Contrastive Language-Image Pre-Training (CLIP)

作者:

Pengxiao Han ↗Changkun Ye ↗Yanshuo Wang ↗Jinguang Tong ↗Miaohua Zhang ↗Xuesong Li ↗Jie Hong ↗Lars Petersson ↗

Vision-language models like Contrastive Language-Image Pre-Training (CLIP) have been extensively studied in data-scarce scenarios. A particularly challenging and realistic task in this area is online zero-shot learning with CLIP, where unknown test samples are predicted sequentially in random order by CLIP while keeping the feature extraction and model parameters fixed during the sequential inference phase. Most existing approaches in this setting address the problem by adapting representations online using incoming test samples, while neglecting the distribution of the data on which CLIP was initially trained. This mismatch can lead to degraded performance when the label distribution in the test data differs from that of the training domain. To address this gap, we propose Label Shift Aware (LSA), which formulates the online zero-shot classification task as a domain adaptation problem. Specifically, LSA adapts the predictions computed by CLIP, which was trained on an unknown source distribution, to a target distribution using only unlabeled test data, and applies label shift correction to mitigate the mismatch between the source and target domains. The extensive experiments across multiple datasets demonstrate that the proposed LSA consistently outperforms state-of-the-art online zero-shot learning methods based on CLIP.

阅读与讨论 → 访问原文 →

21.

Nature (Science) 2026-06-10 DOI: HASH:ab674c8e3911f8ef037f20b90aaac355

A 5.3-million-year-old deep-sea whale necropolis in the Diamantina Zone

作者:

Xiaotong Peng ↗

Whale falls are biodiversity oases at seabeds1–6, yet their record from the oceans has remained sparse and fragmentary6,7. Here we report the discovery of a vast whale necropolis in the Diamantina Zone (4,616- to 7,001-m depth), extending about 1,200 km along the sea floor of the southeastern Indian Ocean. This area has a deep and extensive accumulation comprising five modern natural whale-fall communities and 476 fossil cetaceans recorded. We show that carcasses host specialized communities dominated by brittle stars, bone-boring worms and chemosynthesis-based bivalves and that the fossil record in this area comprises both extant and extinct deep-diving beaked whales. Isotopic dating shows that whale falls in this region have occurred since at least 5.3 million years ago. These findings reshape the understanding of the limits and biogeography of whale-fall ecosystems and establish some deep sea floors as a fossil archive for tracing cetacean evolution over geological time. Researchers uncovered an enormous deep-sea accumulation of whale remains in the southeastern Indian Ocean, showing long-term, specialized ecosystems and an extensive fossil record that offers new insight into deep-ocean biodiversity and whale evolutionary history.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2602.09379

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

作者:

Shihao Xu ↗Tiancheng Zhou ↗Jiatong Ma ↗Yanli Ding ↗Yiming Yan ↗Ming Xiao ↗Guoyi Li ↗Haiyang Geng ↗Yunyun Han ↗Jianhua Chen ↗Yafeng Deng ↗

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression–anxiety classification (up to 92.3%), performance deteriorates substantially for depression–anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2606.11926

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

作者:

Jiajie Jin ↗Yuyang Hu ↗Kai Qiu ↗Qi Dai ↗Chong Luo ↗Guanting Dong ↗Xiaoxi Li ↗Tong Zhao ↗Xiaolong Ma ↗Gongrui Zhang ↗Zhirong Wu ↗Bei Liu ↗…

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.AI) 2026-06-18 DOI: arXiv:2606.18257

From Memorization to Creation: Evaluating the Cognitive Depth of LLM-Generated Educational Questions

作者:

Xiaolong Wang ↗Zhe Zhao ↗Song Lai ↗Chaoli Zhang ↗Zijie Geng ↗Yu Tong ↗Ye Wei ↗Qingsong Wen ↗

arXiv:2606.18257v1 Announce Type: cross Abstract: While LLMs show promise in automating educational content creation, their ability to generate questions that stimulate higher-order thinking remains understudied. This work evaluates six widely-used LLMs through a Bloom's Taxonomy lens, focusing on their capacity to transcend rote memorization and achieve cognitive leaps. Using a hybrid human–AI evaluation protocol, we generate and analyze 20{,}700 questions across computer science, K–12 math, and social-science domains. Key contributions include: (1) a fine-grained prompting strategy that reduces question repetitiveness by 24.45\% for Qwen2.5-7B-Instruct, and increases the proportion of higher-order cognitive level outputs by 11.53\% for InternLM3-8B-Instruct; (2) quantitative metrics for cognitive shift intensity (CogShift) and category drift, revealing InternLM3's superior performance in multi-level transitions; (3) an interpretability analysis revealing metric-level correlations that enhance the transparency of Chain-of-Thought prompting. Our findings highlight the importance of cognitive-aware prompt design and provide benchmarks for deploying LLMs in personalized learning systems.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16456

SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling

作者:

Weiqiao Shan ↗Ruixiang Mao ↗Yuang Li ↗Yuhao Zhang ↗Yingfeng Luo ↗Tong Zheng ↗Chen Xu ↗Yucheng Qiao ↗Chunxiang Jin ↗Yi Yuan ↗Jingdong Chen ↗Tong Xiao ↗…

arXiv:2606.16456v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) models enable efficient scaling, but training them from scratch remains prohibitively expensive. MoE upcycling mitigates this cost by converting pretrained dense models into sparse MoE models. However, existing upcycling methods typically rely on large-scale continued training and often perform poorly under data-constrained supervised adaptation, due to either homogeneous experts or overly disruptive perturbations to pretrained parameters. In this setting, effective upcycling must leverage pretrained weight structure while introducing sufficient diversity among routed experts. To this end, we propose SVD-Partitioned Residual Initialization (SPRI), which distributes SVD-partitioned residuals derived from pretrained feed-forward network (FFN) weights across routed experts, introducing controlled expert diversity grounded in pretrained spectral structure. We further introduce a two-stage training strategy to improve adaptation stability. We evaluate SPRI on multilingual speech-to-text translation, where limited supervised data challenges MoE upcycling and multiple target languages provide natural routing heterogeneity. On CoVoST2 across 15 En-to-XX directions, SPRI improves average BLEU and COMET over fully fine-tuned dense models by 2.58 and 3.32 points, respectively, and outperforms the prior best MoE upcycling baseline by 3.39 BLEU and 4.34 COMET points.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络