论文广场 - AcademicHub

01.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.04621

MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

作者:

Weiyu Li ↗Antoine Toisoul ↗Tom Monnier ↗Roman Shapovalov ↗Rakesh Ranjan ↗Ping Tan ↗Andrea Vedaldi ↗

We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow

阅读与讨论 → 访问原文 →

02.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.02800

Cosmos 3: Omnimodal World Models for Physical AI

作者:

NVIDIA ↗Aditi ↗Niket Agarwal ↗Arslan Ali ↗Jon Allen ↗Martin Antolini ↗Adeline Aubame ↗Alisson Azzolini ↗Junjie Bai ↗Maciej Bala ↗Yogesh Balaji ↗Josh Bapst ↗…

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI – effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16337

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

作者:

Wei Xu ↗Ke Yang ↗Gang Luo ↗Keli Zheng ↗Lingyan Hu ↗Jing Wang ↗Kefeng Li ↗

arXiv:2606.16337v1 Announce Type: new Abstract: Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.12189

DynaTok: Token-Based 4D Reconstruction from Partial Point Clouds

作者:

Weirong Chen ↗Keisuke Tateno ↗Hidenobu Matsuki ↗Michael Niemeyer ↗Daniel Cremers ↗Federico Tombari ↗

We address 4D reconstruction from partial point cloud sequences, where depth-sensor observations are incomplete, unordered, and lack explicit temporal correspondences. This geometry-only setting is challenging due to missing observations and ambiguous dynamics. While recent progress has largely relied on image-based methods, existing point-based approaches typically focus on single objects, assume relatively complete inputs, or require explicit correspondences. To address these limitations, we propose DynaTok, a point-based framework for correspondence-free 4D reconstruction from partial point cloud sequences without images. DynaTok encodes frames into compact latent tokens, aggregates incomplete observations over time with a Transformer-based spatiotemporal encoder, and decouples geometry and motion through residual tokens in a unified model. A flow-matching decoder then reconstructs complete, temporally consistent 4D point-cloud sequences conditioned on the latent tokens. Experiments on object- and scene-level benchmarks demonstrate improved reconstruction quality and temporal coherence from partial point cloud observations. Project page: https://wrchen530.github.io/dynatok/.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16939

Scalable Circuit Learning for Interpreting Large Language Models

作者:

Naiyu Yin ↗Dennis Wei ↗Tian Gao ↗Amit Dhurandhar ↗Karthikeyan Natesan Ramamurthy ↗Yue Yu ↗

arXiv:2606.16939v1 Announce Type: cross Abstract: A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavior. However, raw neurons are polysemantic, making learned circuits hard to interpret. Sparse autoencoder (SAE) features alleviate this, but their high dimensionality makes existing intervention-based circuit learning methods computationally prohibitive. We propose CircuitLasso, a scalable circuit-learning approach based on sparse linear regression. CircuitLasso recovers circuits whose structural accuracy matches that of state-of-the-art intervention-based methods on the benchmark data, at a fraction of the computational cost. For interpretability, CircuitLasso efficiently uncovers relationships among SAE features, showing how human-interpretable semantic features propagate through the model and influence its predictions. Finally, we validate the utility of our learned circuits by leveraging their insights to achieve comparable performance at substantially lower cost on a domain-generalization task.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.CV) 2026-06-19 DOI: arXiv:2606.20521

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

作者:

Juncheng Ma ↗Jianxin Bi ↗Yufan Deng ↗Xuanran Zhai ↗Kewei Zhang ↗Ye Huang ↗Bo Liang ↗Shukai Gong ↗Jiankai Tu ↗Xiaotian Tang ↗Jiaxin Li ↗Kaiqi Chen ↗…

Embodied foundation models are expected to benefit from data scaling like large language models, but face a much tighter data bottleneck. Teleoperated real-robot trajectories remain the dominant pretraining source due to their precise action supervision and embodiment alignment, yet their scalability is limited by high collection cost, acquisition difficulty, and low behavioral and environmental diversity. These limitations have sparked interest in egocentric human video as a scalable, substantially lower-cost, and more diverse alternative for embodied model pretraining. However, its effectiveness compared to teleoperated real-robot data remains underexplored. To address this question, we conduct a systematic study comparing egocentric human video and teleoperated real-robot trajectories as pretraining data sources for embodied foundation models, under fixed post-training and validation protocols. Surprisingly, we find that egocentric data, when processed through a carefully designed filtering and labeling pipeline, is not merely a viable substitute for model pretraining but can lead to superior performance. With the same amount of pretraining data, models pretrained on egocentric data achieve a 24% lower validation loss on real-robot action prediction, as well as 52.5% and 90% higher success rates on in-distribution and out-of-distribution real-robot task execution, respectively. This finding verifies a scalable paradigm for embodied foundation models: pretrain on egocentric human video to learn diverse world representations, then adapt with a small amount of labeled real-robot data for action-space alignment. We hope this study encourages broader exploration of egocentric data and offers guidance for data quality assessment before costly robot data collection.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2606.12365

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

作者:

Adam Wei ↗Nicholas Pfaff ↗Thomas Cohn ↗Arif Kerem Day{\i}↗Constantinos Daskalakis ↗Giannis Daras ↗Russ Tedrake ↗

arXiv:2606.12365v1 Announce Type: cross Abstract: We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2606.15127

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

作者:

Xian Sun ↗Wei Gao ↗Yingshuo Wang ↗Lingdong Kong ↗Yanhang Li ↗Zhichao Fan ↗Zexin Zhuang ↗Wenlong Dong ↗Zhiyuan Zheng ↗Hrishikesh Paranjape ↗Abhishek Mandal ↗Johnny R. Zhang ↗…

arXiv:2606.15127v1 Announce Type: new Abstract: Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit workflows may inspect traces for misleading or biased input. In such settings, two responses can receive the same final-answer score while differing in whether the trace explicitly flags injected biasing content. Accuracy-only evaluation collapses these cases. We study this gap as a measurement blind spot for responsible evaluation and introduce a minimal trace-level diagnostic with two axes: susceptibility (whether the bias breaks a previously correct answer) and acknowledgment (whether the trace contains a rubric-defined surface reference to the injected content). Across thousands of biased GSM8K trials, GPT-4o and Claude Sonnet~4 have similar susceptibility rates ($1.3\%$ vs.\ $1.2\%$) but substantially different acknowledgment rates ($13.0\%$ vs.\ $75.0\%$) under the same rubric.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.16124

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

作者:

Ke Li ↗Di Wang ↗Yongshan Zhu ↗Ting Wang ↗Weiping Ni ↗Tao Lei ↗Quan Wang ↗Xinbo Gao ↗

Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.LG) 2026-06-18 DOI: arXiv:2506.13196

KEPLA: A Knowledge-Enhanced Deep Learning Framework for Accurate Protein-Ligand Binding Affinity Prediction

作者:

Han Liu ↗Keyan Ding ↗Peilin Chen ↗Yinwei Wei ↗Liqiang Nie ↗Dapeng Wu ↗Shiqi Wang ↗

arXiv:2506.13196v5 Announce Type: replace Abstract: Accurate prediction of protein-ligand binding affinity is critical for drug discovery. While recent deep learning approaches have demonstrated promising results, they often rely solely on structural features of proteins and ligands, overlooking their valuable biochemical knowledge associated with binding affinity. To address this limitation, we propose KEPLA, a novel deep learning framework that explicitly integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance. KEPLA takes protein sequences and ligand molecular graphs as input and optimizes two complementary objectives: (1) aligning global representations with knowledge graph relations to capture domain-specific biochemical insights, and (2) leveraging cross attention between local representations to construct fine-grained joint embeddings for prediction. Experiments on two benchmark datasets across both in-domain and cross-domain scenarios demonstrate that KEPLA consistently outperforms state-of-the-art baselines. Furthermore, interpretability analyses based on knowledge graph relations and cross attention maps provide valuable insights into the underlying predictive mechanisms.

阅读与讨论 → 访问原文 →

11.

medRxiv (Medicine) 2026-06-15 DOI: HASH:09934c3d6b0d32a1f47cd32b933431d5

Excitation-Inhibition Balance in Schizophrenia Spectrum Disorders: EEG Criticality Reflects Frontal Metabolites and a Potential Compensatory Mechanism

作者:

Hasanaj ↗Kallweit ↗M. S ↗Karsli ↗Meisinger ↗Boudriot ↗Roell ↗Melcher ↗Vural ↗Schulz ↗Klimas ↗Schmoelz ↗…

Background The excitation-inhibition (E-I) balance is essential for normal brain functioning, while deviations from this balance have been implicated in several psychiatric disorders. However, the extent to which electroencephalography (EEG) and proton magnetic resonance spectroscopy (1H-MRS) E-I markers are altered in schizophrenia spectrum disorders (SSD), how they converge across modalities, and how they relate to cognitive performance and clinical symptoms remain insufficiently characterized. Methods We recruited 111 healthy controls (HC) and 113 individuals with SSD. All participants underwent resting-state EEG and 1H-MRS. Metabolites were measured either in the anterior cingulate cortex (ACC; NSSD = 63, NHC = 58) or in the left dorsolateral prefrontal cortex (lDLPFC; NSSD = 50, NHC = 53), from which gamma-aminobutyric acid (GABA), glutamate + glutamine (Glx), and the Glx/GABA ratio were extracted. Extracted EEG E-I markers included oscillatory activity, aperiodic activity, functional E-I, microstates, multiscale entropy, and neuronal avalanche criticality. Results MRS results showed no group differences in GABA, Glx, or the Glx/GABA ratio. In contrast, most EEG-derived E-I markers indicated increased cortical inhibition in SSD, including steeper aperiodic exponents, prolonged microstate durations, and greater prevalence of subcritical states. However, functional E-I showed a divergent pattern, suggesting balanced dynamics in SSD and relatively inhibition-weighted dynamics in HC. Across groups, higher ACC and lDLPFC GABA predicted a lower kappa index, whereas a higher lDLPFC Glx/GABA ratio was associated with a higher kappa index. In SSD, reduced avalanche criticality was associated with better cognition and less severe symptoms. Conclusion Several EEG-derived E-I proxies, but not MRS measures, indicate an increased cortical inhibition in SSD. Criticality indices best capture frontal neurochemical metabolites and improvements in clinical symptoms, potentially reflecting inhibitory compensation mechanisms in SSD.

阅读与讨论 → 访问原文 →

12.

medRxiv (Medicine) 2026-06-15 DOI: HASH:3d6062ab9384e2472bd9996f1f3c83ae

Long-read sequencing enables high-accuracy mitochondrial heteroplasmy detection in Parkinson's disease

作者:

Lüth ↗Schaake ↗Much ↗Belyea ↗M. M ↗Seibler ↗Grünewald ↗May ↗Klein ↗Weissensteiner ↗Trinh ↗

Background: Low-frequency heteroplasmic mitochondrial DNA (mtDNA) variants are associated with aging and neurological diseases, including Parkinson's disease (PD). Targeted deep mtDNA sequencing using PacBio HiFi long reads has the potential to resolve heteroplasmy across the full mitochondrial genome with high accuracy. Methods: To validate Vega PacBio sequencing for detecting mtDNA heteroplasmy, we analyzed four predefined mixtures of two mtDNA haplotypes. We generated a single long-range PCR amplicon covering the entire mitochondrial genome. These amplicons were mixed at predefined ratios (minor mixture haplotype component: 5%, 2%, 1%, and 0.1%). Variant calling was performed using Mutserve2, and accuracy was assessed by calculating the F1 score from comparisons between expected and detected variants. Full-length mtDNA PacBio sequencing was applied to investigate heteroplasmy across fibroblast passages derived from five LRRK2 p.Gly2019Ser variant carriers (n=3 affected with PD and n=2 unaffected carriers). Changes in mtDNA heteroplasmy level and variant load were assessed longitudinally using a linear mixed model. Results: The single-amplicon approach enabled full-length haplotype resolution without amplification bias associated with overlapping PCR strategies. The F1 score of the predefined mixtures was 1.0 for heteroplasmy levels between 5% and 1% and remained high (0.91) at 0.1%. We detected n=10/62 variants discordant with the Illumina reference at the 0.1% mixture, but sensitivity remained very high at 1.00 in that mixture. Detected minor variants closely matched expected heteroplasmy levels, with average variant levels of 0.057 (5%), 0.022 (2%), 0.011 (1%), and 0.001 (0.1%). Across twelve fibroblast passages, we observed fewer mtDNA heteroplasmic variants ({beta}=-3.2, p=0.026). Increased heteroplasmic variant load over time was also associated with older age ({beta}=1.50, p=0.001) and PD affection status ({beta}=5.0, p=1.0 x 10-4) in LRRK2 variant carriers. Notably, we observed distinct patterns of heteroplasmic variants that either increased or decreased in heteroplasmy level across passages. Conclusion: PacBio HiFi sequencing, combined with a single-amplicon strategy, enables accurate full-length mtDNA heteroplasmy detection and longitudinal analysis, providing a valuable tool for studying mitochondrial variation and dynamics in disease.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2606.12708

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages

作者:

Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support NLP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.12826

DIMOS: Disentangling Instance-level Moving Object Segmentation

作者:

Hongxiang Huang ↗Hongwei Ren ↗Xiaopeng Lin ↗Yulong Huang ↗Zeke Xie ↗Bojun Cheng ↗

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.11614

Information-Theoretic Decomposition for Multimodal Interaction Learning

作者:

Zequn Yang ↗Yake Wei ↗Haotian Ni ↗Zhihao Xu ↗Di Hu ↗

Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at https://github.com/GeWu-Lab/DMIL.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17800

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

作者:

Lichen Bai ↗Tianhao Zhang ↗Shitong Shao ↗Dingwei Tan ↗Qiyu Zhong ↗Zhengpeng Xie ↗Haopeng Li ↗Qinghao Huang ↗Dandan Shen ↗Tengjiao Ji ↗Wei Wang ↗Peicheng Wu ↗…

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2602.14367

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

作者:

Shuofei Qiao ↗Yunxiang Wei ↗Xuehai Wang ↗Bin Wu ↗Boyang Xue ↗Ningyu Zhang ↗Hossein A. Rahmani ↗Yanshan Wang ↗Qiang Zhang ↗Keyan Ding ↗Jeff Z. Pan ↗Huajun Chen ↗…

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2606.11209

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

作者:

Jingpei Wu ↗Xiao Han ↗Weixiang Shen ↗Boer Zhang ↗Zifeng Ding ↗Volker Tresp ↗

Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps – a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

阅读与讨论 → 访问原文 →

19.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2604.18827

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

作者:

arXiv:2604.18827v2 Announce Type: replace-cross Abstract: Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling – even in the mouse visual cortex, a relatively simple system – models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at https://github.com/enigma-brain/omnimouse.

阅读与讨论 → 访问原文 →

20.

arXiv (CS.AI) 2026-06-15 DOI: arXiv:2604.01463

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

作者:

Keshav Shankar ↗Dan Ding ↗Wei Gao ↗

arXiv:2604.01463v2 Announce Type: replace-cross Abstract: Physically Assistive Robots require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause substantial physical and cognitive fatigue for users with severe motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework. This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, occupational therapists confirmed the generated policies are safe and accurately reflect user preferences.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.19419

Playful Agentic Robot Learning

作者:

Junyi Zhang ↗Jiaxin Ge ↗Hanjun Yoo ↗Letian Fu ↗Zihan Yang ↗Yaowei Liu ↗Raj Saravanan ↗Shaofeng Yin ↗Justin Yu ↗Dantong Niu ↗Zirui Wang ↗Roei Herzig ↗…

arXiv:2606.19419v1 Announce Type: cross Abstract: Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.14961

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

作者:

Juming Xiong ↗Weixin Liu ↗Kevin Guo ↗Congning Ni ↗Junchao Zhu ↗Chongyu Qu ↗Chao Yan ↗Katherine Brown ↗Avinash Baidya ↗Xiang Gao ↗Bradley Malin ↗Zhijun Yin ↗…

Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence–rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence–rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.20005

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

作者:

Guangda Liu ↗Yiquan Wang ↗Chengwei Li ↗Wenhao Chen ↗Jing Lin ↗Yiwu Yao ↗Danning Ke ↗Wenchao Ding ↗Jieru Zhao ↗

arXiv:2606.20005v1 Announce Type: cross Abstract: Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring $O(N_QN_K)$ memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to $43\times$ and $14\times$ speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, enabling long-context distillation on a single GPU.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2606.05405

Agents' Last Exam

作者:

Yiyou Sun ↗Xinyang Han ↗Weichen Zhang ↗Yuanbo Pang ↗Tianyu Wang ↗Yuhan Cao ↗Yixiao Huang ↗Chris Duroiu ↗Haoyun Zhang ↗Jeffrey Lin ↗Weishu Zhang ↗Tyler Zeng ↗…

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2601.18692

A Pragmatic VLA Foundation Model

作者:

Wei Wu ↗Fan Lu ↗Yunnan Wang ↗Shuai Yang ↗Shi Liu ↗Fangjing Wang ↗Qian Zhu ↗He Sun ↗Yong Wang ↗Shuailei Ma ↗Yiyu Ren ↗Kejia Zhang ↗…

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络