论文广场 - AcademicHub

01.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2606.16154

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

作者:

Prasanth YSS ↗Zhichen Ren ↗Rasa Hosseinzadeh ↗Ilan Gofman ↗Yuqi Chen ↗Zhaoyan Liu ↗Guangwei Yu ↗Jesse C. Cresswell ↗Satya Krishna Gorti ↗

arXiv:2606.16154v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at https://github.com/layer6ai-labs/wapo.

阅读与讨论 → 访问原文 →

02.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2505.23823

RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery

作者:

Youngseung Jeon ↗Ziwen Li ↗Thomas Li ↗JiaSyuan Chang ↗Morteza Ziyadi ↗Xiang 'Anthony' Chen ↗

Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact-abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.AI) 2026-06-15 DOI: arXiv:2606.14357

No Accidental Software Agent First Canonical Code for Human Code Entropy Reduction and 30 to 500 times Lower Frontier Model Requirements

作者:

Jepson Taylor ↗

arXiv:2606.14357v1 Announce Type: cross Abstract: Frontier coding models may spend substantial capacity learning not only program behavior, but also accidental entropy in human repositories. Such repositories contain valuable signals: tests, incidents, migrations, edge cases, product judgment, and operational history. These signals are entangled with framework churn, naming drift, generated-source ambiguity, dependency rituals, CI dialects, weak proof routes, and human-oriented review customs. We propose agent-first canonical code, a proof-carrying substrate that rewrites routine product software into canonical behavior profiles, typed change algebra, proof lanes, constrained edit grammars, semantic patch cells, runtime negative memory, and proof-carrying change objects. The core hypothesis is that quotienting software by behavior equivalence under a declared oracle can collapse equivalent encodings into governed representatives with explicit evidence and proof obligations. The endpoint is amortized cost per verified correct change, including source, context, reasoning, tools, verification, security, provenance, review, failed loops, defects, and foundry cost under a common oracle. Reported reduction bands are hypotheses, not measured frontier results. The proposed limit is a No-Accident Horizon: removable accident decreases until residual novelty, evidence, governance, risk, and future optionality dominate. For supported routine-product distributions, this gives a defensible planning target near 100-fold all-in cost reduction, not a guarantee for all software. Preliminary QLoRA experiments on Qwen2.5-Coder-14B show that 64,088 canonical trajectories are learnable and suppress tested forbidden-language markers, but do not establish behavior preservation, scaling economics, or verified-change cost. The contribution is a falsifiable program centered on minimum functional description length and verified-change cost.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17115

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

作者:

Jingyu Hu ↗Giuseppe Tripodi ↗Reed Naidoo ↗Sarah F. McGough ↗Tapabrata Chakraborti ↗

arXiv:2606.17115v1 Announce Type: cross Abstract: Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.

阅读与讨论 → 访问原文 →

05.

arXiv (quant-ph) 2026-06-19 DOI: arXiv:2606.20503

General circuit mapping algorithm for neutral atom quantum computers

作者:

Neven Gentil ↗Lous S. Rianne ↗Aida Todri-Sanial ↗

arXiv:2606.20503v1 Announce Type: new Abstract: Neutral atom quantum computers (NAQC) are emerging as a promising, scalable quantum computing platform because of their long qubit coherence, flexible qubit arrangement, and multiqubit gate capabilities. However, circuit execution often requires physically moving qubits, making compilation a critical optimization challenge. We propose a circuit independent mathematical framework built on graph-theoretic combinatorial optimization that determines the minimal number of required qubit transfers. This model captures spatial constraints specific to NAQC platforms with zone-limited gate operations and multi-qubit gates. From this framework, we encode the qubit mapping problem as a nonlinear integer program and solve it using a genetic algorithm, enabling trade-offs between minimizing the total traveled distance and the number of parallel transfer operations. Compared to the state-of-the-art scalable compiler for zoned architectures, our approach consistently finds fewer transfers. Depending on the optimization focus, our method produces shorter traveled distances or fewer parallel transfer operations. This work provides both theoretical guaranties and a practical tool for efficient, architecture-aware quantum circuit compilation. As a result, practitioners can generate hardware-aware mappings that reduce movement-induced errors and better exploit atom transfer parallelism, directly improving execution efficiency on NAQC devices.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.LG) 2026-06-18 DOI: arXiv:2505.15215

Clustering and Pruning in Causal Data Fusion

作者:

Otto Tabell ↗Santtu Tikka ↗Juha Karvanen ↗

arXiv:2505.15215v3 Announce Type: replace-cross Abstract: Data fusion, the process of combining observational and experimental data, can enable the identification of causal effects that would otherwise remain non-identifiable. Although identification algorithms have been developed for specific scenarios, do-calculus remains the only general-purpose tool for causal data fusion, particularly when variables are present in some data sources but not others. However, approaches based on do-calculus may encounter computational challenges as the number of variables increases and the causal graph grows in complexity. Consequently, there exists a need to reduce the size of such models while preserving the essential features. For this purpose, we propose pruning (removing unnecessary variables) and clustering (combining variables) as preprocessing operations for causal data fusion. We generalize earlier results on a single data source and derive conditions for applying pruning and clustering in the case of multiple data sources. We give sufficient conditions for inferring the identifiability or non-identifiability of a causal effect in a larger graph based on a smaller graph and show how to obtain the corresponding identifying functional for identifiable causal effects. Examples from epidemiology and social science demonstrate the use of the results.

阅读与讨论 → 访问原文 →

07.

arXiv (quant-ph) 2026-06-15 DOI: arXiv:2606.13928

Compact graphs and quantum automorphisms

作者:

Pedro Baptista ↗Gabriel Coutinho ↗Chris Godsil ↗Simon Schmidt ↗

arXiv:2606.13928v1 Announce Type: new Abstract: Compact graphs are graphs for which the fractional automorphism polytope has no genuinely fractional vertices. This paper proposes a quantum analogue of this idea by evaluating the fundamental magic unitary of the quantum automorphism group on states, which we show to produce a closed convex set of doubly stochastic matrices sitting between the classical automorphism polytope and the full fractional automorphism polytope. Our main result is that the natural quantum analogue of compactness is classical, that is, a quantum compact graph is classically compact. We also relate this set to the quantum orbital algebra and obtain a hierarchy of classical and quantum compactness pseudo notions. The framework recovers familiar consequences of compactness through commutants and suggests quantum analogues of generous transitivity and distance-transitivity. We also isolate examples and open problems indicating where quantum symmetries may strictly refine the classical compactness theory.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.13768

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

作者:

Sharath Girish ↗Tsai-Shien Chen ↗Zhikang Dong ↗Mukesh Singhal ↗Hao Chen ↗Sergey Tulyakov ↗Aliaksandr Siarohin ↗

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2602.09329

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection

作者:

Xueying Ding ↗Simon Kl\"uttermann ↗Haomin Wen ↗Yilong Chen ↗Leman Akoglu ↗

arXiv:2602.09329v3 Announce Type: replace Abstract: Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.

阅读与讨论 → 访问原文 →

10.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.16336

Optimizing resource bounds in direct fidelity estimation

作者:

Netanel Barel ↗Lee Peleg ↗Yotam Kadish ↗Amit Ben Kish ↗Yotam Shapira ↗

arXiv:2606.16336v1 Announce Type: new Abstract: Direct fidelity estimation provides a way to estimate the fidelity between an experimentally prepared state and a desired pure target state without performing full tomography. Two influential formulations were introduced in 2011 by Flammia and Liu and by da Silva, Landon-Cardinal, and Poulin. In these protocols, the total estimation error is controlled through two distinct probabilistic steps: first, the fidelity is approximated using randomly sampled Pauli observables; second, each sampled expectation value is estimated from finitely many measurement outcomes. In this work we show that additional structural information about the noise can substantially sharpen the corresponding resource bounds. In particular, for some canonical channels the effective number of sampled Pauli settings can be reduced, leading to lower measurement cost both in the general pure-state setting and in the case of a stabilizer state. These results illustrate a broader point: worst-case confidence bounds in direct fidelity estimation can be significantly conservative when experimentally relevant structure is ignored. As a technical ingredient, we also revisit the allocation of the total accuracy and confidence budgets between the two probabilistic steps. Reformulating the analysis in terms of separate error parameters yields a constrained optimization problem whose solution lowers the average number of measurements in the general pure-state setting. Numerical simulations based on quantum circuits implemented in Qiskit illustrate both the improvement obtained under structured-noise assumptions and the conservativeness of the original worst-case bounds.

阅读与讨论 → 访问原文 →

11.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2603.28153

Arbitrarily Configurable Wavefunctions via Imaginary Gauge Phase Imprint in Non-Hermitian Lattices

作者:

Ji-Long Dong ↗Shi-Liang Zhu ↗Dan-Wei Zhang ↗

arXiv:2603.28153v2 Announce Type: replace-cross Abstract: We propose a general framework, termed the imaginary gauge phase imprint (IGPI), which enables engineering arbitrarily configurable wavefunctions with exact solutions and self-organization dynamics in any-dimensional non-Hermitian lattices under imaginary gauge fields. Using this method, we uncover a novel phase with exact critical wavefunctions, dubbed the skin critical phase (SCP), which is marked by unconventional localization, topological-skin, and dynamical characteristics. Furthermore, we validate the IGPI by imprinting and visualizing complex fractal states with Sierpinski-carpet and Koch-snowflake profiles, as well as exotic super-moire and 3D-moire states in regular lattices. Our work not only offers fresh insights into non-Hermitian critical and fractal physics, but also provides a rigorous paradigm for controlling and visualizing wavefunction patterns using the IGPI in engineered non-Hermitian systems.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.18558

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

作者:

Jianing Zhang ↗Chenhao Zheng ↗Yajun Yang ↗Max Argus ↗Rustin Soraki ↗Winson Han ↗Taira Anderson ↗Chun-Liang Li ↗Shuo Liu ↗Jiafei Duan ↗Zhongzheng Ren ↗Jieyu Zhang ↗…

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

阅读与讨论 → 访问原文 →

13.

medRxiv (Medicine) 2026-06-12 DOI: HASH:899a0673fe5b3245d66ca5ccefd57248

Home-based binocular serious games in virtual reality to treat visual acuity and stereovision in residual amblyopia: AMBER study

作者:

Aurilia ↗Martin ↗N.-L ↗Simon-Martinez ↗Antoniou ↗M.-P ↗Bouthour ↗Bavelier ↗Backus ↗B. T ↗Dornbos ↗Blaha ↗…

Objectives: Amblyopia is a pediatric visual disorder traditionally treated by patching the fellow eye, though many patients retain residual amblyopia post-treatment. Increasing evidence suggests that visual plasticity allows treat-ment beyond the classical therapeutic window. AMBER evaluated the efficacy of binocular serious games in virtual reality (VR) in residual amblyopia. Methods and Analysis: The monocentric, prospective, randomized, crossover trial (reported as case series) includ-ed 14 anisometropic, strabismic, or mixed residual amblyopia patients (6-35 years; 5 children, 9 adults). Participants underwent two 2-month intervention phases: optical correction (standard care) and standard care plus VR games (2.5 h/week), each with a 2-month follow-up. Best-corrected visual acuity (BCVA), stereoacuity, and reading speed were assessed (5 timepoints) using the Sloan and Landolt charts, the Titmus, TNO, Lang II, Asteroid, and Mnread tests. Compliance and adverse events (AE) were recorded. Results: VR training improved BCVA in 10 amblyopic eyes (Landolt and Sloan), with more pronounced effects in anisometropic patients. Six patients showed improved stereoacuity (Titmus; 4x mixed, 1x anisometropic, 1x stra-bismic amblyopia), persistent only in children (1x strabismic, 1x mixed amblyopia). Four improvements were ob-served with TNO (1x), Lang II (1x), Asteroid (0x), and MNread (1x). Despite positive trends, when comparing re-sults of individual patients, between both eyes, and with standard treatment, consistency of improvements cannot be conclusively demonstrated. One non-severe AE (dizziness) was reported. Conclusions: Following individual cases, VR training improved BCVA and stereoacuity, particularly in children and patients with high compliance. However, considering the cohort as a whole, consistency of effects has to be confirmed in larger groups. Thus, the methodologically sophisticated AMBER study revealed differences in VR treatment efficacy between amblyopia types, children/adults, endpoints and tests, offering precious data for the design of meaningful future studies. It shows that neurovisual plasticity gauged by VR-games offers safe, engaging treatment options for residual amblyopia.

阅读与讨论 → 访问原文 →

14.

arXiv (math.PR) 2026-06-11 DOI: arXiv:2606.11845

Stochastic epidemic model with varying infectivity and waning immunity: the law of large numbers with unbounded infectivity

作者:

Rapha\"el Forien ↗\'Etienne Pardoux ↗

arXiv:2606.11845v1 Announce Type: new Abstract: We revisit the large population limit of our epidemic model with infection age dependent infectivity and progressive immunity waning, under the assumption that the supremum in $t$ of the random infectivity function has a finite expectation, while the previous proofs assumed that this supremum admits a deterministic upper bound.

阅读与讨论 → 访问原文 →

15.

arXiv (math.PR) 2026-06-16 DOI: arXiv:2606.16791

Riviera model with egoistical settlers

作者:

P. L. Krapivsky ↗

arXiv:2606.16791v1 Announce Type: cross Abstract: The Riviera model mimics a densifying settlement along the coastline. In the lattice version, houses are built sequentially in empty sites with the constraint that every newly built house has at least one empty neighboring site. The distribution of clusters of adjacent houses does not obey a closed set of evolutionary equations, but the void-cluster-void distribution does. We compute the latter and extract the cluster distribution from it. In the jammed state, when all voids have length one and the evolution ceases, the cluster distribution has a neat form and exhibits a factorial decay with the length of the cluster. To investigate finite systems, we employ a static approach directly treating jammed states. If the coastline is a finite segment, we determine the statistics of the number of empty sites in the jammed state (the average, variance, and higher cumulants). We also study a continuum version in which houses are built along the line so that each newly built house is sufficiently separated from at least one neighboring house.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2606.15565

If These Walls Could Talk: Critical Play with Large Language Models in Museums

作者:

Anders Sundnes L{\o}vlie ↗

arXiv:2606.15565v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly being used in museums to as role playing chatbots which let visitors talk to simulated versions of people and artefacts from the past. While such installations can be playful and engaging, they are also problematic because LLMs cannot be trusted to speak truthfully. I identify a fundamental dilemma for the use of LLMs in museum chatbots: LLMs cannot be trusted to tell the truth, and efforts to make them more reliable may ruin that which is attractive about the bots in the first place - their ability to engage in life-like conversation. In response, I propose designing for critical play with LLM-based bots: Designing for playful interactions with bots that are unreliable but still able to represent the past in an adequate and engaging manner - as fictional characters representing historical narratives, styles of discourse, diverse perspectives, humor and satire.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2606.16579

On the Entropy Formula for Real, Complex, and Quaternionic Deep Linear Networks

作者:

Luis Contreras ↗Marco Nahas ↗Tejas Kotwal ↗

arXiv:2606.16579v1 Announce Type: new Abstract: We extend the entropy formula of Menon and Yu for the real Deep Linear Network (DLN) to its complex and quaternionic analogues, obtaining a unified formula for DLNs over $\mathbb{R}$, $\mathbb{C}$, and $\mathbb{H}$.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2601.04061

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

作者:

Chubin Zhang ↗Jianan Wang ↗Zifeng Gao ↗Yue Su ↗Tianru Dai ↗Cai Zhou ↗Jiwen Lu ↗Yansong Tang ↗

Generalist Vision-Language-Action models remain constrained by the scarcity of robotic data relative to the abundance of human video demonstrations. Existing Latent Action Models attempt to use video data but often suffer from visual entanglement, encoding noise rather than manipulation skills. To address this limitation, we propose Contrastive Latent Action Pretraining (CLAP), a framework that first uses Act-VAE to learn an executable action-token vocabulary from robot trajectories and then aligns human visual transitions with this vocabulary through contrastive learning. This alignment maps unlabeled human videos into a physically grounded latent action space rather than reconstructing appearance. Building on the aligned tokens, we train CLAP-NTP as an autoregressive VLA using robot demonstrations and pseudo-labeled human videos, preserving instruction following and object generalization. For deployment and target-domain adaptation, we further introduce a post-training strategy that combines CLAP-RF, a Rectified Flow action head for low-latency continuous action chunk prediction, with Knowledge Matching regularization to preserve pretrained semantic knowledge during fine-tuning. Extensive experiments show that CLAP achieves strong performance against competitive baselines while enabling effective skill transfer from human videos to robotic execution.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.18960

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

作者:

Zirui Zheng ↗Jiaqian Yu ↗Xiongfeng Peng ↗jun shi ↗Mingyi Li ↗Chao Zhang ↗Weiming Li ↗Dong Wang ↗Huchuan Lu ↗Xu Jia ↗

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

阅读与讨论 → 访问原文 →

20.

bioRxiv (Bioinfo) 2026-06-11 DOI: HASH:87b356df08848d08b388a086615a88ae

A systematic imputation framework for sparse, multimodal space biology datasets: application to retinal imaging and omics from the RR9 mission

作者:

Nagesh ↗Sanders ↗Costes ↗S. V ↗Avci ↗Sigit ↗Agarwal ↗Haghighi ↗Batool ↗Karouia ↗Chander ↗A. M ↗…

Space biology experiments are expensive, logistically complex, and inherently limited in sample size, resulting in datasets that are frequently incomplete and highly heterogeneous (2). Missing data is a fundamental barrier to building reliable computational models of how the human body responds to spaceflight. This work introduces a systematic framework for addressing missing data through imputation. We developed a validated four-stage framework for imputation specifically designed to preserve biological signal needed for digital twin development, while quantifying trade-offs in downstream analyses. Using retinal imaging and omics data from the NASA RR9 mission as a case study (9), we demonstrate how to diagnose why data is missing(10), select and optimize appropriate imputation strategies (5,10), and rigorously evaluate whether imputed data remains biologically meaningful. A key finding of this work is that while imputation substantially improves the performance of predictive models, it can simultaneously obscure subtle biological patterns; a critical trade-off that researchers must understand before applying these methods (11). This framework provides practical, actionable guidance for space biologists and data scientists working with sparse, multimodal datasets in space biology, and represents a foundational step toward more complete and reliable data-driven models of human physiology in extreme environments.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.19637

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

作者:

Priyanshi Garg ↗Ishita Rao ↗Jieqiong Ding ↗Amandalynne Paullada ↗

Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.14638

Improving Lunar Topography with Deep Learning Schrödinger Bridges

作者:

Matthew Repasky ↗Erwan Mazarico ↗Michael K. Barker ↗Stefano Bertone ↗Terence J. Sabaka ↗Yao Xie ↗

Increasing the resolution of planetary topography models can enable a better understanding of surface processes and geomorphology; however, existing analytical super-resolution methods are expensive and difficult to apply at large scales. Generative models provide the tools to learn complex relationships within data and can be applied at scale due to hardware accelerators and parallelization. We present a diffusion-based Schrödinger Bridge (SB) generative modeling approach for lunar topography super-resolution, connecting the distribution of low-resolution topography to that of high-resolution topography, incorporating physically-constraining optical imagery. Our approach is inspired by existing Shape-from-Shading methods, which improve a priori low-resolution topography by using optical images at the target resolution. We train SBs on a novel dataset of rendered lunar topography, emulating optical imagery from the Lunar Reconnaissance Orbiter Narrow Angle Camera. The result is a flexible approach for topography super-resolution which can provide pixel-level uncertainties in the reconstruction.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.10200

An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

作者:

Ahmed Faizul Haque ↗S. M. Riaz Rahman Antu ↗Saif Ahmed ↗Asadullah Hil Galib ↗Souvik Pramanik ↗Mohammad Ashrafuzzaman Khan ↗Mohammad Abdul Qayum ↗Mohsin Sajjad ↗

An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2412.21036

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

作者:

Shangyu Xing ↗Changhao Xiang ↗Yuteng Han ↗Yifan Yue ↗Zhen Wu ↗Xinyu Liu ↗Zhangtai Wu ↗Fei Zhao ↗Xinyu Dai ↗

Geometric shapes play important roles in both physical world and human cognition. While multimodal large language models (MLLMs) have made significant advancements in visual understanding, their abilities to recognize geometric shapes and their spatial relationships, which we term geometric perception, are not explicitly and systematically explored. To address this gap, we introduce GePBench, a novel benchmark specifically designed to assess the geometric perception capabilities of MLLMs. Our extensive evaluations reveal that even the current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks. Furthermore, we show that models trained with GePBench data demonstrate considerable improvements on a wide range of downstream tasks, highlighting the critical role of geometric perception in enabling advanced multimodal applications. Our code and datasets are available at \href{https://github.com/Changhao-Xiang/GePBench}{https://github.com/Changhao-Xiang/GePBench}.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16515

Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

作者:

Swaminathan S K ↗Damiya Gondha ↗Theyanesh Eswaramoorthy Rajahkrishnan ↗Aritra Hazra ↗

arXiv:2606.16515v1 Announce Type: cross Abstract: Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal – a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $\psi$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $\psi_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $\psi(s_t)$ to $\psi(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $\psi$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络