Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-17

Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.

02.
medRxiv (Medicine) 2026-06-23

Changes in hierarchical brain dynamics of rumination following mindfulness-based cognitive therapy for depression

Major depressive disorder (MDD) is a leading cause of disability worldwide with risk of onset and recurrence linked to depressive ruminative thought patterns. Mindfulness-based cognitive therapy (MBCT) is an evidence-based treatment for depression that targets the ability to recognise, decenter, and disengage from ruminative thought patterns. Elucidating how MBCT impacts hierarchical brain organisation may be key to understanding the processes by which MBCT can modulate ruminative tendencies. In a randomised controlled functional magnetic resonance imaging (fMRI) trial on individuals with MDD (N=80) before and after MBCT in addition to treatment as usual (TAU), we investigated changes in hierarchical brain organisation during resting-state and rumination. We built whole-brain models to obtain generative connectivity (GEC) matrices per patient and quantified brain hierarchy by measuring the global directedness and regional trophic levels in each GEC, in which greater directedness reflects more directional information flow and less recurrence. Global directedness in MBCT+TAU compared to TAU increased during rumination, with no changes during resting-state. Furthermore, increased regional breadth of hierarchy during rumination was related to improvements in clinical and behavioural outcomes following MBCT+TAU. Increased brain hierarchy during rumination following mindfulness training may be consistent with a shift away from self-reinforcing negative mental loops towards more differentiated and less coupled cognitive and bodily cycles, supporting MBCT's ability to interrupt ruminative processes. Hierarchical brain dynamics may hold promise as a treatment-sensitive marker and a potential mechanism of therapeutic change in MBCT for depression.

03.
arXiv (CS.LG) 2026-06-25

Breaking Data Symmetry is Needed For Generalization in Feature Learning Kernels

arXiv:2604.00316v2 Announce Type: replace-cross Abstract: Grokking occurs when a model achieves high training accuracy but generalization to unseen test points happens long after that. This phenomenon was initially observed on a class of algebraic problems, such as learning modular arithmetic (Power et al., 2022). We study grokking on algebraic tasks in a class of feature learning kernels via the Recursive Feature Machine (RFM) algorithm (Radhakrishnan et al., 2024), which iteratively updates feature matrices through the Average Gradient Outer Product (AGOP) of an estimator in order to learn task-relevant features. Our main experimental finding is that generalization occurs only when a certain symmetry in the training set is broken. Furthermore, we empirically show that RFM generalizes by recovering the underlying invariance group action inherent in the data. We find that the learned feature matrices encode specific elements of the invariance group, explaining the dependence of generalization on symmetry.

04.
arXiv (CS.CV) 2026-06-16

Selective Synergistic Learning for Video Object-Centric Learning

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

05.
arXiv (CS.LG) 2026-06-15

Time Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining

arXiv:2605.26759v2 Announce Type: replace Abstract: Causal discovery from time series is critical for many real-world applications, such as tracing the root causes of anomalies. Existing approaches typically rely on dataset-specific optimization, making it difficult to transfer their causal discovery capabilities to new time series governed by diverse causal mechanisms. In this paper, we propose PTCD, a novel Pretraining framework for Time-series Causal Discovery, which improves cross-task generalization through context-conditioned modeling and transferable causal augmentation. To model complex temporal causal dependencies, PTCD employs a dual-scale iterative attention mechanism to capture window-level causal relationships, and a Gaussian mixture with a context-level routing mechanism to handle heterogeneous exogenous distributions. To further address distribution shifts across causal graphs, PTCD adopts a pretraining paradigm on synthetic datasets that integrates intervention-based learning and a causal mixup strategy, promoting stable causal discovery and stronger generalization. Extensive experiments on multiple real-world out-of-distribution (OOD) datasets demonstrate that PTCD excels in both causal discovery and root cause identification.

06.
arXiv (CS.LG) 2026-06-16

A spectral audit framework reveals task-dependent aperiodic reliance across EEG and ECG deep learning

arXiv:2606.08583v2 Announce Type: replace Abstract: Deep learning on physiological time series is interpreted through domain-specific features – oscillatory rhythms in EEG, morphological complexes in ECG – yet these signals sit atop a broadband aperiodic 1/f-like envelope that covaries with arousal, age, and pathology. We introduce a spectral audit framework combining aperiodic/periodic decomposition, phase-preserving Fourier interventions, sham controls, and simulation validation. Aperiodic reliance was task-dependent and architecture-general: across six neural architectures, flattening drops exceeded 0.42 balanced-accuracy points for sleep-wake classification, reached 0.07-0.13 for clinical abnormality detection, and remained minimal for motor imagery. Six of seven EEG foundation models showed FDR-significant aperiodic reliance on clinical EEG; age/sex and recording-era controls reduced but did not eliminate the effect. Applying the audit to PTB-XL ECG revealed neural drops of 0.32–0.36 persisting after demographic matching, confirming this confound class extends beyond EEG. Aperiodic controls should become standard for interpretable physiological time-series deep learning.

07.
arXiv (quant-ph) 2026-06-12

Statistical Mechanics and Symmetries of Non-Abelian Anyon Proliferation: From Deformation to Decoherence

arXiv:2606.12527v1 Announce Type: new Abstract: Topological quantum computation relies on braiding non-Abelian anyons, but requires the underlying topological order to survive imperfect state preparation and environmental noise. We show that the instability of topological order to wavefunction deformations and to decoherence, with the latter probed by syndrome distributions, are generically captured by stat-mech models whose symmetries naturally expose the corrupting anyonic excitations. As an example, we combine this framework with Monte-Carlo simulations to resolve the stability of $D_4$ topological order under deformations and quantum channels that proliferate multiple non-Abelian anyon species that individually are unable to condense. We show that beyond a finite threshold, proliferation of two non-Abelian anyon species parasitically condenses a shared Abelian-anyon fusion outcome, destroying the topological order. Our symmetry-based approach sharply differentiates the resulting trivial phase from that obtained by condensing all Abelian charges; in other words, the trivial phase "remembers" which anyons condensed. This framework provides a first step into identifying the relevant symmetry for optimal decoders, conditioned on syndrome measurements, of non-Abelian topological order.

08.
arXiv (CS.AI) 2026-06-19

Beyond Accuracy: Measuring Logical Compliance of Predictive Models

arXiv:2606.20208v1 Announce Type: new Abstract: Machine learning models are predominantly evaluated through predictive performance metrics such as ranking quality, prediction error, or classification accuracy. While these metrics effectively quantify how closely predictions match the ground truth, they do not assess whether model outputs respect predefined logical or domain-specific constraints. In high-stakes applications, including healthcare, finance, and autonomous systems, logical consistency can be as critical as predictive accuracy, yet no standard metric captures this dimension. We introduce the Rule Violation Score (RVS), a complementary evaluation metric that quantifies the extent to which a predictive model respects a given set of logical rules, independently of predictive accuracy. RVS treats hard rules (strict constraints) and soft rules (statistical regularities) differently, can be evaluated on any dataset and on any predictive model expressed over a relational vocabulary, and can be computed using SQL queries that are automatically generated for Horn rules. Beyond evaluating models, RVS can also evaluate the logical consistency of training datasets and help identify poorly defined rules. We evaluate RVS on three benchmarks covering knowledge graph link prediction and relational regression, including rule-based, embedding-based, and neuro-symbolic predictive models. Our results demonstrate that two models achieving comparable predictive accuracy can exhibit substantially different levels of logical compliance, revealing differences in model behavior that standard metrics fail to capture.

09.
arXiv (CS.CL) 2026-06-25

Shared Doubt: Zero-Shot Cross-Lingual Confidence Estimation for Language Models

Confidence estimation (CE), i.e., quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features in open-ended question answering. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.

10.
arXiv (CS.CV) 2026-06-15

RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

11.
arXiv (CS.CL) 2026-06-12

Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

12.
PLOS Medicine 2026-06-23

Comparisons of core component delivery in cardiac rehabilitation programs by country income classification and decade based on the 2025 Global Audit Update: A survey study

by Gabriela Lima de Melo Ghisi, Rachael P. Carson, Karam Turk Adawi, Rongjing Ding, Warner M. Mampuya, Mariya P. Jiandani, Jimena Martinez, Monserrat Cruz Rivero, Claudia V. Anchique, Dinah L. van Schalkwijk, Jonathan Gallagher, Buket Akinci, Dion Candelaria, Jirapa Champaiboon, Daniel F. Quesada-Chaves, Tone M. Norekvål, Iwona Szadkowska, Borut Jug, Evangelia Kouidi, Marta Supervia, Won-Seok Kim, Chamila Mettananda, Lilian Mbau, Gulsim T. Aimakova, Sherry L. Grace, on behalf of the ICCPR Global Cardiac Rehabilitation Audit Update Investigators Background Cardiovascular disease (CVD) remains a leading global health burden. Cardiac rehabilitation (CR) is essential to reducing morbidity and improving patient outcomes. Since the COVID-19 pandemic, CR delivery worldwide has evolved, yet these changes have not been systematically charactemkjrized. The objective of this study was to characterize globally: (1) the delivery of core CR components, including risk factors assessed, patient education practices, and program resources; (2) differences in these elements by country income classification and relative to the initial 2016 Global CR Audit. Methods and findings A cross-sectional Audit update was conducted. Program-level data were collected from May 1st to September 1st 2025 using a REDCap survey adapted from previous Audits. Eligible respondents were leads of phase II/post-discharge CR programs providing at least an initial assessment, structured aerobic exercise, and ≥1 additional core component. ICCPR associations and local leaders supported program identification. Main outcomes were core components delivered (10 assessed), risk factors assessed (14 assessed), patient education dose (hours/patient/program), and program resources (17 assessed). Generalized linear mixed models (GLMM) tested differences by income classification and (when applicable) changes since 2016. Of 7,025 programs identified globally, 1,505 (62% median country response rate) initiated a survey from 90/113 (80%) countries with CR. The median number of core components offered was 8/program (p25, p75 = 6, 10), with upper-middle income countries offering significantly more components overall (median = 9), and also high-income countries offering more than low-income countries (8 versus 6, p 

13.
arXiv (CS.LG) 2026-06-15

Code Correctness Signals in LLM Hidden States: Pre-Generation Probing and Repair Geometry

arXiv:2606.14530v1 Announce Type: new Abstract: Large language models encode rich information in their hidden states. This work asks whether code correctness is legible in the hidden states of Qwen3-4B-Instruct-2507, before it generates and as it repairs a failed attempt, studied on 444 LiveCodeBench tasks. It reports two findings connected by a single confound-control tool: residualization. First, the correctness of the model's first-attempt code is linearly decodable from the prompt-final hidden state, with a leakage-free held-out AUC of 0.931 +/- 0.008 across 50 outer splits. After the linear effect of prompt length is removed from each hidden state dimension, the probe still reaches 0.911 +/- 0.010, well above a prompt-length baseline of 0.754 +/- 0.014. Second, on 236 cleaned cases where the model attempts to repair a failed first attempt, the hidden state shift from the failing attempt to its repair carries a statistically detectable contrastive direction, significant on both a magnitude and a split-half test against label-shuffled nulls. This direction does not survive a conditional residualization against repair-context covariates that differ between successful and failed repairs, marking it as a correlate of repair success driven by the repair context rather than an isolated repair-comprehension feature. The probe layer is selected by nested cross-validation, and the same residualization approach that upholds the pre-generation correctness result overturns the repair-direction interpretation. The contribution is as much methodological as empirical: a diagnostic honest enough to report a negative result alongside a positive one.

14.
arXiv (CS.CL) 2026-06-19

Characterizing Narrative Content in Web-scale LLM Pretraining Data

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

15.
arXiv (CS.CV) 2026-06-15

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction – recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86–94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15–34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

16.
arXiv (CS.AI) 2026-06-24

Social Structure Matters in 3D Human-Human Interaction Generation

arXiv:2606.24255v1 Announce Type: cross Abstract: Although text-to-motion generation has achieved strong progress in synthesizing realistic single-person motions from language, extending it to text-driven 3D human-human interaction (HHI) remains non-trivial, as HHI requires modeling the underlying social structure that governs phase progression, actor roles, and inter-actor coordination. In this paper, we formulate HHI generation as a social structure modeling and grounding problem: the model must first infer how an interaction unfolds and how the two actors coordinate their roles, and then realize this structure as continuous, physically plausible, and partner-aware 3D motion. To study how such structure should be modeled, we first examine the capability boundary of large language models (LLMs) for HHI generation. Our analysis shows that LLMs can think by recovering phase decompositions and partner-aware roles, but cannot directly move, as they fail to generate dynamic, physically plausible, and interaction-aware motion. This motivates our planner-executor paradigm, Think with LLM, Move with Motion Skill. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor then grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. Together, our Solo-to-Social framework bridges social organization and motion realization, producing 3D HHI with improved phase consistency, role alignment, and partner-aware coordination.

17.
bioRxiv (Bioinfo) 2026-06-14

Structural Analysis of Prostate Cancer N-Glycans Using Graph-Based Structural Metrics

The N-linked glycans are structurally complex carbohydrate modifications that regulate protein folding, immune recognition, and cellular signaling, and their expression is extensively remodeled during cancer progression, making them promising biomarkers. In this study, prostate cancer-associated N-glycans from a range of relevant peer-reviewed studies were curated and digitized to develop a versatile computational framework that quantitatively encodes their spatial complexity across diverse biological systems. We invented two indices – the Distance & Connectivity Index (DCI) and the Position & Composition Index (PCI) – to capture the spatial information in N-glycans as layered architectures, enabling calculation of residue-level path lengths, branching structure, and compositional diversity. DCI summarizes glycan structure as both a scalar and matrix representation, while PCI does the same but also captures monosaccharide diversity, linkage heterogeneity, and cross-layer branching features. These metrics were computed with GlycoAssessor, an open-source platform that extracts information for the DCI and PCI from glycans drawn via Symbol Nomenclature for Glycans (SNFG) notation. Principal Component Analysis (PCA) was applied to evaluate whether glycans from prostate cancer tissues cluster distinctly in a disease-relevant manner. Results show that the spatial information in N-glycans: (1) increased in a multi-dimensional, non-linear manner, (2) objectively segregated structural themes, (3) could function as a potential prostate cancer biomarker that is distinct from mass-to-charge ratio and relative abundance, and (4) could objectively quantify novel subtype classifications of glycans associated with disease states and progression.

18.
arXiv (CS.CV) 2026-06-11

Spatially Selective Self-Training for Unsupervised Building Change Detection

Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision. SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08%, 91.69%, and 86.60%, respectively, outperforming existing unsupervised and label-free baselines.

19.
arXiv (CS.CV) 2026-06-17

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present Phys4D, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts a three-stage training paradigm that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of 4D world consistency evaluation that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

20.
arXiv (CS.CV) 2026-06-24

VistaRef: Boosting Visual Spatial Orientation Awareness for Pointing-to-Object Detection

Grounding deictic gestures in natural images is fundamental to AR and human-robot collaboration, providing a basis for seamless spatial interaction. While Transformer-based visual models have achieved significant progress in general object detection, their global attention mechanisms often neglect micro-geometric relationships, degrading orientation accuracy. In pointing tasks, this deficiency manifests as an inability to accurately capture the pointing ray implied by finger poses, which results in pointing drift and localization ambiguity when dealing with distant or densely packed objects. To address this, we propose VistaRef, a framework designed to explicitly enhance spatial orientation awareness. First, we develop the Local Hand Entity Modeling (LHEM) module, which incorporates hand-pose embeddings to strengthen the model's capability to capture subtle finger deviations. Second, drawing inspiration from multi-view geometry, we construct the Geometric Ray Modeling (GRM) module to transform implicit orientation information into explicit spatial geometric features, guiding feature aggregation and deep fusion via attention mechanisms. Furthermore, we introduce a novel Orientation-Consistent Alignment Loss (OCAL) to synergistically supervise hand presence and pointing consistency, ensuring that all architectural improvements collectively serve the core objective of spatial localization. Experimental results demonstrate that VistaRef significantly outperforms the baseline, achieving a 14-point absolute gain in grounding accuracy. Qualitative analysis further confirms that VistaRef effectively models the geometric correlation from hand to target, bridging the spatial perception gap inherent in traditional Transformers for complex scenarios. Code: https://github.com/lingli1724/VistaRef.

21.
medRxiv (Medicine) 2026-06-18

Biomedical Capacity, Governance, and Health Security: A Dominican Republic Research Analysis of Stakeholder Perspectives

The COVID-19 pandemic exposed critical vulnerabilities in globally concentrated biomedical supply chains and accelerated interest in nearshoring and hemispheric health-security strategies. The Dominican Republic, already the third-largest medical device exporter in Latin America, occupies a strategically significant but institutionally constrained position within this realignment. This study evaluates stakeholder perceptions of the principal opportunities and barriers affecting biomedical ecosystem development in the Dominican Republic, with particular attention to governance, workforce capacity, and value-chain upgrading pathways. Methods. A concurrent mixed-methods design was employed, integrating a cross-sectional electronic survey of 142 purposively sampled domain experts (administered September-December 2025) with a qualitative executive consultation with senior government and industry leaders. Survey analyses combined descriptive statistics, one-sample t-tests against the scale neutral midpoint, chi-square goodness-of-fit tests, Friedman non-parametric ranking, Spearman rank correlations, and exploratory linear and logistic multivariable regression. Qualitative responses were analyzed using a framework approach grounded in the Triple Helix model of innovation systems. Results. Perceived government support was significantly below neutral (mean = 2.67, SD = 1.12; p = 0.034). Workforce shortages (83.3%) and weak academia-industry collaboration (71.4%) were the most frequently endorsed barriers ({chi}2(5) = 18.7, p = 0.002). Regulatory modernization (88.1%) and workforce development (85.7%) ranked as the highest-priority policy levers (Friedman p = 0.005). Clinical trials and contract research organization services were the dominant sub-sector priority (76.2%, binomial p < 0.001). In multivariable analysis, perceived government support, talent availability, and confidence in IP protection jointly explained 46% of the variance in sector competitiveness (R2 = 0.46, p < 0.001). Strong majority support existed for a formal public-private biomedical coordination authority (73.8%, p < 0.001).Conclusion. Institutional credibility and advanced human capital–rather than geography or market access–are the perceived binding constraints on the Dominican Republics biomedical trajectory. Regulatory modernization, targeted workforce investment, and the establishment of a national biomedical coordination authority represent the highest-leverage interventions for positioning the country as a hemispheric hub for biomedical manufacturing, clinical research, and health security.

22.
arXiv (CS.CV) 2026-06-12

CRAG: Can 3D Generative Models Help 3D Assembly?

Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Project Page: https://ai4ce.github.io/CRAG/

23.
arXiv (quant-ph) 2026-06-11

Measurement incompatibility and quantum steering via linear programming

arXiv:2506.03045v3 Announce Type: replace Abstract: The problem of deciding whether a set of quantum measurements is jointly measurable is known to be equivalent to determining whether a quantum assemblage is unsteerable. This problem can be formulated as a semidefinite program (SDP). However, the number of variables and constraints in such a formulation grows exponentially with the number of measurements, rendering it intractable for large measurement sets. In this work, we circumvent this problem by transforming the SDP into a hierarchy of linear programs that compute upper and lower bounds on the incompatibility robustness with a complexity that grows polynomially in the number of measurements. The hierarchy is guaranteed to converge and it can be applied to arbitrary measurements – including non-projective POVMs (Positive Operator-Valued Measures) – in arbitrary dimensions. While convergence becomes impractical in high dimensions, in the case of qubits our method reliably provides accurate upper and lower bounds for the incompatibility robustness of sets with several hundred measurements in a short time using a standard laptop. We also apply our methods to qutrits, obtaining non-trivial upper and lower bounds in scenarios that are otherwise intractable using the standard SDP approach, although such bounds are significantly looser than the ones obtained in the qubit case. Finally, we show how our methods can be used to construct local hidden state models for states (i.e., to prove that a state cannot lead to steering under any possible local measurements), or conversely, to certify that a given state exhibits steering; for two-qubit quantum states, our approach is comparable to, and in some cases outperforms, the current best methods.

24.
arXiv (CS.CV) 2026-06-24

MM-TRELLIS: Point-Cloud Guided Multi-Modal 3D Vehicle Generation in Autonomous Driving

Recovering realistic 3D vehicle models from autonomous driving scenes is crucial for synthesizing training data and building simulation environment. However, most existing vehicle generation methods fail to fully exploit multimodal sensors i.e. multi-view images and LiDAR point clouds) and rely on neural rendering based reconstruction, leading to low-quality mesh. Recently, native 3D generative models have made significant progress, yet they are not built for arbitrary multi-view inputs and often struggle with in-the-wild driving images. In this work, we present MM-TRELLIS, a multi-modal version of TRELLIS for in-the-wild 3D vehicle generation that integrates LiDAR and image sensors from autonomous driving datasets into native 3D generative models. Specifically, multi-view images are cycled as conditioning inputs, while LiDAR point clouds provide test-time guidance to ensure geometric accuracy and cross-view consistency. During denoising, we first align the guidance point cloud with the model priors, then enforce consistency between the generated geometry and the guidance point cloud. Finally, we introduce a voxel filtering strategy based on the opacity of 3D Gaussian Splatting to suppress floaters and produce clean meshes. Comprehensive experiments on Waymo dataset demonstrate our method outperforms existing methods in high-fidelity 3D vehicle generation. Code is available at https://github.com/HongliXiao/MM-TRELLIS.

25.
arXiv (CS.CV) 2026-06-16

Robust Spoofed Speech Detection via Temporal Pyramid Modeling

Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.