论文广场 - AcademicHub

01.

medRxiv (Medicine) 2026-06-23 DOI: HASH:ae6e54ff4a79aeb6b871813dbdf74fe4

Multidimensional motivation in aging: a person-centred framework spanning goal-directed behaviour, social reward and pleasure

作者:

Warren ↗S. L ↗Somasundaram ↗Horne ↗Robinson ↗G. A ↗Behenska ↗Nguyen ↗Goh ↗A. M. Y ↗Jeon ↗Y.-H ↗…

Motivational changes are determinants of healthy aging, social engagement, and functional independence, and may signal early neurodegenerative risk. Existing assessment approaches in aging typically treat motivation as a unitary construct. Here, we introduce MotDem, an age-appropriate measure of motivation co-designed with people living with dementia, carers, and clinicians. Across a broad adult lifespan sample (18-80 years), MotDem revealed a robust three-domain motivational architecture encompassing goal-directed behaviour, social reward, and pleasure, with a fourth satiety factor retained as exploratory. This structure was replicated in an independent older cohort (45-80 years) from a different national context. MotDem showed strong convergence with established measures of apathy and anhedonia, alongside more modest associations with depressive symptomatology. Together, these findings show that motivational aging is multifaceted and poorly captured by traditional unitary assessment. MotDem provides a multidimensional framework for measuring distinct motivational drivers of heterogeneous aging trajectories, with implications for resilience, wellbeing, and neurodegenerative risk.

阅读与讨论 → 访问原文 →

02.

arXiv (CS.CV) 2026-06-24 DOI: arXiv:2606.24057

EPEdit: Redefining Image Editing with Generative AI and User-Centric Design

作者:

Hoang-Phuc Nguyen ↗Dinh-Khoi Vo ↗Trong-Le Do ↗Hai-Dang Nguyen ↗Tan-Cong Nguyen ↗Vinh-Tiep Nguyen ↗Tam V. Nguyen ↗Khanh-Duy Le ↗Minh-Triet Tran ↗Trung-Nghia Le ↗

The demand for image manipulation has seen a significant increase recently. Traditional tools like Photoshop and Capture One, while powerful, require considerable expertise to use effectively. Generative AI has introduced alternative platforms, such as Luminar Neo, Pixlr X, and Canva. However, many of these solutions, including resource-heavy models like Stable Diffusion, often require substantial retraining and fine-tuning, leading to high costs for users. To address these challenges, we introduce Efficient Photo Editor (EPEdit), an application that integrates a robust backend framework with a user-friendly front-end interface. EPEdit supports a wide range of creative image editing tasks, including image generation, object replacement, object removal, background modification, changes in object pose or perspective, region-specific editing, and thematic collection design, all guided by masks and prompts. Users can interact with the system through simple text commands or by marking areas for precise adjustments, making it accessible even to those without technical expertise. At its core, EPEdit leverages zero-shot image editing algorithms based on Stable Diffusion model, removing the need for additional fine-tuning. This approach enables efficient image manipulation and thematic collection creation. User evaluations for tasks of image editing, thematic design, and overall system performance demonstrate that EPEdit outperforms existing solutions, offering a user-friendly, cost-effective solution for comprehensive image editing.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.CV) 2026-06-19 DOI: arXiv:2606.20250

Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation

作者:

Duc T. Nguyen ↗Hoang-Long Nguyen ↗Thanh-Ha DO ↗Huy-Hieu Pham ↗

Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi-stage paradigm: class activation map (CAM) generation, offline pseudo-mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi-stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false-positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single-Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post-hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high-fidelity activation maps directly within a single training loop. Experiments on the LUAD-HistoSeg and BCSS datasets demonstrate that SSHR outperforms state-of-the-art multi-stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large-scale histopathology workflows. The code is available at: https://github.com/trongduc-nguyen/SSHR

阅读与讨论 → 访问原文 →

04.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2507.19712

Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

作者:

Ngoc Hung Nguyen ↗Nguyen Van Thieu ↗Quang-Trung Luu ↗Anh Tuan Nguyen ↗Senura Wanasekara ↗Nguyen Cong Luong ↗Fatemeh Kavehmadavani ↗Van-Dinh Nguyen ↗

arXiv:2507.19712v3 Announce Type: replace-cross Abstract: In this paper, we explore mission assignment and task offloading in an Open Radio Access Network (Open RAN)-based intelligent transportation system (ITS), where autonomous vehicles leverage mobile edge computing for efficient processing. Existing studies often overlook the intricate interdependencies between missions and the costs associated with offloading tasks to edge servers, leading to suboptimal decision-making. To bridge this gap, we introduce Oranits, a novel system model that explicitly accounts for mission dependencies and offloading costs while optimizing performance through vehicle cooperation. To achieve this, we propose a twofold optimization approach. First, we develop a metaheuristic-based evolutionary computing algorithm, namely the Chaotic Gaussian-based Global ARO (CGG-ARO), serving as a baseline for one-slot optimization. Second, we design an enhanced reward-based deep reinforcement learning (DRL) framework, referred to as the Multi-agent Double Deep Q-Network (MA-DDQN), that integrates both multi-agent coordination and multi-action selection mechanisms, significantly reducing mission assignment time and improving adaptability over baseline methods. Extensive simulations reveal that CGG-ARO improves the number of completed missions and overall benefit by approximately 7.1% and 7.7%, respectively. Meanwhile, MA-DDQN achieves even greater improvements of 11.0% in terms of mission completions and 12.5% in terms of the overall benefit. These results highlight the effectiveness of Oranits in enabling faster, more adaptive, and more efficient task processing in dynamic ITS environments.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17399

The Discrete-Log Clock: How a Transformer Learns Modular Multiplication

作者:

Huu Danh Nguyen ↗

arXiv:2606.17399v1 Announce Type: cross Abstract: When small transformers grok modular multiplication, prior work reports that the learned embedding has a "dense" Fourier spectrum requiring all frequencies. This contrasts with modular addition, where only a sparse set of key frequencies suffices. We show this density is an artifact of analyzing in the wrong basis. The natural Fourier transform for multiplication is not the standard additive DFT but the multiplicative character transform, which decomposes functions on the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$ into its irreducible representations. Applying this transform to a grokked transformer trained on $a \cdot b \bmod 113$, we find the embedding spectrum becomes highly sparse (Gini coefficient 0.58 vs. 0.07 in the additive basis) with only 4 key frequencies carrying significant energy. Furthermore, 96.9% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D-periodic structure when reordered by the discrete logarithm. These results demonstrate the transformer reduces multiplication to addition in discrete-log space, implementing a "Discrete-Log Clock" algorithm analogous to Nanda et al.'s Clock algorithm for addition. The methodology generalizes: matching the analysis basis to the algebraic structure of the task reveals interpretable structure where standard tools see noise.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.12706

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

作者:

Thach Nguyen ↗Danhua Guo ↗Tom Lampo ↗Fei Wu ↗Burhan Yaman ↗

Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.19591

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

作者:

Vu Nguyen Nguyen Xuan ↗Huy Ngo Quang ↗

In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

阅读与讨论 → 访问原文 →

08.

arXiv (quant-ph) 2026-06-19 DOI: arXiv:2606.19748

Variational Polaron Theory for Ground States of Strongly Coupled Light-Matter and Electron-Phonon Systems

作者:

Nguyen Thanh Phuc ↗

arXiv:2606.19748v1 Announce Type: cross Abstract: Strong light-matter and electron-phonon coupling generate ground states dressed by virtual bosonic excitations, making bare-state truncations and perturbative treatments unreliable in the ultrastrong-coupling regime. We introduce a nonperturbative variational ground-state framework based on a state-dependent polaron transformation, combined with a product-state ansatz and a second-order perturbative correction for residual matter-boson entanglement. We show that the optimized transformed frame becomes asymptotically decoupled at infinite coupling, because the leading linear coupling is canceled while off-diagonal matter transitions are suppressed by displaced-oscillator overlaps. The approach is asymptotically correct in both weak- and strong-coupling limits and remains accurate in the intermediate regime, where fixed polaron transformations are least reliable. Dicke-model benchmarks reproduce ground-state energies, fidelities, and the superradiant transition, with second-order energy errors below 0.2%. Holstein-model benchmarks yield errors below 0.5% and clarify how translational symmetry affects wave-function quality. This dressed-basis framework enables nonperturbative modeling of strongly coupled light-matter and electron-phonon systems.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16149

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

作者:

Minh-Ha Nguyen ↗Erica Gray ↗Chih-Ting Yang ↗Rizwan Hamid ↗Lingyao Li ↗Siyuan Ma ↗Thomas A. Cassini ↗Cathy Shyr ↗

arXiv:2606.16149v1 Announce Type: new Abstract: Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15695

When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning

作者:

Thinh T. H. Nguyen ↗Khoa D. Doan ↗Binh T. Nguyen ↗Danh Le-Phuoc ↗Kok-Seng Wong ↗

arXiv:2606.15695v1 Announce Type: cross Abstract: Federated class-incremental learning (FCIL) becomes substantially harder when clients observe different label subsets, progress through tasks at different stages, and provide uneven supervision for the same semantic concepts. Existing FCIL methods often preserve old knowledge through input-space synthesis, but they can be fragile under heterogeneous task streams and difficult to transfer across modalities. To alleviate such issues, we propose PRO, a framework that replaces synthetic input replay with projected rehearsal orchestration. To remove external pretraining, we evaluate all methods under the same warmup. After this, PRO maintains compact class-level projected memories on the server and allows clients perform balanced pseudo multi-task training over current examples and old projected memories. To handle stronger representation drift, we further introduce PRO-MAX, which augments PRO with neighborhood-weighted memory alignment while preserving the same server-light principle that the server only aggregates model updates and memory statistics. Across image, text, and graph benchmarks, PRO and PRO-MAX improve retention and final utility under heterogeneous streams while remaining competitive in homogeneous FCIL. Even when baselines are given expanded replay budgets, they degrade under supervision imbalance and stage misalignment, indicating that replay quantity alone does not resolve replay-quality failures. Additional weak-task diagnostics further show that larger replay mismatch is associated with larger downstream degradation, while our method keeps projected memories better aligned with the evolving representation.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.LG) 2026-06-19 DOI: arXiv:2606.19412

Spectral Retrieval-Augmented Time-Series Forecasting

作者:

Huu Hiep Nguyen ↗Minh Hoang Nguyen ↗Dung Nguyen ↗Hung Le ↗

arXiv:2606.19412v1 Announce Type: new Abstract: Time series forecasting leverages historical patterns to predict future values, but traditional methods face challenges when dealing with complex, non-stationary patterns that are difficult to memorize during training. Retrieval-augmented approaches have emerged as promising solutions by retrieving similar historical patterns to enhance predictions. However, existing retrieval methods suffer from two fundamental limitations: spectral blindness, which overlooks critical frequency-domain characteristics that capture underlying periodic structures, and temporal recency, which treats all historical data equally without emphasizing recent, more relevant patterns. In this paper, we propose SpecReTF, a novel retrieval method that addresses these issues by converting time series into windowed frequency representations, measuring similarity with a combined metric that captures both amplitude and phase information. To balance recency and historical context, we apply an exponential moving average weighting scheme that emphasizes recent windows. Extensive experiments on benchmark datasets demonstrate that SpecReTF outperforms time-domain retrieval methods, achieving superior forecasting accuracy across diverse, non-stationary time series.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.18553

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

作者:

Minh-Loi Nguyen ↗Xuan-Vu Le ↗Long-Bao Nguyen ↗Hoang-Bach Ngo ↗Trung-Nghia Le ↗

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content–visual, visual–visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.16103

SceneCraft: Interactive System for Image Editing via Scene Graph

作者:

Duc-Manh Phan ↗Ngoc-Dai Tran ↗Duy-Khang Do ↗Tam V. Nguyen ↗Minh-Triet Tran ↗Trung-Nghia Le ↗

Recent advances in generative AI have enabled natural language-driven image editing, yet existing systems often fail in complex scenes with multiple interacting objects because they rely heavily on users crafting precise text prompts. To address the absence of structured control, we propose SceneCraft, a novel interactive framework that bridges user intent and model execution by representing images as editable scene graphs. Instead of guessing text prompts through trial and error, users interact directly with a visual graph to perform complex spatial and relational operations. These graph modifications are automatically translated into precise, context-aware editing prompts, effectively eliminating linguistic ambiguity. To ensure robust and diverse results, structured prompts are dispatched to multiple state-of-the-art generative models. Evaluations across diverse editing scenarios show that SceneCraft provides a more intuitive control mechanism, significantly reducing the cognitive burden of manual prompt engineering while generating outputs that users consistently rate as higher in quality and fidelity.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.AI) 2026-06-18 DOI: arXiv:2606.19152

AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

作者:

Zongmin Zhang ↗Yuyang Lou ↗Bowen Zhang ↗Junwu Chen ↗Ryo Kuroki ↗Xuan Vu Nguyen ↗Edvin Fako ↗Lixue Cheng ↗Philippe Schwaller ↗

arXiv:2606.19152v1 Announce Type: cross Abstract: Identifying the lowest-energy surface-adsorbate configuration is critical for modeling heterogeneous catalysis, yet exhaustive exploration with ab initio calculations is computationally prohibitive. Machine-learning force fields (MLFFs) accelerate structural relaxation but leave the search over the vast configurational space a major bottleneck, and open-loop large language model (LLM) agents lack a physics-grounded feedback mechanism to correct erroneous initial guesses. We propose AdsMind (Adsorption configuration discovery with Machine intelligence and relaxation feedback), a closed-loop multi-agent framework that enables autonomous error correction through MLFF relaxation feedback. Across four LLM backends, AdsMind achieves consistently high search reliability, with success rates of 100% and 98.8% on the benchmarks AA20 and OCD-GMAE62. Relative to its single-pass (1-Shot) ablation it reduces cross-backend energy dispersion, and it uses only 4.11 and 4.67 MLFF relaxations per case, respectively – an approximately 14-fold reduction over heuristic enumeration baselines. Density functional theory (DFT) validation using VASP/PBE on six representative AA20 systems shows that the reported open-loop Adsorb-Agent outputs exhibit qualitative adsorption-energy sign errors for molecular adsorbates, whereas AdsMind preserves the correct sign in all tested cases with closer quantitative agreement. AdsMind thus delivers reliability, self-reflection, and interpretability simultaneously, supporting more DFT-informed autonomous chemistry workflows.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2601.03792

VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation

作者:

Huynh Trung Kiet ↗Dao Sy Duy Minh ↗Nguyen Dinh Ha Duong ↗Le Hoang Minh Huy ↗Long Nguyen ↗Dien Dinh ↗

Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss' kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.CL) 2026-06-24 DOI: arXiv:2604.08558

WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

作者:

Hanna Lee ↗Tan Dat Nguyen ↗Jaehoon Kang ↗Kyuhong Shim ↗

Recent decoder-only autoregressive text-to-speech (AR-TTS) models produce high-fidelity speech, but their memory and compute costs scale quadratically with sequence length due to full self-attention. In this paper, we propose WAND, Windowed Attention and Knowledge Distillation, a framework that adapts pretrained AR-TTS models to operate with constant computational and memory complexity. WAND separates the attention mechanism into two: persistent global attention over conditioning tokens and local sliding-window attention over generated tokens. To stabilize fine-tuning, we employ a curriculum learning strategy that progressively tightens the attention window. We further utilize knowledge distillation from a full-attention teacher to recover high-fidelity synthesis quality with high data efficiency. Evaluated on three modern AR-TTS models, WAND preserves the original quality while achieving up to 66.2% KV cache memory reduction and length-invariant, near-constant per-step latency.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2509.09631

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

作者:

Ngoc-Son Nguyen ↗Thanh V. T. Tran ↗Hieu-Nghia Huynh-Nguyen ↗Truong-Son Hy ↗Van Nguyen ↗

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.LG) 2026-06-25 DOI: arXiv:2606.25800

ROAD-VLA: Robust Online Adaptation via Self-Distillation for Vision-Language-Action Models

作者:

Kejing Wang ↗Toan Nguyen ↗Minh Hoang Nguyen ↗Simon Khan ↗Flora D. Salim ↗

arXiv:2606.25800v1 Announce Type: new Abstract: Effective online adaptation of vision-language-action (VLA) models remains challenging, as sparse rewards provide weak supervision for high-dimensional autoregressive action policies. Although self-distillation can in principle provide denser training signals, we find that text-based privileged teachers conditioned on demonstrations, retrieved experiences, or high-level plans are ineffective for VLA adaptation, exposing a modality gap between symbolic guidance and low-level robot actions. We propose ROAD-VLA, an advantage-guided self-distillation framework that constructs a proximal teacher directly in action space by perturbing action-token logits with calibrated advantage estimates. This converts sparse rewards into dense token-level supervision while keeping the teacher close to the current policy. We further derive a policy-improvement lower bound under calibrated advantages and accurate teacher matching. Across seven robotic manipulation environments with in-distribution and out-of-distribution shifts, ROADVLA outperforms PPO in nearly all settings, demonstrating robust online VLA adaptation.

阅读与讨论 → 访问原文 →

19.

arXiv (quant-ph) 2026-06-24 DOI: arXiv:2601.15559

Spectator-transition crosstalk in a spin-3/2 silicon vacancy qudit in silicon carbide revealed by broadband Ramsey interferometry

作者:

Jun-Jae Choi ↗Seung-Jae Hwang ↗Seoyoung Paik ↗Juhwan Kim ↗Jawad Ul-Hassan ↗Nguyen Tien Son ↗Hiroshi Abe ↗Takeshi Ohshima ↗Jaekwon Suk ↗Hyeon-Ho Jeong ↗Dong-Hee Kim ↗Sang-Yun Lee ↗…

arXiv:2601.15559v3 Announce Type: replace Abstract: Color center spins in 4H-SiC offer a rare combination of wafer-scale materials maturity with long spin coherence and chip-level photonics, making them promising building blocks for scalable quantum technologies. In particular, the silicon vacancy hosts an S=3/2 ground state, a native qudit that enables compact encodings and subspace-selective control, but also introduces spectator transitions: short, detuned pulses can coherently drive non-addressed level pairs and create crosstalk. Here we use broadband Ramsey interferometry to reveal and quantify such spectator-transition crosstalk. Experimentally, the Ramsey Fourier spectra display multiple lines beyond the addressed single-quantum transition. Analytically, we map each line to a pairwise energy difference between qudit levels of the rotating-frame Hamiltonian and assign its weight via compact amplitudes set by the prepared state and the microwave pulse parameters, predicting a deterministic six-branch structure. Numerical time-domain propagation with the experimental sampling reproduces the detuning map, and the measured peak positions coincide with the analytic branch lines without frequency fitting. Together these results provide a practical, spectator-aware framework for multilevel control in the silicon vacancy qudit. The approach offers clear guidance to suppress crosstalk or, conversely, to exploit spectator lines, for example as additional constraints for in situ pulse calibration and for phase-sensitive quantum state and process estimation.

阅读与讨论 → 访问原文 →

20.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.13427

VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits

作者:

Hoang-Nguyen Cao ↗Le-Hoang Bui ↗Dinh-Khoi Vo ↗Minh-Triet Tran ↗Trung-Nghia Le ↗

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.CV) 2026-06-19 DOI: arXiv:2606.20199

Evaluation of Image Matching for Art Skills Assessment

作者:

Asaad Alghamdi ↗Michael Poor ↗Trung-Nghia Le ↗Tam V. Nguyen ↗

While some individuals possess a natural talent for drawing, mastering this skill requires dedicated training and practice. Determining one's skill in the art of drawing requires proper comprehensive assessment. In this paper, we propose a method to measure drawing skill by by matching the hand-drawn image with the original template. Existing techniques often involve complex processes. However, advancements in computer vision allow us to train computers to perform these comparisons at a human-like level, thereby resolving the tedious and overwhelming traditional process. Using computer vision applications, determining image similarity involves identifying the level of similarities in an image with a reference image. We have implemented and analyzed the SIFT feature and Siamese network to measure image similarity. Our results indicate that it is feasible to assess art skill levels. Through feature analysis, we found that SIFT-based key point matching provides a more effective means of detecting drawing skills.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2606.15092

High-Dimensional Random Projection for Activation Steering in Language Models

作者:

Minh-Hieu Pham ↗Bach Do ↗Laziz Abdullaev ↗Tan Minh Nguyen ↗Khoat Than ↗

arXiv:2606.15092v1 Announce Type: new Abstract: Activation steering has emerged as a key methodology for controlling the behavior of large language models (LLMs). Existing difference-in-means based methods, however, are fundamentally limited: they capture only mean differences between class activations and fail to recover discriminative signals that naturally exist in the nonlinear feature subspace under the superposition hypothesis. Motivated by that, we propose High-Dimensional Random-projection for Activation Steering (HiDRA), a training-free approach that integrates seamlessly with existing activation steering methods. By performing activation addition in the projected high-dimensional space, HiDRA can provably capture a better discriminative structure beyond the reach of linear methods. Experiments across diverse LLM families and benchmarks demonstrate that HiDRA consistently outperforms baseline counterparts, achieving stronger behavioral control without significant computational overhead.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2606.15784

Bayesian Networks with Latent Time Embedding for Stage-Aware Causal Modeling of Alzheimer's Disease Progression

作者:

Nguyen Linh Dan Le ↗

arXiv:2606.15784v1 Announce Type: new Abstract: Alzheimer's disease (AD) progression is often described through the amyloid-tau-neurodegeneration, or AT(N), cascade. However, most longitudinal models represent this cascade either as a fixed sequence of biomarkers or as a black-box forecasting task. This makes it difficult to determine when biologically guided biomarker relationships influence future regional pathology. In this study, we introduce Bayesian Networks with Latent Time Embedding (BN-LTE), a Bayesian structural framework for stage-aware modeling of AD progression. BN-LTE estimates disease pseudotime from baseline biomarker profiles and constrains directed dependencies according to biologically plausible AT(N) ordering. Posterior spline-varying structural equations are then used to link initial multimodal measurements with future annualized regional tau-PET change. Across repeated subject-disjoint evaluations using ADNI data, BN-LTE shows strong spatial reconstruction of tau progression compared with the included forecasting baselines. Beyond spatial reconstruction, BN-LTE recovers posterior stage-varying AT(N)-constrained effects and identifies a mid-pseudotime window of amyloid sensitivity. This window is supported by model-implied g-formula contrasts, root-adjusted AIPW, mechanism-sensitive ablations, and robustness analyses across spline and prior specifications. Overall, these findings position BN-LTE as a Bayesian structural framework for forecasting tau progression while examining stage-dependent AT(N)-cascade mechanisms in observational longitudinal neuroimaging data. Our code is available at https://github.com/danleneurocom/BN-LTE.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.AI) 2026-06-12 DOI: arXiv:2606.12673

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

作者:

Phan Nguyen ↗Dat Cao ↗Hien Chu ↗Khue Hoang ↗

arXiv:2606.12673v1 Announce Type: cross Abstract: Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2606.19749

Benchmarking Agentic Review Systems

作者:

Dang Nguyen ↗Wanqing Hao ↗Yanai Elazar ↗Chenhao Tan ↗

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络