Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-12

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

arXiv:2305.08175v5 Announce Type: replace-cross Abstract: Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.

02.
bioRxiv (Bioinfo) 2026-06-17

AMaNITA: an end-to-end workflow for native tRNA nanopore sequencing data analysis

Transfer RNA (tRNA) molecules serve as essential adapters during protein translation. While direct RNA sequencing (DRS) via Oxford Nanopore Technologies has emerged as a powerful platform for systematic tRNAome profiling, we currently lack a simple and robust statistical framework for nanopore tRNA data analyses. Here, we address this gap by developing AMaNITA (Abundance, Modifications, and Nanopore Intensity Toolbox Application), an end-to-end bioinformatic workflow that enables simplified, robust, and scalable analyses of nanopore native tRNA sequencing datasets. AMaNITA streamlines the entire analytical trajectory: from upstream processing (basecalling, mapping, filtering, batch effect correction) to downstream assessment of differential tRNA abundance and modification stoichiometry. The workflow generates an interactive HTML report for data exploration and analysis, allowing the user to download the source data files and resulting plots. AMaNITA can be executed using Singularity from the command line, without requiring installation of dependencies.

03.
arXiv (CS.CL) 2026-06-11

Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models

Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 18.15x higher throughput, 7.15x faster time-to-first-token and 6.03x memory reduction compared to standard PyTorch inference on Apple Silicon, with comparable or higher throughput speedups up to 95.81x on Intel and AMD processors.

04.
arXiv (CS.CV) 2026-06-16

Training-Free Adversarial Robustness in Computational MRI

Deep learning (DL) methods have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining. In this work, we propose a novel approach for mitigating adversarial attacks on MRI reconstruction models without any retraining. Based on the idea of cyclic measurement consistency, we devise a novel mitigation objective that is minimized in a small ball around the attack input. Results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods. We also introduce a practically relevant scenario for small adversarial perturbations that models impulse noise in raw data, which relates to herringbone artifacts, and show the applicability of our approach in this setting. Finally, we show our mitigation approach remains effective in two realistic extension scenarios: a blind setup, where the attack strength or algorithm is not known to the user; and an adaptive attack setup, where the attacker has full knowledge of the defense strategy.

05.
arXiv (CS.LG) 2026-06-16

Whole-Brain Connectomic Graph Model Enables Whole-Body Locomotion Control in Fruit Fly

arXiv:2602.17997v3 Announce Type: replace Abstract: Animals perform coordinated whole-body movements under the control of neural systems shaped by brain-wide connectivity. The mapping of the whole-brain neural connections, or the connectomes, provides a natural graph for modeling sensorimotor information flow, yet its potential as a neural controller for embodied agents remains largely unexplored. Here, we introduce the Fly-connectomic Graph Model, which directly instantiates the whole-brain connectome of an adult Drosophila as a graph-structured neural controller for movements of a simulated biomechanical fruit fly via deep reinforcement learning. We achieve stable performance across diverse locomotion tasks, as well as better sample efficiency compared to both graph and non-graph baselines. Our results demonstrate a biologically informed way towards effective control policy design by translating whole-brain wiring principles into actionable architectural priors, while also improving the interpretability through dynamic information flow. This work also highlights the potential to bridge neuromechanics with embodied intelligence by providing a computational platform for investigating the sensorimotor transformation underlying animal behavior and a paradigm to advance the development of more nature-aligned intelligent systems.

06.
arXiv (CS.CL) 2026-06-24

RoPE-Aware Bit Allocation for KV-Cache Quantization

Existing low-bit KV-cache quantizers often treat each cached key as a flat vector. Under RoPE, however, a key's contribution to a future attention logit decomposes into a position-dependent sum over two-dimensional frequency blocks. This makes key-cache quantization a block-wise bit-allocation problem: high-energy RoPE blocks are more sensitive to quantization error and should receive more bits. We introduce Block-GTQ, a RoPE-aware bit allocator for key-cache quantization built on TurboQuant-MSE(TQ-MSE). For each layer and KV head, Block-GTQ computes a label-free energy score for each RoPE block and greedily allocates integer bit widths by marginal gain. Under matched K/V bit budgets, Block-GTQ better preserves RoPE query-key logits on a ten-model diagnostic panel, cutting per-layer MAE by 32-80% at 2 and 3 b/dim K-only quantization and winning all 367/367 layer comparisons against uniform TQ-MSE. These fidelity gains translate to stronger downstream long-context retrieval, understanding, and reasoning. At K2V2 on Llama-3.1-8B-Instruct, Block-GTQ raises the six-task NIAH average from 70.6 to 97.4, and the LongBench-EN average from 36.87 to 53.31. On AIME 2024/2025 with DeepSeek-R1-Distill-Qwen-7B, without an fp16 recent-key buffer, Block-GTQ at K3V2 scores 51.7/37.5, close to fp16's 54.2/37.9, whereas uniform TQ-MSE collapses to 0.0/0.0. We further implement a packed-cache serving path. On a single H800 GPU with Qwen2.5-3B-Instruct, packed K3V3 achieves 3.24x KV-cache compression with fp16-comparable quality, runs 1.34x faster than fp16 FlashAttention2 at 128K context, reduces peak memory from 56.31 GB to 19.85 GB, and remains feasible at 256K and 512K where fp16 OOMs. Code is available at https://github.com/JIA-Lab-research/blockgtq.

07.
arXiv (math.PR) 2026-06-12

The censored stochastic six-vertex model and parabolic Kazhdan–Lusztig $R$-polynomials

arXiv:2606.12670v1 Announce Type: new Abstract: We introduce a censored version of the stochastic six-vertex model. We show that for parameters $b_1 < b_2$, this model started from the initial condition ${1}_{x>0}$ is stochastically dominated at any time by the blocking measure. This is a partial analog of the censoring inequality for monotone spin systems. In particular, this result allows us to control the behavior of second-class particles. The proof uses parabolic Kazhdan–Lusztig $R$-polynomials, whose appearance is explained using a connection between the stochastic six-vertex model and the Iwahori–Hecke algebras of symmetric groups. Furthermore, we find an intertwining relation for this process using normalized parabolic Kazhdan–Lusztig $R$-polynomials as an intertwining kernel.

08.
arXiv (CS.AI) 2026-06-24

Sensing Intelligence as a Trainable Metamaterial Property

arXiv:2605.23967v2 Announce Type: replace-cross Abstract: In biological systems, sensing is not performed by the brain alone: the body deforms, vibrates, and filters external stimuli before they are transduced into neural signals. In engineered systems, this processing burden is placed largely on electronics and computation, while the mechanical body is usually designed only for strength and stability. Here, we present sensing intelligence as a trainable property of the body. We show that the geometry of a metamaterial can be optimized to reshape external stimuli into internal signals that are easier for a neural network to interpret. Rather than hand-designing this physical preprocessing, we let the neural network train its own body for sensing by backpropagating the sensing loss to the body's design parameters through differentiable simulation. Across numerical and experimental sensing scenarios, the optimized body improves sensing accuracy by up to fivefold or reduces the number of required electronic sensors by nearly an order of magnitude.

09.
arXiv (CS.LG) 2026-06-16

GPT-Based Fast Simulation of CLAS12 Detector Hits via Conditional Autoregressive Generation

arXiv:2606.16035v1 Announce Type: cross Abstract: Modern particles physics experiments have demonstrated an increasing need for fast, high-fidelity detector simulation as detector components have improved and subsequent computational requirements approach the limits of available resources. Recently, deep generative models have emerged as a promising alternative to traditional Monte-Carlo methods, with recent works drawing inspiration from large language models (LLMs) and self-supervised next-token prediction methods. In this work, we present an application of a GPT-style autoregressive transformer as a fast surrogate model for the calorimeter inside the CLAS12 experiment at the Thomas Jefferson National Accelerator Facility. The model is conditioned on incident momentum and generates realistic detector hits autoregressively across all nine calorimeter layers as sequences of strip, ADC, and TDC tokens. We demonstrate that the model faithfully reproduces hit multiplicity, spatial distributions, energy deposits, and the energy-momentum response of the electromagnetic calorimeter. The generator achieves inference rates exceeding 700 events per second on a single GPU, providing a substantial speedup over traditional Geant4-based simulations while maintaining physics fidelity essential for high-luminosity experimental programs.

10.
arXiv (CS.LG) 2026-06-11

Capacity-Constrained Online Convex Optimization with Delayed Feedback

arXiv:2606.11711v1 Announce Type: new Abstract: Online learning with delayed feedback typically assumes that the learner can track all pending rounds until their feedback arrives. In practice, tracking resources are finite, and feedback from untracked rounds is permanently lost. In this paper, we study delayed online convex optimization (OCO) under a hard capacity constraint, where at most $C$ pending rounds can be tracked at any time. To model delay information, we introduce a semi-clairvoyant model that refines the clairvoyant assumption from prior work: rather than requiring delays to be known at prediction time, the learner observes delay expirations online, consistent with the classical unconstrained delayed setting. Our approach proceeds via a reduction to a novel ``delayed and weighted'' OCO problem, using a scheduler that randomizes tracking decisions and importance-weights the resulting observations. For this base problem, we propose and analyze Delayed-Weighted FTRL and its bandit analogue, establishing regret bounds that explicitly characterize the interaction between time-varying weights and delayed feedback. Combining these base learners with our schedulers yields the first regret guarantees for capacity-constrained OCO under convex and strongly convex losses, for both first-order and bandit feedback. For first-order feedback, capacity $C = \Omega(\log T)$ suffices to recover standard delayed OCO rates up to logarithmic factors. For bandit feedback, the regret rates are modulated by powers of $(1 + \sigma_{max}/C)$, where $\sigma_{max}$ is the maximum number of pending observations at any time. This allows the regret bound to degrade gracefully when $C < \sigma_{max}$, while remaining sublinear.

11.
arXiv (CS.CV) 2026-06-15

MUSE: Agentic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

Text-driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non-target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement-level state tracking, so part-level failures often lead to full-scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory-grounded multi-agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement-level controllability and preservation-aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584-case preservation-aware editing pool paired with external structured checks. On full construction cases, MUSE improves All-Goal success from 37.9 to 80.7 and surface-constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240-case editing test split, MUSE achieves 49.6 All-Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local-editing baselines support stronger alignment with user intent, and downstream navigation-proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

12.
arXiv (CS.AI) 2026-06-15

An Agentic Retrieval Framework for Autonomous Context-Aware Data Quality Assessment

arXiv:2606.13692v1 Announce Type: cross Abstract: Data quality assessment is a critical prerequisite for effective data analytics and data-driven decision-making, yet it remains a challenging task due to the inherently context-dependent nature of data quality. Existing approaches often rely on static rules or manual assessment strategies, limiting their adaptability to diverse usage scenarios and constraining automation at scale. Recent advances in artificial intelligence, particularly large language models, offer new opportunities for automating data quality assessment, but raise concerns related to reliability, grounding, and execution safety. In this paper, we propose a unified agentic-retrieval framework for autonomous context-aware data quality assessment. The framework interprets natural-language descriptions of intended data usage, derives context-aware assessment strategies, and generates executable validation logic through a multi-agent workflow. To ensure operational reliability, the framework introduces a feasibility validation stage that evaluates the realism and executability of generated assessment specifications before execution, enabling iterative refinement when necessary. Accepted validation logic is executed deterministically to guarantee reproducible and auditable results. We implement the proposed framework as an end-to-end prototype and evaluate it across multiple usage scenarios applied to the same dataset. The results demonstrate that assessment outcomes adapt meaningfully to different intended uses, while feasibility-gated execution reduces unrealistic or non-executable rule generation. The proposed approach provides a practical foundation for deploying autonomous yet controlled data quality assessment in modern data-driven environments.

13.
arXiv (CS.CV) 2026-06-16

Mask Proposal Voting Based on Geodesic Framework for Robust Image Segmentation

Despite great advances, finding accurate segmentation remains a challenging task, especially in scenarios with cluttered backgrounds, complex intensity variations and topology appearance. Minimal path models have exhibited their strong ability in addressing image segmentation tasks. However, the performance of minimal paths-based segmentation approaches is heavily influenced by model initialization, hence limiting their application scope in practice. In this work, we propose a novel mask proposal voting framework that overcomes the major drawback of classical approaches, allowing robust segmentation even in complicated scenarios. Firstly, we introduce an efficient method for constructing adaptive domain cuts as a constraint for initializing the region-based min-cut evolution, by which diverse and reliable mask proposal candidates can be generated, substantially increasing the possibility of accurately covering the objective region by these proposals. Secondly, we propose a new mask voting scheme to build a voting score map encoding the final segmentation information. In contrast to classical path voting methods, our model allows incorporating priors to assign different importance to each individual mask. As a consequence, the proposed segmentation model is capable of accurately delineating object boundaries under complex scenarios, and is insensitive to initialization. Experiments demonstrate that our method consistently outperforms state-of-the-art minimal path-based approaches in both accuracy and robustness.

14.
arXiv (CS.CV) 2026-06-19

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce PRISM (Preference Representation in Intermediate States of Diffusion Models). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

15.
arXiv (CS.AI) 2026-06-18

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

arXiv:2606.18293v1 Announce Type: cross Abstract: Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

16.
arXiv (CS.CV) 2026-06-19

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

Creating 3D visual illusions, a single 3D mesh that reveals entirely different semantics from various viewing angles, is a fascinating but tough challenge. Existing optimization-based methods are slow and can produce oversaturated colors. In contrast, naive stitching approaches fail to produce geometrically coherent objects. This results in visible unnatural seams and semantic leaks. In this paper, we present a fast and training-free framework for generating text-driven 3D visual illusions. Our approach decouples the generation into two stages. First, we propose a cross-space dual-branch denoising process. This process dynamically decodes 3D latents into voxel space for CLIP-guided orientation alignment and Signed Distance Field (SDF) blending, which ensures seamless geometric fusion. Second, we introduce a view-conditioned texture synthesis module that projects and aggregates view-specific 2D diffusion priors onto the fused geometry. Extensive experiments demonstrate that our method generates highly realistic, dual-semantic 3D illusions in just 3-5 minutes. It significantly outperforms existing methods in geometric integrity, semantic recognizability, and efficiency. Project page: https://siang1105.github.io/JanusMesh.github.io/

17.
arXiv (CS.AI) 2026-06-18

Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems

arXiv:2606.18882v1 Announce Type: cross Abstract: This work investigates the application of a domain-shift aware neural network for regression tasks aimed at estimating unbalance masses in rotating shafts under varying operating conditions. Experimental data were collected from a test rig in which a primary shaft, equipped with a flange carrying unbalanced masses, was driven at different rotational speeds, while a secondary shaft could be optionally activated to introduce domain discrepancy. The unbalance masses were positioned at a fixed radial distance, and the dynamic response of the system was recorded using triaxial accelerometers. The inverse problem of mass estimation is formulated within a domain adaptation framework, where the network is trained with a maximum mean discrepancy strategy to align feature representations across source and target distributions. The results demonstrate the effectiveness of explicitly addressing domain shift in improving prediction accuracy, especially when the system's physical behavior and sources of domain discrepancy are not fully known and fall outside the training conditions. These findings highlight the potential of domain-shift aware models for regression tasks in Structural Health Monitoring.

18.
arXiv (math.PR) 2026-06-25

Gaussian rigidity for infinite exchangeable sequences

arXiv:2606.25976v1 Announce Type: new Abstract: We prove a Gaussian rigidity theorem for infinite exchangeable sequences of real-valued random variables: the joint Gaussianity of a single pair of entries already forces the entire sequence to be a Gaussian process. This settles a conjecture raised by Newman (2026). The main analytic ingredient in the proof is Hardy's uncertainty principle. We also obtain a finite-dimensional vector-valued extension.

19.
arXiv (CS.CV) 2026-06-25

VPA-Guard: Defending and Benchmarking Image-to-Video Generation Against Visual Prompt Attacks

Recent advancements in Image-to-Video (I2V) generation have transformed input images from simple appearance references into interactive control interfaces where visual cues such as arrows, sketches, and emojis orchestrate complex video dynamics with unprecedented controllability. However, these seemingly innocuous static cues can be interpreted by models as executable temporal instructions, unfolding into harmful actions in the generated videos. Despite the severity of this threat, existing safety benchmarks remain predominantly focused on text-based and content-only image-based jailbreaks, leaving implicit visual prompt attacks insufficiently explored. To bridge this gap, we present VVA-Bench, the first systematic benchmark for evaluating video generation safety under categorized vision-centric prompt attacks. Extensive experiments on VVA-Bench demonstrate that state-of-the-art models are highly susceptible to such attacks, with Attack Success Rates (ASR) reaching 100.0\% on Wan 2.7 and 74.8\% on Veo 3.1. To mitigate these risks, we propose VPA-Guard, a retrieval-augmented and self-evolving defense framework. By leveraging few-shot reasoning to identify latent malicious intents, our method reduces the attack ASR by 44.2\% and the harmfulness score by 73.4\% on average, while maintaining the model's utility for legitimate user edits. Our work provides both a rigorous benchmark and an effective defense strategy to advance safe and socially responsible multimodal generation.

20.
arXiv (CS.CV) 2026-06-16

LentiAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

Real-time stereoscopic video communication has long been a goal of immersive telepresence, yet practical systems still require specialized capture rigs or reduce remote users to a single portrait view. We present LentiAvatar, a Gaussian head-avatar system that connects monocular avatar capture with subpixel-encoded glasses-free lenticular display for real-time autostereoscopic communication. From a monocular portrait video, LentiAvatar reconstructs a controllable head avatar and optimizes it for the lateral viewing zones induced by the display. The method uses natural head turns as pseudo-multiview (PMV) supervision to constrain regions that are otherwise weakly observed in monocular training, including hair, ears, jaw contours, and neck boundaries. Reliable side frames are yaw-binned, aligned to virtual cameras, and supervised within a strict head-and-hair domain; contour-aware losses and staged regularization further suppress ghosting, alpha leakage, and depth instability while preserving lateral detail. At runtime, LentiAvatar renders 32 virtual views and encodes them into a 4K lenticular raster with calibrated subpixel-routing masks. The live-tracker prototype sustains 10.65 FPS, and a subject-specific distilled driver raises the same display pipeline to 38.49 FPS.

21.
arXiv (CS.CV) 2026-06-25

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a pixel-aligned Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the number of input views, leads to view-biased density distributions, and introduces alignment errors, particularly when source views contain occlusions or low texture. To address these challenges, we introduce VolSplat, a new multi-view feed-forward paradigm that replaces pixel alignment with voxel-aligned Gaussians. By directly predicting Gaussians from a predicted 3D voxel grid, it overcomes pixel alignment's reliance on error-prone 2D feature matching, ensuring robust multi-view consistency. Furthermore, it enables adaptive control over density based on 3D scene complexity, yielding more faithful Gaussians, improved geometric consistency, and enhanced novel-view rendering quality. Experiments on widely used benchmarks demonstrate that VolSplat achieves state-of-the-art performance, while producing more plausible and view-consistent results. The video results, code and trained models are available on our project page: https://lhmd.top/volsplat.

22.
arXiv (CS.CV) 2026-06-19

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

23.
arXiv (CS.CV) 2026-06-24

DivRL: Disentangled Self-Similarity Rewards for Diverse Subject-Driven Generation

Subject-driven image generation faces an "Identity-Diversity Paradox", where strong identity preservation often leads to rigid and low-diversity outputs. We propose a post-training framework called DivRL that jointly optimizes identity consistency and structural diversity simultaneously by leveraging disentangled visual features from a robust similarity model. Specifically, we introduce a Negative Self-Similarity Measure (nSSM) to quantify structural diversity, and Visual Semantic Matching (VSM) to evaluate identity consistency. We propose an "Explore-and-Suppress" strategy that treats VSM as a gated constraint: the model freely explores structurally diverse configurations, and only samples that violate the identity threshold are penalized via a quadratic hinge loss. This converts identity preservation from a competing objective into a feasibility constraint, allowing nSSM and VSM to improve jointly. Experiments demonstrate that our method effectively pushes the model to generate both consistent and diverse images and improves structural diversity while maintaining comparable identity consistency through a gated optimization formulation.

24.
arXiv (CS.CV) 2026-06-19

Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification

Automated skin cancer classification from dermoscopic images remains challenging due to heterogeneous lesion structure, strong intra-class variability, and subtle visual differences between benign and malignant cases. Existing CNN/ViT pipelines typically rely on global or patch-level features and often combine patient metadata via late fusion, which limits spatially grounded multimodal reasoning. We present a novel region-based graph learning framework that explicitly models lesions as graphs of spatially coherent superpixel regions represented as frozen CNN features. To capture fine-grained lesion arrangements, we encode inter-regional geometry as edge attributes and introduce a dedicated metadata context node connected to all regions, providing structured integration of demographic/clinical variables within the same relational space. Node representations are updated using our edge-aware graph transformer followed by attention-driven propagation, and a final graph-level embedding for benign-malignant classification. Experiments on four public benchmarks demonstrate that explicit region-level relational modeling and graph-native multimodal fusion yield consistent gains over the state-of-the-art. Consequently, we establish a new graph-centric perspective in which CNN features are modeled as relational nodes and improved through contextual integration, yielding more expressive and robust classifications.

25.
arXiv (quant-ph) 2026-06-16

Ultracold atomic lattice systems for simulating topological phases: A review

arXiv:2606.16598v1 Announce Type: cross Abstract: Owing to rapid recent progress, ultracold atomic lattice systems for simulating topological phases are now at a pivotal stage, evolving from established paradigms into increasingly versatile and programmable quantum simulators. In this review, we survey recent experimental advances across four major classes of platforms: optical lattices, including optical lattices with laser-assisted tunneling and optical Raman lattices; synthetic lattices in momentum or internal-state space; Floquet-engineered lattices; and optical tweezer arrays, all of which offer distinct capabilities for realizing and probing topological matter. For each class, we highlight representative experimental breakthroughs, the topological models that have been realized, and the advanced detection and characterization techniques employed, emphasizing how these complementary approaches collectively expand the frontier of quantum simulation. We also discuss emerging directions in strongly correlated and nonequilibrium topological phases, and conclude with an outlook on future prospects.