Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CV) 2026-06-17

RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

02.
arXiv (quant-ph) 2026-06-16

Exactly Solvable Quantum Model with Spin-Dependent Coulomb Interaction

arXiv:2501.05103v5 Announce Type: replace Abstract: In this work, we report an exactly solvable quantum model featuring a spin-dependent Coulomb interaction, described by the spin vector potential \(\vec{\mathcal{A}} = k (\vec{r} \times \vec{S}) / r^2\) together with a Coulomb-type scalar potential \(\varphi = \kappa / r\) . The model is governed by the Schrödinger-type Hamiltonian \(\mathcal{H}_S = \vec{\Pi}^2 / (2M) + q \varphi\) in nonrelativistic quantum mechanics and by the Dirac-type Hamiltonian \(\mathcal{H}_D = c \vec{\alpha} \cdot \vec{\Pi} + \beta M c^2 + q \varphi\) in relativistic quantum mechanics, where \(\vec{\Pi} = \vec{p} - (q/c)\vec{\mathcal{A}}\) is the canonical momentum. We demonstrate two main results: (i) Just as the Coulomb-type scalar potential \(\mathcal{S}_Maxwell = \{\vec{\mathcal{A}} = 0,\ \varphi = \kappa / r\}\) is a local exact solution of Maxwell's equations on $r\neq0$, the gauge potential \(\mathcal{S}_YM = \{\vec{\mathcal{A}} = k (\vec{r} \times \vec{S}) / r^2,\ \varphi = \kappa / r\}\) constitutes a local exact solution of the Yang–Mills equations on the punctured region $r\neq0$. (ii) Both Hamiltonians \(\mathcal{H}_S\) and \(\mathcal{H}_D\) can be solved exactly in the presence of this spin-dependent Coulomb interaction. The resulting energy spectra are derived, and they naturally reduce to those of the ordinary hydrogen atom when the spin-dependent terms are neglected. Finally, we clarify the quantization conditions and the fixed-background interpretation of the model.

03.
arXiv (CS.CL) 2026-06-19

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

04.
arXiv (quant-ph) 2026-06-15

Emission of time-ordered photon pairs from a coherently-driven Kerr microcavity

arXiv:2601.06468v2 Announce Type: replace-cross Abstract: Weakly-interacting many-body systems possess remarkable quantum properties that are essential components of quantum technologies, and constitute a topic of fundamental interest. Here we show that in a solid-state nonlinear microcavity embedding discrete modes of exciton-dressed photons, we can isolate a single eigenmode of quantum fluctuations from the much brighter coherent fraction of the field. In this regime, we perform frequency- and time-resolved correlations measurements between photons on the red and blue side of the fluctuations spectrum. When the average number of fluctuation quanta is smaller than one, we observe the formation of large pairwise time-ordered correlations: red photon first and blue photon second. We show that this peculiar time-ordering correlation emerges spontaneously from the interplay between frequency-resolved detection, and the non-trivial internal quantum structure of the elementary fluctuations.

05.
arXiv (CS.LG) 2026-06-12

ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

arXiv:2505.20076v4 Announce Type: replace Abstract: Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation, and are often tied to a particular level of granularity along the local-to-global spectrum. This leads to explanations that lack a unified view and may miss key interactions. We present ExPLAIND, a theoretically grounded, unified framework that integrates model components, data, and training trajectory while supporting explanations across granularities. We generalize recent work on gradient path kernels, reformulating models trained by AdamW as kernel machines. From the resulting kernel feature maps, we derive novel parameter-wise and step-wise influence scores. We empirically validate the resulting decomposition of model behavior in several settings and apply ExPLAIND to two case studies. Our findings on a Transformer exhibiting Grokking support previously proposed learning phases, while refining the final phase as one in which outer layers align around a representation pipeline learned after memorization. For EuroLLM pretraining, ExPLAIND reveals a two-phase dynamic, with the first characterized by outer-layer MLP learning and the second by increased relative influence of intermediate attention layers. These results establish ExPLAIND as a unified framework for interpreting model behavior and training dynamics.

06.
arXiv (math.PR) 2026-06-18

The FBSDE approach to sine-Gordon up to $6\pi$

arXiv:2401.13648v3 Announce Type: replace-cross Abstract: We develop a stochastic analysis of the sine-Gordon Euclidean quantum field $(\cos (\beta \varphi))_2$ on the full space up to the second threshold, i.e. for $\beta^2 < 6 \pi$. The basis of our method is a forward-backward stochastic differential equation (FBSDE) for a decomposition $(X_t)_{t \geqslant 0}$ of the interacting Euclidean field $X_{\infty}$ along a scale parameter $t \geqslant 0$. This FBSDE describes the optimiser of the stochastic control representation of the Euclidean QFT introduced by Barashkov and one of the authors. We show that the FBSDE provides a description of the interacting field without cut-offs and that it can be used effectively to study the sine-Gordon measure to obtain results about large deviations, integrability, decay of correlations for local observables, singularity with respect to the free field, Osterwalder-Schrader axioms and other properties.

07.
arXiv (math.PR) 2026-06-12

Interference Queueing Networks: A Replica Mean-Field Approach in the Symmetric Setting

arXiv:2606.13264v1 Announce Type: new Abstract: We propose a model for evaluating the performance of wireless communication networks beyond the ubiquitous full-buffer assumption, under which every transmitter is always active. The network is represented by N interacting queues arranged on a torus, with homogeneous arrival rate and service rates depending on the activity of neighboring interferers. More precisely, each queue is associated with a transmitter-receiver pair, and its service rate is given by the Shannon capacity, which depends on the corresponding Signal-to-Interference-plus-Noise Ratio (SINR). Since interfering transmitters only emit when their queue is non-empty, the SINR and hence the service rate improves when neighboring queues are empty. We derive the stability region of the system, together with approximations of its stationary distribution and its exponential rate of convergence to stationarity. These approximations are obtained via a replica mean-field limit, for which we establish propagation of chaos and long-time behavior results.

08.
bioRxiv (Bioinfo) 2026-06-13

PertDiffBench: Benchmarking Diffusion Models for Single-Cell Perturbation Response Prediction

Diffusion models are increasingly used to predict transcriptional responses to perturbations, but whether they improve on simpler generative and representation-based baselines remains unclear. Existing evaluations often do not separate the effects of model architecture, input representation, biological context and metric choice, making it difficult to determine where diffusion-based methods are useful. Here we introduce PertDiffBench, a standardized benchmark for diffusion-based transcriptomic perturbation prediction across single-cell and bulk RNA-seq datasets. PertDiffBench evaluates diffusion-based models across three complementary evaluation settings: standard prediction in known single-cell contexts and bulk perturbation conditions, generalization to unseen cell types, species, drugs and intermediate time points, and stress tests of feature dimensionality, input representation, noise type and gene ordering. Across these settings, diffusion models did not show a consistent advantage. scGen remained a strong baseline in common prediction tasks, whereas scDiffusion was the most competitive diffusion-based method in several generalization settings. Temporal imputation showed a different pattern, with a simple DDPM operating directly in expression space outperforming more specialized models. Stress tests showed that performance was model dependent and sensitive to feature dimensionality, encoder choice, noise type and gene ordering. Pretrained encoders did not consistently improve performance, with the classical scVI representation slightly exceeding STATE in seen-condition and unseen-cell-type settings. These results indicate that diffusion-model performance in perturbation response prediction depends strongly on task design and representation choice. PertDiffBench provides a practical framework for evaluating these models under biologically varied and stress-tested conditions.

09.
arXiv (CS.AI) 2026-06-11

Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

arXiv:2606.11869v1 Announce Type: cross Abstract: Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general-purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function-calling APIs, the Model Context Protocol, code agents to pair with), but the practice that chains them lives in podcasts, blogs, and leaked system prompts. This paper writes that practice down as a methodology, Agents All the Way Down: two preconditions crossed once and kept, then three practices repeated for the agent's life. The preconditions are (P1) Substrate, the LLM as a software component, framed as tools, then system, then messages under prompt-caching; and (P2) Building blocks: function calling, MCP, CLI orchestration, the liteshell pattern, the agent loop, skills, characters, hooks, and scaffolding. The practices are (P3) prototype with a general-purpose agent; (P4) harvest, fold, and ship the result as a CLI, the Turtle pattern; and (P5) agent-tests-agent, in which a general-purpose agent drives it through behavioural scenarios, a complement to classical testing, not a replacement. The working loop is P3 to P4 to P5 and back, and one corollary falls out for free: multi-agent orchestration is just CLI composition. The methodology is framework-free by construction. It was distilled from the AAC, a custom agent for the open-source LAMB platform, built in about ten days by one developer with an AI pair-programmer and in production . We present it as a transferable practice, independent of any language or framework.

10.
arXiv (CS.LG) 2026-06-15

When Language Representations Interact: Separability and Cross-Lingual Effects in LLMs

arXiv:2606.14347v1 Announce Type: new Abstract: Large language models exhibit strong multilingual capabilities, however, their internal representations are difficult to interpret. Understanding these interactions is important for ensuring reliable behavior in multilingual systems. Recent work has shown that causal-geometric structure can explain how certain concepts are encoded as approximately linear and separable directions, but whether this framework extends to multilingual models, where language identity is correlated and hierarchical, is underexplored. We apply causal-geometric analysis to multilingual LLMs, studying 28 bilingual contrasts across three models, allowing us to analyze when languages behave as approximately independent factors and when structured dependencies persist. We find evidence that language concepts admit stable linear representations that are largely separable under a covariance-adjusted (causal) inner product, with structured deviations reflecting linguistic similarity. Moreover, languages within the same family (such as Germanic or Romance) exhibit a simplex-like geometric structure, suggesting hierarchical organization. These results extend causal-geometric interpretability to multilingual settings and provide insight into how separability and similarity may exist in multilingual LLM representations, motivating interpretability analyses that diagnose when and how structured dependencies between concepts can be anticipated. This has implications for trustworthy deployment, as residual structure between languages may lead to unintended cross-lingual effects when models are monitored or intervened upon.

11.
arXiv (CS.AI) 2026-06-17

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

arXiv:2606.17416v1 Announce Type: cross Abstract: Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual training, embeddings often encode language cues with speaker identity, causing speakers to form language-specific clusters. We propose L-Proto, a language-aware episodic prototypical training strategy that constructs language-consistent episodes. By sampling speakers from a single language per episode, L-Proto reduces language-driven variation during training and encourages embeddings to focus more directly on speaker identity. Experiments on the TidyVoice Challenge benchmark demonstrate consistent performance improvements over conventional fine-tuning and random episodic sampling across multiple backbone architectures.

12.
arXiv (quant-ph) 2026-06-19

Charge-Conjugation Violation and Population Asymmetry in Bipartite Fermionic Lattices

arXiv:2606.06138v2 Announce Type: replace-cross Abstract: Charge conjugation violation (CCV) is a central concept in particle physics and appears also for quasiparticles in quantum many-body systems, which typically relies on an embedded external symmetry breaking to the underlying system. An open question is how an intrinsic CCV mechanism could emerge and what its macroscopic consequences would be. We establish sublattice kinks in bipartite fermionic lattices as a concrete setup showing intrinsic CCV. The intrinsic CCV of the sublattice kink is based on the graph-topological nature of the underlying Hamiltonian, with no explicit symmetry breaking taking place. It leads to a population asymmetry of different configurations and imprints a hidden leaf-like structure in the eigenenergy spectrum. The population asymmetry also leads to an imbalanced sublattice-kink production triggered by the vacuum-instability in the quench dynamics. Our work demonstrates the graph topology as the microscopic origin of intrinsic CCV, with the population asymmetry as the macroscopic consequence, of which the proposed setup is highly amenable to experimental implementation via cold-atom quantum simulators.

13.
arXiv (CS.LG) 2026-06-12

Estimating Individualized Treatment Effects in Acute Ischemic Stroke with Causal Transformation Models (TRAM-DAG): A Multi-Centre Observational Study with External RCT Validation

arXiv:2606.12623v1 Announce Type: cross Abstract: Personalized medicine in acute ischemic stroke requires moving beyond average treatment effects (ATE) to individualized treatment effect (ITE) estimates to support treatment decisions. In acute ischemic stroke, mechanical thrombectomy has been shown to be more effective on average than lysis in randomized controlled trials (RCTs), such as the MR CLEAN study. We aim to identify which individual patients benefit most from mechanical thrombectomy compared to lysis. The outcome of interest is the modified Rankin Scale (mRS) at three months, an ordinal measure of functional disability (0: no symptoms, 6: death). We demonstrate that causal transformation models on directed acyclic graphs (TRAM-DAG) can be used for ITE estimation after being fitted on observational MAGIC multi-center stroke patient data. To ensure comparability with the MR CLEAN population, which we use for validation, we train the TRAM-DAG on a MAGIC sub-population with NIHSS at admission >= 6, corresponding to one inclusion criterion of MR CLEAN. The fitted model is then used to estimate ITEs for stroke patients in the MR CLEAN population. While these ITE estimates cannot be confirmed experimentally, we show that their average is consistent with the trial's reported ATE. Furthermore, the ITE estimates correctly rank trial patients by their observed frequency of a good outcome (mRS at three months

14.
arXiv (CS.AI) 2026-06-11

A Physics-Inspired Optimizer: Velocity Regularized Adam

arXiv:2505.13196v3 Announce Type: replace-cross Abstract: We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.

15.
arXiv (CS.AI) 2026-06-19

Beyond Accuracy: Measuring Logical Compliance of Predictive Models

arXiv:2606.20208v1 Announce Type: new Abstract: Machine learning models are predominantly evaluated through predictive performance metrics such as ranking quality, prediction error, or classification accuracy. While these metrics effectively quantify how closely predictions match the ground truth, they do not assess whether model outputs respect predefined logical or domain-specific constraints. In high-stakes applications, including healthcare, finance, and autonomous systems, logical consistency can be as critical as predictive accuracy, yet no standard metric captures this dimension. We introduce the Rule Violation Score (RVS), a complementary evaluation metric that quantifies the extent to which a predictive model respects a given set of logical rules, independently of predictive accuracy. RVS treats hard rules (strict constraints) and soft rules (statistical regularities) differently, can be evaluated on any dataset and on any predictive model expressed over a relational vocabulary, and can be computed using SQL queries that are automatically generated for Horn rules. Beyond evaluating models, RVS can also evaluate the logical consistency of training datasets and help identify poorly defined rules. We evaluate RVS on three benchmarks covering knowledge graph link prediction and relational regression, including rule-based, embedding-based, and neuro-symbolic predictive models. Our results demonstrate that two models achieving comparable predictive accuracy can exhibit substantially different levels of logical compliance, revealing differences in model behavior that standard metrics fail to capture.

16.
arXiv (CS.LG) 2026-06-11

Simplicity Suffices for Parameter Noise Injection in Stochastic Gradient Descent

arXiv:2606.12054v1 Announce Type: new Abstract: Injecting noise into the optimization process is a well-established technique for improving the training and generalization of deep neural networks. Yet, despite the breadth of existing approaches, it remains unclear which design choices truly matter in practice. In this work, we investigate parameter noise injection for stochastic gradient descent, focusing on two key questions: how to efficiently pair each training example with its own perturbation in mini-batch training, and whether sophisticated noise parameterizations or multi-sample gradient averaging yield meaningful gains over simpler alternatives. To address the first question, we leverage a distributional identity for linear layers that allows per-example noise injection without breaking batched computation. To address the second, we systematically compare several diagonal Gaussian parameterizations against an isotropic baseline across varying noise levels on CIFAR100. Our results consistently show that simple, lightweight strategies, isotropic noise with a single perturbed forward pass per update step, recover most of the benefit of more complex schemes. These findings suggest that simplicity suffices for parameter noise injection, and that practitioners need not resort to elaborate perturbation designs to reap the optimization and generalization benefits of noisy SGD.

17.
arXiv (CS.CV) 2026-06-16

Towards Global AI-Driven Cervical Cancer Screening

The global elimination of cervical cancer is a key public health goal set by the World Health Organization (WHO), with screening programs reducing mortality by up to 80%. However, access to experts and biopsy services is limited in low- to middle-income countries (LMICs). Deep learning (DL)-based algorithms offer promising support for screening, but most existing approaches have been developed and validated on private datasets from single countries. We present the first DL-based approach to cervical cancer screening validated on data from multiple countries. Technically, we phrase the problem of detecting and classifying lesions in colposcopy images as a multi-task learning problem, in which we simultaneously perform image-level classification and lesion segmentation. Our model was trained on a private data set of acid stain colposcopy images with manually generated lesion segmentation masks and corresponding histopathological results, employing extensive data augmentation to address image variability. In an in-distribution validation with pathology results serving as ground truth, our algorithm outperformed medical experts (Balanced Accuracy: 0.68 vs 0.64) in CIN1- (Cervical intraepithelial neoplasia grade 1 or lower) versus CIN2+ (grade 2 or higher) classification. External validation on four colposcopy data sets from four countries featuring radical differences in prevalence and patient characteristics yielded superior performance of our method compared to baseline methods. Performance variability across countries was high with AUC values ranging from 0.54 - 0.80. Overall, algorithm performance varied with age, transformation zone (cervical area most prone to lesion development), presence of comorbidities and pathognomonic signs, with comorbidities having by far the largest negative effect. Future work should focus on improving model robustness and generalizability.

18.
arXiv (quant-ph) 2026-06-11

Fisher geometry reshapes the effect of incompatibility in multiparameter quantum estimation

arXiv:2606.11343v1 Announce Type: new Abstract: Multiparameter quantum estimation faces two fundamental obstacles: sloppiness, i.e., anisotropy of the quantum Fisher information matrix (QFIM) that renders some parameter directions insensitive, and incompatibility, the non-commutativity of optimal measurements for different parameters. The trade-off bound $C_T$ captures their joint impact on precision, but it has remained unclear how the distribution of incompatibility across parameter planes affects its overall cost. Here we separate the total amount of incompatibility from its location. We introduce a dimensionless quantity $G_n^{(F)}$ that measures the alignment between the incompatibility distribution and the eigenvalues of the QFIM, and show how the Frobenius scale of the incompatibility contribution factorizes. We obtain a bound and prove the incompatibility cost lies between this bound and a rank-dependent multiple thereof. We also prove that at fixed sloppiness, or equivalently fixed Fisher volume, concentrating incompatibility into a single parameter plane reduces the optimized trade-off cost because the Fisher geometry can then be reshaped to allocate more Fisher area to that plane. A qutrit $SU(2)$ encoding numerically confirms that states with larger incompatibility strength can nevertheless incur a smaller cost if the matching factor $G$ is sufficiently small. Our results establish that the distribution of incompatibility relative to the Fisher eigenbasis is a central diagnostic for multiparameter estimation, beyond the total incompatibility strength.

19.
arXiv (CS.CL) 2026-06-15

FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or non-verifiable facts, making the use of a single factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.74, indicating that the benchmark remains a challenging task for future research. We release our dataset and code at https://github.com/XiangyanChen/FineDialFact.

20.
arXiv (CS.CV) 2026-06-16

Variational Deep Unfolding with Mamba-Based Nonlocal Modeling for Underwater Image Enhancement

Underwater imaging plays a crucial role in ocean engineering, although captured data often suffer from poor visibility and color distortion. To address these challenges, we propose a model-based deep unfolding network for underwater image enhancement that integrates variational modeling into a learnable architecture. The framework is guided by a variational formulation based on a dehazing decomposition, incorporating a multiplicative residual component to absorb remaining artifacts and a nonlocal gradient-type constraint to preserve structural details and enhance edge sharpness. We provide a theoretical analysis establishing the existence of solution for the associated minimization problem. The proposed unfolding method incorporates Mamba layers to efficiently capture self-similarities in the scene. In addition, we introduce a proximal trajectory loss that enforces consistency between the unfolding stages and the iterations of an ideal restoration regularizer. Experimental results demonstrate that the proposed unfolding approach achieves improved visual quality and competitive quantitative performance compared with recent state-of-the-art methods. The source code will be available at https://github.com/MIA-UIB/Variational-Unfolding-Mamba-Underwater-Enhancement .

21.
arXiv (CS.CV) 2026-06-16

LLM-Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models

Authors:

This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. While previous studies have primarily focused on improving classification performance through data augmentation techniques, relatively few studies have systematically examined whether model explanations are grounded in clinically relevant lesion regions. In this study, geometric augmentation, color-based augmentation, and mixed augmentation strategies were applied to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge evaluation framework was designed using GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6 to assess Grad-CAM explanations from the perspectives of lesion localization and explanation trustworthiness. To improve evaluation consistency and clinical grounding, a progressive prompt engineering strategy was introduced, incorporating evaluation rubrics, clinical knowledge, penalty rules, and structured output formats.

22.
arXiv (CS.CV) 2026-06-18

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$\lambda$ Mix Interaction ($L\lambda MI$) block. Comprising Local-$\lambda$ and Interactive-$\lambda$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

23.
arXiv (CS.CV) 2026-06-11

Bridging the Modality Gap in Forensic Image Retrieval

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

24.
arXiv (CS.CV) 2026-06-18

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

25.
arXiv (CS.AI) 2026-06-16

AIRMap: AI-Generated Radio Maps for Wireless Digital Twins

arXiv:2511.05522v4 Announce Type: replace-cross Abstract: Accurate, low-latency channel modeling is essential for real-time wireless network simulation and digital-twin applications. Traditional modeling methods like ray tracing are however computationally demanding and unsuited to model dynamic conditions. In this paper, we propose AIRMap, a deep-learning framework for ultra-fast radio-map estimation, along with an automated pipeline for creating the largest radio-map dataset to date. AIRMap uses a single-input U-Net autoencoder that processes only a 2D elevation map of terrain and building heights. Trained on 1.2M Boston-area samples and validated across four distinct urban and rural environments with varying terrain and building density, AIRMap predicts path gain with under 4 dB RMSE in 4 ms per inference on an NVIDIA L40S-over 100x faster than GPU-accelerated ray tracing based radio maps. A lightweight calibration using just 20% of field measurements reduces the median error to approximately 5%, significantly outperforming traditional simulators, which exceed 50% error. Integration into the Colosseum emulator and the Sionna SYS platform demonstrate near-zero error in spectral efficiency and block-error rate compared to measurement-based channels. These findings validate AIRMap's potential for scalable, accurate, and real-time radio map estimation in wireless digital twins.