Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-17

Blended Chart Surfaces: A Seamless Explicit Representation for Smooth Surface Fitting

A surface representation suitable for geometry processing should be compact and explicit, provide global smoothness guarantees, support a wide range of surface topologies, and offer reliable access to differential quantities such as normals and surface energies, while remaining compatible with modern differentiable optimization. Existing neural representations typically sacrifice one or more of these properties: implicit fields typically require iso-surfacing for downstream use, while explicit neural maps are constrained by canonical-domain parametrizations or exhibit seam artifacts between local charts. We introduce Blended Chart Surfaces, a compact, network-free, explicit representation that is smooth by construction and anchored to user-provided topology. Given a coarse proxy mesh encoding the intended surface topology and approximate geometry, Blended Chart Surfaces jointly optimize for a polynomial map at each proxy vertex using an off-the-shelf optimizer to fit to an implicit target shape, avoiding the need for an input parametrization. Neighboring maps are fused using a smooth 'one-ring coordinate' blending scheme, decoupling topology and coarse geometry (carried by the proxy) from geometric details (carried by the local patches). The surface is globally smooth, fully differentiable, and enables stable evaluation of derivatives, making differential quantities and surface energies directly accessible. Additionally, our construction is equivariant to rigid motions and scaling of the proxy mesh. We evaluate Blended Chart Surfaces on various topologies and geometric complexity, and compare against explicit alternatives including interpolating-function baselines and mesh-displacement MLPs. Across these, Blended Chart Surfaces achieve a favorable trade-off among compactness, simplicity, access to differential quantities, and expressivity while remaining smooth across patch boundaries.

02.
arXiv (CS.AI) 2026-06-16

Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course

arXiv:2606.16842v1 Announce Type: cross Abstract: Teaching Software Engineering for AI-enabled systems entails addressing the integration of AI components within full-scale software architectures under realistic constraints. While machine learning courses emphasize model development, students often lack experience in architectural design, deployment, and monitoring of AI-enabled systems. Empirical evaluations of such system-oriented AI courses remain limited. This paper reflects on the design and implementation of a project-based master's-level course titled AI Algorithms: Theory and Engineering, at the University of Bremen, in which students developed a movie recommendation system while making architectural design decisions to address challenges related to scalability, deployment, and evolving requirements. We conducted a mixed-methods study combining analyses of student submissions and questionnaire responses to investigate integration challenges, learning outcomes, and opportunities for improvement. Our results indicate persistent difficulties in early architectural decisions, heterogeneous ML integration, evolving requirements, and data management, largely due to uneven ML and software engineering expertise. From the educator's perspective, the course fostered system-level reasoning and strengthened awareness of data-centric ML practices in AI-enabled systems.

03.
arXiv (CS.AI) 2026-06-16

RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning

arXiv:2606.15278v1 Announce Type: cross Abstract: Affective and cognitive disorders manifest as distributed, time-varying brain network dynamics across regions, channels, and time, challenging robust representation learning from EEG/sEEG for clinical diagnosis. We propose RECTOR (Masked Region-Channel-Temporal Modeling), an end-to-end self-supervised framework that unifies joint region-channel-temporal representation learning beyond fixed anatomical priors. At its core, RECTOR-SA is a hierarchical, block-sparse self-attention induced by Adaptive Functional Partitioning that evolves region structures from static anatomical definitions to adaptive functional regions. The self-supervision is driven by Masked Topology and Representation Learning, which jointly optimizes three complementary objectives: Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency. Across diverse benchmarks, RECTOR sets a new state-of-the-art in EEG emotion recognition and sEEG task-engagement classification. Crucially, its strong robustness to missing channels and cross-montage generalization underscores its potential for large-scale pre-training on heterogeneous EEG/sEEG, providing interpretable insights at both region and channel levels.

04.
arXiv (CS.AI) 2026-06-17

ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics

arXiv:2606.17668v1 Announce Type: cross Abstract: Molecular dynamics (MD) simulation is computationally demanding, particularly for large-scale systems requiring long-term analysis. Accurate forecast of the outcomes of a MD simulation is not only an attractive scientific challenge but also has substantial practical value. In this work, we developed a data-driven framework, termed ASTEROID (Advanced Spatiotemporal TransformER fOr Inferring Dynamics), that can directly predict multi-step atomic coordinates, avoiding conventional iterative integration. For this purpose, our ASTEROID reformulates MD trajectories as high-dimensional spatiotemporal sequences and integrates the Spatiotemporal Information (STI) Transformation equation into a Transformer architecture. The core innovation of ASTEROID lies in its ability to model multiscale spatiotemporal dependencies. In particular, for spatial dependencies, a local-global self-attention mechanism captures both short- and long-range interactions. For temporal dependencies, an encoder-decoder structure integrates global context with autoregressive forecasting. ASTEROID was evaluated on several quantum-mechanics derived molecular datasets. Our results indicate that ASTEROID achieved not only a higher level of accuracy in multi-step prediction than existing methods on various benchmarks, but also significantly reduced computational cost of conventional MD simulation. Moreover, the model supports iterative multi-step forecasting over an extended time scale. This work establishes a robust and generalizable data-driven paradigm for accelerating MD simulations.

05.
arXiv (CS.LG) 2026-06-16

TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning

arXiv:2606.05878v2 Announce Type: replace Abstract: Foundation models mark a profound paradigm shift in time series modeling, with task-specific models being superseded by general-purpose zero-shot models. Yet, current approaches primarily focus on forecasting, while real-world time series are often irregularly and partially observed, requiring models that can jointly forecast, impute missing values, and handle degraded sampling conditions. To address these challenges, we introduce TS-ICL, a novel probabilistic In-Context Learning encoder–regressor Transformer that unifies forecasting and imputation. TS-ICL formulates time series tasks as timestamp-aligned regression and naturally incorporates covariates by training on synthetic dependency structures generated from a novel causal data prior. Empirically, TS-ICL achieves a new state-of-the-art in imputation, while remaining competitive with leading forecasting foundation models across both univariate and covariate-aware benchmarks. It shows particularly strong performance in forecasting with partially observed look-back windows.

06.
arXiv (CS.AI) 2026-06-11

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

arXiv:2604.20348v2 Announce Type: replace-cross Abstract: Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves 70.5% average success rate, outperforming the best training-free baseline by 6.1 percentage points and surpassing most supervised methods. We also demonstrate superior real-world performance on 3 tasks without hardware-specific retraining.

08.
arXiv (quant-ph) 2026-06-12

Theoretical Study for Generating Optical GKP State via a Single-Photon-Added Squeezed Vacuum

arXiv:2606.12467v1 Announce Type: new Abstract: A theoretical framework is developed to analyze the generation of the optical GKP state using a single-photon-added squeezed vacuum. This state, defined by the squeezing parameter $r$, is injected into a 50:50 beam splitter, and the optical GKP state is obtained through conditional measurement at one output port. The single-photon-added squeezed vacuum is especially prominent in this context because it provides a simpler and more experimentally accessible ingredient than Schrodinger cat states, while conditional measurement ensures projection onto a state that closely approximates the finite-energy GKP form. Fidelity is employed to quantify this closeness, and the analysis demonstrates that the scheme achieves a maximum fidelity of 85% at a squeezing level of $3.76 \ dB$. This performance surpasses approaches based on squeezed optical odd Schrodinger cat states, underscoring the single-photon-added squeezed vacuum as a practical and effective pathway toward fault-tolerant photonic quantum computing.

09.
arXiv (CS.AI) 2026-06-19

Latent Confounded Causal Discovery via Lie Bracket Geometry

arXiv:2606.19610v1 Announce Type: cross Abstract: Recent work on Kan-Do-Calculus (KDC) has established that the boundary between passive observation and active intervention in causal inference is a category-theoretic bi-adjunction, with interventions modeled by left Kan extensions and conditioning by right Kan extensions. This paper introduces two causal discovery algorithms under latent confounding, building on the information-geometric and categorical consequences of KDC. In smooth statistical settings, Radon-Nikodym derivatives between observational and interventional measures induce local causal vector fields; failures of these fields to close under Lie brackets become computable Frobenius residuals, which we interpret as witnesses of failed visible integrability and possible latent or unmodeled structure. Our first algorithm, BRIDGE (Bracket Residuals for Interventional Discovery and Geometric Estimation), combines an interventional density or Radon-Nikodym-ratio engine with a geometric screen that proposes a high-recall family of admissible arrows, identifies non-closing visible pairs as latent-obstruction candidates, and passes the reduced family to downstream score-based or differentiable discovery routines. The second algorithmic contribution, Spectral Kan-Do Flow Matching (SKFM), learns amortized intervention fields and factors latent curvature spectrally, exposing the direct Lie-space endpoint toward which BRIDGE points. A detailed set of experiments show that both algorithms are capable of discovering causal models with latent confounders while collapsing the super-exponential space of possible DAGs by many orders of magnitude. This paper introduces a new paradigm in causal discovery, where latent structure is inferred directly from the geometry of intervention-induced flows.

10.
arXiv (CS.LG) 2026-06-16

Multiscale Hypersonic Boundary Layer Reconstruction via Spectral Binning and Subdomain-wise Conditional Diffusion

arXiv:2606.15023v1 Announce Type: cross Abstract: We propose a multiscale probabilistic reconstruction framework for hypersonic Couette flow, where near-wall states are inferred from limited top-wall observations using conditional diffusion model. The boundary layer is divided into overlapping wall-normal subdomains, and a single height- and Mach-conditioned Elucidating Diffusion Model (EDM) is trained jointly for M=6,7,8 to sample velocity, density, pressure, and temperature fields conditioned on a top-wall boundary slice. A soft overlap inpainting strategy assembles subdomain predictions into full-volume reconstructions while maintaining inter-subdomain continuity and small-scale variability. To improve the spectral fidelity of the generated fields, we introduce a novel bounded binned spectral power (BSP) loss that preserves high-wavenumber content while remaining numerically stable across the diffusion noise schedule. Validation against direct numerical simulation data shows that the model recovers instantaneous structures, spectra, statistical profiles, correlations, and wall quantities across all training Mach numbers, while providing spatially structured uncertainty estimates. The reconstructed Mach-conditioned profiles also collapse under the Trettel-Larsson transformation, indicating consistency with compressibility scaling. These results establish the domain decomposed conditional diffusion model with a bounded binned spectral loss as an effective probabilistic surrogate for near-wall reconstruction in hypersonic wall-bounded turbulence.

11.
arXiv (CS.LG) 2026-06-17

Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows

arXiv:2606.17413v1 Announce Type: new Abstract: Space-based monitoring of atmospheric carbon dioxide (CO2) is essential for constraining the global carbon budget. NASA's Orbiting Carbon Observatory-2 (OCO-2) estimates column-averaged dry-air mole fractions of CO2 (XCO2) using high-resolution spectra. However, current operational retrieval algorithms are computationally expensive and do not properly quantify uncertainties. We present a novel deep learning framework that addresses these challenges. Due to the difficulties of ground-truth data for real satellite observations, we develop and validate our approach using a high-fidelity simulation dataset. This dataset, created to support OCO-2 uncertainty quantification (UQ), incorporates realistic forward model errors. Our architecture encodes spectral bands using a multi-branch neural network and estimates posteriors of the full CO2 column or desired summaries thereof using two scalable UQ methods: Laplace approximations and normalizing flows. Our approach has five key advantages relative to operational "full-physics" solvers: (1) Amortization: Inference is orders of magnitude faster, enabling real-time processing of massive data streams; (2) Model error robustness: By training on simulations that explicitly include model discrepancies, our method accounts for systematic errors often neglected by standard inversions; (3) Point estimate accuracy: We achieve superior predictive accuracy compared to baseline methods; (4) Improved UQ: The probabilistic outputs yield better-calibrated uncertainty estimates; and (5) Non-Gaussian posteriors: When utilizing normalizing flows, our framework successfully models complex, asymmetric posterior distributions, overcoming the limitations of the Gaussian assumption. These results suggest that simulation-based deep learning is a viable path toward next-generation operational processing systems.

12.
medRxiv (Medicine) 2026-06-12

Design, Implementation, and Evaluation of a Shadowing Program for Medical Students in the Basic Sciences Phase

Introduction Shadowing, as an educational method based on active observation, can foster a realistic understanding of professional roles and enhance the communication skills of medical students. This study aimed to design, implement, and evaluate a shadowing program for basic sciences medical students. Methods This development study was conducted based on the ADDIE model in five phases. The study population consisted of 799 medical students in semesters 2 to 5. The stages included Analysis (determining needs through literature review and expert panels), Design (specifying learning environments and evaluation methods), Development (preparing guides and educational tools), Implementation (within the Medical Ethics course), and Evaluation (using questionnaires and reflection forms). Findings This study aimed to design and evaluate an educational shadowing program based on the ADDIE model. In the Analysis phase, the profiles of 799 students and learning objectives were determined. In the Design phase, a structured program for four types of shadowing was designed. In the Development phase, all guides and educational tools were prepared. In the Implementation phase, the program was carried out with complete coverage and adherence to ethical considerations. Finally, the program evaluation showed that "Motivation to become a good physician" (3.75-3.95) and "Enhancing empathy" (3.50-3.94) received the highest scores, while "Increasing understanding of the basic science-clinical connection" (2.53-2.89) and "Willingness to attend on holidays" (1.87-2.31) received the lowest scores. Conclusion The findings indicate that implementing the shadowing program is an effective method for strengthening the professional attitudes and academic motivation of medical students. However, the program did not significantly improve students perception of the basic science-clinical connection, indicating a need for curricular refinement. The continuation and extension of this program to other levels and fields of medical sciences are recommended.

13.
arXiv (quant-ph) 2026-06-17

The Standard Model, The Exceptional Jordan Algebra, and Triality

作者:

arXiv:2006.16265v2 Announce Type: replace-cross Abstract: Jordan, Wigner and von Neumann classified the possible algebras of quantum mechanical observables, and found they fell into 4 "ordinary" families, plus one remarkable outlier: the exceptional Jordan algebra. We point out an intriguing relationship between the complexification of this algebra and the standard model of particle physics, its minimal left-right-symmetric $SU(3)\times SU(2)_{L}\times SU(2)_{R}\times U(1)$ extension, and $Spin(10)$ unification. This suggests a geometric interpretation, where a single generation of standard model fermions is described by the tangent space $(\mathbb{C}\otimes\mathbb{O})^{2}$ of the complex octonionic projective plane, and the existence of three generations is related to $SO(8)$ triality.

14.
Nature (Science) 2026-06-17

Fast formation to reinforce lithium-rich cathodes

作者:

Formation in lithium-ion battery manufacturing typically involves low-rate charge–discharge cycles to establish stable electrode–electrolyte interfaces—a time-consuming process1–4. Here, our findings on lithium-rich layered oxide cathodes challenge the necessity of conventional formation, which can even shorten battery lifespan. Fast formation, on the other hand, reduces production cost and enhances capacity and stability. Multiscale synchrotron-based techniques show that residual lithium ions after the initial charge are critical for subsequent structural evolution and cycling performance. Deep lithium de-intercalation causes severe structural degradation and capacity loss due to the inherently fragile lithium-deficient matrix. By contrast, the residual lithium ions from fast formation enhance reversibility through a self-pinning effect, preventing pernicious lattice deformation and reinforcing the ion-storage framework. Adjusting the initial charge current density from 0.2 C to 2 C improves reversible capacity by 20% and extends cycle life by more than 36%. This approach can also be extended to other electrode systems, providing insights for more-efficient battery production. Fast formation in lithium-ion batteries outperforms conventional slow formation, lowering costs and improving battery capacity, stability and cycle life, offering broader application to electrode systems.

15.
arXiv (CS.AI) 2026-06-12

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

arXiv:2606.13258v1 Announce Type: new Abstract: Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

16.
arXiv (CS.CV) 2026-06-16

KeepLoRA++: Continual Learning with Layer-Scaled Residual Gradient Adaptation

Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents KeepLoRA++, balancing these objectives through a unified dual-dimensional knowledge retention mechanism. We analyze knowledge distribution of Transformer architecture from both inter-layer and intra-layer perspectives. The inter-layer perspective examines how retention is distributed across layers, while the intra-layer perspective focuses on the parameter space within each layer. Our analysis reveals a structural property: general transferable knowledge is mainly encoded in the shallow layers and the principal subspace of the parameters, while task-specific adaptations are localized in the deep layers and the residual subspace. Motivated by this insight, KeepLoRA++ introduces a layer-scaled residual gradient adaptation method. New tasks are learned by restricting LoRA parameter updates to the residual subspace, combined with a shallow-to-deep layer scaling, to prevent interference with previously acquired capabilities. Specifically, the gradient of a new task is projected onto a subspace orthogonal to both the principal subspace of the pre-trained model and the dominant directions of previous task features, while simultaneously assigning smaller update magnitudes to shallow layers and larger ones to deeper layers. Our theoretical analysis and empirical evaluations confirm that KeepLoRA++ successfully balances these three competing objectives, consistently outperforming representative baselines across image classification, visual question answering, and video understanding tasks.

17.
arXiv (CS.CL) 2026-06-18

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

18.
Nature (Science) 2026-06-17

Reimagining machine vision with optical computing

作者: 未知作者

A general-purpose artificial-intelligence vision system for use in image-sensing devices has been developed by embedding fundamentals of core computer-vision operations into a light-manipulating planar material called an optical metasurface. A prototype enables accurate, real-time perception and processing across diverse tasks, suggesting that this could be a solution for rapid, low-energy, on-device vision intelligence. A specialized ‘metasurface’ can preprocess incoming scene information on image-generating devices.

19.
arXiv (CS.LG) 2026-06-12

Strategic PAC Learnability via Geometric Definability

arXiv:2605.13426v3 Announce Type: replace Abstract: Strategic classification studies learning settings in which individuals can modify their features, at a cost, in order to influence the classifier's decision. A central question is how the sample complexity of the induced (strategic) hypothesis class depends on the complexities of the underlying hypothesis class and the cost structure governing feasible manipulations. Prior work has shown that in several natural settings, such as linear classifiers with norm costs, the induced complexity can be controlled. We begin by showing that such guarantees fail in general - even in simple cases: there exist hypothesis classes of VC dimension $1$ on the real line such that, even under the simplest interval neighborhoods, the induced class has infinite VC dimension. Thus, strategic behavior can turn an easy learning problem into a non-learnable one. To overcome this, we introduce structure via a geometric definability assumption: both the hypothesis class and the cost-induced neighborhood relation can be defined by first-order formulas over $\mathbb{R}_{\mathtt{exp}}$. Intuitively, this means that hypotheses and costs can be described using arithmetic operations, exponentiation, logarithms, and comparisons. This captures a broad range of natural classes and cost functions, including $\ell_p$ distances, Wasserstein distance, and information-theoretic divergences. Under this assumption, we prove that learnability is preserved, with sample complexity controlled by the complexity of the defining formulas.

20.
arXiv (quant-ph) 2026-06-19

Measuring Rényi entropy with an Echo Protocol

arXiv:2504.05237v3 Announce Type: replace Abstract: We present efficient and practical protocols to measure the second Rényi entropy, whose exponential is known as the purity. Our approach is based on expressing the purity in terms of transition probabilities generated by an echo-type forward-backward evolution sequence, making it applicable to quantum many-body systems. Notably, our approach does not rely on random-noise averaging, a feature that can be extended to protocols to measure out-of-time-order correlation functions, as we demonstrate. By way of example, we show that our protocols can be practically implemented in superconducting qubit-based platforms, as well as in cavity-QED trapped ultra-cold gases.

21.
arXiv (quant-ph) 2026-06-16

Complete entanglement detection using polynomial invariants

arXiv:2606.16712v1 Announce Type: new Abstract: Existing methods for deciding whether a bipartite quantum state is separable or entangled typically fall into one of two categories: they are either complete but require access to an explicit density matrix followed by numerical optimization, or they can be evaluated directly by measuring the quantum system but are incomplete, in the sense that they cannot detect all forms of entanglement. In this work, we overcome both limitations in a unified framework. First, we bypass numerical optimization by deriving separability criteria in the form of universal bounds on tensor powers of separable states. We prove that these bounds are complete: every entangled state violates them for sufficiently large tensor powers. Second, we explicitly construct a corresponding complete family of nonlinear entanglement witnesses, which can detect all forms of entanglement without requiring an explicit density matrix. The witnesses we construct are moreover basis-independent, in the sense that they are invariant under conjugation by local unitaries. Altogether, our results expand the toolbox for entanglement detection in arbitrary local dimensions in a manifestly invariant way.

22.
bioRxiv (Bioinfo) 2026-06-12

PHI-Reason: evidence-grounded species-level phage-host prediction from structured biological text profiles

Phage–host interaction (PHI) prediction is a fundamental problem in microbiology with applications in microbial ecology and microbiome engineering. Existing computational approaches typically convert phage and host information into numerical representations derived from sequence similarity, protein content, genome composition or reference databases, then score candidate hosts or train host-prediction models. Although effective, such representations often make it difficult to inspect which biological evidence supports a prediction. Here, we present PHI-Reason, a species-level PHI prediction framework that reformulates host prediction as constrained biological text reasoning. Instead of embedding phages and hosts directly as numerical vectors, PHI-Reason converts heterogeneous PHI-related evidence from phage genomes, host genomes, functional annotations, homology searches and biological metadata into modular natural-language profiles. A frozen large language model then performs species-level candidate-host ranking or pairwise PHI assessment by integrating the supplied evidence at inference time. Across species-level benchmarks, PHI-Reason achieved competitive host-prediction performance and recovered complementary correct assignments relative to established sequence- and reference-based methods. Its explicit profile design enabled systematic evidence perturbation and rationale-grounding analyses, showing that predictions depend on coherent multi-source biological evidence and that hallucination risk from unsupported or incomplete profiles can be made operationally measurable. These results position PHI-Reason as a constrained evidence-integration framework for species-level PHI prediction. Rather than replacing sequence-based predictors, it provides an interpretable layer that shows how far explicit biological evidence can support host inference, and where that evidence falls short.

23.
arXiv (CS.CV) 2026-06-16

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.

24.
arXiv (CS.AI) 2026-06-16

Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling

arXiv:2606.15623v1 Announce Type: cross Abstract: Pairwise comparison is the gold standard for subjective ranking tasks; however, exhaustive annotation requires a massive number of human comparisons ($O(n^2)$). While sorting-based methods have reduced this burden to $O(n\log n)$, they still require expensive human judgment for every single comparison. To further improve annotation efficiency, we propose leveraging a Vision-Language Model (VLM) not as an annotator replacement, but as a question prioritizer to identify which comparisons genuinely require human judgment. The proposed Surprise-Guided MergeSort (SGS) framework achieves this through three integrated components: (1) a bottom-up MergeSort scheduler that structures comparisons and exploits transitivity, (2) a composite Surprise Scorer – combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy – to quantify comparison ambiguity, and (3) an adaptive budget allocator that routes high-surprise pairs to humans while automating low-surprise pairs via transitivity inference. Validation was conducted on six diverse benchmarks spanning text similarity (STS-B, BIOSSES, SICKR-STS) and image quality assessment (KonIQ-10k, TID2013, LIVE Challenge). SGS effectively identified and skipped up to 535 non-informative comparisons per session. Consequently, it achieved Kendall's $\tau{\times}100$ improvements of $+6$ to $+12$ over Active Elo under the same total budget. These results demonstrate that combining VLM-guided surprise metrics with algorithmic sorting provides a generally consistent accuracy-efficiency trade-off across diverse domains.

25.
arXiv (CS.CV) 2026-06-18

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.