Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

Structured benchmarks have advanced text-conditional image generation for real-world imagery, however, no such benchmark exists for synthetic radiograph generation. Despite being a highly active area of research, existing studies continue adopting inconsistent evaluation protocols and lack a unified assessment of the three most critical criteria: generative fidelity, privacy risk, and downstream utility. To address these limitations, we introduce CheXGenBench, the first unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and downstream utility across frontier text-to-image (T2I) generative models. Our evaluation protocol, comprising over 20 quantitative metrics, covers 11 leading T2I architectures with plug-and-play integration for newer models. Through a rigorous and fair evaluation protocol, we establish comprehensive baseline state-of-the-art (SoTA) performances across all dimensions to guide future research. Furthermore, our results uncover several limitations of current generative models, which include first, even SoTA models struggle with long-tailed medical distributions; second, models pose high privacy risks regardless of fidelity quality; and third, while synthetic data already benefits downstream classification, it is of limited utility for downstream multimodal tasks. Drawing from these results, we propose concrete research directions to advance the field. The code is available at https://github.com/Raman1121/CheXGenBench

02.
arXiv (CS.AI) 2026-06-18

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

arXiv:2606.18322v1 Announce Type: cross Abstract: Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

03.
arXiv (CS.LG) 2026-06-16

The Data Manifold under the Microscope

arXiv:2606.15760v1 Announce Type: new Abstract: A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into data-manifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry. We repurpose and extend dSprites and COIL-20 with additional transformation dimensions and dense, axis-aligned sampling, and pair them with finite-difference estimators that recover curvature, reach, and volume at near-ground-truth accuracy in a regime where general-purpose estimators are unreliable or difficult to deploy. The framework is intended as a controlled testbed, useful as a calibration environment for geometric estimators and a sandbox for probing theoretical assumptions. To illustrate its use, we present two application studies, namely assessing the scaling behavior of the bounds of Genovese et al. and Fefferman et al., and tracking the layer-wise geometry of a $\beta$-VAE, highlighting the behavior of current bounds and the value of controlled benchmarks for guiding and validating future theory. A reference implementation is available at https://github.com/koulakis/manifold-microscope.

04.
arXiv (quant-ph) 2026-06-12

Beyond the Unruh vacuum: multi-time correlations in black hole collapse and evaporation

arXiv:2606.13383v1 Announce Type: new Abstract: The black hole information paradox originates from the thermal character of Hawking radiation, which appears to erase information about the collapsing matter. However, thermality constrains only observables defined at a single time and leaves the structure of temporal quantum correlations largely unexplored. Here we show that multi-time quantum-field correlations provide a concrete mechanism for the survival of pre-collapse information in black hole evaporation. Using a two-dimensional model of gravitational collapse and evaporation, we demonstrate that late-time multi-time correlations are not fully reproduced by the Unruh vacuum. In particular, they contain a contribution that depends explicitly on parameters characterizing the pre-collapse state, despite the thermal character of the asymptotic radiation. Our results identify measurable multi-time correlations as carriers of information in Hawking radiation and suggest that formulations of the black hole information paradox based solely on single-time observables are incomplete.

05.
arXiv (CS.CL) 2026-06-16

Context Compression Is Not One Thing: Readable Symbolic Re-expression vs. Coherent Summary at Matched Budget

We study context compression for multi-hop question answering with small language models. We propose Telegraph English, a readable symbolic format that rewrites retrieved passages into structured entity-relation statements, preserving reasoning evidence at lower token cost. In controlled experiments on MuSiQue, TwoWiki, and HotpotQA, Telegraph English outperforms three matched-budget compression baselines (character-level deletion, truncation, and random sub-sampling) on every dataset, with gains of 13 to 20 F1 percentage point. It also outperforms a coherent prose summary produced by the same encoder on the hardest dataset. A pre-registered depth-interaction hypothesis is null: the advantage does not grow with reasoning depth within datasets. We interpret these results as evidence that readable symbolic re-expression preserves entity content more densely than either natural language or coherent summarization at matched token budget.

06.
arXiv (quant-ph) 2026-06-19

Operational Tube-Sector Theory of Quantum State Distinguishability Under Generalized Symmetries

作者:

arXiv:2606.19678v1 Announce Type: cross Abstract: A variational principle for quantum-state distinguishability is established in many-body systems with generalized symmetries, including noninvertible cases described by fusion categories. Standard fidelity and symmetry-resolved diagnostics emerge as coarse-grained limits of a more refined operational structure. When symmetry actions terminate at entanglement cuts, distinguishability is governed by boundary tube algebras within a symmetry-constrained measurement resource theory. The physically admissible instruments are characterized by complete positivity, entanglement-cut locality, boundary-module covariance, and sequential stability. The resulting optimal measurement structure is uniquely fixed by the center of the boundary tube algebra, $\mathcal{A}_{\mathrm{phys}} = Z\!\left(\mathrm{Tube}_{\mathcal{C}}(\mathcal{M}_A)\right)$, whose primitive idempotents define tube-sector probabilities that refine fidelity-based and symmetry-resolved descriptions. The associated tube positive-operator-valued measures (POVM) are extremal and yield optimal one-shot hypothesis-testing distinguishability under symmetry constraints. The construction is universal across fusion categories and independent of microscopic realization.

07.
arXiv (CS.CV) 2026-06-17

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

08.
medRxiv (Medicine) 2026-06-17

Long-term mortality and cause-specific death after non-cardiac chest pain: a multicentre cohort study of 160,245 patients in China

Abstract Background Non-cardiac chest pain (NCCP) is commonly regarded as a low-risk condition. However, long-term mortality, cause-specific death, and high-risk subgroup characteristics remain poorly defined. Methods In this multicentre registry-linked cohort study, we linked the Chest Pain Center Registry from 101 hospitals in Hunan, China, with the Mortality and Cause of Death Registry. Adults diagnosed with NCCP from Jan 1, 2017, to Dec 31, 2021, were included. We assessed 3-year all-cause, cardiovascular, and non-cardiovascular mortality using Cox, restricted cubic spline, and Fine-Gray models. Findings Among 160,245 patients, 4674 deaths occurred within 3 years (2.9%). Mortality increased sharply after 60.5 years. Age [≥] 60.5 years (adjusted hazard ratio [aHR] 7.49 [95% CI 6.89-8.14]), rural residence (time-varying aHR 1.46 [1.35-1.57] in year 1 and 1.66 [1.46-1.89] in years 1-3), and male sex (aHR 1.47 [1.38-1.57]) independently predicted death. Three-year mortality ranged from 0.3% in younger urban women to 8.4% in older rural men. Cardiovascular diseases accounted for 56.4% of deaths among older patients, whereas other non-cardiovascular causes (22.8%) and malignancy (20.8%) were the largest categories among younger decedents. Interpretation NCCP is not uniformly benign. Age, rural residence, and sex identify patients who could benefit from risk-stratified follow-up, with cardiovascular prevention prioritised for older rural men and broader non-cardiovascular assessment considered for younger patients.

09.
arXiv (quant-ph) 2026-06-16

Noise-induced shallow circuits and absence of barren plateaus

arXiv:2403.13927v3 Announce Type: replace Abstract: Motivated by realistic hardware considerations of the pre-fault-tolerant era, we comprehensively study the impact of uncorrected noise on quantum circuits. We first show that in the task of estimating observable expectation values any noise truncates most quantum circuits to effectively logarithmic depth. We then prove that quantum circuits under any non-unital noise do not exhibit barren plateaus for cost functions composed of local observables. However, by using the effective shallowness, we also design an efficient classical algorithm to estimate observable expectation values within any constant additive accuracy, with high probability over the choice of the circuit, in any circuit architecture. Taken together, our results establish that, unless we carefully engineer quantum circuits to take advantage of the noise, noisy quantum circuits are unlikely to offer an advantage over shallow ones for algorithms that output observable expectation value estimates, such as many variational quantum machine learning proposals.

10.
arXiv (CS.AI) 2026-06-16

Agentic Framework for Deep Learning workload migration via In-Context Learning

arXiv:2606.15994v1 Announce Type: new Abstract: Translating deep learning models from PyTorch's flexible, object-oriented design to JAX's functional, stateless setup is usually a manual and error-prone task. Automated migration is challenging because Large Language Models (LLMs) struggle with strict and dynamic API alignment and are prone to mistakes for exacting operations. We propose a fully autonomous system that combines In-Context Learning (ICL) with oracle-driven self-debugging. First, we curated an ICL context that serves as a strict reference for idiomatic JAX styling and test case generation. Second, instead of depending on the LLM to deduce mathematical outputs, we run the source PyTorch modules to get their actual dynamic tensor states. This creates an unchangeable execution oracle. We then use an autonomous agentic loop to synthesize tests based on the oracle data. The test cases are executed repeatedly, and the traceback is sent back to the LLM for self-correction. Ablations show that combining ICL references with oracle grounding and self-debugging greatly outperforms pure instructional and basic agentic baselines. This improvement does not add an excessive computational overhead. Our lightweight pipeline achieves 91% numerical equivalence (compared to baseline: 9%, instruction + self-debugging: 27%) on neural modules, providing a highly reliable, scalable blueprint for cross-framework migration. This has been validated across several state-of-the-art models including SAM (segment anything), T5, Code Whisper amongst others showing high numerical equivalency. Code: https://github.com/AI-Hypercomputer/accelerator-agents/tree/main/MaxCode

11.
arXiv (CS.CV) 2026-06-19

Does Head Pose Correction Improve Biometric Facial Recognition?

Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.

12.
arXiv (CS.LG) 2026-06-17

Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric Thresholds

arXiv:2606.18218v1 Announce Type: cross Abstract: We study finite-horizon queue peaks in generalized switches, a standard stochastic-network model in which many queues share constrained service resources. Arrivals may be dependent, time-varying, and adapted to the past; the standing load condition is uniform interior slack, meaning the conditional mean arrival vector stays in a fixed contraction of the capacity region. We show that this slack reshapes the finite-time peak law for drift-minimizing scheduling policies such as MaxWeight. The square-root envelope that is sharp without slack persists only up to a geometry-dependent threshold; beyond that threshold, the running maximum grows only logarithmically with the horizon, both with high probability and in expectation. The mechanism is self-normalization: in the current queue direction, the projected fluctuation scale is normalized by the stabilizing drift scale. This removes capacity geometry from the logarithmic coefficient, while geometry remains in the threshold. Matching lower bounds show that both the logarithmic term and a geometric threshold are unavoidable. When finite-time state-space collapse is available, the threshold can be sharpened using local bottleneck geometry. For generalized input-queued switches, we obtain finite-time peak bounds with tight logarithmic coefficients. Simulations illustrate the two-phase envelope, local geometric refinements, and variance-sensitive improvements predicted by the theory.

13.
arXiv (CS.CL) 2026-06-17

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

14.
arXiv (CS.AI) 2026-06-17

Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management

arXiv:2606.17203v1 Announce Type: cross Abstract: Multi-agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over shared software artifacts, errors and low-confidence decisions made by upstream agents propagate to downstream stages, producing orphaned requirements, contradictory links, and compliance gaps that pose significant risks in safety-critical domains. We propose a trust-aware coordination framework where a shared knowledge graph serves as both centralized semantic memory and a coordination surface through which agents assess and build upon each other's contributions using calibrated confidence scores. Our approach introduces a two-stage traceability link prediction pipeline combining embedding-based retrieval with LLM-based multi-criteria analysis, a traceability seeding mechanism that enables comparison between derivation-time and validation-time confidence, and a consistency protocol governing pipeline interactions through confidence threshold gating, confidence divergence detection, and conflict resolution. We evaluate on an automotive software engineering case study measuring link prediction calibration, protocol effectiveness, threshold sensitivity, and the impact of traceability seeding. Ablation studies confirm that confidence calibration is essential for effective pipeline coordination.

15.
arXiv (CS.AI) 2026-06-18

A Hybrid LSTM–Vision Transformer Architecture for Predicting HRRR Forecast Errors

arXiv:2606.19026v1 Announce Type: cross Abstract: Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric phenomena. Previous work demonstrated that Long Short-Term Memory (LSTM) networks can successfully predict forecast errors in the High-Resolution Rapid Refresh (HRRR) model using mesonet observations, but we believe performance degradation is linked to periods of complex vertical atmospheric evolution. To address this limitation, we develop a hybrid LSTM-Vision Transformer (LSTM-ViT) framework that combines temporal sequence learning from surface observations with atmospheric profiles from the New York State Mesonet profiler network. The LSTM-ViT framework is trained to predict HRRR hourly precipitation, 10 m wind speed, and 2 m temperature forecast errors at individual mesonet stations. Across all three predictors, incorporation of profiler-derived atmospheric structure improves forecast error prediction skill relative to the baseline LSTM architecture, with the largest gains occurring at shorter forecast lead times and during periods of enhanced PBL activity. Improvements are particularly pronounced for precipitation forecast error, where the LSTM-ViT framework achieves approximately a twofold increase in predictive skill relative to the baseline LSTM while better capturing convectively driven error evolution and reducing degradation associated with PBL processes. These results demonstrate that combining temporal sequence learning with vertically informed attention mechanisms provides a physically meaningful pathway for improving forecast error prediction in operational NWP systems. Our research offers forecasters enhanced guidance regarding model bias and forecast confidence.

16.
arXiv (quant-ph) 2026-06-11

Wigner Cat Phases: A finely tunable system for exploring the transition to quantum chaos

作者:

arXiv:2512.22169v4 Announce Type: replace Abstract: A quantum mechanical setting consisting of a frozen qubit composed with a fully thermalized chaotic system of N states is proposed, with potential relevance to quantum control. Observing the states of the composed system selectively retaining the states leads to the observation of novel localization in the subsystem. At a tuning parameter of 1.0, implying no selection, the system exhibits Wigner-Dyson level spacing statistics, indicative of quantum chaos. As the tuning parameter is reduced and selection occurs at a cutoff, the nearest-neighbor level spacing distribution develops heavier tails, a signature of suppressed spectral mixing and the emergence of non-thermal dynamics. In these regimes, the eigendensity develops a pronounced "cat-ears" structure, reflecting the formation of spatially localized bimodal eigenstates. These topological features persist without transitioning to Poisson statistics, indicating a transition from quantum chaos to a non-thermal, novel many-body localized (MBL) regime-referred to as Wigner Cat Phases. The proposed mixed random matrix ensemble offers a practical probe for sustaining this novel quantum localization setting. Results from our rigorous spectral statistics analysis show how "cat-ears" form in spectral densities based on the degree of selection or disorder and indicate that gap ratio statistics must be used with caution in detecting the full integrable limit due to the possibility of heavy-tailed Wigner-Dyson distributions.

17.
arXiv (math.PR) 2026-06-15

On the Poisson Follower Model

arXiv:2309.04864v5 Announce Type: replace Abstract: We introduce a stochastic geometry dynamics inspired by opinion dynamics that captures the essence of modern asymmetric social networks with leaders and followers. Points in the Euclidean space represent opinions, and the leader of an agent is the one with the closest opinion. In this dynamics, each follower updates its opinion by halving the distance to its leader. We demonstrate that this simple dynamics and its iterations exhibit several interesting purely geometric phenomena related to the evolution of leadership and opinion clusters, which resemble those observed in social networks. We also show that when the initial opinions are randomly distributed as a stationary Poisson point process, the spatial frequency of each of these phenomena can be expressed through an integral geometry formula involving semi-algebraic domains. Finally, we analyze numerically the limiting behavior of this follower dynamics. In the Poisson case, the agents fall into two categories: ultimate followers, who continue updating their opinions indefinitely, and ultimate leaders, who adopt a fixed opinion after a finite time. Spatial discrete event simulations support all our findings.

18.
arXiv (CS.LG) 2026-06-16

How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs

arXiv:2602.05779v2 Announce Type: replace Abstract: The Edge-of-Chaos (EoC) theory developed for the random initialization of deep networks allows more efficient training by both preserving information in the initial outputs of the network and minimising exploding or vanishing gradients through characterisation of the intermediate layers as Gaussian processes. This EoC theory provides formulae for the choice of the initialisation distribution variances of the weights and biases. For activations which are approximately linear around the origin, the EoC theory typically encourages the Gaussian process variance to converge towards zero with increasing depth. Here we consider the less studied setting of highly sparsity inducing activations where a large region of values near the origin are set to zero. In this setting we prove a new phenomenon whereby initialisations leading to larger fixed Gaussian processes are beneficial to training stability. This theory informs a new, yet simple, initialisation strategy that allows training DNNs and CNNs with as large as 90\% sparsity in the hidden layers.

19.
arXiv (CS.LG) 2026-06-17

INI-VPINN: A Variational Physics-Informed Neural Network with Implicit Neumann and Interface Handling for Multi-Material Domains with Geometric Singularities

arXiv:2606.18032v1 Announce Type: cross Abstract: We propose a new weak-form Physics-Informed Neural Network approach (named INI-VPINN). INI-VPINN naturally incorporates Neumann boundary and interface conditions into the variational formulation. It removes the need for additional loss terms or multiple subdomain networks. This framework employs compact support weighting functions and integration by parts to implicitly impose flux and continuity constraints. In this way, it implicitly ensures physical consistency across material boundaries. The proposed method is tested on Poisson and Laplace problems with sharp interfaces and complex geometries. Results show that, compared with several other Physics Informed Neural Networks-based formulations, the INI-VPINN consistently achieves higher accuracy, smoother and faster convergence. The proposed framework provides a general approach for solving multimaterial problems with complex geometries and mixed Neumann-Dirichlet boundary conditions using neural networks. The implementation is publicly available in a GitHub repository.

20.
arXiv (CS.AI) 2026-06-15

Patcher: Post-Hoc Patching of Backdoored Large Language Models

arXiv:2606.02995v2 Announce Type: replace-cross Abstract: Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden triggers that bypass safety mechanisms. Existing defenses often require comprehensive attack information or multiple triggered examples, making them impractical when defenders only observe a single reported failure case without knowing whether it stems from a backdoor attack or a natural alignment bug. This paper presents Patcher, a post-hoc defense framework that repairs backdoored language models using only a single reported failure case and the model parameters. Patcher operates in two stages. First, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering to separate triggers from benign context. Second, it patches the model through a constrained fine-tuning objective that breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks through KL-divergence constraints. We conduct extensive evaluations across multiple backdoor attack strategies and demonstrate that Patcher successfully localizes triggers and neutralizes backdoors while maintaining model utility. We further show robustness against adaptive attacks designed to evade our defense. This work represents a significant step toward practical defenses against training-time attacks in deployed language models.

21.
arXiv (CS.CV) 2026-06-16

K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model

Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $K-Prism$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $semantic priors$ learned from annotated datasets, (ii) $in-context knowledge$ from few-shot reference examples, and (iii) $interactive feedback$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $what$ to segment and 2-D dense prompts indicating $where$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings.

23.
arXiv (CS.CV) 2026-06-11

Semantic search for 100M+ galaxy images using AI-generated captions

Finding scientifically interesting phenomena through slow manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search for over 100 million galaxy images, enabling discovery from previously infeasible searches, including the identification of 36 new extragalactic stellar stream candidates. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at https://github.com/NolanKoblischke/AION-Search

24.
arXiv (CS.CV) 2026-06-11

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success–cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.

25.
arXiv (quant-ph) 2026-06-12

Stable, bidirectional electro-optic transduction in thin film lithium tantalate

arXiv:2606.12726v1 Announce Type: new Abstract: Efficient and stable microwave-optical transduction is a key enabling technology for distributed superconducting quantum computing and heterogeneous quantum networks. Electro-optic transducers based on thin-film lithium niobate (TFLN) have shown strong promise, but demonstrations to date have been limited by various factors such as low frequency bias drift, low efficiency, fabrication complexity, and scalability. Here we demonstrate the first integrated electro-optic microwave-optical transducers realized in thin-film lithium tantalate (TFLT), a material platform offering Pockels nonlinearity comparable to TFLN together with improved bias stability and high-power handling. We fabricate superconducting microwave resonators coupled to tunable photonic-molecule optical resonators using wafer-scale deep ultraviolet lithography, offering high-throughput production of hundreds of devices per wafer. Across six devices we observe coherent bidirectional conversion between C-band optical photons and 4.9-5.5 GHz microwave photons, with measured on-chip efficiencies and inferred single-photon coupling rates g_0/2{\pi} ~ 1 kHz consistent with theory. Continuous operation over multiple days is achieved using a static bias field with minimal feedback, demonstrating a major operational advantage. We further characterize optical loss statistics, microwave resonator performance, and optically induced added noise under pulsed pumping, finding less than one added photon for 100 microsecond pulses at the highest measured efficiencies. These results establish TFLT as a scalable and robust electro-optic platform for future quantum interconnects and modular quantum processors.