Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (quant-ph) 2026-06-16

Witnessing Spin-Orbital Entanglement using Resonant Inelastic X-Ray Scattering

arXiv:2512.06718v2 Announce Type: replace Abstract: Entanglement plays a central role in quantum technologies, yet its characterization and control in materials remain challenging. Recent developments in spectrum-based entanglement witnesses have enabled new strategies for quantifying many-body entanglement in macroscopic materials. Here, we develop a protocol for detecting spin-orbital entanglement using experiment-accessible resonant inelastic x-ray scattering (RIXS). Central to our approach is the construction of a Hermitian generator from experimentally measurable spectra, which allows us to compute the quantum Fisher information (QFI) available in spin–orbital systems. The resulting QFI provides upper bounds for $k$-producible states and thus serves as a robust witness of spin-orbital entanglement. To account for realistic experimental limitations, we further extend our framework to include relaxed QFI bounds applicable to measurements lacking full polarization resolution.

02.
arXiv (CS.CV) 2026-06-16

RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

03.
arXiv (CS.CL) 2026-06-17

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.

04.
arXiv (CS.AI) 2026-06-18

From Specification to Execution: AI Assisted Scientific Workflow Management

arXiv:2606.18425v1 Announce Type: cross Abstract: Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

05.
medRxiv (Medicine) 2026-06-12

Home-based binocular serious games in virtual reality to treat visual acuity and stereovision in residual amblyopia: AMBER study

Objectives: Amblyopia is a pediatric visual disorder traditionally treated by patching the fellow eye, though many patients retain residual amblyopia post-treatment. Increasing evidence suggests that visual plasticity allows treat-ment beyond the classical therapeutic window. AMBER evaluated the efficacy of binocular serious games in virtual reality (VR) in residual amblyopia. Methods and Analysis: The monocentric, prospective, randomized, crossover trial (reported as case series) includ-ed 14 anisometropic, strabismic, or mixed residual amblyopia patients (6-35 years; 5 children, 9 adults). Participants underwent two 2-month intervention phases: optical correction (standard care) and standard care plus VR games (2.5 h/week), each with a 2-month follow-up. Best-corrected visual acuity (BCVA), stereoacuity, and reading speed were assessed (5 timepoints) using the Sloan and Landolt charts, the Titmus, TNO, Lang II, Asteroid, and Mnread tests. Compliance and adverse events (AE) were recorded. Results: VR training improved BCVA in 10 amblyopic eyes (Landolt and Sloan), with more pronounced effects in anisometropic patients. Six patients showed improved stereoacuity (Titmus; 4x mixed, 1x anisometropic, 1x stra-bismic amblyopia), persistent only in children (1x strabismic, 1x mixed amblyopia). Four improvements were ob-served with TNO (1x), Lang II (1x), Asteroid (0x), and MNread (1x). Despite positive trends, when comparing re-sults of individual patients, between both eyes, and with standard treatment, consistency of improvements cannot be conclusively demonstrated. One non-severe AE (dizziness) was reported. Conclusions: Following individual cases, VR training improved BCVA and stereoacuity, particularly in children and patients with high compliance. However, considering the cohort as a whole, consistency of effects has to be confirmed in larger groups. Thus, the methodologically sophisticated AMBER study revealed differences in VR treatment efficacy between amblyopia types, children/adults, endpoints and tests, offering precious data for the design of meaningful future studies. It shows that neurovisual plasticity gauged by VR-games offers safe, engaging treatment options for residual amblyopia.

06.
arXiv (CS.CV) 2026-06-18

RUB: Evaluating Residual Knowledge in Unlearned Models

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

07.
arXiv (CS.CV) 2026-06-15

MUSE: Agentic 3D Scene Authoring via Memory-Grounded Incremental Requirement Satisfaction

Text-driven 3D scene generation is a promising technique for digital content creation, embodied AI simulation, and interactive design, yet practical workflows often require refining, extending, or correcting existing scenes while preserving non-target content. Existing methods can produce realistic and structurally plausible scenes, but they generally lack editability with requirement-level state tracking, so part-level failures often lead to full-scene regeneration or manual intervention. To tackle this challenge, we formulate controllable 3D scene authoring as incremental requirement satisfaction, unifying construction and editing. In this paper, we present MUSE, a memory-grounded multi-agent framework in which an Architect compiles instructions into structured requirements, a Sculptor executes local scene operations, and an Inspector verifies each step while updating Working, Scene, and Skill Memory. To evaluate requirement-level controllability and preservation-aware editing, we introduce AuthorBench, offering 145 constrained construction cases and a 1,584-case preservation-aware editing pool paired with external structured checks. On full construction cases, MUSE improves All-Goal success from 37.9 to 80.7 and surface-constraint fulfillment from 35.0 to 92.6 over the strongest baseline. On a stratified 240-case editing test split, MUSE achieves 49.6 All-Goal success, 99.9 preservation rate, and only 0.6 unintended change rate. Beyond automated metrics, human evaluations on compared local-editing baselines support stronger alignment with user intent, and downstream navigation-proxy tests indicate stronger spatial stability. Combined with ablations validating our memory designs, these results establish MUSE as an effective framework for controllable 3D scene authoring.

08.
arXiv (CS.CL) 2026-06-16

A Unified Definition of Hallucination: It's The World Model, Stupid!

Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.

09.
arXiv (CS.CV) 2026-06-16

Interpolation between Convolution and Attention via K-Nearest Neighbors

作者:

The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. Convolutional Neural Networks are defined by spatially local convolution operations, while Transformers rely on global self-attention. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and weighted aggregation. Convolution selects neighbors by spatial proximity while self-attention selects by feature similarity, revealing that they lie on a continuous spectrum rather than representing categorically different computations. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. ConvNN exactly recovers standard and depthwise convolution by restricting neighbor selection to normalized spatial coordinates, and exactly recovers self-attention and its sparse variants, including KVT-attention, by replacing spatial proximity with scaled dot-product similarity. Beyond these special cases, ConvNN serves as a drop-in replacement for both convolution and attention layers, enabling systematic exploration of the intermediate spectrum between local and global aggregation through configurable similarity functions, neighbor selection strategies, positional encodings, and aggregation kernels.

10.
arXiv (CS.AI) 2026-06-11

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

arXiv:2606.12252v1 Announce Type: cross Abstract: Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

11.
arXiv (CS.LG) 2026-06-19

How to sketch a learning algorithm

作者:

arXiv:2604.07328v3 Announce Type: replace Abstract: How does the choice of training data influence an AI model? This broad question is of central importance to interpretability, privacy, and basic science. At its technical core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error $\varepsilon$ and failure probability $\delta$ in the deep learning setting. Our precomputation and prediction algorithms are only $\tilde{O}(\log(1/\delta)/\varepsilon^2)$ factors slower than regular training and inference, respectively. The storage requirements are those of $\tilde{O}(\log(1/\delta)/\varepsilon^2)$ models. Our proof is based on an assumption that we call stability. In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at https://github.com/SamSpo1/microgpt-sketch. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.

12.
bioRxiv (Bioinfo) 2026-06-11

EditorForge: An Active-Site-Aware Framework for Inverse-Folding-Based Protein Redesign

Inverse-folding models can rapidly generate protein sequences compatible with a supplied backbone, but unconstrained redesign is poorly suited to enzyme and genome-editor-associated domains, where catalytic, substrate-proximal, and conserved structural regions must remain protected. In this paper, we present EditorForge, a modular constraint-and-audit suite for editor-domain protein redesign that wraps fixed-backbone inverse folding with explicit design masks, fixed-position enforcement, active-site-proximity auditing, active-site-shielded regeneration, and downstream structural quality control. Using full-length Moloney murine leukemia virus reverse transcriptase structure 4MH8 (MMLV RT 4MH8) as a demonstration target, EditorForge first restricted redesign to a bounded 25-position envelope while fixing 428 residues. An initial audit detected active-site-proximal failure modes despite fixed-position integrity. Later, the Active Site Shield module then removed five unsafe design positions, replaced them with lower-contact alternatives, and regenerated candidates under stricter constraints. Post Shield Audit evaluated 24 regenerated candidates, all of which satisfied the hard sequence/mask and active-site-shield constraints. For the eight candidates that were selected or returned for structure-prediction/refolding quality control. Enhanced RefoldQC found that all 8 evaluated predicted structures passed the computational structure-QC screen. That said, the selected 8 candidates passed the computational structure-QC screen, with global C RMSD values of 1.2061–1.5555~[A], active-site C RMSD values of 0.4098–1.8397~[A], mutation-neighborhood C RMSD values of 1.3155-1.6848~[A], and average pLDDT-like confidence values of 94.87-95.11. In short, EditorForge provides a reproducible triage layer that converts general inverse-folding output into constrained and editor-specific candidate sets for downstream structural and biological review on top of existing structural prediction tools.

13.
arXiv (CS.CV) 2026-06-16

EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP–OCT Pretraining

Color fundus photography (CFP) is the mainstay for large-scale retinal screening, yet its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, but is less accessible in population-level screening. Here, we present EyeMVP, a cross-modal retinal foundation model that uses paired CFP–OCT pretraining to learn OCT-informed CFP representations. EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP–OCT image triples from 112,642 patients across eight hospitals in China. The model uses cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks. Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models and shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs.~0.852 for EyeCLIP) and 0.825 for myopic macular schisis. In an exploratory reader study, EyeMVP exceeds junior and intermediate ophthalmologist groups but does not reach senior ophthalmologist performance on macular edema, while showing numerically higher balanced accuracy than all reader groups on myopic macular schisis. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.

14.
medRxiv (Medicine) 2026-06-11

Corticospinal tract risk modifies motor recovery after minimally invasive surgery for intracerebral hemorrhage: a secondary analysis of MISTIE-III

Objective: Outcome after surgical hematoma evacuation for intracerebral hemorrhage (ICH) depends on hematoma location. As corticospinal tract (CST) integrity affects motor recovery after stroke, we hypothesized that CST integrity drives heterogeneity in surgical outcomes and investigated this in a secondary analysis of MISTIE-III participants. Methods: Risk of CST injury was categorized into four levels, based on the interaction between the CST, the hematoma, and perihematomal edema (PHE) on automatically segmented stability CT: no risk, PHE infiltration, hematoma infiltration, and complete interruption of the CST. Associations with outcome were tested using multivariable linear regression for motor National Institutes of Health Stroke Scale (NIHSS) at day 180 and ordinal regression for modified Rankin Scale (mRS) at day 365, introducing an interaction term between CST risk and treatment group. Results: Day 180 motor NIHSS was significantly lower for 'no risk' ({beta}:-3.77, [95% confidence interval [CI]: -5.8 to -1.70], p=0.0003) and 'PHE infiltration' ({beta}:-2.3, [95%CI: -3.5 to -1.1]; p=0.0002) vs. 'complete interruption'. Surgery was associated with lower Day 180 motor NIHSS in participants with hematoma infiltration ({beta}:-2.07, [95%CI: -3.8 to -0.4], p=0.016). Compared to complete interruption, 'no risk' (adjusted odds ratio [aOR]:0.27, [95%CI: 0.10 to 0.74], p=0.01) and 'PHE infiltration' (aOR:0.41, [95%CI: 0.23 to 0.74]; p=0.003) were associated with lower odds of unfavorable day 365 mRS. Surgery was associated with lower mRS in participants with no risk (aOR:0.23, [95%CI: 0.05 to 0.97, p=0.045). Interpretation: Increasing CST risk is associated with worse motor recovery (day 180) and disability (day 365). CST risk modifies the effect of the MISTIE-III procedure on motor recovery and disability.

15.
arXiv (CS.AI) 2026-06-12

Understanding the Rejection of Fixes Generated by Agentic Pull Requests – Insights from the AIDev Dataset

arXiv:2606.13468v1 Announce Type: cross Abstract: AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

16.
arXiv (quant-ph) 2026-06-11

Partitioned Iterative Quantum Scheduling of Satellites for Urgent Disaster Response: Case study of Wildfire

arXiv:2606.12310v1 Announce Type: new Abstract: The standard in Earth-observation tasks today is having near real-time access to surface images in response to changing conditions. For instance, as urban environments interface more with wildlands and wildfires become less predictable, their tracking with satellite resources becomes essential. This requires the coordination of increasingly large constellations of satellites, giving rise to challenging computational problems. With wildfire detection and tracking as a backdrop, we investigate the power of special purpose and novel computing paradigms to tackle the ensuing satellite scheduling problems, making a compelling case for quantum algorithms. We bring quantum scheduling algorithms closer to implementation by examining both the emerging iterative quantum algorithm framework, which comes with analytic guarantees compared to some classical algorithms, and distributed quantum computing methods whose relevance is on the rise as utility-scale problems begin to get solved with quantum computers. Drawing strength from several computing fronts, we develop a distributed/parallelization scheme in conjunction with the quantum algorithm design and apply these techniques to real-world datasets for wildfire detection. While our quantum subprocesses are currently too small to see significant quantum advantage, our results validate the utility of these techniques, and continue forging the path toward distributed quantum computing.

17.
arXiv (CS.CV) 2026-06-12

Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors

Content moderation in online multiplayer 3D virtual environments is increasingly automated, yet detection has focused on images, video, and audio, leaving suggestive motion a blind spot. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On a dataset spanning everyday, artistic, suggestive, and explicit movement (17+ hours of video), a logistic regression trained on 61-feature LMA descriptors reaches 68% binary SFW/NSFW accuracy (70% random forest) under a leak-free evaluation protocol. At this level, our descriptor performs comparably to a learned video model trained on the same motion re-rendered as appearance-free video, a gray figure with no clothing, skin, or scene. The indirectness (tortuosity) of each joint's trajectory, measured as the ratio of the joint's path length to its net displacement, peaks at the suggestive tier, showing that the Direct-to-Indirect polarity of Laban's Space factor provides an interpretable marker of the shift from functional to suggestive motion. Ultimately, Laban-based kinematic descriptors offer a lightweight, interpretable approach to suggestive-motion detection: every decision decomposes into named, theory-grounded features. Because the classifier operates on pose trajectories alone, moderation can run directly on avatar poses in virtual environments, with no appearance data.

18.
arXiv (CS.LG) 2026-06-17

Broadcast Product: Redefining Shape-aligned Element-wise Multiplication and Beyond

arXiv:2409.17502v2 Announce Type: replace Abstract: Broadcast operations are widely used in scientific computing libraries, yet their mathematical formulation is often implicit and inconsistently represented in machine learning literature. This problem frequently leads to invalid equations when element-wise products are written despite mismatched tensor shapes. In this paper, we formalize such operations by introducing the broadcast product $\boxdot$, which explicitly extends the Hadamard product through shape-aligned element duplication. We provide a rigorous definition of the broadcast product, analyze its algebraic properties, and show how it can be expressed using standard linear algebra. Building on this framework, we formulate least-squares problems and sketch a proof-of-concept broadcast decomposition. As a preliminary illustration, we show that the formalism enables a new family of decompositions with distinct structural properties from conventional tensor decompositions. This work establishes a mathematical foundation for broadcast-aware tensor operations, connecting practical implementations with rigorous tensor analysis.

19.
arXiv (quant-ph) 2026-06-17

$\mathcal{PT}$-Symmetric Spin–Boson Model with a Continuous Bosonic Spectrum: Exceptional Points and Dynamics

arXiv:2512.20277v2 Announce Type: replace Abstract: This work studies a $\mathcal{PT}$-symmetric non-Hermitian spin–boson model, consisting of a non-Hermitian two-level system coupled to a continuous bosonic bath. The static properties of the system are analyzed through a projection method derived from the displacement operator. We find that only a single exceptional point (EP) emerges, in contrast to non-Hermitian spin–boson models with finite modes, which typically exhibit multiple EPs. Notably, only a single real eigenvalue is found before the EP, which differs markedly from typical non-Hermitian systems where a pair of real eigenvalues precedes the EP. The time evolution of observables is further investigated via the Dirac–Frenkel time-dependent variational principle. Compared to its Hermitian counterpart, the non-Hermitian model exhibits distinct dynamical signatures, most notably the emergence of oscillations with periodic amplified amplitude. In the $\mathcal{PT}$-unbroken phase, the system exhibits sustained oscillatory dynamics with suppressed decoherence, whereas in the $\mathcal{PT}$-broken phase, additional dissipative channels accelerate decoherence and drive rapid convergence toward a stable steady state. These results shed light on how $\mathcal{PT}$ symmetry protects coherent light–matter interactions in non-Hermitian quantum systems.

20.
arXiv (CS.CV) 2026-06-12

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

Few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this work, we introduce Z-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher. Our method addresses the central bottlenecks of increased task difficulty and limited model capacity in 2-step generation through three simple but effective design choices tailored to this regime. First, we propose Distribution-Aligned Adversarial Learning, which uses teacher-generated images rather than external real images as real samples for GAN training, providing a more attainable and informative adversarial target. Second, we adopt Step-Decoupled Parameterization, assigning independent model parameters to the two denoising steps to better match their distinct capacity demands. Third, we perform End-to-End Training with Iterative Regularization, allowing the first step to receive gradients from final image quality while preserving a meaningful intermediate generation through an explicit step-1 loss. Together, these designs substantially narrow the quality gap between 2-step and 8-step generation in both qualitative and quantitative evaluations, highlighting the potential of carefully tailored distillation strategies for improving the quality-efficiency trade-off in few-step generation.

21.
arXiv (CS.CL) 2026-06-16

Understanding the Behaviors of Environment-aware Information Retrieval

Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance. In this work, we present the first systematic analysis of how LLMs can learn to adapt their query formulation strategies for different retrievers via reinforcement learning (RL). Our empirical study reveals that RL effectively teaches an LLM to tailor its queries to specific retriever characteristics. We discover that different retrievers exhibit surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), suggesting strategies learned for one retriever ineffective for another. We further show that performance can be enhanced by incorporating retriever-specific human guidance and by scaling model size. To facilitate learning over multi-retrieval-step trajectories, we introduce a branching-based rollout technique that improves training stability. Our work provides the first empirical evidence and actionable insights for building truly retriever-aware RAG systems. Code and resources are available at https://github.com/LCO-Embedding/Envs-aware-Information-Retrieval.

22.
medRxiv (Medicine) 2026-06-17

LLM-Driven Extraction of NI-RADS and Imaging Tumor Characteristics to Enhance Oropharyngeal Cancer Survivorship Surveillance

Abstract Purpose Radiologic surveillance is essential for oropharyngeal cancer (OPC) survivors, guiding recurrence detection and follow-up strategies. The Neck Imaging Reporting and Data System provides a standardized framework for post-treatment risk reporting at both the primary tumor site (pNI-RADs) and cervical lymph nodes (nNI-RADS). Comprehensive surveillance additionally requires assessment of disease status, including the primary tumor, nodal involvement, and distant metastases. These clinical results are often embedded as unstructured data within free-text radiology reports. We hypothesized that a large language model (LLM) can reliably extract NI-RADS score criteria and summarize key imaging features from unstructured radiology text, achieving high concordance with expert review. Methods Previously untreated OPC patients who received definitive cancer therapy were identified. Eligible imaging reports included post-treatment head and neck CT, MRI, or FDG PET/CT scans containing narrative and impression text. Examinations lacking narrative or impression text, containing pre-existing NI-RADS annotations, or involving non-surveillance imaging modalities were excluded. A total of 200 reports were randomly selected from 7,076 eligible examinations for manual abstraction using a three-reviewer consensus framework to establish a reference dataset. Using the Palantir Foundry Pipeline Builder, a GPT-5-based LLM was deployed to extract pNI-RADS and nNI-RADS scores, and key imaging features of disease status from these reports. Performance was evaluated using exact agreement and F1-based metrics. Results Agreement for no evidence of disease (score of 1) was 93.3% (126/135; F1 = 0.94) and 90.3% (130/144; F1 = 0.93) for pNI-RADS and nNI-RADS, respectively. For NI-RADS [≥]2, exact category agreement was 73.1% (38/52; macro-F1 = 0.75) for pNI-RADS and 64.3% (27/42; macro-F1 = 0.56) for nNI-RADS. Quadratic weighted {kappa} was 0.81 and 0.59, respectively. For post-treatment disease surveillance variables, agreement was 94.9% (149/157; F1 = 0.87) for primary tumor presence, 89.1% (164/184; F1 = 0.87) for nodal disease presence, and 94.7% (126/133; F1 = 0.70) for distant metastasis detection. Specificity was high across disease-status variables (0.95-0.99), with negative predictive values of 0.95 for primary tumor, 0.87 for nodal disease, and 0.99 for distant metastasis. Conclusions Our LLM-based information retrieval and classification approach for radiographic treatment response from unstructured, multidimensional imaging reports achieved high performance for disease exclusion and moderate performance for detecting suspected residual and/or new disease. This pipeline supports scalable and standardized surveillance data capture for longitudinal monitoring, clinical analytics, and survivorship research in head and neck oncology.

24.
arXiv (CS.LG) 2026-06-19

Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying

arXiv:2606.20167v1 Announce Type: new Abstract: Spatial prediction tasks are often limited by a lack of high-quality labelled ground-truth observations. To overcome this challenge, self-supervised pre-training is a possible solution, with contrastive learning dominant for location encoders. Those approaches usually align geographic coordinates with just one additional modality. We propose two multimodal contrastive learning architectures: Multimodal Embedding via Location Tying (MELT) and Sequential Alternating Location Training (SALT). These architectures expand this framework beyond two modalities by utilising unpaired geospatial data. Both methods are technically viable and match the performance of the strongest two-modality baseline (SATCLIP) across four downstream tasks. However, increasing the number of modalities does not consistently improve performance, suggesting that the chosen location encoder is the main limitation - the contrastive objective reaches its peak early, regardless of modality diversity or pre-training volume. MELT provides more stable training than SALT and presents a stronger foundation for future scaling.

25.
arXiv (CS.CV) 2026-06-16

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

Visual Question Answering (VQA) in robotic surgery, referred to as surgical VQA, requires high-level understanding of complex surgical scenes and the integration of visual perception with language reasoning, with the potential to support surgical training and intraoperative decision-making. Recent Vision-Language Models (VLMs) have shown promising performance through parameter-efficient fine-tuning; however, most existing approaches rely on coarse visual grounding, typically limited to bounding boxes, which fails to capture the fine-grained spatial structure of surgical objects. In this work, we propose a unified framework that jointly performs pixel-level segmentation and visual question answering within a single framework. Our approach integrates a VLM with a Segment Anything Model (SAM)-based decoder and represents scene elements as object tokens generated by the VLM. These object tokens guide answer prediction and are further projected to the SAM-based decoder to produce segmentation masks. By optimizing the object token embeddings through both segmentation and question answering objectives, the model learns spatially grounded representations that enhance visual reasoning while providing explicit pixel-level grounding. We evaluate the proposed method on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public EndoVis18 dataset, where it consistently outperforms baseline methods for surgical VQA. These results demonstrate that incorporating context-aware object tokens into vision-language models improves fine-grained surgical scene understanding.