Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-16

UrbanWell: Benchmarking Multimodal Large Language Models for Spatio-Temporal Urban Wellbeing Analytics

arXiv:2606.15890v1 Announce Type: new Abstract: Understanding urban wellbeing from multimodal data requires integrating heterogeneous spatial and temporal signals, posing significant challenges for current multimodal large language models (MLLMs). We introduce UrbanWell, a large-scale benchmark designed to systematically evaluate the spatio-temporal reasoning capabilities of MLLMs for urban wellbeing analytics through joint modeling of satellite and street view imagery. UrbanWell spans 38 cities across multiple years and includes diverse indicators covering (1) environmental conditions (CO$_2$, NO$_2$, PM${2.5}$, and Normalized Difference Vegetation Index), (2) spatial accessibility (minimum distance to supermarkets and restaurants), (3) urban form (road length, road density, and land use), (4) urban vitality (population, economic activity diversity, and land use diversity), and (5) subjective perception attributes (e.g., safety, beauty, liveliness, wealth, and quietness). All indicators are aligned at grid level to enable standardized evaluation. Beyond static prediction, UrbanWell defines temporal reasoning tasks, including future value forecasting from historical observations and temporal trend classification. We benchmark 15 state-of-the-art representative MLLMs in a zero-shot setting, providing a comprehensive comparative evaluation across spatial and temporal dimensions. Experimental results indicate that while MLLMs capture salient spatial and perceptual cues, their performance varies substantially across heterogeneous urban indicators spanning environment and subjective perception. UrbanWell serves as a unified benchmark for evaluating multimodal spatial and temporal reasoning in urban wellbeing analytics, offering a standardized testbed for systematic assessment and future research on multimodal urban intelligence. Our codes and datasets are accessible via https://github.com/axin1301/UrbanWell-Benchmark.

02.
arXiv (CS.CL) 2026-06-16

Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling

Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendations that help users initiate dialogues. Conventionally, recommending these starters relies on a closed "exposure-click" loop. Yet, this feedback loop mechanism traps the system in an echo chamber where, compounded by data sparsity, it fails to capture the dynamic nature of conversational search intents shaped by the open world. As a result, the system skews towards popular but generic suggestions. In this work, we uncover an untapped paradigm shift to shatter this harmful feedback loop: harnessing user "free will" through active user expressions. Unlike traditional recommendations, conversational search empowers users to bypass menus entirely through manually typed queries. The open-world intents in active queries hold the key to breaking this loop. However, incorporating them is non-trivial: (1) there exists an inherent distribution shift between active queries and formulated starters. (2) Furthermore, the "non-ID-able" nature of open text renders traditional item-based popularity statistics ineffective for large-scale industrial streaming training. To this end, we propose Passive-Active Bridge (PA-Bridge), a novel framework that employs an adversarial distribution aligner to bridge the distributional gap between passively recommended starters and active expressions. Moreover, we introduce a semantic discretizer to enable the deployment of popularity debiasing algorithms. Online A/B tests on our platform, demonstrate that PA-Bridge significantly boosts the Feature Penetration Rate by 0.54% and User Active Days by 0.04%.

03.
medRxiv (Medicine) 2026-06-10

Exploratory Assessment of Pulsed-Wave Doppler Representations of Lung Sounds Using Deep Learning: An In-Vitro Phantom Study

The increasing availability of portable ultrasound systems motivates exploration of novel approaches to respiratory signal assessment. In this in-vitro study, we investigate whether pulsed-wave (PW) Doppler ultrasound can capture structured spectral patterns from replayed lung sound recordings. Digitized respiratory sounds were replayed through a tissue-mimicking ultrasound phantom, generating 1,478 PW Doppler spectral images from recordings associated with healthy subjects and several externally labeled disease categories. Exploratory classification experiments using a ResNet-18 architecture demonstrated that these Doppler representations contain learnable differences under controlled conditions. These findings motivate further investigation into PW Doppler as a potential representation of respiratory acoustics.

04.
arXiv (CS.AI) 2026-06-12

Reconstructing Template-Memorized Images from Natural Prompts

arXiv:2507.07947v4 Announce Type: replace-cross Abstract: Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data stewardship. To better understand and control these risks, prior work has introduced techniques and attacks that reconstruct images, or parts of images, from training data. While these results demonstrate that training data can be recovered, existing methods often rely on high computational resources, partial access to the training set, or carefully engineered prompts. In this work, we present a new attack that requires low resources, assumes little to no access to the training data, and identifies seemingly benign prompts that can lead to potentially risky image reconstruction. We further show that such reconstructions may occur unintentionally, even for users without specialized knowledge. For example, we observe that for one existing model, the prompt ``blue Unisex T-Shirt'' generates the face of a real individual. Moreover, by combining the identified vulnerabilities with real-world prompt data, we discover prompts that reproduce memorized visual elements. Our approach builds on insights from prior work and leverages domain knowledge to expose a fundamental vulnerability arising from the use of scraped e-commerce data, where templated layouts and images are closely tied to pattern-like textual prompts. The code for our attack is publicly available at https://github.com/TheSolY/lr-tmi.

05.
arXiv (CS.LG) 2026-06-11

Program Evaluation with Remotely Sensed Outcomes

arXiv:2411.10959v5 Announce Type: replace-cross Abstract: We study causal inference in experiments and quasi-experiments, where the economic outcome is imperfectly measured by a remotely sensed variable. The remotely sensed variable is low-cost, scalable, and predictive of the economic outcome in observational data; examples include satellite imagery and mobile phone activity. We model the remotely sensed variable as post-outcome: variation in the economic outcome causes variation in the remotely sensed variable. For example, changes in environmental quality cause changes in satellite imagery, not vice versa. Under this assumption, we propose a formula to nonparametrically identify the causal parameter by combining experimental and observational data. We develop a method for n^{-1/2} inference that is robust to misspecification and that does not restrict the algorithms used to process remotely sensed variables.

06.
arXiv (CS.AI) 2026-06-11

Sparsified Kolmogorov-Arnold Networks for Interpretable Quantum State Tomography

arXiv:2606.11814v1 Announce Type: cross Abstract: Machine-learning approaches to quantum state tomography can achieve high reconstruction fidelity, but the physical structure used by the trained model often remains implicit. Here we ask whether a sparsified Kolmogorov-Arnold Network (KAN) can be used not only as a regressor, but also as an inspectable reconstruction rule whose internal organization can be checked against known Pauli structure. We study a controlled three-qubit GHZ-family benchmark in which all 63 non-identity Pauli expectation values are used to reconstruct three GHZ-subspace variables: the population imbalance $z$, the real off-diagonal component $c$, and the imaginary off-diagonal component $s$. Under finite-shot sampling and depolarizing noise, external ablation identifies the extended 12-channel GHZ-relevant Pauli set from the 63 measurements, with exact top-12 recovery across the tested shot counts and depolarizing-noise strengths. These support patterns remain stable across multi-seed random-initialization and noise-level analyses, and collapse under random-label controls. The dominant pruned input-hidden-output pathways organize Z-type population observables and X/Y off-diagonal observables in a pattern consistent with the analytic GHZ Pauli grouping, and sparse formula recovery recovers the canonical signed Pauli relations. The contribution of the KAN is therefore pathway-level structural interpretability within a neural reconstruction model, rather than superior sparse regression. Together with negative controls, these probes provide a consistency chain for auditing learned reconstruction rules against known physical structure.

07.
arXiv (CS.AI) 2026-06-16

Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients

arXiv:2606.16210v1 Announce Type: new Abstract: Learned representations in intelligent sensing systems are often evaluated by reconstruction fidelity or downstream prediction accuracy, but these criteria do not specify which latent distinctions are justified by the sensing process. In sensor-conditioned environments, nuisance factors can change measurements without changing the scene, while distinct scenes may be indistinguishable under limited sensing capability. This paper formulates sensor-conditioned representation correctness as preserving sensing-supported scene distinctions while suppressing nuisance-induced and sensor-unsupported variation. We introduce the scene-relevant observation quotient, a representation target induced by sensing-supported distinguishability after nuisance canonicalization, and develop Observation-Quotient Tucker-Structured Autoencoding (OQ-TSAE), a scene-nuisance factorized framework with diagnostics for false distinction, false merge, nuisance sensitivity, and latent ordering consistency. Experiments on a controlled benchmark show that quotient-consistent supervision improves representation-correctness diagnostics over reconstruction-oriented, metric-learning, and contrastive-learning baselines. Sensitivity, perturbation, and ablation studies show the importance of quotient-aligned supervision, reliable quotient relations, and quotient geometry. Complementary real-radar experiments show that a reconstruction-only OQ-TSAE variant retains competitive downstream utility, robustness under observation degradation, and low seed-to-seed variability. These results suggest that sensor-conditioned representations should be evaluated not only by predictive utility, but also by whether their latent geometry preserves sensing-justified scene distinctions.

08.
arXiv (CS.LG) 2026-06-15

Provably Safe, Yet Scalable Reinforcement Learning

arXiv:2606.14536v1 Announce Type: new Abstract: Safe reinforcement learning (RL) aims to learn policies that optimize rewards while satisfying constraints. Predominant approaches rely on soft-constrained policy optimization, which has achieved empirical success but does not provide formal safety guarantees for the learned policy. In contrast, methods with strict guarantees typically rely on explicit certificate functions, whose construction requires the direct synthesis and verification of control-invariant sets, a process that scales poorly with state dimension and often yields overly conservative behavior. In this paper, we present the Provably Safe, yet Scalable RL (PS2-RL) framework, a novel two-phase architecture for learning provably safe policies in a scalable manner, designed to overcome the key bottlenecks of prior methods. Rather than explicitly computing invariant sets, PS2-RL leverages a learned backup policy to forward-integrate the system dynamics, generating an implicit control-invariant set online. In the first phase, the backup policy is trained with our proposed safe-arrival value function, which characterizes the optimal backup policy for invariant-set construction. In the second phase, an RL policy is trained end-to-end through a differentiable projection layer that strictly enforces the safety guarantees induced by the learned backup policy. By maximizing the volume of the implicit control-invariant set in the first phase, the resulting PS2 policy from the second phase is performant and scalable, while maintaining provable safety. Crucially, PS2-RL imposes no restrictions on the underlying RL algorithm and can be plugged into any existing training pipeline. We establish theoretical guarantees for the proposed framework and evaluate it on robotic control tasks with state dimensions up to 10, a regime in which prior provably safe RL methods struggle or become impractical.

09.
arXiv (quant-ph) 2026-06-12

Where a Quantum Reservoir Works: A Transferable Operating Band

arXiv:2606.13284v1 Announce Type: new Abstract: In quantum reservoir computing, a fixed quantum system transforms an input signal, while learning reduces to training a simple linear readout on its measured outputs. Since the quantum dynamics themselves are never optimized, the method is well suited to today's hardware. Yet these dynamics must still be chosen carefully, because their settings remain fixed throughout training and inference. It therefore remains an open question where, in its control space, a fixed quantum system learns well. We address this question for a dissipative reservoir by mapping performance over three central physical controls: the strength of the input drive, the coupling between neighboring qubits, and the rate of dissipation. Good performance concentrates in a single, well-defined operating region of this control space. This region transfers across tasks and reservoir initializations, and the same memory-defined regime persists under architectural changes. It is also mechanistically grounded, since it disappears whenever any of the mechanisms that create it is removed. Finally, the region can be located cheaply before any task is run, using a simple memory diagnostic.

10.
arXiv (CS.LG) 2026-06-11

Higher-Order Token Interactions via Quantum Attention

arXiv:2606.11673v1 Announce Type: cross Abstract: Standard dot-product self-attention computes, in a single layer, only pairwise (order-2) interactions between tokens; representing a generic order-$k$ interaction is known to require either super-quadratic resources in one layer or composition across depth. We introduce Quantum Higher-Order Attention (QHA), a shallow, hardware-realizable quantum attention head that, via data re-uploading and an all-to-all non-Clifford entangler, synthesizes order-$k$ token interactions inside the circuit and exposes them through a local single-qubit read-out. We prove (i) an expressivity separation: any single standard self-attention layer with embedding dimension $m$, $H$ heads and $p$-bit precision satisfying $mHp=o(N/\log\log N)$ cannot represent the order-$k$ correlation family that one QHA head represents with circuit depth $O(\log k)$ ($O(k)$ two-qubit gates); and (ii) a trainability guarantee for its local-design instantiation: with a local read-out and $O(\log n)$ depth the gradient variance is $\Omega(1/\mathrm{poly}(n))$ (no barren plateau), which we confirm empirically – while being explicit that the more expressive all-to-all instantiation we benchmark is trained empirically and shows exponentially decaying gradients. Empirically, at a $6.5\times$ smaller parameter budget, QHA generalizes hidden-subset parity of every order $k\le6$ from disjoint inputs, whereas the larger classical attention head collapses past order~2; consistent with theory, the size of the advantage tracks the target's Fourier degree - largest for parity and shrinking when low-order structure is present. As an application, QHA serves as a compact high-order interaction detector across three domains - genetic epistasis, learning-parity-with-noise, and graph triangle detection - reaching the noise ceiling at the smallest parameter budget where field-standard linear methods fail.

11.
arXiv (CS.LG) 2026-06-18

Robust and Interpretable Adaptation of Equivariant Materials Foundation Models via Sparsity-promoting Fine-tuning

arXiv:2606.18691v1 Announce Type: new Abstract: Pre-trained materials foundation models, or machine learning interatomic potentials, leverage general physicochemical knowledge to effectively approximate potential energy surfaces. However, they often require domain-specific calibration due to physicochemical diversity as well as mismatches between practical computational settings and those used in constructing the pre-training data. To address this, we propose a sparsity-promoting fine-tuning method that selectively updates model parameters by exploiting the structural properties of E(3)-equivariant materials foundation models. On energy and force prediction tasks across molecular and crystalline benchmarks, our method matches or surpasses full fine-tuning and equivariant low-rank adaptation while updating only $\sim$3~\% of parameters, and in some cases as little as $\sim$0.5~\%. Beyond energy and force calibration, we further demonstrate task generalizability by applying our method to magnetic moment prediction and magnetism-aware total energy modeling. Finally, analysis of sparsity patterns reveals physically interpretable signatures, such as enhanced $d$-orbital contributions in transition metal systems. Overall, our results establish sparsity-promoting fine-tuning as a flexible and interpretable method for domain specialization of equivariant materials foundation models.

12.
arXiv (CS.LG) 2026-06-11

Probabilistic Salary Prediction with Graph Attention Networks and a Mixture Density Network

arXiv:2606.11663v1 Announce Type: cross Abstract: Accurate salary prediction is critical for bridging the information gap between employers and job seekers in modern labor markets. Existing approaches predominantly yield a single point estimate and treat job attributes such as location, occupation, and industry as independent categorical features, ignoring both the inherent uncertainty and multi-modality of real-world compensation data and the rich hierarchical and semantic-similarity relationships that govern pay norms. In this paper we propose GAT-MDN, a unified framework that addresses both limitations simultaneously. For each of the three attribute domains we construct a domain-specific graph whose edges encode (i) hierarchical parent-child containment and (ii) weighted similarity links derived from a pre-trained Sentence-Transformer. Parallel Graph Attention Networks (GATs) with edge-feature-aware attention learn rich, context-sensitive node representations from these multi-relational graphs. A priority-based hierarchical selection module then assembles a composite feature vector that gracefully handles missing or coarse attributes, and a Mixture Density Network (MDN) head maps this vector to the parameters of a Gaussian Mixture Model (GMM), yielding a full conditional salary distribution. Extensive experiments on a real-world Dutch job-posting dataset of over 1 million records demonstrate that GAT-MDN significantly outperforms a non-graph MLP-MDN baseline in both Negative Log-Likelihood (NLL) and Mean Squared Error (MSE).

13.
arXiv (CS.CL) 2026-06-18

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

14.
arXiv (CS.CV) 2026-06-18

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

15.
arXiv (CS.AI) 2026-06-15

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

arXiv:2606.14239v1 Announce Type: new Abstract: Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards – signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

16.
arXiv (CS.LG) 2026-06-19

Alternating Direction Method of Multipliers for Nonlinear Matrix Decompositions

arXiv:2512.17473v3 Announce Type: replace-cross Abstract: We present an algorithm based on the alternating direction method of multipliers (ADMM) for solving nonlinear matrix decompositions (NMD). Given an input matrix $X \in \mathbb{R}^{m \times n}$ and a factorization rank $r \ll \min(m, n)$, NMD seeks matrices $W \in \mathbb{R}^{m \times r}$ and $H \in \mathbb{R}^{r \times n}$ such that $X \approx f(WH)$, where $f$ is an element-wise nonlinear function. We evaluate our method on several representative nonlinear models: the rectified linear unit activation $f(x) = \max(0, x)$, suitable for nonnegative sparse data approximation, the component-wise square $f(x) = x^2$, applicable to probabilistic circuit representation, and the MinMax transform $f(x) = \min(b, \max(a, x))$, relevant for recommender systems. The proposed framework flexibly supports diverse loss functions, including least squares, $\ell_1$ norm, and the Kullback-Leibler divergence, and can be readily extended to other nonlinearities and metrics. We illustrate the applicability, efficiency, and adaptability of the approach on real-world datasets, highlighting its potential for a broad range of applications.

17.
arXiv (CS.LG) 2026-06-11

SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing

arXiv:2606.11674v1 Announce Type: cross Abstract: We present SpAArSIST, a deployment-oriented refinement of the widely used AASIST graph pooling backend for self-supervised learning (SSL) based anti-spoofing. Motivated by redundant operations in public implementations, we replace learned pooling and stack-node attention with explicit, lightweight choices: separate train and inference graph pooling ratios $(k_{\mathrm{tr}},k_{\mathrm{inf}})$, magnitude-based node scoring, and mean aggregation of graph nodes. The best overall configuration (rank 1) cuts backend compute by 20.7% (195.045M $\rightarrow$ 154.706M MACs) and model size by 4.1% (611.8k $\rightarrow$ 586.4k params), while improving out-of-domain robustness on In-the-Wild to 2.82% EER and 0.078 minDCF (from 4.64% and 0.133) and remaining competitive on ASVspoof5. We further provide a composite selection score that summarizes accuracy, calibration, and compute to support balanced deployment-oriented model choice.

18.
arXiv (CS.AI) 2026-06-15

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

arXiv:2606.01730v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash{}-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error.

19.
arXiv (CS.CV) 2026-06-16

Interpolation between Convolution and Attention via K-Nearest Neighbors

Authors:

The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. Convolutional Neural Networks are defined by spatially local convolution operations, while Transformers rely on global self-attention. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and weighted aggregation. Convolution selects neighbors by spatial proximity while self-attention selects by feature similarity, revealing that they lie on a continuous spectrum rather than representing categorically different computations. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. ConvNN exactly recovers standard and depthwise convolution by restricting neighbor selection to normalized spatial coordinates, and exactly recovers self-attention and its sparse variants, including KVT-attention, by replacing spatial proximity with scaled dot-product similarity. Beyond these special cases, ConvNN serves as a drop-in replacement for both convolution and attention layers, enabling systematic exploration of the intermediate spectrum between local and global aggregation through configurable similarity functions, neighbor selection strategies, positional encodings, and aggregation kernels.

20.
arXiv (quant-ph) 2026-06-11

A quantum implementation of high-order power method for estimating geometric entanglement of pure states

arXiv:2405.19134v3 Announce Type: replace Abstract: Entanglement is one of the fundamental properties of a quantum state and is a crucial differentiator between classical and quantum computation. There are many ways to define entanglement and its measure, depending on the problem or application under consideration. Each of these measures may be computed or approximated by multiple methods. However, hardly any of these methods can be run on near-term quantum hardware. This work presents a quantum adaptation of the iterative high-order power method for estimating the geometric measure of entanglement of multi-qubit pure states using rank-1 tensor approximation. This method is executable on early fault-tolerant (hybrid) quantum hardware and does not depend on quantum memory. We simulate this algorithm and mitigate the effects of noise on the results of the computation using a theoretical model based on a known mitigation approach, which assumes a global depolarising noise channel.

21.
arXiv (CS.CV) 2026-06-19

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

Composed image retrieval retrieves a target image using a composed query of a reference image and a modified text description. In the fashion domain, this task requires understanding subtle attribute variations such as color, pattern, and texture. However, existing approaches face limitations due to scarce annotated data and simplistic negative sampling. We propose a novel framework that integrates a multi-modal large language model (LLaVA) to generate attribute-aware triplets and introduces a two-stage fine-tuning strategy to enhance contrastive learning. We leverage pretrained vision-language models, such as CLIP-ViT/B32, to generate and concatenate sentence-level prompts with the relative caption and to scale the number of negatives using static representations. Experimental results demonstrate enhanced compositional reasoning and improved fine-grained retrieval behavior, underscoring the feasibility and potential of the proposed framework for fashion retrieval.

22.
arXiv (CS.AI) 2026-06-16

Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving

arXiv:2603.18897v2 Announce Type: replace-cross Abstract: LLM-powered agents execute tasks through a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, leaving tool latency exposed on the task critical path. This paper presents PASTE, a tool-aware agent-serving system that predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. PASTE isolates speculative results until confirmed by the LLM and jointly schedules tool execution and returning LLM sessions to avoid shifting bottlenecks to the GPU. Across deep research, coding, and scientific-agent workloads, PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.

23.
arXiv (CS.CL) 2026-06-16

ArFake: A Robust Framework for Multi-Dialect Arabic Speech Spoofing Detection Benchmark

With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.

24.
arXiv (CS.CV) 2026-06-19

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1–100x, pseudo-labels across thresholds 0.90–0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

25.
arXiv (CS.AI) 2026-06-16

Minimal Oversight: Uncertainty-Aware Governance for Delegated AI Systems

arXiv:2606.15563v1 Announce Type: new Abstract: AI systems increasingly delegate decisions to specialized models, evaluators, tools, and supervisory controllers. The central AI problem is no longer only model accuracy, but uncertainty-aware governance: how much autonomy to grant, which evidence should calibrate trust, what performance ceiling a delegated AI system can sustain, and when human intervention becomes necessary. We propose the Minimum Sufficient Oversight Principle (MSO), a variational principle for principled autonomy delegation: minimize governance burden on the Fisher information manifold subject to a delivery constraint. The resulting Euler-Lagrange solution yields a water-filling allocation of governed delegation across the task space. Building on a revealed-action governed delegation channel model, we prove a capacity theorem for stationary symbolwise review policies, derive a local first-order approximation relating workflow complexity to quality degradation, and give a drift-dominated autonomy-time scaling law linking intervention timing to effective capacity, complexity, and drift. Within this framework, masking appears as a structural AI-governance pathology: corrected performance can hide the competence signal needed to calibrate trust. Synthetic simulations and a semi-real reconstructed workflow support design prescriptions including upstream-first correction, sensitivity-based intervention, and explicit feasibility checks before autonomy is expanded. The result is a computable framework for uncertainty, planning, and oversight in delegated AI systems. A companion Python package is available at https://github.com/crbazevedo/delegation-lab.