Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-12

Visual enhancement and 3D representation for underwater scenes: a review

Underwater visual enhancement (UVE) and underwater 3D reconstruction pose significant challenges in computer vision and AI-based tasks due to complex imaging conditions in aquatic environments. Despite the development of numerous enhancement algorithms, a comprehensive and systematic review covering both UVE and underwater 3D reconstruction remains absent. To advance research in these areas, we present an in-depth review from multiple perspectives. First, we introduce the fundamental physical models, highlighting the peculiarities that challenge conventional techniques. We survey advanced methods for visual enhancement and 3D reconstruction specifically designed for underwater scenarios. The paper assesses various approaches from non-learning methods to advanced data-driven techniques, including Neural Radiance Fields and 3D Gaussian Splatting, discussing their effectiveness in handling underwater distortions. Finally, we conduct both quantitative and qualitative evaluations of state-of-the-art UVE and underwater 3D reconstruction algorithms across multiple benchmark datasets. Finally, we highlight key research directions for future advancements in underwater vision.

02.
arXiv (CS.CV) 2026-06-18

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

03.
arXiv (CS.CV) 2026-06-16

Chroma-gated, differentiable OKLCH interpolation: Continuous Oklab fallback for color-cast reduction

OKLCH – the cylindrical (lightness, chroma, hue) form of Ottosson's Oklab color space – is the interpolation space recommended by CSS Color 4 for gradients and color-mix(), and it is now broadly deployed. Its polar parameterization, however, casts color near the neutral axis in two ways: (1) an inter-hue detour between two chromatic endpoints that sweeps through an unintended hue (blue to yellow visibly passing through green), and (2) an off-line bow when one endpoint is achromatic. Existing remedies are uniformly two-valued – a threshold switch that fires only at an achromatic endpoint – so they address only (2); on chromatic pairs every one of them reduces to raw OKLCH, leaving the (1) inter-hue cast untreated. We introduce Continuous Oklab fallback (COFb), a one-parameter, differentiable chroma gate $w(C)=C^n/(C^n+\sigma^n)$ that continuously blends the OKLCH path toward the linear Oklab path as chroma falls. A single gate reduces the (1) cast that the two-valued family leaves untreated and unifies the handling of (1) and (2) without any endpoint test. We characterize a cast-hue trade-off frontier, adopt a default ($n=1$, the rational Michaelis-Menten form; $\sigma\approx0.19$ for a typical sRGB palette, from a normalization-independent cast-half criterion), and verify the gate's properties symbolically. At the default, COFb halves the inter-hue path detour (mean lateral deviation -49.5%, chroma-weighted hue excursion -35.5%). We also state the method's limits: on (2) alone the two-valued switch remains better, and like any Cartesian blend COFb does not preserve chroma. In deployment, COFb runs entirely in plain Oklab (a,b) to sRGB, so it serves as a fallback that delivers the same cast-reduced gradients where modern CSS color interpolation (color-mix(in oklch) and the like) is unavailable – older engines, image and video pipelines, or GPU shaders.

05.
arXiv (CS.AI) 2026-06-17

AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources

arXiv:2606.17887v1 Announce Type: cross Abstract: Generative AI (GenAI) deployment in the workplace is accelerating rapidly. Nevertheless, questions of who adopts, who benefits, and who is left behind and why are still understudied. In this paper, we investigate these dynamics in the context of a multinational tech company transitioning from a legacy Human Resources (HR) search system to a GenAI-supported system, analyzing search log data, survey data (n=25), and ten semi-structured interviews. Our findings show that adoption depended on the fit between the GenAI system's design assumptions and employees' work positionalities (role, spoken language, tenure). Further, we find that employees' trust in GenAI answers was built through source-checking, comparison among systems, and seeking input from colleagues or HR when in doubt. Our contribution is twofold. First, we provide empirical evidence of workplace GenAI adoption during a live organizational transition, showing that adoption is influenced by factors such as situational fit, search literacy, and trust calibration. It is also further shaped by knowledge conditions such as the system's content quality, employee training, and guidance. Second, we translate these findings into design considerations for inclusive deployment and adoption in high-stakes environments such as HR. We argue that organizations should design systems considering the role and context-sensitive benefits they yield to different social groups. They also need to treat the organizational knowledge infrastructure as AI infrastructure to improve the accountability and usability of GenAI systems

06.
arXiv (CS.CV) 2026-06-16

Context-Aware RL for Agentic and Multimodal LLMs

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query–answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query–context–answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

07.
arXiv (CS.AI) 2026-06-16

Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

arXiv:2606.15251v1 Announce Type: cross Abstract: Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone's predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.

08.
arXiv (CS.LG) 2026-06-19

FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

arXiv:2606.19408v1 Announce Type: new Abstract: Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.

09.
medRxiv (Medicine) 2026-06-22

Genetic modifiers of psychiatric, motor, and cognitive symptoms in Huntington's disease

The Enroll HD natural history platform provides rich longitudinal phenotypes enabling genome wide analyses across diverse clinical domains. Psychiatric symptoms are a major source of morbidity in Huntington's disease (HD), yet the genetic architecture underlying their onset is poorly understood. We analyzed ~18,000 people with HD (PwHD) to define genetic determinants of ages at psychiatric, motor, and cognitive symptom onset, and HD diagnosis. GWAS meta analysis recapitulated 11 established modifiers of motor onset and identified a novel locus spanning RAB3B/ZFYVE9 associated with age at violent/aggressive behavior onset. Exome wide analyses in Enroll HD participants implicated rare variants in FAN1, PMS1, POLD1, and HTT. Several HD modifiers of motor and cognitive symptom onset (MSH3, FAN1, HTT) also influenced psychiatric symptom onset, whereas PMS1 and POLD1 showed significant association with motor symptom onset. Psychiatric polygenic scores predicted psychiatric symptom onset, revealing a hybrid architecture combining psychiatric liability in general population with HD- or repeat expansion disease (RED) specific pathways.

10.
arXiv (CS.LG) 2026-06-12

Feature-preserving Latent-EnKF for Data Assimilation of Flows with Shocks

arXiv:2606.12559v1 Announce Type: cross Abstract: The ensemble Kalman filter (EnKF) is widely adopted for sequential data assimilation, but fails for solutions with discontinuities, such as shocks in compressible flows. Uncertainty in shock location induces multimodal ensemble statistics that violate the Gaussian assumptions underlying the EnKF, producing large-scale spurious oscillations in the analysis state. We introduce a feature-preserving latent-EnKF that performs the ensemble update in a learned low-dimensional latent space, where shock and flow features admit a smooth manifold representation, thereby preserving sharp features during EnKF analysis. The updated latent state is mapped back to physical state through a shared decoder for all ensemble members. The algorithm eliminates the member-specific ordered training and positivity flooring used in prior approaches. Numerical experiments on a Sod shock tube and Mach 2 shock interaction with a 2D cylinder, using sparse and noisy observations, show accurate feature recovery of shocks and contact discontinuities without spurious oscillations.

11.
arXiv (CS.AI) 2026-06-18

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

arXiv:2605.22142v2 Announce Type: replace-cross Abstract: Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.

12.
arXiv (CS.CL) 2026-06-19

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.

13.
arXiv (quant-ph) 2026-06-17

Full-state information-disturbance tradeoff for direction estimation with antiparallel spin-coherent pairs

arXiv:2606.18040v1 Announce Type: new Abstract: We determine the optimal information–disturbance tradeoff for estimating an unknown spatial direction encoded in two antiparallel spins. Rotational covariance reduces the optimization over all instruments to a finite-dimensional Choi problem: a positive seed operator obeys one trace constraint for each irreducible sector of the input representation, while both the directional score and the operation fidelity are linear functionals of this seed. For two antiparallel spin-$1/2$ particles, whose physical representation decomposes as $0\oplus1$, we derive the two-multiplier dual problem and characterize the optimal instrument from the kernel vectors of the dual slack operator. The optimal operation is a covariant filter with scalar–vector coherence and is generally not a convex interpolation between the identity channel and a measure-and-reprepare strategy. At maximum information we recover the Gisin–Popescu score, but the least disturbing output state is optimized independently, giving a smaller disturbance than both the parallel-spin benchmark and antiparallel measure-and-reprepare. We also formulate the parallel benchmark and, as a central extension of the method, treat antiparallel spin-coherent states of arbitrary spin $j$. In this case the signal coherently occupies all sectors $\ell=0,\ldots,2j$ of $j\otimes j$, the endpoint information is governed by nearest-neighbor sector coherences, and the endpoint disturbance is obtained from an explicit finite block-diagonal eigenvalue problem.

14.
arXiv (CS.CL) 2026-06-12

Epistemic Constitutionalism Or: how to avoid coherence bias

作者:

Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.

15.
arXiv (CS.CV) 2026-06-17

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.

16.
arXiv (CS.LG) 2026-06-18

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

arXiv:2606.18539v1 Announce Type: new Abstract: Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.

17.
arXiv (quant-ph) 2026-06-16

Adiabatically-induced Kawaguchi geometry and jerk in quantum-classical systems

arXiv:2606.16037v1 Announce Type: new Abstract: Adiabatically eliminating the quantum degrees of freedom in a mixed quantum-classical system produces an effective force in the classical equation of motion. The elimination can be made to any order in the adiabatic parameter, generating a series of higher order forces. By applying a sequence of near-identity unitary transformations to the quantum state, we derive a hierarchy of increasingly accurate effective actions for the classical variables. The third order Euler-Lagrange equation is non-Newtonian as the force depends on the jerk, the third order time derivative of position. We find that the third order terms induce a special kind of Kawaguchi geometry on the space of classical variables. This geometry is characterized by an almost symplectic structure and a differential line element that depends on the acceleration in addition to the velocity. Our results can be used to efficiently capture higher order nonadiabatic effects in molecular dynamics simulations.

18.
arXiv (CS.AI) 2026-06-19

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

arXiv:2606.20502v1 Announce Type: cross Abstract: Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable–patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

19.
arXiv (quant-ph) 2026-06-17

Pulse-optimised circuit elements for scalable and noise-resilient quantum chemistry

arXiv:2606.17357v1 Announce Type: new Abstract: Useful chemistry calculations on near-term quantum processors are hindered by current algorithmic runtimes. We develop a methodology to significantly reduce these runtimes. Typically, variational quantum eigensolver (VQE) algorithms are implemented as sequences of primitive gates. Our methodology instead relies on gradient-ascent pulse engineering to construct hardware-tailored pulses for the direct implementation of VQEs. As problem sizes increase, it quickly becomes intractable to optimise a pulse that implements an entire VQE ansatz circuit. However, leading VQEs are constructed in a modular fashion. A problem-tailored VQE is assembled from parameterised circuit elements that simulate hopping between two or four electronic spin orbitals. We show that these circuit elements can be implemented more efficiently using hardware-tailored pulses. We numerically demonstrate our methodology on a silicon spin-qubit quantum processor. We find that common circuit elements, known as single- and double-qubit excitations, can be implemented in less than 289 ns and 927 ns, respectively. Compared with conventional gate-based implementations, our pulse-accelerated qubit excitations provide a scalable approach for faster and therefore more noise-robust quantum chemistry simulations by reducing VQE runtimes by up to a factor of 15.3.

20.
arXiv (CS.AI) 2026-06-12

TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models

arXiv:2602.10132v3 Announce Type: replace-cross Abstract: Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics from sparse, noisy, and incomplete sensors readings. The complexity of the underlying physics and the heterogeneity of experimental data pose formidable challenges for conventional numerical methods, and highlight the promise of modern data-native approaches. A major obstacle in realizing this potential is, however, the lack of curated, openly available datasets and standardized benchmarks. Existing fusion datasets are scarce, fragmented across institutions, facility-specific, and inconsistently annotated, which limits reproducibility and prevents a fair and scalable comparison of AI approaches. In this paper, we introduce TokaMark, a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST). TokaMark provides a comprehensive suite of tools designed to unify access to multi-modal fusion data and standardize evaluation protocols. The benchmark includes a curated list of 14 tasks spanning a range of physical mechanisms, exploiting a variety of diagnostics and covering multiple operational use cases. A baseline model is provided to facilitate transparent comparison and validation within a unified framework. By establishing a unified benchmark, TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy. The dataset, benchmark, documentation, and tooling are open-sourced under https://github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamark_baseline.

21.
arXiv (CS.CL) 2026-06-17

A Framework for Evaluating Agentic Skills at Scale

Agent skills – structured, reusable knowledge artifacts that augment LLM agent capabilities – have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.

22.
arXiv (CS.LG) 2026-06-15

PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation

arXiv:2508.18166v5 Announce Type: replace-cross Abstract: Modern app store recommender systems struggle with multiple-category apps, as traditional taxonomies fail to capture overlapping semantics, leading to suboptimal personalization. We propose PCR-CA (Parallel Codebook Representations with Contrastive Alignment), an end-to-end framework for improved CTR prediction. PCR-CA first extracts compact multimodal embeddings from app text, then introduces a Parallel Codebook VQ-AE module that learns discrete semantic representations across multiple codebooks in parallel – unlike hierarchical residual quantization (RQ-VAE). This design enables independent encoding of diverse aspects (e.g., gameplay, art style), better modeling multiple-category semantics. To bridge semantic and collaborative signals, we employ a contrastive alignment loss at both the user and item levels, enhancing representation learning for long-tail items. Additionally, a dual-attention fusion mechanism combines ID-based and semantic features to capture user interests, especially for long-tail apps. Experiments on a large-scale dataset show PCR-CA achieves a +0.76% AUC improvement over strong baselines, with +2.15% AUC gains for long-tail apps. Online A/B testing further validates our approach, showing a +10.52% lift in CTR and a +16.30% improvement in CVR, demonstrating PCR-CA's effectiveness in real-world deployment. The new framework has now been fully deployed on the Microsoft Store.

23.
arXiv (CS.LG) 2026-06-17

Amortizing Maximum Inner Product Search with Learned Support Functions

arXiv:2603.08001v2 Announce Type: replace Abstract: Maximum inner product search (MIPS) is a crucial subroutine in machine learning, requiring the identification of a vector taken within a database (the keys) that best aligns with a given query. We propose amortized MIPS: a regression-based approach that trains neural networks to directly predict MIPS solutions, amortizing the cost of repeatedly solving MIPS for queries drawn from a known distribution over a fixed key database. Our key insight is that the MIPS value function is the support function of the set of keys, a well-studied convex function whose gradient yields the optimal key. This motivates two complementary amortized models: SupportNet, an input-convex neural network trained to regress the support function, and KeyNet, a vector-valued network that directly regresses the optimal key. SupportNet can serve as a cluster router, steering queries toward relevant database partitions, while KeyNet can be used as a drop-in replacement for the original query, fed directly to off-the-shelf indexing pipelines. Our experiments on the BEIR benchmark show that, for document embeddings, learned \SupportNet{}s and \KeyNet{}s significantly improve IVF match rates when accounting for compute effort, whether measured in FLOPs, number of probes, or wall-clock time. Our code is available at: https://github.com/apple/ml-amips.

24.
arXiv (quant-ph) 2026-06-19

Passive-User Bell-State Loop-Back Key Establishment without Quantum Detectors at the User Nodes

arXiv:2606.19551v1 Announce Type: new Abstract: We propose and analyze a Bell-state extension of the Loop-Back quantum key distribution architecture for secret-key establishment between two passive users that do not require quantum transmitters or quantum detectors. In the proposed setting, a single active station, Alice, provides the entangled-state infrastructure, retains one qubit of an initially prepared Bell pair, and sends the traveling subsystem through two passive users, denoted by $B_1$ and $B_2$. Each passive user applies a local Pauli operation to the same traveling subsystem, so that the operation observed by Alice is only the effective composition $U_{\mathrm{eff}}=U_2U_1$. After the subsystem returns, Alice performs a Bell-state measurement and, using her private knowledge of the initial Bell state, deterministically identifies the effective Pauli operation. However, the individual factors $U_1$ and $U_2$ remain algebraically hidden from Alice whenever the local choices are uniformly and independently selected. The public effective operation acts as a parity-like constraint: each passive user can infer the operation applied by the other from its own private choice, while the active station learns only the global composition. This construction transfers the essential distributed-transformation mechanism of passive-user Loop-Back QKD to the entangled-state regime. Unlike single-qubit passive-user schemes, whose useful events are intrinsically post-selected, the Bell-state version is limited primarily by the success probability of the Bell-state measurement. We discuss the algebraic structure of the protocol, its interpretation as an infrastructure-assisted mediated key-establishment mechanism, and the physical assumptions required to protect passive Pauli modulators against active injection or Trojan-horse-type attacks.

25.
arXiv (CS.LG) 2026-06-17

Deep Reinforcement Learning for Minimum Zero-Forcing Sets

arXiv:2606.18106v1 Announce Type: new Abstract: This paper explores the problem of finding the minimum zero-forcing set on undirected graphs and proposes an adapted machine-learning framework to solve the problem. The minimum zero-forcing set problem is a graph coloring problem where the color of an initial set of nodes propagates throughout a network. The set of nodes is zero-forcing if it forces all uncolored nodes to change color under the constraint of the color-change rule. There are several applications to this problem across different domains such as network science, network control, and designing logical circuits. Finding the minimum zero-forcing set is shown to be NP-hard. We propose a reinforcement learning framework, SD-ZFS, that adapts the S2V-DQN architecture to the ZFS problem. We train several models on this adapted framework and analyze the performance across graph datasets that have varying structures. We evaluate how the models trained on the framework generalize, scale, and transfer to different network types. The results demonstrate the effectiveness of the framework when compared against the optimal solution and greedy heuristic. We provide further insight into how the ZFS problem can be solved through machine-learning and the influence of network structure on the problem.