论文广场 - AcademicHub

01.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.20014

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

作者:

Jannik H\"osch ↗Alessandro Sestini ↗Florian Fuchs ↗Amir Baghi ↗Joakim Bergdahl ↗Konrad Tollmar ↗Jean-Philippe Barrette-LaPierre ↗Linus Gissl\'en ↗

arXiv:2606.20014v1 Announce Type: cross Abstract: Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and ``Flat'' RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4\% vs 51.5\% win rate, $p=0.103$) while both significantly outperform Flat RL trained without skill decomposition. A user study ($n=15$) reveals that 60\% of participants perceive LLM+RL agents as the most human-like ($p=0.027$), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.

阅读与讨论 → 访问原文 →

02.

arXiv (CS.AI) 2026-06-11 DOI: arXiv:2606.11901

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

作者:

Tobias J\"ulg ↗Seongjin Bien ↗Simon Hilber ↗Yannik Blei ↗Pierre Krack ↗Maximilian Li ↗Sven Parusel ↗Rudolf Lioutikov ↗Florian Walter ↗Wolfram Burgard ↗

arXiv:2606.11901v1 Announce Type: cross Abstract: Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at https://duobench.github.io/

阅读与讨论 → 访问原文 →

03.

arXiv (math.PR) 2026-06-19 DOI: arXiv:2511.08288

The central heat trace on large compact classical groups

作者:

Thibaut Lemoine ↗Myl\`ene Ma\"ida ↗

arXiv:2511.08288v2 Announce Type: replace-cross Abstract: We study the large-$N$ asymptotics of the central trace of the heat kernel on compact classical groups. For every classical family $G_N\subset \mathrm{GL}_N(\C)$, we prove a full large-$N$ asymptotic expansion, using a highest weights/partitions correspondence adapted to the large-rank regime, under which the eigenvalues of the Laplace–Beltrami operator stabilize as observables in the algebra of shifted symmetric functions. Then, we prove a random surface representation of the trace in terms of ramified coverings of the torus. We provide two independent applications: an explicit large-rank counting law for the Casimir spectrum, with exponential Hardy–Ramanujan-type growth in contrast with the polynomial behavior of Weyl's law at fixed rank, and a rigorous probabilistic formulation of the Yang–Mills/Hurwitz duality on a two-dimensional torus initiated by Gross and Taylor, completing a previous work of the authors. We also extend this duality to a Yang–Mills/Gromov–Witten duality by expressing the coefficients of the central heat trace as explicit functionals of the generating function of Gromov–Witten invariants.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.LG) 2026-06-15 DOI: arXiv:2606.13982

Adaptive Nucleus Truncation for Long-Form Reasoning

作者:

Ousmane Amadou Dia ↗

arXiv:2606.13982v1 Announce Type: cross Abstract: Sampling plays an important role in long-form language-model reasoning. Over thousands of decoding steps, small changes in the candidate token set can compound into different reasoning trajectories, stability profiles, and final answers. Existing truncation methods such as top-$p$, min-$p$, and fixed top-$n\sigma$ sampling improve over unrestricted sampling, but they rely on fixed thresholds that cannot adapt to changes in entropy, task difficulty, training stage, or generation budget. We introduce Adaptive Nucleus Truncation Sampling (ANTS), which extends top-$n\sigma$ sampling from a fixed decoding rule into an adaptive rollout-control mechanism for long-form generation. ANTS selects standardized neighborhoods around the maximum logit before temperature scaling, adapts the truncation width using an entropy-conditioned controller, and retains a no-truncation fallback arm to stabilize training when truncation becomes unsafe. On a 33B-total / 4B-active sparse Mixture-of-Experts reasoning model, ANTS improves average performance over percentage-based benchmarks by +1.9, +3.8, and +5.2 points at 8K, 16K, and 32K generation budgets, respectively. The strongest gains appear on instruction following and mathematical reasoning, with IFBench improving by more than 10 points at 32K and AIME 2025 improving by 7 points. Code generation reveals an important budget interaction. On Codeforces, ANTS trails the baseline at 8K, but reverses this gap and substantially improves ELO at 16K and 32K. These results suggest that sampler design should be treated not just as a decoding hyperparameter, but as part of how we stabilize and scale long-budget reasoning.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15038

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

作者:

Zhemin Zhang ↗Weijie Chen ↗David Le ↗Amara Tariq ↗Alex Wallace ↗Matthew Stib ↗Juan Maria Farina ↗Chadi Ayoub ↗Reza Arsanjani ↗Imon Banerjee ↗

arXiv:2606.15038v1 Announce Type: new Abstract: Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions. CT and EHR modalities are encoded independently using domain-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. We evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, on large-scale multi-institutional cohorts (PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external). Fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance and image-guided co-attention achieved the best external performance. We therefore introduce a generalizable foundation model-based cross-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction. Our results establish task-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.CV) 2026-06-19 DOI: arXiv:2606.20044

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

作者:

Xuanhao Qi ↗Tom H. Luan ↗Yukang Zhang ↗Jinkai Zheng ↗Zhou Su ↗Shuwei Li ↗Lei Tan ↗

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16694

Adaptive inference and function vectors in deep transformers

作者:

Ravin Raj ↗Gautam Reddy ↗

arXiv:2606.16694v1 Announce Type: cross Abstract: Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations ('function vectors') to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.

阅读与讨论 → 访问原文 →

08.

Nature (Science) 2026-06-09 DOI: HASH:a7c077bc54808d10f98c9344c49fe3af

AI technology must serve human cognitive development, not the other way around

作者:

Jianjun Wu ↗

Letter to the Editor

阅读与讨论 → 访问原文 →

09.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2606.16759

Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

作者:

\c{S}evket Kaan Alk{\i}r ↗Naci Sald{\i}↗Berkay Anahtarc{\i}↗Can Deha Kar{\i}ks{\i}z ↗

arXiv:2606.16759v1 Announce Type: new Abstract: We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unknown reward, and the goal is to recover a policy explaining the observed behaviour via the maximum causal entropy principle. We formulate the inverse problem by enforcing consistency with the expert mean-field term and long-run feature expectations, treating two reward classes within a unified occupation-measure framework. For finite-dimensional linear rewards, we give a convex dual reformulation with an explicit log-partition objective, and prove smoothness and curvature properties justifying constant-step-size gradient descent. For infinite-dimensional RKHS rewards, we develop a Lagrangian relaxation whose inner-maximising policy is characterised by a soft Bellman equation. The main obstacle is the absence of a discount-factor contraction. We resolve this by introducing a minorisation-based sub-stochastic kernel that yields a strict contraction of the soft Bellman operator. We establish Fréchet differentiability and Lipschitz smoothness of the log-likelihood score, leading to a gradient ascent algorithm with convergence guarantees. Two numerical examples, a malware-spread MFG and an RKHS-based consumer-choice model, show that the recovered policies closely match expert behaviour.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.LG) 2026-06-19 DOI: arXiv:2606.20357

On the Variance of Temporal Difference Learning and its Reduction Using Control Variates

作者:

Hsiao-Ru Pan ↗Bernhard Sch\"olkopf ↗

arXiv:2606.20357v1 Announce Type: new Abstract: We analyze the variance of temporal difference (TD) learning using the phased setting with tabular representation, and show that one of the mechanisms behind its ability to reduce variance is by effectively aggregating over a larger number of independent trajectories. Based on this insight, we demonstrate that (1) the variance of TD is asymptotically bounded from above by Monte Carlo (MC) estimators, and (2) shorter horizon updates incurs less variance for a fixed number of samples. Beyond TD, we show that Direct Advantage Estimation (DAE), a method for estimating the advantage function, can be seen as a type of regression-adjusted control variate, which achieves a tighter bound on the variance compared to TD in the large-sample limit. Finally, we numerically illustrate the behaviors of these estimators with carefully designed environments.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.06113

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

作者:

Huaisong Zhang ↗Hao Yu ↗Yuxuan Zhang ↗Jiahe Wang ↗Xinrui Chen ↗Haoxiang Cao ↗Feng Lu ↗Wendong Zhang ↗Changqian Yu ↗Chun Yuan ↗

Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

阅读与讨论 → 访问原文 →

12.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.15230

Benchmarking Quantum Extreme Learning based on Gaussian Boson Sampling

作者:

Daniel Montesinos ↗Gian Luca Giorgi ↗Roberta Zambrini ↗

arXiv:2606.15230v1 Announce Type: new Abstract: Reservoir models offer a hardware-efficient learning paradigm for noisy intermediate-scale quantum devices by exploiting untrained quantum dynamics as a fixed feature map and restricting optimization to a simple classical readout layer. We propose a quantum extreme learning machine implemented using gaussian boson sampling and an encoding strategy that achieves high classification accuracy while reducing optical resource requirements. Classical inputs are jointly encoded in the squeezing parameters and in the interferometer unitary, enabling sampling-based, highly nonlinear feature maps while leveraging large-scale GBS output statistics, which are conjectured to be classically intractable. We systematically compare multiple families of quantum features accessible in the same setup and find that photon-number sampling probabilities provide the best performance, consistent with their higher effective feature dimensionality. Finally, we benchmark against classical nonlinear baselines and analyse robustness under noisy scenarios, showing competitive performance with fewer trainable parameters and indicating practical promise for near-term photonic implementations.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.18943

Physics-IQ Verified

作者:

Tim R\"adsch ↗Yuki M Asano ↗Hilde Kuehne ↗Stefan Bauer ↗Priyank Jaini ↗Robert Geirhos ↗Carsten T. L\"uth ↗

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $\tau = 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

阅读与讨论 → 访问原文 →

14.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.15083

REGRID-QAOA: A Resource-Efficient Graph-Reduced Hybrid QAOA Framework for Physics-Constrained Power System Islanding

作者:

Yuqi Jiang ↗Yuqi Zhang ↗Zhiding Liang ↗Qiang Guan ↗Yan Li ↗Ganesh Kumar Venayagamoorthy ↗

arXiv:2606.15083v1 Announce Type: new Abstract: Quantum computing has rapidly emerged as a powerful paradigm for tackling computationally demanding problems. In particular, quantum optimization shows strong promise for hard combinatorial problems in power systems, where increasing distributed energy penetration heightens the need for intentional islanding to maintain grid reliability and resilience. However, power system islanding is an NP-hard combinatorial optimization problem that becomes computationally prohibitive for classical solvers as network size grows, motivating the use of quantum computing as a promising alternative pipeline. This study develops a resource-efficient hybrid QAOA islanding framework that brings physics-constrained power-system partitioning into the quantum optimization workflow. The framework combines coherency-informed graph reduction, physics-aware constraint modeling, and structured post-processing to efficiently convert shallow-circuit QAOA samples into high-quality feasible islanding decisions without deep circuits or large shot budgets. The proposed framework is validated on the standard IEEE benchmark systems (9-, 14-, 24-, 30-, 39-, and 57-bus), demonstrating that the hybrid workflow achieves Gurobi-optimal solution quality with a clear quantum resource advantage over vanilla QAOA, while the resulting islanding solutions satisfy all physical feasibility requirements after network separation. This study establishes QAOA-based islanding as a viable quantum approach for critical infrastructure, with structured post-processing as the key enabler of quantum resource efficiency.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16010

Theorem-Grounded Execution Ontologies for Interpretable Machine Reasoning

作者:

Raghu Anantharangachar ↗

arXiv:2606.16010v1 Announce Type: cross Abstract: Large language models have achieved impressive performance on reasoning tasks spanning mathematics, science, programming, and commonsense inference. Despite these advances, their reasoning processes remain largely latent, making them difficult to interpret, verify, replay, debug, and transfer across domains. Existing approaches such as chain-of-thought, tree-of-thoughts, graph-of-thoughts, and tool-augmented reasoning expose intermediate reasoning artifacts but typically lack explicit execution semantics, formal state representations, and verifiable reasoning structures. We introduce Theorem-Grounded Execution Ontologies (TGEO), a framework that models reasoning as an executable state-transition process rather than a sequence of generated tokens. Given an input problem, TGEO identifies relevant theorem families, binds the problem to a domain ontology, discovers semantic objects, instantiates states and operators, constructs predicates and contracts, and synthesizes an executable reasoning graph. The resulting graph provides an interpretable, replayable, and auditable representation of reasoning in which every state transition, operator application, and validation step is explicitly represented. TGEO integrates five architectural components: (1) theorem-grounded reasoning priors, (2) executable ontologies, (3) operator-mediated state transitions, (4) predicate and contract-based execution validation, and (5) architectural auditing and failure localization. We evaluate TGEO on theorem-intensive reasoning tasks derived from mathematical benchmark domains and a curated Golden Execution Suite. Our findings demonstrate the value of executable reasoning representations for interpretable, verifiable, and reproducible AI reasoning systems.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2603.26551

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

作者:

Moritz Nottebaum ↗Matteo Dunnhofer ↗Christian Micheloni ↗

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.CV) 2026-06-18 DOI: arXiv:2606.18886

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

作者:

Haoyu Hu ↗Xiyao Ma ↗Shiqi Liu ↗Linsen Zhang ↗Xiaoliang Xie ↗Xiaohu Zhou ↗Zeng-Guang Hou ↗

Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CL) 2026-06-15 DOI: arXiv:2605.11378

An Empirical Study of Automating Agent Evaluation

作者:

Kang Zhou ↗Sangmin Woo ↗Haibo Ding ↗Kiran Ramnath ↗Subramanian Chidambaram ↗Aosong Feng ↗Vinayak Arannil ↗Muhyun Kim ↗Ishan Singh ↗Darren Wang ↗Zhichao Xu ↗Megha Gandhi ↗…

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.AI) 2026-06-18 DOI: arXiv:2606.18897

SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation

作者:

Jiangnan Xia ↗Xuansheng Wu ↗Yu Yang ↗Xin Wang ↗Ninghao Liu ↗

arXiv:2606.18897v1 Announce Type: cross Abstract: Intent-based recommender systems have gained significant attention for improving accuracy and interpretability by modeling the underlying motivations behind user behaviors. Most existing models derive intents directly from user sequences via clustering or prototype learning. However, they are sensitive to sequence quality, require presetting the number of intents, and lack explicit semantic grounding. These issues lead to an incomplete and coarse intent set and limit the effectiveness of recommendation. In this paper, we propose the Sparse Autoencoder for intent-based recommendation (SAERec), a novel recommender that automatically constructs a fine-grained and interpretable intent space from a textual corpus to guide recommendation. Rather than treating texts as side signals, SAERec leverages them as high information density evidence for intent construction. Specifically, we first extract a comprehensive set of fine-grained interpretable intents from the latent space of large language models (LLMs) by using a sparse autoencoder (SAE) to disentangle and interpret text embeddings, which isolates intent-related semantics from textual noise. Then, for each user, we retrieve relevant intents from this set as priors to guide recommendation. It contains personal intents matching a user's current interests and public intents capturing general item patterns shared across users (e.g., quality, price). Finally, to integrate retrieved intents into sequence modeling, we propose a multi-branch attention mechanism that captures temporal dependencies and injects both personal and public intent signals, followed by an adaptive fusion layer to construct the final user representation for recommendation. Extensive experiments on public datasets demonstrate the superiority of SAERec, consistently outperforming state-of-the-art baselines while providing human-understandable explanations.

阅读与讨论 → 访问原文 →

20.

arXiv (quant-ph) 2026-06-11 DOI: arXiv:2606.12216

Time-Frequency Grid States for Reconstruction and Correction of Channel-Induced Distortion in Entangled Photons

作者:

Siang-Yun Liu ↗Bo-Ren Huang ↗Zhi-Xuan Zen ↗Yen-Hung Chen ↗Pin-Ju Tsai ↗

arXiv:2606.12216v1 Announce Type: new Abstract: Characterization of time-frequency (TF) quantum states requires reliable reconstruction of their TF distributions. However, imperfect transmission or measurement channels can distort reconstructed joint spectral intensities (JSIs), especially when the underlying perturbation mechanism is unknown. Here, we experimentally demonstrate a reconstruction and correction framework that uses a TF grid state as an intrinsic frequency-domain reference. By analyzing the displacement of the grid points, a Gaussian process regression model is employed to reconstruct a correction mapping for the nonlinear coordinate deformation without assuming a prior physical model of the distortion. The learned mapping reduces the residual coordinate deviation of the TF grid state by approximately a factor of 11 and, when applied to an independent frequency-entangled test state, improves the Gaussian-shape fidelity from 76.2\% to 90.0\%. These results establish TF grid states as practical metrological resources for diagnosing and correcting distortions in TF quantum systems, providing a pathway toward distortion-resilient quantum communication and high-dimensional quantum information processing.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.08781

DeepMine-Mamba: Mitigating Information Dilution in Mamba-Based State Space Models for Document Image Binarization

作者:

Sheng-Wei Chan ↗Yung-Che Wang ↗Hsin-Jui Pan ↗Chia-Min Lin ↗Jen-Shiun Chiang ↗

Document image binarization aims to separate foreground text from degraded backgrounds while preserving thin, broken, and low-contrast strokes. Although deep learning methods have improved binarization performance, most existing approaches rely on convolutional, transformer-based, or generative architectures, while Mamba-based state space models remain largely unexplored for this task. In this work, we investigate Mamba-based feature propagation and observe that direct state-space propagation may dilute weak foreground cues during long-range modeling, especially faint ink traces, fragmented characters, and boundary-sensitive stroke details. To address this problem, we propose DeepMine-Mamba, a Mamba-based binarization framework equipped with a novel Anti-Dilution Gate that estimates propagation-induced feature changes and selectively restores stroke-sensitive local responses while suppressing unnecessary background enhancement. Experiments on DIBCO/H-DIBCO benchmarks under a strict leave-one-year-out protocol show that DeepMine-Mamba achieves competitive overall performance, with strong average FM and Fps across benchmark years. Ablation results further show that the Anti-Dilution Gate is the key component for mitigating propagation-induced foreground dilution and improving stroke preservation.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.14631

SED:Lightweight Saliency prediction for Event-based data via Distillation

作者:

Romaric Mazna ↗Jean Martinet ↗Michele Magno ↗

Event-based saliency prediction has gained attention recently, as combining event cameras with saliency estimation can act as an upstream stage that naturally improves the efficiency of downstream eventbased perception at the edge. However, current approaches are either neuromorphic, underperforming on event-based saliency benchmarks, or too heavy for resource-constrained edge applications due to their reliance on transformers or 3D convolutions. Drawing inspiration from efficient convolutional modules, SED and aiming to exploit the temporal information in event data, we propose a lightweight network, trained through knowledge distillation, built on a Depthwise Spatio-Temporal Block (DSTconv) – a factorization of the 3D depthwise separable convolution. Relative to its teacher, our model reduces the model size from 180 MB to 0.32 MB (562x) and the parameter count from 45M to 81k (554x), while matching or outperforming it on the N-DHF1K and N-UCF Sports datasets. Moreover, it generalizes strongly beyond its training distribution, transferring from synthetic to real event data where a model trained from scratch fails.

阅读与讨论 → 访问原文 →

23.

Nature (Science) 2026-06-09 DOI: HASH:c8863ebdd163f9371729135d5c4faf32

Let’s talk about biomedical research kits

作者:

Rao M. Uppu ↗

Although undoubtably helpful in many ways, experimental assay kits risk undermining the fundamentals of science. How can we course correct? Although undoubtably helpful in many ways, experimental assay kits risk undermining the fundamentals of science. How can we course correct?

阅读与讨论 → 访问原文 →

24.

arXiv (math.PR) 2026-06-16 DOI: arXiv:2606.15842

A small noise approximation for Muller's Ratchet

作者:

Carola Sophia Heinzel ↗Peter Pfaffelhuber ↗Anton Wakolbinger ↗

arXiv:2606.15842v1 Announce Type: new Abstract: We consider an infinite system of SDEs with Fleming-Viot noise indexed by $k=0,1,2,\dots$, whose parameters $\alpha,\lambda$, and $\nu$ are the (deleterious) selection coefficient, the (uni-directional) mutation rate, and a quantity which determines the size of the system's fluctuations. The SDE's unique weak solution $X(t) = (X_k(t))_{k=0,1,2,...}$ models what is known in population genetics as Muller's ratchet. Here, $X_k(t)$ stands for the frequency of individuals carrying $k$ deleterious mutations. Since the mutation process is uni-directional, $t\mapsto \inf\{k: X_k(t)> 0\}$ is non-decreasing for almost every path of $X$, and we refer to an increase as a click of Muller's ratchet. A long standing question concerns the clicking rate of Muller's ratchet. Using Duhamel's principle for semigroups, we give a partial answer by approximating $E(\sum_{k=1}^\infty kX_k(t) )$ and $E\big(X_0(t)\big)$ up to $O(1/\nu^2)$ for fixed $\alpha$, $\lambda$ and $t>0$. Our results suggest that $\psi:=\nu \alpha e^{-\lambda/\alpha}$ is a crucial quantity also when the mutation/selection ratio $\theta = \lambda/\alpha$ is moderately large: for large $\nu \alpha$, clicking of the ratchet on the time scale $\frac 1\alpha \log \theta$ becomes rare as soon as $\psi$ becomes large.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15186

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

作者:

Yuxuan Jiang ↗Mingyang Han ↗Yusheng Dai ↗Andong Wang ↗Tianhong Zhou ↗Jiaxin Ye ↗Dongxiao Wang ↗Haoxiang Shi ↗Boyu Li ↗Jun Song ↗Cheng Yu ↗Bo Zheng ↗…

arXiv:2606.15186v1 Announce Type: cross Abstract: Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络