Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-17

Stable and Steerable Sparse Autoencoders with Weight Regularization

arXiv:2603.04198v2 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied weight regularization by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability.

02.
arXiv (CS.CL) 2026-06-12

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain invariant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion on these attributes requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions. This is regardless of the experimenter's viewpoint on the subject, or whether the outcome shows existence or non-existence. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that Age of Empires II is functionally- and Turing-complete.

03.
arXiv (CS.AI) 2026-06-19

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

arXiv:2603.04219v2 Announce Type: replace-cross Abstract: We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality. Audio samples are available on our web page.

04.
medRxiv (Medicine) 2026-06-15

Semantic Embeddings and the Peripheral Transcriptome in Ischemic Stroke: Connecting Molecular Signatures to NANDA-I Diagnoses

Objective: To construct and evaluate, in an exploratory manner, a pathophysiologic rationale link- ing biological pathways derived from the peripheral transcriptome in ischemic stroke (IS) to nursing diagnoses in the NANDA-I 2024-2026 taxonomy, while emphasizing that this association is not di- rect, deterministic, or automatically inferable from textual similarity with large language models (LLMs). Methods: A computational study was conducted using public secondary data from the Gene Ex- pression Omnibus series GSE16561, which includes 63 peripheral blood samples: 39 from indi- viduals with IS and 24 from healthy controls. The pipeline integrated transcriptomic analysis and functional enrichment, semantic mapping through ClinicalBERT embeddings, and mechanistic and clinical-conceptual judgment using Claude Sonnet 4.6 as a judge. The judgment stage was treated as the central interpretive layer, designed to mediate the transcriptome, pathophysiology, functional manifestation, and NANDA-I diagnosis. Results: The analysis identified a bimodal transcriptomic pattern, with activation of pathways re- lated to innate immunity and suppression of pathways related to adaptive immunity. Semantic map- ping generated 158 pathway-diagnosis pairs. The Spearman correlation between cosine similarity and the mechanistic score was negative and statistically significant (rho = -0.243; p = 2.09e-03), but weak in magnitude. This effect size indicates that semantic similarity explained less than 6% of the variance in mechanistic plausibility, reinforcing the insufficiency of embeddings as a stand- alone criterion. Of the 158 pairs, 14 were classified as high concordance, 8 as moderate, and 136 as divergent. Conclusion: The main value of this study lies in demonstrating that translating biological pathways into nursing diagnoses requires pathophysiologic, functional, and clinical-conceptual mediation. The prioritized pairs represent mechanistically plausible hypotheses for future research, without implying causality, direct clinical confirmation, or immediate care recommendations.

05.
bioRxiv (Bioinfo) 2026-06-12

PHI-Reason: evidence-grounded species-level phage-host prediction from structured biological text profiles

Phage–host interaction (PHI) prediction is a fundamental problem in microbiology with applications in microbial ecology and microbiome engineering. Existing computational approaches typically convert phage and host information into numerical representations derived from sequence similarity, protein content, genome composition or reference databases, then score candidate hosts or train host-prediction models. Although effective, such representations often make it difficult to inspect which biological evidence supports a prediction. Here, we present PHI-Reason, a species-level PHI prediction framework that reformulates host prediction as constrained biological text reasoning. Instead of embedding phages and hosts directly as numerical vectors, PHI-Reason converts heterogeneous PHI-related evidence from phage genomes, host genomes, functional annotations, homology searches and biological metadata into modular natural-language profiles. A frozen large language model then performs species-level candidate-host ranking or pairwise PHI assessment by integrating the supplied evidence at inference time. Across species-level benchmarks, PHI-Reason achieved competitive host-prediction performance and recovered complementary correct assignments relative to established sequence- and reference-based methods. Its explicit profile design enabled systematic evidence perturbation and rationale-grounding analyses, showing that predictions depend on coherent multi-source biological evidence and that hallucination risk from unsupported or incomplete profiles can be made operationally measurable. These results position PHI-Reason as a constrained evidence-integration framework for species-level PHI prediction. Rather than replacing sequence-based predictors, it provides an interpretable layer that shows how far explicit biological evidence can support host inference, and where that evidence falls short.

06.
arXiv (CS.CV) 2026-06-16

EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

07.
arXiv (quant-ph) 2026-06-11

Fast Adiabatic Quantum Gates via Hyperfine Intermediate States

arXiv:2606.11655v1 Announce Type: new Abstract: The appeal of adiabatic quantum computing lies in its intrinsic robustness against various technical imperfections, making it attractive for many quantum information applications. However, it faces a fundamental challenge: accelerating the adiabatic operations while preserving adiabaticity within the qubit coherence time. In this article, we propose an electromagnetically induced transparency-based adiabatic CNOT gate protocol which harnesses atomic hyperfine intermediate states (HISs) to speed up the adiabatic evolution. The HISs, naturally-existed in two-photon transitions, often need to be suppressed due to their significant decay errors. In contrast, this paper introduces a novel method that utilizes appropriately chosen HISs not only to enhance the adiabaticity in STAY pathway but also to accelerate the population transfer in TRANSFER pathway. Through pulse optimization, we achieve adiabatic gate fidelities exceeding 0.9991 within 0.3903 {\mu}s in realistic Cs atomic setups. To demonstrate the generality of protocol we further assess the impact of decays from multiple HIS and extend our model to arbitrary number of states, providing a practical route toward fast and robust adiabatic quantum gates in Rydberg-atom platforms.

08.
arXiv (CS.LG) 2026-06-17

Instrumental and Proximal Causal Inference with Gaussian Processes

arXiv:2603.02159v2 Announce Type: replace-cross Abstract: Instrumental variable (IV) and proximal causal learning (Proxy) methods are central frameworks for causal inference in the presence of unobserved confounding. Despite substantial methodological advances, existing approaches rarely provide reliable epistemic uncertainty (EU) quantification. We address this gap through a Deconditional Gaussian Process (DGP) framework for uncertainty-aware causal learning. Our formulation recovers popular kernel estimators as the posterior mean, ensuring predictive precision, while the posterior variance yields principled and well-calibrated EU. Moreover, the probabilistic structure enables systematic model selection via marginal log-likelihood optimization. Empirical results demonstrate strong predictive performance alongside informative EU quantification, evaluated via empirical coverage frequencies and decision-aware accuracy rejection curves. Together, our approach provides a unified, practical solution for causal inference under unobserved confounding with reliable uncertainty.

09.
arXiv (math.PR) 2026-06-18

Phase transitions for contact processes on sparse random graphs via metastability and local limits

arXiv:2505.22471v2 Announce Type: replace Abstract: We propose a new perspective on the asymptotic regimes of fast and slow extinction in the contact process on locally converging sequences of sparse finite graphs. We characterise the phase boundary by the existence of a metastable density, which makes the study of the phase transition particularly amenable to local-convergence techniques. We use this approach to derive general conditions for the coincidence of the critical threshold with the survival/extinction threshold in the local limit. We further argue that the correct time scale to separate fast extinction from slow extinction in sparse graphs is, in general, the exponential scale, by showing that fast extinction may occur on stretched exponential time scales in sparse scale-free spatial networks. Together with {the results of} Nam, Nguyen and Sly (Trans.\ Am.\ Math.\ Soc.\ 375, 2022), our methods can be applied to deduce that the fast/slow threshold in sparse configuration models coincides with the survival/extinction threshold on the limiting Galton-Watson tree.

10.
arXiv (quant-ph) 2026-06-12

Proper and improper mixed states serve as different prior beliefs for quantum state retrodiction

arXiv:2502.10030v2 Announce Type: replace Abstract: A mixed quantum state can be taken as capturing an unspecified form of ignorance; or as describing the lack of knowledge about the true pure state of the system ("proper mixture"); or as arising from entanglement with another system that has been disregarded ("improper mixture"). These different views yield identical density matrices and therefore identical predictions for future measurements. But when used as prior beliefs for inferring the past state from later observations ("retrodiction"), they lead to different updated beliefs. This is a purely quantum feature of Bayesian agency. Based on this observation, we establish a framework for retrodicting on any quantum belief and we prove a necessary and sufficient condition for the equivalence of beliefs. We also illustrate how these differences have operational consequences in quantum state recovery.

11.
arXiv (CS.CL) 2026-06-11

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.

12.
arXiv (math.PR) 2026-06-11

On multidimensional infinite dihedral group extensions of Gibbs Markov maps

arXiv:2601.08961v2 Announce Type: replace-cross Abstract: We obtain a local central limit theorem for cocycles associated with a class of non abelian and non compact group extensions of Gibbs Markov maps. This class consists of multidimensional infinite dihedral groups. Unlike in the set up of the random walks on groups, we cannot use the convolution of measures on the group and instead we resort to an approach based on irreducible representations. Depending on the dimension of the group, we obtain either mixing, and thus ergodicity, or dissipativity. Also, we obtain the asymptotics of the first return time of the group extension to the origin.

13.
arXiv (CS.LG) 2026-06-17

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

arXiv:2602.17894v2 Announce Type: replace-cross Abstract: Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities - for example, health markers, demographics, or political affiliations - and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size - the total sample size divided by $D_{\chi^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{\chi^2}$ is the $\chi^2$-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

14.
arXiv (CS.CV) 2026-06-16

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

15.
arXiv (CS.CL) 2026-06-16

AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and Johndal 2008; Eckhoff et al. 2018), which established the schema and the Koine-Greek reference set for the project. Annotation uses the Stanford Stanza PROIEL-trained workflow; sentence-level alignment uses LaBSE, a multilingual sentence-embedding model; word-level alignment uses multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release provides curated samples and the open-source toolkit; the full annotated corpus partitions remain under v0.5 audit on the Greek national HPC. Quantitative scale, per-witness verse counts, and per-period annotated-row counts are reported in the v0.5 release notes, after the audit pass completes. Concept DOI: 10.5281/zenodo.20439182.

16.
arXiv (CS.CV) 2026-06-12

On Pitfalls of $RemOve-And-Retrain$: Data Processing Inequality Perspective

The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from an information-theoretic perspective. We show that model- and data-agnostic post-processing of attribution maps (transformations that, by the data processing inequality, cannot add information about the decision function) can often improve ROAR scores. This means that an improved ROAR ranking is not, by itself, evidence that an attribution map carries more information about the model. We trace this failure mode to a bias toward spatially blurry masks. Experiments on CIFAR-10, SVHN, and CUB-200 show a consistent association between blurriness and ROAR performance, a pattern that also appears in the ROAD variant. We provide guidelines for more cautious removal-based benchmarking, with implications for validating mechanistic understanding of neural network internals.

17.
arXiv (math.PR) 2026-06-17

Cutoff for asymmetric shelf shuffle

arXiv:2606.18039v1 Announce Type: new Abstract: A mechanical shuffler consists of $m$ shelves. A deck of $n$ cards, arranged in increasing order, is dealt from the bottom sequentially. Each card is assigned a shelf uniformly at random and placed on the top (bottom) of the existing pile with probability $p$ ($1-p$) independently. We refer to this as asymmetric shelf-shuffle. We find the law $\nu_{n, m}^{(p)}$ of the permutation induced by the asymmetric shelf-shuffle and show that the pair consisting of the number of descents and the number of valleys is a sufficient statistic. This generalizes a result of Diaconis, Fulman, and Holmes (Ann. Appl. Prob., 2013) corresponding to the case $p=1/2$. For $p=1/2$, Chen and Ottolini (ECP, 2025) established the cutoff in the total variation distance near $\lfloor n^{5/4}\rfloor$. We establish the cutoff for the asymmetric shelf shuffle. Let $\nu_n$ be the uniform measure on the set of all permutations $S_n$ of $\{1, \ldots, n\}$. For a fixed $p\neq 1/2$ and $c>0$, we show that \[\operatorname{TV}\left(\nu_{n, \lfloor cn^{3/2}\rfloor }^{(p)}, \nu_n\right)=1-2\Phi\left(-\frac{|2p-1|}{4\sqrt{3}c}\right)+O_{c, p}(n^{-1/2})\;.\] We also establish the cutoff in the separation distance near $m\approx n^{2}$ and in the relative entropy near $m=n^{3/2}$. In both cases, we also obtain the cutoff profile explicitly.

18.
arXiv (CS.CL) 2026-06-11

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start protection.This report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.

19.
arXiv (CS.CL) 2026-06-19

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple : the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction – ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction – over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

20.
arXiv (CS.CL) 2026-06-15

An Empirical Study of Automating Agent Evaluation

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

21.
arXiv (CS.CV) 2026-06-16

Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning

Embodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed-loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed-loop control progressively propagates through sequential decision-making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill-Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution-and-Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real-world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: https://eqcy.github.io/sce/.

22.
arXiv (CS.CV) 2026-06-19

PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds

Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num{33000} tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct{128 x 128}{\m} with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num{3000} tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \url{https://huggingface.co/datasets/Haoyuan-Shen/PCFootprint}.

23.
arXiv (CS.CL) 2026-06-18

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

24.
arXiv (quant-ph) 2026-06-11

Mixed-State Topological Order under Coherent Noise

arXiv:2411.03441v2 Announce Type: replace Abstract: Mixed-state phases of matter under local decoherence have recently garnered significant attention due to the ubiquitous presence of noise in current quantum processors. One of the key issues is understanding how topological quantum memory is affected by realistic coherent noise, such as random rotation noise and amplitude-damping noise. In this work, we investigate the intrinsic error threshold of the two-dimensional toric code (TC), a paradigmatic topological quantum memory, under these types of coherent noise by employing both analytical and numerical methods based on the doubled-Hilbert-space formalism. A connection between the mixed-state phase of the decohered TC and a non-Hermitian Ashkin-Teller-type statistical-mechanics model is established, and the mixed-state phase diagrams under the coherent noise are obtained. We find remarkable stability of mixed-state topological order under random rotation noise with axes near the $Y$-axis of qubits. We also identify intriguing extended critical regions at the phase boundaries, highlighting a connection with non-Hermitian physics. We argue that these phase boundaries provide upper bounds for the intrinsic error threshold, beyond which quantum error correction becomes impossible. We complement these findings by estimating the error thresholds for random rotation noise under standard quantum error correction, thereby providing lower bounds on the intrinsic error threshold.

25.
arXiv (CS.AI) 2026-06-12

Divination by Prompt: LLM-Mediated Xuanxue on Chinese Social Media

arXiv:2606.12418v1 Announce Type: cross Abstract: The rapid proliferation of large language models (LLMs) has produced a striking cultural practice: using conversational AI for divination. This paper offers one of the first systematic studies of LLM-mediated divination in the context of Xuanxue, an internet-native umbrella term for mystical and spiritual practices on Chinese social media. Using a mixed-methods design, we analyze 23000+ posts and comments from Xiaohongshu and conduct 32 semi-structured interviews with users and professional diviners. Users primarily consult LLMs about pragmatic concerns - romantic relationships, careers, exams, and in-game gacha draws - via two intersecting pathways: trend-driven curiosity enabled by viral visibility and zero-cost access, and event-driven anxiety under conditions of uncertainty. A defining feature is collaborative prompt refinement, which turns users into active prompt engineers. Among commenters expressing a clear stance, perceived efficacy skews positive, with "accuracy" often justified through biographical fit and retrospective confirmation, consistent with Barnum and confirmation bias. Users also develop verification practices such as repeated trials and cross-model comparison. Professional diviners, by contrast, portray LLMs as lacking the "spiritual power" required for genuine divination, reflecting both ontological commitments and economic boundary-work. We also show how participants navigate tensions between scientific and metaphysical frames when interpreting AI-generated readings. Situating these findings in anthropological and cognitive-evolutionary theories of divination, we argue that LLM divination preserves core functions of traditional practice while introducing scalability, repeatability, and prompt-driven co-production that reshape how divinatory authority is constructed and evaluated.