Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
bioRxiv (Bioinfo) 2026-06-22

Benchmarking cell type annotation in spatial transcriptomics: resolving cellular hierarchies, biological fidelity, and dynamic cell states

Spatial transcriptomics enables the quantification of gene expression within its native tissue context, providing unprecedented insight into tissue architecture, cellular ecosystems, and local cell-cell interactions at regional and single-cell resolution. Accurate cell type annotation is a critical prerequisite for interpreting these data and is often the first and most essential step in downstream analysis. Despite rapid advances in computational methods, cell type annotation remains challenging and frequently requires extensive expert-driven manual curation based on marker-gene expression, spatial context, and prior biological knowledge. While early approaches relied primarily on transcriptional similarity, newer methods increasingly incorporate spatial information, histological features, and multimodal data to improve annotation accuracy. Nevertheless, reliable annotation remains difficult when biological interpretation requires fine-grained subtype resolution, particularly for platforms with limited gene panels, tissues undergoing dynamic cellular state transitions, and studies in which reference and query datasets differ substantially in biological context or technical modality. Here, we present a systematic benchmark of 20 state-of-the-art cell type annotation methods across four spatial transcriptomics datasets spanning diverse technologies, experimental conditions, cell numbers, and gene panel sizes. Importantly, all benchmark datasets contain expert-curated cell type labels, including well-resolved cell populations and subtype annotations, providing high-quality biological ground truth for evaluation. The benchmark encompasses both reference-based and reference-free methods representing a broad range of computational frameworks. Performance was assessed using conventional classification metrics, including accuracy and F1-based measures, together with structure-aware metrics that evaluate both cell-level annotation accuracy and preservation of higher-order biological organization. Across datasets, annotation performance varied substantially according to tissue context, reference-query similarity, and annotation granularity. Fine-grained subtype annotation and recovery of rare cell populations remained challenging for many methods, particularly in datasets capturing injury, repair, developmental, and regenerative processes characterized by continuous cellular state transitions. Notably, high classification accuracy did not necessarily correspond to preservation of global cellular relationships or biologically coherent downstream pathway and gene-set enrichment analyses. Overall, scANVI, Seurat, and TACCO consistently ranked among the top-performing methods, although their relative advantages were context dependent. Together, our results provide a comprehensive assessment of current annotation strategies for spatial transcriptomics and offer practical guidance for selecting methods that best align with specific biological questions, dataset characteristics, and analytical priorities.

02.
arXiv (CS.CV) 2026-06-19

Contour-Constrained Deformable Registration with Parameter Characterization for Head and Neck Surgical Guidance

With 890,000 annual new cases globally, head and neck squamous cell carcinoma has one of the highest recurrence rates among solid malignancies. Although frozen section analysis is the standard of care for intraoperative margin assessment, accurately relocating detected positive margins on the resection bed remains challenging due to imprecise alignment between resected specimens and their resection bed, compounded by post-resection mucosal tissue shrinkage. We present a biomechanics-driven deformable registration framework that corrects post-resection tissue deformation to provide intraoperative guidance. Our approach registers 3D specimen meshes to intraoperative resection bed point clouds using a deformable registration approach based on regularized Kelvinlet basis functions. The registration matches surface point clouds, fiducial landmarks, and boundary contour constraints that directly penalize perpendicular distance-to-agreement between specimen and resection bed boundaries. Across nine specimens from skin, buccal mucosa, and tongue sites, the overall mean target registration error was $11.11 \pm 4.07$ mm using rigid registration, which decreased to $8.20 \pm 2.68$ mm (26.19\% reduction) using deformable registration without contour constraint. The proposed contour-constrained deformable registration further reduced the error to $5.62 \pm 2.28$ mm, a 49.41\% reduction relative to rigid registration. We observed the largest reduction in the most clinically challenging tongue specimens. We also performed a systematic two-stage parameter search to characterize the relative importance of surface alignment, fiducial correspondences, contour constraint, and strain energy regularization. This search revealed that contour weighting dominates registration accuracy for tissue types with large lateral deformation, while the algorithm operates over a broad range of parameter combinations.

03.
medRxiv (Medicine) 2026-06-17

Brain age gap correlates with DTI-derived microstructural abnormalities in multiple sclerosis.

Background: Brain age gap (BAG) is increased in multiple sclerosis (MS), but whether it reflects microstructural pathology beyond conventional atrophy remains unclear. Objective: To test whether BAG is elevated in MS and correlates with conventional and diffusion tensor imaging (DTI) abnormalities relative to healthy controls. Methods: A case-control study of 43 people with MS and 18 healthy controls was performed. BAG was estimated from T1-weighted MRI using brainageR. Controls were used as MRI reference distributions. MRI values were expressed as deviation z-scores and correlated with BAG within MS. Conventional MRI and DTI domains were analysed using age/sex-adjusted partial correlations with domain-wise Benjamini-Hochberg FDR correction, where appropriate. Results: BAG was higher in MS than controls (4.79 vs -2.58 years; p

04.
arXiv (CS.CL) 2026-06-17

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness – a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

05.
arXiv (CS.CV) 2026-06-17

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.

06.
arXiv (CS.AI) 2026-06-17

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

arXiv:2606.17767v1 Announce Type: cross Abstract: Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through "spoken statistics," intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.

07.
arXiv (quant-ph) 2026-06-12

The table maker's quantum search

arXiv:2601.13306v2 Announce Type: replace Abstract: We show that quantum search can be used to compute the hardness to round an elementary function, that is, to determine the minimum working precision required to compute the values of an elementary function correctly rounded to a target precision of $n$ digits for all possible precision-$n$ floating-point inputs in a given interval. For elementary functions $f$ related to the exponential function, quantum search takes time $\tilde O(2^{n/2} \log (1/\delta))$ to return, with probability $1-\delta$, the hardness to round $f$ over all $n$-bit floating-point inputs in a given binade. For periodic elementary functions in large binades, standalone quantum search yields an asymptotic speedup over the best known classical algorithms and heuristics. We then estimate the resources required for a fault-tolerant implementation of the proposed algorithm for the $\sin$ and $\cos$ functions in double precision. We find that, although the algorithm can in principle compete with the fastest known practical method for computing the hardness to round over all binades in the format, it requires qubit coherence times that are unrealistically long for present technology.

08.
arXiv (CS.LG) 2026-06-16

Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation

arXiv:2606.14945v1 Announce Type: new Abstract: The autoresearch pattern enables autonomous experimentation by having a large language model (LLM) iteratively modify code to optimize a target metric. Its stateless design, however, reconstructs experimental context from scratch at every iteration, incurring $O(n)$ token cost per iteration and $O(n^{2})$ total. This work reformulates the pattern as a stateful ReAct agent using LangGraph, where typed persistent state carries experimental history across iterations via a tool-calling interface. Two benchmarks are evaluated: hyperparameter tuning (15 iterations, small per-iteration observations) and code performance optimization (40 iterations, large per-iteration observations containing full source code and benchmark results). On hyperparameter tuning, the stateful agent consumes 90\% fewer tokens (2{,}492 vs.\ 24{,}465). On code optimization, the stateful agent consumes 52\% fewer tokens (627K vs.\ 1{,}275K) while achieving comparable optimization quality on both tasks. The token reduction is structural: the stateless agent re-reads the full history at $O(n)$ cost per iteration, while the stateful agent operates within a fixed-size conversation window at $O(1)$ cost. This paper describes the architecture in sufficient detail for practitioners to implement a stateful autoresearch agent for their own workflows.

09.
arXiv (CS.CV) 2026-06-16

CPS4: Class Prompt driven Semi-Supervised Spine Segmentation with Class-specific Consistency Constraint

Vision Language Model (VLM) has great potential to enhance the quality of pseudo labels in semi-supervised spine segmentation by leveraging textual class prompts to generate segmentation map, but no one has studied it yet. Although promising, it lacks explicit constraints to ensure consistency between spine class prompts and spine unit region, resulting in unsatisfactory performance in multi-class segmentation map generation. In this paper, we propose CPS4, the first text-guided semi-supervised spine segmentation network using class prompts to enhance the quality of spine pseudo labels. Specifically, CPS4 is implemented through two training stages. (i) Class-specific consistency constrained VLM pretraining stage: we propose token- and pixel-level attention loss to optimize the consistency between class prompts and spine units, forcing the textual class prompt to be closely coupled with the target spine unit in the semantic space. (ii) Class Prompt driven semi-supervised spine segmentation stage: using the pretrained vision-text encoder, we derive each class-specific binary segmentation map for the unlabeled spine image and integrate them into an unified multi-class segmentation map, improving the quality of the spine pseudo label generated by the semi-supervised spine segmentation network. Experimental results show that our CPS4 achieves superior spine segmentation performance with Dice of 80.44%, only using 5% labeled data on the public spine segmentation dataset, surpassing popular semi-supervised learning and VLM methods. Our code will be available.

10.
arXiv (CS.AI) 2026-06-19

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

arXiv:2606.20520v1 Announce Type: cross Abstract: Autonomous agents are increasingly connected to cloud, deployment, and data-control workflows, but production mutation authority should not reside inside non-deterministic reasoning processes. Existing access-control mechanisms authorize identities, while assurance layers certify proposed actions; neither alone provides a mandatory enforcement point for certified authority at the moment of mutation. This paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure. SEB consumes certificates issued by the Sovereign Assurance Boundary (SAB), verifies that the requested mutation matches the certified execution contract, checks validity windows, policy epochs, revocation epochs, and live-state drift, mints scoped execution identity, invokes infrastructure APIs, and records signed decision and outcome records. By separating proposal, admission, and execution, SEB turns certified authority into a short-lived, revocable, auditable runtime capability, provided that production mutation APIs reject non-broker identities. We present the SEB execution model, certificate and replay-verification predicates, scoped identity semantics, bypass-prevention deployment patterns, failure behavior, and a concrete prototype implementation. We evaluate the prototype on AWS and Kubernetes clusters, measuring latency overheads, revocation propagation, drift detection, and security under fault injection.

11.
arXiv (CS.AI) 2026-06-17

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

arXiv:2606.17199v1 Announce Type: cross Abstract: Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

12.
arXiv (CS.CV) 2026-06-24

Performance and Interpretability of Convolutional, Transformer, and Hybrid Deep Learning Models in Colorectal Histology Classification

Deep learning has become an important tool in computational pathology, enabling automated analysis of histopathological images. While convolutional neural networks (CNNs) have traditionally dominated this field, transformer-based and hybrid architectures have recently demonstrated promising performance. However, comprehensive comparisons of these approaches for colorectal histopathology remain limited. This study evaluated twelve ImageNet-pretrained CNN, transformer, and hybrid architectures using the Kather colorectal histopathology dataset containing 5,000 image tiles from eight tissue classes. All models were trained using a standardized transfer-learning and fine-tuning protocol and assessed using multiple performance metrics, including accuracy, precision, sensitivity, specificity, F1-score, ROC-AUC, Cohen's kappa, and Matthews correlation coefficient. All evaluated models achieved high classification performance, with accuracies ranging from 93.2% to 97.1%. EVA-02 achieved the highest overall performance (97.1% accuracy, 97.0% F1-score), closely followed by ViT-B/16. Among CNNs, ResNet34 and ConvNeXt-Tiny demonstrated highly competitive performance, achieving accuracies of 96.4% and 96.3%, respectively. Transformer architectures generally produced the strongest results across evaluation metrics, although the performance gap between the best transformer and CNN models was relatively small. Per-class analysis showed consistently strong classification performance across all tissue categories, with Complex Stroma representing the most challenging class. Overall, transformer-based architectures achieved the highest predictive performance, whereas modern CNNs provided a favorable balance between accuracy and model complexity. These findings provide a comprehensive benchmark of major deep learning paradigms for colorectal histopathology classification.

13.
arXiv (quant-ph) 2026-06-19

Entanglement structure of the dynamical phases in the sub-Ohmic spin-boson model

arXiv:2606.20313v1 Announce Type: new Abstract: The sub-Ohmic spin-boson model exhibits three distinct dynamical regimes in its spin population dynamics, classified as coherent, incoherent, and pseudo-coherent. Whether these regimes correspond to distinct spin-bath entanglement structures remains an open question. Here we address this using tree tensor network states with projector-splitting time evolution (TTN-TDVP-PS), scanning a broad grid in the sub-Ohmic $(s, \alpha)$ plane. We find that the spin entanglement entropy $S_\mathrm{spin}(t)$ reaches a stationary plateau on a timescale shorter than the polarization relaxation, enabling construction of a stationary entropy landscape from the stationary value $S_\mathrm{stable}$. Within this scalar entropy landscape, the entropy ridge broadly follows the population-based phase boundary at small $s$, but does not reproduce the two-branch structure at large $s$. The ridge remains single-valued within the incoherent region rather than separately tracking both population-based transitions. The Bloch-sphere representation provides a geometric interpretation of this behavior. The entropy plateau corresponds to trajectories settling onto constant-radius shells, with the ridge marking the parameters of smallest stationary Bloch radius. Mode-resolved bath entanglement shows that low-frequency modes dominate the environmental entropy scale and that coherent dynamics enhance bath-mode correlations beyond direct spin–mode correlations. These results establish the stationary spin entanglement entropy as a physically informative observable that complements population-based classifications of dissipative quantum dynamics.

14.
arXiv (CS.AI) 2026-06-17

FacProcessTwin: An LLM-Based System for Process Twin Development

arXiv:2606.17666v1 Announce Type: cross Abstract: Process twins provide real-time representations of entire production processes. By capturing how process steps interact, rather than monitoring a single machine in isolation as an asset-based digital twin does, they have the potential to drive efficiency gains across the whole process. However, developing a process twin is costly. It requires accurately modelling the entire production process: its process steps, the equipment and product-specific settings each step uses, and its process variations. The resulting model must then be bound to live operational data. We present FacProcessTwin, a system that leverages a large language model (LLM) to reduce this development time, building a process twin from a plant's process documentation and natural-language input from an operator. FacProcessTwin generates this complete process model and then automatically binds its process steps to live operational data. The generated model and its data bindings are rendered as an interactive process diagram through which manufacturing personnel can monitor and correct the system's autonomous decisions, such as resolving uncertainty at safety-critical binding steps. We evaluate FacProcessTwin through a real-world case study of an Australian food manufacturer, covering 16 production process flows that span chilled, frozen, and aseptic shelf-stable product categories and include process variations within the same product. The results show that FacProcessTwin generates these process models accurately (a mean F1 of 95.2% against ground truth) and builds each twin in roughly a sixth of the manual time. Its human-in-the-loop governance then keeps the safety-critical bindings correct: at ambiguous tags where a single-pass baseline silently mis-binds 75.0% of the time, FacProcessTwin defers to the operator and mis-binds none.

15.
arXiv (CS.CV) 2026-06-12

CACR:Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.

16.
arXiv (CS.CV) 2026-06-11

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

Object navigation in unseen indoor environments requires agents to perform semantic search under partial observability. Vision-language models (VLMs) provide strong semantic-spatial priors for this task, but how to interface them with robot navigation remains challenging: dense VLM inference is expensive, while abstracting environments into symbolic memories often separates high-level reasoning from the raw visual evidence that supports it. We propose we propose PIGEON (Point of Interest Guided Exploration for Object Navigation), a VLM-driven framework that formulates object navigation as raw-observation-grounded sparse decision problem. PIGEON introduces Points of Interest (PoIs) as sparse visual decision units that couple geometrically executable waypoints with raw egocentric observations. Rather than using VLMs as dense controllers or restricting them to frontier ranking, PIGEON enables VLMs to select among task-critical PoIs, including exploration frontiers, suspected target objects, traversable stairs, and floor-level summaries, while low-level planners execute continuous motion between them. This PoI interface further makes high-level navigation decisions verifiable, allowing us to develop an RLVR pipeline that improves local VLMs without manual Chain-of-Thought annotations. Extensive experiments on Habitat ObjectNav benchmarks show that PIGEON achieves state-of-the-art zero-shot performance, scales consistently with foundation model capacity, and transfers to Active Embodied Question Answering with only prompt modifications. Real-world deployments on physical robots further demonstrate its robustness and efficiency.

17.
arXiv (CS.CV) 2026-06-15

$\mu_0$: A Scalable 3D Interaction-Trace World Model

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $\mu_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $\mu_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $\mu_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $\mu_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $\mu_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $\pi_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

18.
arXiv (CS.AI) 2026-06-15

Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value

arXiv:2510.01663v2 Announce Type: replace-cross Abstract: For many real-world applications, understanding feature-outcome relationships is as crucial as achieving high predictive accuracy. While traditional neural networks excel at prediction, their black-box nature obscures underlying functional relationships. Kolmogorov–Arnold Networks (KANs) address this by employing learnable spline-based activation functions on edges, enabling recovery of symbolic representations while maintaining competitive performance. However, KAN's architecture presents unique challenges for network pruning. Conventional magnitude-based methods become unreliable due to sensitivity to input coordinate shifts. We propose ShapKAN, a pruning framework using Shapley value attribution to assess node importance in a shift-invariant manner. Unlike magnitude-based approaches, ShapKAN quantifies each node's actual contribution, ensuring consistent importance rankings regardless of input parameterization. Extensive experiments on synthetic and real-world datasets demonstrate that ShapKAN preserves true node importance while enabling effective network compression. Our approach improves KAN's interpretability advantages, facilitating deployment in resource-constrained environments.

19.
arXiv (CS.AI) 2026-06-16

TimeVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting

arXiv:2606.16173v1 Announce Type: new Abstract: High-quality time series forecasting is pivotal for real-world decision-making. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. While the ''LLM-as-a-Judge'' paradigm has revolutionized text evaluation by providing flexible, human-aligned judgment, its application to time series remains largely unexplored. In this paper, we leverage Vision-Language Models (VLMs) as judges for time series forecasting, harnessing their ability to comprehend time series plots grounded in textual information. Specifically, we propose a novel framework integrating micro- and macro-level judgments informed by contextual information to evaluate time series forecasting. To this end, we introduce TimeVista, a comprehensive VLM-as-a-Judge benchmark comprising 5563 time series samples paired with detailed evaluation rubrics. Extensive meta-evaluations demonstrate that VLMs are highly reliable judges, achieving significantly higher consistency with human preferences than conventional metrics. Building upon our benchmark, we comprehensively assess recent Time Series Foundation Models (TSFMs) under the VLM-as-a-Judge paradigm. Our results demonstrate that VLMs serve as robust and interpretable judges, providing a comprehensive, human-aligned standard for evaluating time series models.

20.
arXiv (CS.CV) 2026-06-18

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

21.
arXiv (CS.AI) 2026-06-16

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

arXiv:2602.07883v4 Announce Type: replace Abstract: LLM-powered agentic systems excel at complex long-horizon tasks, but remain constrained by static configurations fixed before execution. Such rigidity forces a trade-off between domain-specific performance and cross-task generalization: strong priors and compact tool spaces aid specialization but weaken transfer, while task-agnostic workflows and broad action spaces expand coverage but dilute guidance. Existing pre-execution optimization, planner-worker orchestration, and configuration patching fall short of resolving this tension, as they decouple adaptation from execution, causing information loss, fragmented optimization, and ambiguous credit assignment. We propose ToolSelf, a tool-driven runtime self-reconfiguration paradigm that abstracts configuration updates as a standardized tool interface and unifies execution and adaptation within one policy's action space. The execution agent can dynamically update sub-goals, strategies, toolboxes, context, and context-management modes based on task progress and feedback. We further introduce Configuration-Aware Two-stage Training (CAT), which combines rejection sampling fine-tuning with trajectory-level KTO reinforcement learning to internalize self-reconfiguration. Across diverse benchmarks, zero-shot ToolSelf rivals task-specialized agents; after CAT training, ToolSelf gains 28.8 points over the static-configuration baseline on average, illuminating a path toward emergent adaptivity that obviates manually injected guidance. The code is available at https://github.com/lian-tian-mo-zun/ToolSelf.

22.
arXiv (quant-ph) 2026-06-15

Interpreting Bohm-like quantum potentials in "Computing quantum waves exactly from classical action"

arXiv:2605.20443v3 Announce Type: replace Abstract: The recent posting arXiv:2605.02621 [14], commenting on the article rspa.2025.0413 [7], argues that the proof of Lemma 3.1 in [7] is missing the spatial derivative of the density, which would lead to a Bohm-like quantum potential. This technical note shows why the propagated density is independent of space in the Feynman propagator construction of Lemma 3.1. This is done by extending the proof of Lemma 3.1 explicitly with Bohm-like quantum potential terms along the stationary action paths, and then showing that these terms are exactly zero. In [7], this property can also be verified directly on most examples (double slit, Aharonov-Bohm, potential well, harmonic oscillator, tunneling, EPR, QED), as well as in the derivations of the Pauli, Dirac, and Maxwell equations. For more general nonlinear actions, a time rescaling may be required to guarantee this space independence along stationary paths. In the hydrogen atom example, this time rescaling can be computed in closed form. In contrast to the general wave of the Madelung solution [9] Lemma 3.1 of [7] is defined first for a propagator, and a general wave is then constructed in a second step. Recall that a propagator is a specific quantum wave, which is initialized at $t=0$ with a Dirac impulse at a given initial position or momentum. In turn, a general wave is constructed in a second step by superposing a distribution of initial conditions using the propagator. This key difference is why the Bohm-like quantum potential terms disappear in the construction [7] (specifically, in the first step) while the Bohm potential in the Madelung analysis does not. This fundamental difference is also consistent with the fact that the wave construction in [7] extends naturally to relativistic contexts, while Bohmian non-locality notoriously prevents such extensions. Keywords - Response to arXiv:2605.02621, in relation to rspa.2025.0413

23.
arXiv (CS.LG) 2026-06-16

Tail-Shape Estimation in LLM Evaluation Is Fragile: A Protocol for Diagnosing False Positives

作者:

arXiv:2606.16511v1 Announce Type: new Abstract: Recent work motivates moving large language model (LLM) evaluation from mean-based to tail-aware metrics, including conditional value-at-risk and tail-index estimates of reward-model error. We ask whether the canonical extreme-value-theory tail-index parameter, which isolates how heavy a tail is from how large the tail mass is, adds discriminative information beyond the mean and a standard tail-magnitude statistic in LLM evaluation. We pre-register a protocol covering admissibility, goodness-of-fit, threshold-stability, and effect-size requirements for any positive tail-shape claim. The protocol is the contribution of this paper; the empirical study below is a demonstration of what its gates catch. Applied to a standard LLM toxicity-evaluation setup under two structurally different scorer families, the protocol catches three distinct modes of false positives that a naive analysis would have published, and rejects the headline tail-shape claim on both scorers. We conclude that tail-shape estimation in the LLM toxicity-evaluation setups we examined is more fragile than the recent literature suggests, and recommend the protocol as a starting point for tail-index claims in similar setups.

24.
arXiv (math.PR) 2026-06-17

Limit theorems for random Dirichlet series with summation over primes, with an application to Rademacher random multiplicative functions

arXiv:2508.15032v2 Announce Type: replace Abstract: It is shown that two conjectures put forward in the recent article Iksanov and Kostohryz (2025) are true. Namely, we prove a functional central limit theorem (FCLT) and a law of the iterated logarithm (LIL) for a random Dirichlet series $\sum_p \frac{\eta_p}{p^{1/2+s}}$ as $s\to 0+$, where $\eta_1$, $\eta_2,\ldots$ are independent identically distributed random variables with zero mean and finite variance, and $\sum_p$ denotes the summation over the prime numbers. As a consequence, an FCLT and an LIL are obtained for $\log \sum_{n\geq 1} \frac{f(n)}{n^{1/2+s}}$ as $s\to 0+$, where $f$ is a Rademacher random multiplicative function.

25.
arXiv (quant-ph) 2026-06-16

Entangled states are typically incomparable

arXiv:2406.03335v2 Announce Type: replace Abstract: Consider a bipartite quantum system, where Alice and Bob jointly possess a pure state $|\psi\rangle$. Using local quantum operations on their respective subsystems, and unlimited classical communication, Alice and Bob may be able to transform $|\psi\rangle$ into another state $|\phi\rangle$. Famously, Nielsen's theorem [Phys. Rev. Lett., 1999] provides a necessary and sufficient algebraic criterion for such a transformation to be possible (namely, the local spectrum of $|\phi\rangle$ should majorise the local spectrum of $|\psi\rangle$). In the paper where Nielsen proved this theorem, he conjectured that in the limit of large dimensionality, for almost all pairs of states $|\psi\rangle, |\phi\rangle$ (according to the natural unitary invariant measure) such a transformation is not possible. That is to say, typical pairs of quantum states $|\psi\rangle, |\phi\rangle$ are entangled in fundamentally different ways, that cannot be converted to each other via local operations and classical communication. Via Nielsen's theorem, this conjecture can be equivalently stated as a conjecture about majorisation of spectra of random matrices from the so-called trace-normalised complex Wishart-Laguerre ensemble. Concretely, let $X$ and $Y$ be independent $n \times m$ random matrices whose entries are i.i.d. standard complex Gaussians; then Nielsen's conjecture says that the probability that the spectrum of $X X^\dagger / \operatorname{tr}(X X^\dagger)$ majorises the spectrum of $Y Y^\dagger / \operatorname{tr}(Y Y^\dagger)$ tends to zero as both $n$ and $m$ grow large. We prove this conjecture, and we also confirm some related predictions of Cunden, Facchi, Florio and Gramegna [J. Phys. A., 2020; Phys. Rev. A., 2021].