Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-18

Decomposing Prediction Mechanisms for In-Context Recall

arXiv:2507.01414v2 Announce Type: replace Abstract: We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system's state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model's training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a "Bayesian-style" prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance.

02.
arXiv (math.PR) 2026-06-17

Optional Stopping for Superhedging Supermartingales

arXiv:2606.17452v1 Announce Type: new Abstract: Superhedging supermartingales, introduced by the authors in previous work, are non-probabilistic processes defined via subadditive outer integrals that carry a purely financial interpretation in terms of superhedging cost. Building on the Leinert-König theory of non-lattice integration, the present paper establishes several results that are classical in probability theory but whose non-probabilistic proofs require fundamentally new arguments: (i) a tower inequality for the conditional outer integral \overline{\sigma}_j applied at stopping times, reducing to equality when the integrand is conditionally integrable; (ii) three versions of Doob's optional stopping theorem, organised by the class of supermartingale and the range of the stopping times; and (iii) Dubins' upcrossing inequality in both finite- and infinite-time horizons. A key structural result, property (K)-a.e., identifies conditions under which the two superhedging operators \overline{\sigma}_j and \overline{I}_j coincide on non-negative functions, extending the scope of all preceding results to the positive operator \overline{I}_j. None of the proofs invoke classical measure-theoretic tools; in particular, (classical) integrability and measurability are not assumed. The analogues of classical stochastic results acquire a purely financial interpretation and, in this way, gain depth and generality by providing a context that is independent of any a priori probabilistic structure.

03.
arXiv (CS.CV) 2026-06-18

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

04.
PLOS Computational Biology 2026-06-01

BeetleAtlas 2: An enhanced <i>Tribolium castaneum</i> web resource for tissue and developmental transcriptomics allowing refinement of gene predictions

by David P. Leader, Muhammad T. Naseem, Janina L. Rinke, Kenneth Veland Halberg BeetleAtlas is an online resource for tissue- and stage-specific transcriptomics in the red flour beetle, Tribolium castaneum. On updating from the original Tcas5.2 genome assembly to the more recent improved icTriCast1.1 genome assembly it became evident that there were major discrepancies between the gene models of the two genome annotations in use: the OGS3 and the NCBI gene sets. As neither was clearly superior we implemented a new design in BeetleAtlas 2 (beetleatlas.org) comprising two parallel ‘modes’ — one incorporating results using the NCBI gene models and a second incorporating those using the OGS3 gene models. This allows direct comparison where equivalent gene models exist: 50–57% of cases. To aid resolution of discrepancies between the two gene model sets and verification of results, gene models are linked to a custom visualization of RNA-seq read coverage of the genome in the UCSC Genome Browser. This displays reads from 22 tissues and life stages superimposed on the icTriCast1.1 genome assembly. Reference tracks show the NCBI gene models, the OGS3 gene models after translation of their coordinates from the Tcas5.2 assembly, and 1050 discontinued NCBI gene models from the previous assembly after a similar transfer of coordinates. We document various situations in which distinct patterns of expression of the tissues can be used to confirm and extend correlations between the two gene sets, resolve discrepancies between them, make corrections and identify putative genes or exons absent from the current gene sets. BeetleAtlas 2 allows those involved in Tribolium research to avoid the pitfalls inherent in incorrect gene models when planning experiments on specific genes and interpreting the results. It also demonstrates how BeetleAtlas 2 might play an important role in establishing a revised gene set for Tribolium castaneum in the future.

05.
arXiv (CS.AI) 2026-06-15

The Shrinking Lifespan of LLMs in Science

arXiv:2604.07530v2 Announce Type: replace-cross Abstract: Scaling laws describe how language model capabilities grow with compute and data, but say nothing about how long a model matters once released. We introduce time-to-peak and lifespan as measures of model obsolescence and use them to characterize the scientific adoption trajectories of 62 LLMs across more than 108k citing papers (2019-2025), separating active adoption from background citation to recover per-model trajectories that citation counts cannot resolve. We find that a model's longevity is shaped more by when it was released than by its characteristics: release year predicts time-to-peak and lifespan more strongly than architecture, openness, or scale. LLM adoption follows an inverted-U curve (rising after release, peaking, and then declining), but this pattern is rapidly compressing. Each successive release year is associated with a 27% shorter time-to-peak and a 23% shorter lifespan ($p < 0.001$), robust to minimum-age thresholds and controls for model size. These adoption-side dynamics are invisible to scaling laws and suggest that specialization on any single model may be a depreciating investment, with costs falling on reproducibility and migration.

06.
arXiv (CS.LG) 2026-06-25

SC-TauPath: A Structural Connectivity Attribution Framework for Mapping Tau Propagation Pathways in Alzheimer's Disease

arXiv:2606.04066v2 Announce Type: replace-cross Abstract: Understanding how structural connections are associated with tau propagation in Alzheimer's disease (AD) remains a central open question, yet existing computational models either rely heavily on biophysical assumptions or lack neurobiologically interpretable pathway maps. We present SC-TauPath, a structural connectivity (SC) attribution framework that maps tau propagation pathways from in vivo neuroimaging data. SC-TauPath combines a Network Diffusion Model (NDM)-augmented multilayer perceptron with gradient $\times$ input attribution to score each SC edge's contribution to tau prediction, then translates these attribution scores into multi-scale pathway maps (backbone edges, high-traffic routes, and hub ROIs), which validates established Braak staging anatomy. Applied to 234 ADNI participants with paired DTI SC and 18F-Flortaucipir PET, SC-TauPath achieves strong cross-validated tau prediction and yields attribution-based pathway maps consistent with established Braak staging anatomy, demonstrating that SC encode spatially specific information about regional tau distribution in AD.

07.
arXiv (CS.CL) 2026-06-16

Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents

Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offered by AI providers, which introduces a critical but underexplored supply chain threat: backdoor attacks. In this work, we first unveil that MLLM-powered GUI agents naturally expose multiple interaction-level triggers, such as historical steps, environment states, and task progress. Based on this observation, we introduce AgentGhost, an effective and stealthy framework for red-teaming backdoor attacks. Specifically, we first construct composite triggers by combining goal and interaction levels, allowing GUI agents to unintentionally activate backdoors while ensuring task utility. Then, we formulate backdoor injection as a Min-Max optimization problem that uses supervised contrastive learning to maximize the feature difference across sample classes at the representation space, improving flexibility of the backdoor. Meanwhile, it adopts supervised fine-tuning to minimize the discrepancy between backdoor and clean behavior generation, enhancing effectiveness and utility. Extensive evaluations of various agent models in two established mobile benchmarks show that AgentGhost is effective and generic, with attack accuracy that reaches 99.7\% on three attack objectives, and shows stealthiness with only 1\% utility degradation. Furthermore, we tailor a defense method against AgentGhost that reduces the attack accuracy to 22.1\%. Our code is available at \texttt{anonymous}.

08.
arXiv (CS.CL) 2026-06-19

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

Benchmark infrastructure for personally identifiable information (PII) detection remains limited: existing corpora cover few entity types, use ad hoc generation conditions, and do not show which surface conditions cause detector failures. We present REDACT, a systematically controlled multilingual PII benchmark with 13,427 records, 324,078 entity annotations, 51 entity types, 4,127 surface-form patterns, and 25 languages across 9 scripts. A strength-2 covering-array sampler controls nine generation axes: domain, format, difficulty, length, density, code-switching, language, adjacency, and co-occurrence. Three entity-level metadata fields (disclosure status, disclosure form, and a GDPR-aligned sensitivity tier) enable stratified evaluation beyond aggregate or per-type F1. From the full benchmark, we evaluate five detectors (Presidio, GLiNER, the OpenAI Privacy Filter, GPT-4.1, and Claude Sonnet 4.6) on a locked, language-stratified sample of 1,000 records. Aggregate F1 masks an architecture-dependent failure structure: the rule-based detector performs poorly on the highest-stakes data, including HIGH-sensitivity categories (recall 0.07) and non-verbatim disclosure forms, while the LLM detectors remain more robust, with the HIGH tier as their strongest sensitivity slice. A three-model reference-free LLM-as-judge assessment corroborates that sensitivity-tier assignment is the task's hardest axis. We release the benchmark, schema, prompts, and stratified evaluation harness.

09.
arXiv (quant-ph) 2026-06-15

Who can compete with quantum computers? Lecture notes on quantum inspired tensor networks computational techniques

arXiv:2601.03035v2 Announce Type: replace Abstract: This is a set of lectures on tensor networks with a strong emphasis on the core algorithms involving Matrix Product States (MPS) and Matrix Product Operators (MPO). Compared to other presentations, particular care has been given to disentangle aspects of tensor networks from the quantum many-body problem: MPO/MPS algorithms are presented as a way to deal with linear algebra on extremely (exponentially) large matrices and vectors, regardless of any particular application. The lectures include well-known algorithms to find eigenvectors of MPOs (the celebrated DMRG), solve linear problems, and recent learning algorithms that allow one to map a known function into an MPS (the Tensor Cross Interpolation, or TCI, algorithm). The lectures end with a discussion of how to represent functions and perform calculus with tensor networks using the "quantics" representation. They include the detailed analytical construction of important MPOs such as those for differentiation, indefinite integration, convolution, and the quantum Fourier transform. Three concrete applications are discussed in detail: the simulation of a quantum computer (either exactly or with compression), the simulation of a quantum annealer, and techniques to solve partial differential equations (e.g. Poisson, diffusion, or Gross-Pitaevskii) within the "quantics" representation. The lectures have been designed to be accessible to a first-year PhD student and include detailed proofs of all statements.

10.
arXiv (CS.CL) 2026-06-12

Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper Novelty

Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper of scientific rigor, rigorously evaluates the novelty of papers, yet a cognitive gap may exist between author self-promotion and reviewer evaluation. To investigate this, we analyzed 15,328 academic papers published in Nature Communications from 2016 to 2021, along with their peer-review comments. We found that both reviewers and authors emphasize result-oriented innovation, with reviewers adopting a more comprehensive evaluation perspective. Furthermore, by examining promotional intensity against inherent paper novelty, we found that its effect depends on the paper's actual innovation level. Highly innovative papers benefit from stronger promotional language, receiving more positive evaluations. We also found that promotional language significantly correlates with reviewer disagreement on novelty specifically for papers of moderate innovativeness, whereas it has negligible impact for papers with either very high or very low novelty. This reveals how promotional language operates most prominently in the gray area of academic evaluation.

11.
arXiv (CS.CL) 2026-06-16

Understanding Scam Trends and Rail Paths from Reddit Self-Disclosure Narratives

Online scam behavior is inherently multi-stage, and the lifecycle includes temporally ordered rails and events rather than isolated signals. Existing works analyze characteristics of scam types and rails, but they do not track scam trends across years. Moreover, the work on the relations between rails is hampered due to the lack of open-source datasets with annotations and coverage of different scam types. To address these gaps, we build a dataset to analyze the yearly trend of scam characteristics and rail paths using Reddit self-disclosure narratives from 2023 to 2025. We collect 21,304 posts from scam-related subreddits with at least one rail among identity, communication, platform, and payment for trend analysis by heuristic annotation. Then, we label 1,800 posts containing explicit or recoverable scam chains by an LLM-assisted method for scam path analysis. The method is evaluated with human annotation. Lastly, we run a topic model on the comments of the posts to analyze the community support behavior. The results reveal that scam processes are predominantly multi-rail. Across years, different scam types and rail components dominate. Different scam types vary systematically in path complexity. Reddit support behaviors have become more detailed over time. This work supports synthetic scam chain data simulation and AI-related scam risk assessment, though findings may not generalise to other platforms.

12.
arXiv (CS.CL) 2026-06-11

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle. We show that valid mathematical reasoning induces a measurable, training-free spectral signature in transformer attention. By treating each attention matrix as a weighted token graph, we extract four diagnostics: Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and smoothness, that require no learned parameters. Experiments across seven models from four architectural families yield effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling $85$–$96\%$ single-threshold classification accuracy. Two findings sharpen the interpretation. First, Platonic validity: the spectral signal tracks logical coherence rather than compiler acceptance, proofs rejected for timeouts or missing imports are correctly classified as valid, a distinction confirmed by a manual audit ($\kappa = 0.82$, $n = 51$). Second, architectural determinism: Sliding Window Attention shifts the discriminative feature from HFER to smoothness ($d = 2.09$, $p < 10^{-48}$), showing that attention design governs which spectral channel encodes reasoning quality. Causal ablation confirms the signature traces induction-head circuits. The method generalises to informal chain-of-thought ($d = 0.78$, $p < 10^{-3}$), and in proof search, HFER reranking improves Best-of-16 Pass@1 by $+4.4$–$6.6$\%, matching $98\%$ of the AUC of fully supervised probes with zero labels. Spectral graph analysis is a principled, architecture-aware primitive for reasoning verification.

13.
arXiv (CS.AI) 2026-06-11

Power Term Polynomial Algebra for Boolean Logic

arXiv:2603.13854v2 Announce Type: replace-cross Abstract: We introduce power term polynomial algebra, a representation language for Boolean formulae designed to bridge conjunctive normal form (CNF) and algebraic normal form (ANF). The language is motivated by the tiling mismatch between these representations: direct CNFANF conversion may cause exponential blowup unless formulas are decomposed into smaller fragments, typically through auxiliary variables and side constraints. In contrast, our framework addresses this mismatch within the representation itself, compactly encoding structured families of monomials while representing CNF clauses directly, thereby avoiding auxiliary variables and constraints at the abstraction level. We formalize the language through power terms and power term polynomials, define their semantics, and show that they admit algebraic operations corresponding to Boolean polynomial addition and multiplication. We prove several key properties of the language: disjunctive clauses admit compact canonical representations; power terms support local shortening and expansion rewrite rules; and products of atomic terms can be systematically rewritten within the language. Together, these results yield a symbolic calculus that enables direct manipulation of formulas without expanding them into ordinary ANF. The resulting framework provides a new intermediate representation and rewriting calculus that bridges clause-based and algebraic reasoning and suggests new directions for structure-aware CNFANF conversion and hybrid reasoning methods.

14.
arXiv (CS.LG) 2026-06-16

Bridging data-driven priors via the score function for posterior sampling – Comparative review and experimental study

arXiv:2606.14800v1 Announce Type: cross Abstract: This paper reviews how a diverse set of popular data-driven priors commonly used in Bayesian inverse problems can be unified through their respective score functions. By framing these priors under this common perspective, we show that they can benefit from their straightfoward and effective integration into a recently proposed sampling algorithm. The applicability of this common framework is illustrated by considering several data-driven priors, namely regularization-by-denoising, normalizing flow-based priors, score-based generative models, and convex-ridge regularizers. For these four particular priors, the performance of the method is evaluated when conducting image inpainting and single image super-resolution. These results, as well as those obtained when restoring real images acquired in a geological context, demonstrate the efficiency of the method. This unified framework proves versatile enough to handle any posterior distribution defined by a broad class of score function-based priors, beyond the specific cases considered in this paper.

15.
arXiv (CS.CV) 2026-06-12

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at intuitive-robots.github.io/sparc-labeling.

16.
arXiv (CS.AI) 2026-06-17

Constitutional On-Policy Safe Distillation

arXiv:2606.03089v2 Announce Type: replace-cross Abstract: On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety–helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

17.
arXiv (CS.CV) 2026-06-25

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

Vision-language models (VLMs) are increasingly deployed in consumer, medical, financial, and enterprise applications. This broad deployment expands the safety surface: risks can arise from multimodal question answering, assistant responses, and cross-modal composition, while moderation policies may vary across products, regions, and deployment stages. Most existing guardrails either rely on fixed taxonomies or target only a narrow set of interaction settings, which limits their adaptability when safety rules change at deployment time. We present SingGuard, a policy-adaptive multimodal guardrail model family for safety assessment in multimodal conversations. SingGuard treats the active policy as a runtime input: given natural-language rules, it checks the target content against the active policy rule by rule and predicts both the safety label and the triggered rule. To balance efficiency and interpretability, SingGuard supports fast, hybrid, and slow inference regimes along a fast-to-slow reasoning spectrum, ranging from direct safety judgments to policy-grounded deliberation. We further optimize this behavior with fast–slow decoupled reinforcement learning. We also introduce SingGuard-Bench, a multimodal guardrail benchmark with 56{,}340 examples spanning 80+ fine-grained risk types across multimodal QA, adversarial attack, and dynamic-rule evaluation settings, including cross-modal joint-risk cases where each modality is harmless in isolation but their composition implies unsafe intent. Across six benchmark families (35 datasets), SingGuard achieves state-of-the-art average F1 in every family. Dynamic-rule evaluation further shows improved policy-following accuracy from 0.6465 to 0.7415 under runtime policy shifts. Our code is available at https://github.com/inclusionAI/Sing-Guard.

18.
arXiv (CS.CV) 2026-06-25

Cage-based Texture Transfer with Geometric Filtering

Real-time texture transfer expands the creative horizon for interactive applications, enabling seamless detail projection in scenarios that range from digital character cosmetics to procedural automotive texturing. Yet, its practical application is governed by inherent trade-offs between processing speed and suppression of artifacts. Low-latency transfer methods frequently fail to suppress artifacts, and robust alternatives rely on large-scale models that are costly in training and memory. Our proposed method bridges the gap between efficiency and robustness by using a cage-based geometric filtering method to identify Non-Cosmetic Zones (NCZs) for artifact suppression. While other models are resource-intensive and require multiple days of training on manually annotated datasets, we are able to successfully suppress artifacts and achieve immediate deployment on consumer-grade hardware. Our framework achieved highly efficient runtimes of ~70ms on mobile devices for a ~4.8k triangle mesh.

19.
arXiv (CS.AI) 2026-06-12

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

arXiv:2606.12983v1 Announce Type: new Abstract: Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12.

20.
arXiv (CS.CL) 2026-06-15

An Empirical Study of Automating Agent Evaluation

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

21.
arXiv (quant-ph) 2026-06-25

Recursive QLSTM with Dynamic Variational Quantum Circuit Adaptation

arXiv:2606.24932v1 Announce Type: new Abstract: Recent advances in quantum computing and machine learning have motivated the development of quantum models for sequential data processing. In this paper, we propose a Recursive Quantum Long Short-Term Memory model, or Recursive QLSTM, which extends QLSTM through metacore-based recursive constructions. We numerically test the model under different input sequence lengths, metacore designs, and recursive rules, and identify the best-performing architecture among these variants. For this selected model, we further provide theoretical arguments explaining why its recursive structure improves temporal information propagation and enhances learning performance. Our results suggest that Recursive QLSTM offers a flexible and effective framework for quantum recurrent learning over input time series of various lengths.

22.
arXiv (CS.LG) 2026-06-11

Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies

arXiv:2601.08136v2 Announce Type: replace Abstract: Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty that distinguishes online RL from standard generative modeling is the lack of direct samples from the target Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which uses a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. However, it remains unclear how these objectives are formally related, or whether they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that share the same expectation. We show that existing noise-expectation and gradient-expectation methods are simply two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and it enables the principled combination of Q-value and Q-gradient information to form an effective estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.

23.
medRxiv (Medicine) 2026-06-24

Breaking The Pain-Stiffness Cycle- Supraclavicular Catheter Facilitated Rehabilitation Of Post-Surgical Elbow stiffness- A Retrospective Observational Study

ABSTRACT Background: Post-traumatic elbow stiffness is a recognised complication following orthopaedic trauma surgery, occurring in 10-15% of trauma patients sustaining injuries. Pain remains the primary barrier to physiotherapy compliance, with surgical arthrolysis carrying recurrence rates of up to 34%. The supraclavicular brachial plexus block, referred to as the 'spinal of the arm', provides anaesthesia and analgesia to the entire upper limb below the shoulder. A structured non-surgical approach combining continuous catheter analgesia with timed rehabilitation was identified as an unmet need in this patient group. Methods: A single-centre retrospective observational study was conducted on data of patients treated for post-surgical upper limb stiffness between January 2022 and April 2026. Of 30 patients identified, 28 with elbow involvement formed the primary analysis group following exclusion of 2 patients with isolated wrist stiffness and complex regional pain syndrome. Ultrasound- guided supraclavicular brachial plexus catheters were inserted using the Contiplex system. Patients received 0.5% Bupivacaine (10-15ml) for initial blockade, followed by daily top-up doses of 0.2% Ropivacaine(20ml) given 30 minutes prior to structured physiotherapy and CPM sessions for up to 5 days. The primary outcome was change in arc of elbow motion in degrees, measured by the attending orthopaedic consultant using standard goniometry. Results: Complete pre- and post- intervention data were available for all 28 patients. Mean pre-intervention arc of elbow motion was 39.1{degrees}(SD+/-23.2{degrees}), improving to 104.2{degrees}(SD+/- 30.0{degrees}) post-intervention. Mean improvement was 65.1{degrees}(SD+/- 30.6{degrees} ); 95% CI 53.8{degrees} to 76.4{degrees} ; range 10{degrees}-140{degrees} ; paired t-test t=-11.27, p

24.
medRxiv (Medicine) 2026-06-22

REPRODUCIBILITY OF 7T MRI MEASUREMENTS OF THE SUSCEPTIBILITY AND VOLUME OF HIPPOCAMPAL SUBFIELDS

PURPOSE: The UK7T travelling head dataset was used to characterise the reproducibility of 7T measurements of the susceptibility of the hippocampal subfields, focusing on the Cornu Ammonis (CA1, CA2 and CA3), dentate gyrus (DG), subiculum (SUB), tail of the hippocampus (TAIL) and entorhinal cortex (ERC). METHODS: Susceptibility maps were created from whole-brain 3D single-echo GRE data (TE=20 ms; 0.7 mm isotropic resolution) using Multi-Scale Dipole Inversion. Automatic Segmentation of Hippocampal Subfields (ASHS) was applied to high resolution T1- and T2-weighted images for segmentation. The mean magnetic susceptibility and volume of hippocampal subfields was evaluated in 50 data sets, comprising 5 repeat acquisitions on 10 healthy participants (age 32 + or -6 years; 3 female). RESULTS: Averaging over subjects, susceptibility values spanned an 18ppb range over the hippocampus (ranging from -13.3ppb in DG to 4.7ppb in ERC). Susceptibility values in the larger hippocampal subfields showed a consistent pattern of variation across subjects, being generally more positive in ERC and SUB than in CA1 and more positive in CA1 than in DG and TAIL. The standard deviation of subfield susceptibilities over subjects ranged from 8.2ppb in the TAIL to 1.7ppb in CA1, and the average standard deviation across repeated measurements, which ranges from 1.7 to 4 ppb, was less than half of the inter-participant standard deviation in all subfields. Susceptibility values in the smaller subfields (CA2 and CA3) were more variable, but ICC(2,k) values for all subfields were >0.82. CONCLUSION: The reported data characterises the variation and reproducibility of hippocampal subfield susceptibility measurements at 7T.

25.
arXiv (CS.AI) 2026-06-16

Controlled Dynamics Attractor Transformer

arXiv:2606.15207v1 Announce Type: cross Abstract: Transformer architectures have dramatically advanced representation learning and inference in deep models through self-attention mechanisms. In parallel,associative memory (AM) frameworks map representations onto energy landscapes, offering interpretable retrieval mechanisms. However, their continuous-time inference dynamics lack the biological plausibility of classical Continuous Attractor Neural Networks (CANNs). To bridge this gap, we propose Controlled Dynamics Attractor Transformer (CDAT), which couples a mixture von Mises-Fisher (Mo-vMF) attention energy with a Hopfield refinement energy, while augmenting energy descent with a CANN-inspired excitation-inhibition modulation. CDAT instantiates a topology-constrained dynamical system whose couplings encode relational structure among tokens, thereby linking attractor-style dynamics to modern energy-based attention. We further provide a constructive dissipation analysis to formally establish their controlled inference dynamics. Benefiting from these robust and structured dynamics, CDAT achieves state-of-the-art performance across multiple benchmarks in graph anomaly detection and graph classification.