Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-24

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

arXiv:2606.24064v1 Announce Type: new Abstract: Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring what to answer rather than how to reason. This trajectory-level imitation encourages memorization of instance-specific steps rather than acquisition of transferable problem-solving skills, limiting generalization to novel problems. We propose Strategy-Guided Policy Optimization (SGPO), which replaces instance-level trajectory imitation with reusable strategy distillation. SGPO extracts structured strategy descriptions from strong-model responses and, for each problem, constructs both autonomous and strategy-guided trajectories to enable direct comparison of the model's behavior with and without strategic guidance. The framework then addresses two key questions. For how to distill, a token-level forward-KL objective selectively transfers the distributional shift induced by strategy conditioning into the unguided policy, with proximal constraints ensuring stability. For when to distill, adaptive instance-level weighting strengthens guidance when autonomous exploration falls short and reduces it as the model's own competence grows. Experiments on four mathematical benchmarks across two model families show that SGPO consistently outperforms SFT, on-policy RL, and hybrid-policy baselines, improving the average score by 2.2 points over the strongest baseline on Qwen2.5-7B-Instruct. Analysis reveals that the forward-KL objective provides an inherently selective distillation signal that outperforms direct trajectory imitation, and that strategy distillation exhibits complementary scaling with base model capability.

02.
arXiv (CS.CL) 2026-06-17

Evidence of Layered Positional and Directional Constraints in the Voynich Manuscript: Implications for Cipher-Like Structure

The Voynich Manuscript (VMS) exhibits a script of uncertain origin whose grapheme sequences have resisted linguistic analysis. We present a systematic analysis of its grapheme sequences, revealing two complementary structural layers: a character-level right-to-left optimization in word-internal sequences and a left-to-right dependency at word boundaries, a directional dissociation not observed in any of our four comparison languages (English, French, Hebrew, Arabic). We further evaluate two classes of structured generator against a four-signature joint criterion: a parametric slot-based generator and a Cardan grille implementing Rugg's (2004) gibberish hypothesis. Across their full tested parameter spaces, neither class reproduces all four signatures simultaneously. While these results do not rule out generator classes we have not tested, they provide the first quantitative benchmarks against which any future generative or cryptanalytic model of the VMS can be evaluated, and they suggest that the VMS exhibits cipher-like structural constraints that are difficult to reproduce from simple positional or frequency-based mechanisms alone.

03.
arXiv (CS.AI) 2026-06-15

Rethinking Backdoor Adversarial Unlearning through the Lens of Catastrophic Forgetting in Continual Learning

arXiv:2606.14078v1 Announce Type: cross Abstract: Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety protection, as they fall short of completely eliminating the backdoor effects. In this work, we present a novel formulation of backdoor learning and unlearning as a sequential, three-stage process from a continual learning perspective. Within this framework, we formally define complete backdoor unlearning and further derive the necessary conditions for achieving it based on the mechanism of catastrophic forgetting. Guided by these insights, we propose Blind Inversion-Backdoor Adversarial Unlearning (BI-BAU), which formulates the generation of adversarial examples satisfying the unlearning conditions as a blind inversion problem. We solve this by integrating the bi-level optimization process of adversarial training into an Expectation-Maximization (EM) algorithm framework to optimize the maximum a posteriori (MAP) objective. Furthermore, BI-BAU is extended to untargeted adversarial scenarios with unknown target classes, as well as to multi-modal contrastive learning tasks, enhancing its applicability to real-world deployment scenarios where pre-trained models may be compromised. Extensive experiments demonstrate that our method exhibits general applicability across a wide spectrum of backdoor attacks and can effectively and thoroughly eliminate the backdoor effects from a backdoor model.

04.
arXiv (CS.AI) 2026-06-16

Wasserstein Convergence of ODE-Based Samplers in Decentralized Diffusion Model via Velocity Field Decomposition

arXiv:2606.15835v1 Announce Type: cross Abstract: Diffusion models have achieved impressive empirical success in generative tasks, and their convergence theory is now relatively well understood. Motivated by privacy and scalability, recent decentralized diffusion architectures replace a single global velocity field with multiple local experts and a routing mechanism, yielding a sampling dynamics with stochastic expert switching that falls outside standard diffusion convergence analyses. In this work, We study a decentralized diffusion framework with stochastic velocity fields and ODE-based sampling. We establish a convergence guarantee in Wasserstein-2 distance, showing that the distribution of the $N$-step discretization converges to the analytical solution at rate $\mathcal{O}(N^{-1/2}+\varepsilon)$ in $W_2$, where $\varepsilon$ captures the neural approximation errors. To our knowledge, this is the first $W_2$ convergence result for decentralized diffusion models with an ODE-based sampling scheme.

05.
arXiv (CS.CL) 2026-06-18

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

06.
arXiv (CS.CL) 2026-06-19

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

Large language models (LLMs) increasingly review and revise text, including their own. A documented self-preference bias (models favoring their own generations when acting as judges) raises the question of whether models also resist valid corrections to their own writing. We test this in a setting where "valid" is decided not by another model but by a deterministic verifier: instruction-following revision on IFEval. A model writes a draft; the official IFEval checker confirms the draft violates a constraint and that a candidate edit fixes it; the model then accepts or rejects that edit either as the genuine in-context author or as a fresh model that sees the draft neutrally. Across four mid-tier model families and 85 author-versus-fresh comparisons, we find no detectable self-preference: authors reject verified-good fixes to their own drafts at essentially the same rate as fresh models judging the same drafts (gap -5.1 pp, 95% CI [-12.9, +2.7]). A self-skepticism hint from a smaller pilot did not replicate at scale. The one robust observation is qualitative: when authors do reject a verified-good fix, 97% of their stated reasons are flaw-catching rather than preference, that is, about the character of rejections, not an elevated rate. Effects smaller than ~13 pp cannot be excluded at this sample size.

07.
arXiv (CS.CV) 2026-06-17

SPHINX: First Explain, Then Explore

Generating adversarial driving scenarios is critical for evaluating and improving autonomous vehicle decision-making systems in simulation. Recent approaches, such as ChatScene and LLM-Attacker, rely primarily on the prior knowledge of Large Language Models and Vision-Language Models to generate driving scenarios procedurally. We argue that adversarial scenes should be generated based on the failure diagnosis (e.g., indecisiveness, multi-frame inconsistency) of the driving policy to specifically address the policy's weaknesses instead of relying on prior assumptions. In this paper, we propose SPHINX, a closed-loop framework for adversarial scenario synthesis guided by a simple principle: first explain, then explore. Beyond blindly exploring the scenario space, SPHINX leverages explainable artificial intelligence methods to analyze the policy, identifying key visual concepts and their influence on policy outputs, and the uncertainty of the decisions. Given the interpretable evidence extracted from the policy's own decision process, we use a vision language model to rationalize and criticize failure modes of the current policy. These critics are then used to generate targeted adversarial scenarios for policy retraining and improvement. We demonstrate that SPHINX can highlight an interpretable account of policy failures while other adversarial scene generation cannot. Across the evaluated benchmarks and test suites, SPHINX can be applied to diverse state-of-the-art autonomous vehicle architectures and yields consistent robustness improvements over existing scenario-generation methods.

08.
arXiv (CS.LG) 2026-06-11

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

arXiv:2606.11652v1 Announce Type: new Abstract: This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic tool-calling ability, these approaches face inherent limitations for SLM training, especially under multimodal scenarios. First, many existing methods evaluate tool use correctness through exact matching against certain ground-truth or predefined formats. However, this assumption is often unsuitable for multimodal tasks, where multiple tool use paths may be valid and annotated tool trajectories are typically unavailable. Second, such sparse and brittle binary rewards provide little guidance on how to improve the underlying decision process, making them particularly difficult for multimodal SLM to learn from. To address these issues, we propose Input Attribution-Aware Policy Optimization (IAPO), an RL algorithm for improving tool use in multimodal SLM by aligning the model's attribution across input components with that of a stronger teacher. Experiments on Qwen2.5-VL-3B show that the proposed method improves visual question answering accuracy by an average of 3% across six test sets compared with existing visual tool use work, by helping the model attend to the most relevant input evidence.

09.
arXiv (CS.AI) 2026-06-18

Surrogate Benchmarks for Model Merging Optimization

arXiv:2509.02555v2 Announce Type: replace-cross Abstract: Model merging techniques aim to integrate the abilities of multiple models into a single model. Most model merging techniques have hyperparameters, and their setting affects the performance of the merged model. Because several existing works show that tuning hyperparameters in model merging can enhance the merging outcome, developing hyperparameter optimization algorithms for model merging is a promising direction. However, its optimization process is computationally expensive, particularly in merging LLMs. In this work, we develop surrogate benchmarks for optimization of the merging hyperparameters to realize algorithm development and performance comparison at low cost. We define two search spaces and collect data samples to construct surrogate models to predict the performance of a merged model from a hyperparameter. We demonstrate that our benchmarks can predict the performance of merged models well and simulate optimization algorithm behaviors.

10.
arXiv (CS.LG) 2026-06-16

Time-Varying Audio Effect Modeling by End-to-End Adversarial Training

arXiv:2512.15313v2 Announce Type: replace-cross Abstract: Deep learning has become a standard approach for the modeling of audio effects, yet strictly black-box modeling remains problematic for time-varying systems. Unlike time-invariant effects, training models on devices with internal modulation typically requires the recording or extraction of control signals to ensure the time-alignment required by standard loss functions. This paper introduces a Generative Adversarial Network (GAN) framework to model such effects using only input-output audio recordings, without requiring a modulation signal extraction. We propose a convolutional-recurrent architecture trained via a two-stage strategy: an initial adversarial phase allows the model to learn the distribution of the modulation behavior without strict phase constraints, followed by a supervised fine-tuning phase where a State Prediction Network (SPN) estimates the initial internal states required to synchronize the model with the target. Additionally, a new metric based on chirp-train signals is developed to quantify modulation accuracy. Experiments modeling a vintage hardware phaser demonstrate the method's ability to capture time-varying dynamics in a fully black-box context.

11.
PLOS Computational Biology 2026-06-18

scMagnifier: Resolving fine-grained cell subtypes via GRN-informed perturbations and consensus clustering

Authors:

by Zhenhui He, Dong Kangning Resolving fine-grained cell subtypes in single-cell RNA sequencing (scRNA-seq) data remains challenging, as their subtle transcriptional differences are often obscured by technical noise and data sparsity. Here, we present scMagnifier, a consensus clustering framework that leverages gene regulatory network (GRN)-informed in silico perturbations to amplify subtle transcriptional differences and uncover latent cell subpopulations. scMagnifier perturbs candidate transcription factors (TFs), propagates perturbation effects through cluster-specific GRNs to simulate post-perturbation expression profiles, and integrates clustering results across multiple perturbations into stable subtype assignments. Additionally, scMagnifier introduces regulatory perturbation consensus UMAP (rpcUMAP), a perturbation-aware visualization that provides clearer separation between cell subtypes and guides the selection of the optimal number of clusters. In both single-batch and multi-batch benchmarks, scMagnifier consistently improves the resolution and accuracy of fine-grained cell type identification. Notably, when integrated with spatial clustering methods such as STAGATE, scMagnifier is compatible with spatial transcriptomics workflows and effectively reveals tumor cell subtypes and their spatial organization in ovarian cancer.

12.
medRxiv (Medicine) 2026-06-16

Risk beliefs, intensive digital information and demand for a new preventative health product in public clinics: Evidence from an experiment in Zimbabwe.

Demand for preventative health care is weak in low-income settings. In a field experiment in a low-income, high-risk setting, we evaluated whether demand for a new bio-medical preventative health product, offered free at public health clinics, responds to digital feedback-based intensive information on health risks and benefits of prevention along with a clinic referral enabling access to the product. In our sample of women aged 18-24 years, we find a large correction in risk beliefs sustained six months after the intervention. Against a background of very low baseline usage, within six months we find a 5.8 percentage point increase in take up of the prevention method, a level of uptake which is very large relative to the control group. Reassuringly, there is no meaningful difference in up-take amongst baseline high- risk and low-risk individuals.

13.
arXiv (CS.AI) 2026-06-12

Automated reproducibility assessments in the social and behavioral sciences using large language models

arXiv:2606.13670v1 Announce Type: new Abstract: Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

14.
arXiv (math.PR) 2026-06-11

An Information-Theoretic Analysis of Threshold Group Testing

arXiv:2606.11353v1 Announce Type: cross Abstract: We study the Threshold Group Testing (TGT) problem in the noiseless and non-adaptive setting, where the objective is to exactly recover a sparse binary vector from pooled tests, using as few tests as possible. In TGT, each test applied to a subset of items returns a positive outcome if the number of 1's (defective items) in that subset meets or exceeds a specified threshold, and has a negative outcome otherwise. We investigate how the complexity of TGT compares to that of Classical Group Testing (CGT), corresponding to the special case of the threshold equal to one, and analyse the impact of increasing the threshold on the required number of tests. Our main contribution is the derivation of a sharp information-theoretic phase transition at $c_{\mathrm{inf}}^{\mathrm{TGT}}k\log(n/k)$ (non-adaptive) tests for TGT within the constant-column test design. The threshold constant $c_{\mathrm{inf}}^{\mathrm{TGT}}$ is expressed as a function of the prevalence of defectives and the threshold value. Our upper bound is derived under an analytic assumption, and we verify that this assumption is satisfied for a threshold value of 2. The value of $c_{\mathrm{inf}}^{\mathrm{TGT}}$ reveals that TGT on the constant-column design has the same information-theoretic behaviour as CGT in the low-prevalence regime. Yet, strikingly, at higher prevalences, the threshold leads to a significant reduction in the number of tests. On the other hand, we provide evidence that when the asymptotic proportion of defective items is positive, TGT actually becomes strictly harder than CGT (excluding trivial reductions).

15.
arXiv (CS.CL) 2026-06-11

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (Recursive Automated Composition for Environment Scaling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.

16.
arXiv (CS.CL) 2026-06-19

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

We present CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. CombEval represents each problem as a typed Cofola specification over entities, combinatorial objects, object dependencies, and constraints, enabling controlled generation of natural-language counting problems with exact solver-verified answers. Unlike static collections, CombEval supports systematic variation of object type, entity scale, constraint count, and reasoning depth. We evaluate 11 LLMs under direct and code-augmented settings and find that models remain brittle on ordered objects, indistinguishable elements, relatively positional constraints, and nested object dependencies. Error analysis further identifies failures in constraint interpretation and counting principles. CombEval provides a diagnostic testbed for studying when and why LLMs fail at combinatorial reasoning. The code and generated benchmark suites are publicly available at \url{https://github.com/YuxuZhou-CN/combination-problem-generation}.

17.
arXiv (quant-ph) 2026-06-12

Quantum metrology via partial quantum error correction

arXiv:2605.08341v2 Announce Type: replace Abstract: We introduce a method for error-corrected quantum metrology where only partial quantum error correction (QEC) is needed to suppress local noise and maintain the probe states' super-standard-quantum-limit (super-SQL) sensing performance. This stands in contrast to the existing QEC-assisted sensing schemes in Phys. Rev. Lett. 112, 080801 (2014) and Phys. Rev. Lett. 112, 150802 (2014), where a probe state is encoded into the logical subspace of a quantum code and error correction involves measurements on all checks of the code. Here, we encode the probe states into superpositions of energetically different states of the underlying quantum code. For our probe states, error correction using a subset of checks is enough to suppress noise both before and after phase imprinting. We analyze the tradeoff in noise suppression. For noise parallel to our phase imprinter of weight $l$, we achieve a suppression of $p^\delta$ where $p$ is the noise strength and $\delta = \lfloor (l+1)/2 \rfloor$. We propose an adaptive imprinter weight increasing strategy to maintain super-SQL performance as we scale up the system. In all our examples, checks and phase imprinters are chosen to be local operators avoiding non-local connectivity.

18.
Nature (Science) 2026-06-16

Daily briefing: How many elementary particles are there?

Authors:

Estimates range from 17 to 995.5. Plus, one man with paralysis is using a brain–computer interface at home and GLP-1 obesity drugs appear to boost testosterone and sperm quality. Estimates range from 17 to 995.5. Plus, one man with paralysis is using a brain–computer interface at home and GLP-1 obesity drugs appear to boost testosterone and sperm quality.

20.
arXiv (quant-ph) 2026-06-17

Coherent Control of an Embedded Bound State Without a Spectral Gap

Authors:

arXiv:2606.17685v1 Announce Type: new Abstract: Bound states in the continuum (BICs) can confine photonic excitations in open systems without conventional cavities or band gaps, making them natural candidates for long-lived quantum storage and single-photon control. Their use is limited, however, by two obstacles: they are dark to incident photons, and they lack spectral-gap protection from the surrounding continuum. We overcome both limitations in a giant atom coupled to a one-dimensional waveguide using two temporal control knobs. Atomic-frequency modulation breaks and restores the destructive-interference condition, enabling deterministic capture and release of mode-matched single photons. Coupling modulation instead preserves the BIC condition while tuning the atomic and photonic weights of the stored state. A key result is that this embedded state can nevertheless be controlled adiabatically despite the absence of a spectral gap, with an intrinsic leakage probability linear in the ramp rate. By separating radiative access from BIC-preserving deformation, the protocol turns a dark BIC into a single-photon memory whose fidelity is set by the intrinsic continuum-induced leakage law, providing a route to embedded-state control in open photonic platforms.

21.
medRxiv (Medicine) 2026-06-23

What Is the Optimal Timing and Frequency of Workload-Matched Postprandial Physical Activity Breaks? A Randomized Controlled Crossover Study of Cardiometabolic and Cognitive Responses During Sedentary Behavior

Purpose Postprandial sedentary behavior is associated with negative health effects and constitutes a large part of daily life in modern society. This study investigated how the timing of physical activity after eating influences glucose levels, cerebral and muscle oxygenation, cognitive performance, and well-being during subsequent sitting. Methods In a four-armed randomized crossover trial, healthy adults consumed four standardized meals separated by 48-hour washout periods. Each meal was followed by 2 hours of sitting combined, in random order, with one of four interventions: (1) sitting only, (2) 15 minutes of moderate intensity cycling immediately after eating, (3) 15 minutes of cycling 20 minutes after eating, or (4) three workload-matched five-minute cycling bouts during sitting. Interstitial glucose (continuous glucose monitoring), cerebral and muscle oxygenation (Functional near infrared spectroscopy), cognitive performance (Stroop test), heart rate, blood pressure, and subjective ratings were assessed every 30 minutes. Data were analyzed using repeated-measures ANOVA. Results Twenty participants (mean age 27.1{+/-}10.3 years, 12 females) completed the study. Cycling immediately after eating reduced mean glucose levels during postprandial sitting, while both 15-minute cycling bouts increased cerebral oxygenation. All active conditions enhanced muscle oxygenation. Heart rate and arousal increased with delayed cycling and active breaks. No effects were observed for blood pressure, cognitive performance, focus, or well-being. Conclusion A short bout of physical activity immediately after eating reduces postprandial hyperglycemia and improves brain oxygenation during sitting, whereas delayed activity and brief breaks increase physiological activation without cognitive or perceptual benefits.

22.
arXiv (CS.CL) 2026-06-19

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

23.
arXiv (quant-ph) 2026-06-16

Study of the triangular-lattice Hubbard model with constrained-path quantum Monte Carlo

arXiv:2603.14808v2 Announce Type: replace-cross Abstract: We benchmark constrained-path Monte Carlo (CPMC) on the triangular-lattice Hubbard model for several fillings and $U$ values and show that symmetry-adapted trial wave functions substantially improve quantitative accuracy. Away from half-filling, simple free-electron-based trials that preserve the ground state symmetry yield energy deviations $\lesssim 1\%$ from exact diagonalization and density matrix renormalization group results. At half-filling, strong frustration in the intermediate to large $U$ regimes necessitates symmetry-projected trials to reach comparable accuracy, where both free-electron and symmetry-broken Hartree-Fock trials incur substantial constraint bias. Since the computational cost of CPMC with symmetry projection scales polynomially with system size, our results motivate its use as a practical route for studying competing ground states in strongly correlated, frustrated systems.

24.
arXiv (CS.CL) 2026-06-15

Simulating Students' Java Programming Errors with Large Language Models

Understanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics.

25.
arXiv (CS.AI) 2026-06-17

ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics

arXiv:2606.17668v1 Announce Type: cross Abstract: Molecular dynamics (MD) simulation is computationally demanding, particularly for large-scale systems requiring long-term analysis. Accurate forecast of the outcomes of a MD simulation is not only an attractive scientific challenge but also has substantial practical value. In this work, we developed a data-driven framework, termed ASTEROID (Advanced Spatiotemporal TransformER fOr Inferring Dynamics), that can directly predict multi-step atomic coordinates, avoiding conventional iterative integration. For this purpose, our ASTEROID reformulates MD trajectories as high-dimensional spatiotemporal sequences and integrates the Spatiotemporal Information (STI) Transformation equation into a Transformer architecture. The core innovation of ASTEROID lies in its ability to model multiscale spatiotemporal dependencies. In particular, for spatial dependencies, a local-global self-attention mechanism captures both short- and long-range interactions. For temporal dependencies, an encoder-decoder structure integrates global context with autoregressive forecasting. ASTEROID was evaluated on several quantum-mechanics derived molecular datasets. Our results indicate that ASTEROID achieved not only a higher level of accuracy in multi-step prediction than existing methods on various benchmarks, but also significantly reduced computational cost of conventional MD simulation. Moreover, the model supports iterative multi-step forecasting over an extended time scale. This work establishes a robust and generalizable data-driven paradigm for accelerating MD simulations.