Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-24

Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models

The evaluation of cultural grounding context becomes complex when multiple cultures convey the same moral lesson. This challenge is particularly relevant to large language models (LLMs), which produce narratives across a wide range of languages and cultural contexts. However, it remains uncertain whether these models preserve culturally grounded meaning when equivalent moral lessons are conveyed through distinct cultural forms. This study introduces a multilingual evaluation narrative framework that integrates a cross-linguistic collection of 414 proverbs spanning 15 languages and uses four LLMs to generate 13k narratives. By employing semantically equivalent proverbs as culturally grounded prompts, the analysis assesses whether models preserve meaning across languages, how cross-lingual conditioning influences narrative realization, and whether different model families converge on similar interpretations. Results indicate that cross-lingual prompting largely preserves proverb-level semantic meaning while systematically redistributing agency, social positioning, and narrative structure. Additionally, strong inter-model convergence is observed in both monolingual and cross-lingual settings, suggesting that multilingual LLMs rely on shared semantic abstractions despite architectural and linguistic differences. These findings shed light on the need for more comprehensive evaluations of cultural grounding. Relying exclusively on semantic similarity in multilingual narrative assessments may overestimate cultural preservation by neglecting culturally meaningful variations in narrative expression.

02.
arXiv (math.PR) 2026-06-12

Counterintuitive problems in discrete probability

arXiv:2606.07516v2 Announce Type: replace Abstract: This manuscript contains a collection of counterintuitive problems in discrete probability, together with detailed solutions. The dataset was constructed as part of a broader research project investigating the capabilities of the latest-generation Large Language Models (LLMs) in solving discrete probability problems, in order to assess whether LLMs tend to make systematic reasoning errors associated with known cognitive biases. The problems collected here are specifically designed to challenge heuristic reasoning strategies that often lead to intuitively appealing but mathematically incorrect conclusions. The dataset combines several types of problems. Some are adapted from classical probabilistic paradoxes and cognitive-bias literature, while others originate from recreational mathematics sources or were developed by ourselves following similar principles. The primary purpose of this document is to provide a transparent and publicly accessible reference for the problems used in our experimental evaluation of language models, as well as providing detailed human-made solutions. At the same time, we believe that this collection may also prove useful for future research on probabilistic reasoning, cognitive biases, and the evaluation of reasoning capabilities in artificial intelligence systems.

03.
arXiv (CS.LG) 2026-06-11

A theory of learning data statistics in diffusion models, from easy to hard

arXiv:2603.12901v2 Announce Type: replace-cross Abstract: While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a distributional simplicity bias, learning simple, pair-wise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.

04.
arXiv (CS.CL) 2026-06-16

ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

Mental health industry faces growing concerns regarding hate speech directed at children's on social media, as exposure to such content can contribute to adverse psychological outcomes during critical stages of development. Current hate speech datasets and detection systems provide limited support for child-focused applications because they are primarily designed for adults and lack dedicated representations of age-specific characteristics associated with hate speech directed at children's. To address this gap, we introduce ChildGuard, a large-scale English dataset for child-targeted hate speech containing 351,877 annotated instances collected from X (formerly Twitter), Reddit, and YouTube. The dataset covers three age groups such as younger children's (under 11), pre-teens (11-12), and teens (13-17). ChildGuard contains two subsets such as a contextual subset (157K) and a lexical subset (194K). Evaluation using recent transformer-based models and LLMs achieves a best Macro-F1 of 82.07%, decreasing to 79.41%, 79.24%, 76.04%, and 74.88% on younger children's, contextual, implicit hate, and cross-subset settings, respectively.

05.
arXiv (quant-ph) 2026-06-24

Tensor-network approach to quantum optical state evolution beyond the Fock basis

arXiv:2511.15295v4 Announce Type: replace Abstract: Understanding the quantum evolution of light in nonlinear media is central to the development of next-generation quantum technologies. Yet modeling these processes remains computationally demanding, as the required resources grow rapidly with photon number and phase-space resolution. Here we introduce a tensor-network approach that efficiently captures the dynamics of nonlinear optical systems in a continuous-variable representation. Using the matrix product state (MPS) formalism, both quantum states and operators are encoded in a highly compressed form, enabling direct numerical integration of the Schrödinger equation. We demonstrate the method by simulating degenerate spontaneous parametric down-conversion (SPDC) and show that it accurately reproduces established theoretical benchmarks - energy conservation, pump depletion, and quadrature squeezing - even in regimes where conventional Fock-basis simulations become infeasible. For high-intensity pump fields ($\alpha = 100$), the MPS representation achieves compression ratios above $3\cdot 10^3$ while preserving physical fidelity. This framework opens a scalable route to modeling multimode quantum light and nonlinear optical phenomena beyond the reach of traditional methods.

06.
medRxiv (Medicine) 2026-06-22

Development and validation of a risk prediction algorithm to estimate all-cause mortality among community-dwelling Canadians: the Mortality Population Risk Tool (MPoRT)

BACKGROUND: The risk of all-cause mortality can inform decision-making for chronic disease prevention. We developed a predictive algorithm to estimate the 5-year risk of death among community-dwelling adults. METHODS: We derived and validated the Mortality Population Risk Tool (MPoRT) using data from population health surveys in Canada (the Canadian Community Health Survey) and the United States (the National Health Interview Survey), survey years 2001 to 2011, linked to vital statistics. The outcome was death within five years of the survey response. The algorithm was developed using data from Ontario respondents using a Cox proportional hazards model, then modified and re-estimated to allow cross-national assessment in Canada and the United States. Twenty-three prespecified predictors were assessed: seven sociodemographic, six behavioural, and ten general health and chronic disease. RESULTS: 527,369 respondents aged 20 to 105 years were included in the Canadian and United States development and validation cohorts, with 43,758 deaths during 3.68 million person-years follow-up. The final sex-specific MPoRT algorithms each contained 21 variables, showing strong discrimination (C-statistic: females 0.874 [0.871–0.877]; males 0.867 [0.865–0.871]) and good calibration overall and in 246 of 247 subgroups. Discrimination was modestly attenuated (0.01 decrease in C-statistic) in cross-national validation between Canada and the United States, with good calibration across all 71 subgroups. INTERPRETATION: MPoRT accurately discriminated all-cause mortality using only self-reported data, enabling broad application without clinical measures. While validation outside North America is needed to confirm broader applicability, MPoRT is designed for straightforward recalibration using routinely available national mortality data. This supports targeted chronic disease prevention strategies at both the population and individual levels, though the limitations inherent to self-reported predictors should be considered when interpreting predictions.

07.
arXiv (CS.CV) 2026-06-16

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.

08.
arXiv (CS.CL) 2026-06-19

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

We study cross-lingual transfer by fine-tuning seven large language models (4B–671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding – the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.

09.
arXiv (CS.CL) 2026-06-18

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.

10.
arXiv (quant-ph) 2026-06-11

Permutation-Invariant N-body gates via Tavis-Cummings Hamiltonian

arXiv:2506.03453v3 Announce Type: replace Abstract: Global control provides a promising route to implementing multi-qubit gates without individual qubit addressing. This is especially appealing for permutation-invariant (PI) gates, whose symmetry is often broken when they are compiled into individually addressed one- and two-qubit gates. Important examples include SWAP, $\sqrt{iSWAP}$, and the n-qubit controlled-Z gate, which is equivalent, up to two single-qubit Hadamard gates, to the multi-qubit Toffoli gate. Motivated by this global-control perspective, we show that all PI unitaries on an arbitrary number of qubits can be realized using the Tavis-Cummings (TC) interaction, the multi-qubit version of the Jaynes-Cummings interaction, together with global uniform z and x fields. Here, the $n$ qubits are identically coupled to a single bosonic mode (oscillator), which is initialized in and returned to its vacuum state. A corollary is that all PI states, including GHZ and Dicke states, can be prepared using the same global control. For the case n=2 qubits, which is particularly important in quantum computing, we also find explicit pulse sequences for implementing all PI qubit unitaries that conserve angular momentum in the z direction, using only the TC interaction and global z fields. This includes controlled-Z, SWAP, and $\sqrt{iSWAP}$.

11.
arXiv (CS.CV) 2026-06-11

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

12.
arXiv (quant-ph) 2026-06-16

A Gauge-Covariant Geometric Framework for Non-Hermitian Quantum Systems

arXiv:2606.15922v1 Announce Type: new Abstract: We develop a comprehensive, gauge-covariant geometric framework for non-Hermitian quantum systems in the quasi-Hermitian regime, that is, the region of parameter space where the non-Hermitian Hamiltonian admits a real spectrum and a positive-definite metric operator. We build this framework by elevating the Dyson map to a central geometric object. This map is the transformation that converts a non-Hermitian Hamiltonian into an equivalent Hermitian one. From it we construct the Dyson connection and decompose it into Hermitian and anti-Hermitian parts, identified respectively as {\it stretching } and {\it rotation } components. This decomposition cleanly separates the genuine physical metric deformations from the unitary gauge redundancies. Working with manifestly gauge-covariant states, we then derive the complex non-Hermitian Berry phase and the quantum geometric tensor (QGT), and show that the non-Hermitian geometric curvature originates from the non-commutativity of the stretching components at the operator level. We further analyse the geometric singularities near an exceptional point (EP) and uncover a distinct hierarchy of divergences. For a general two-level non-Hermitian model, the quantum metric tensor (QMT) exhibits a leading-order divergence $\sim |\epsilon_\mu|^{-2}$, while the Berry curvature shows a weaker, subleading divergence $\sim |\epsilon_\mu|^{-3/2}$, with $\epsilon_\mu$ denoting the parameter displacement from the EP along an individual parameter axis $\mu$. Finally, we examine physical realizations of this model, including the non-Hermitian Su–Schrieffer–Heeger (SSH) and Hatano–Nelson (HN) models, where exact analytical results confirm the predicted critical scaling laws and illustrate the metric-deformation-driven non-Hermitian geometries.

13.
arXiv (CS.CV) 2026-06-11

Global Geometry Is Not Enough for Vision Representations

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across a diverse suite of vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input–output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input–output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

14.
arXiv (math.PR) 2026-06-18

Ergodic Properties of Non-Linear Density-Dependent Perturbations of the Ornstein-Uhlenbeck Process

arXiv:2606.18877v1 Announce Type: new Abstract: The present paper considers McKean-Vlasov SDEs with density-dependent spatially unbounded drift, which may be viewed as a non-linear density-dependent perturbation of the Ornstein-Uhlenbeck process. We develop a comprehensive theoretical framework for this class of equations. First, we establish strong well-posedness and derive optimal Gaussian pointwise bounds for both the solution density and its gradient. Then we derive an explicit expression for the stationary density and show that it satisfies logarithmic Sobolev and Poincaré inequalities. Finally, we prove exponential convergence to equilibrium in the \(\chi^2\)-metric.

15.
arXiv (CS.AI) 2026-06-19

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

arXiv:2606.20122v1 Announce Type: new Abstract: Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

16.
arXiv (CS.AI) 2026-06-11

Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment

arXiv:2606.11672v1 Announce Type: cross Abstract: This paper explores the value of agentic AI tools for cybersecurity purposes. We evaluate the efficacy of a general-purpose GenAI Large Language Model- (GenAI-) based agent when powered by three different Ollama-hosted general-purpose open source models. We assess each agent's performance using precision, recall, false positive count, and a calculated composite score based upon the interplay of the captured metrics, against the baseline performance of an existing, vetted Static Application Security Testing (SAST) tool, Bandit. Our findings refute the notion that a modern open-source GenAI LLM-based agent is currently suitable for the specialized task of SAST scanning under realistic conditions.

17.
arXiv (CS.AI) 2026-06-12

An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration

arXiv:2606.12425v1 Announce Type: cross Abstract: Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However, insufficient instructional support often limits students' access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. Although recent advances in AI, particularly large language models, offer scalable opportunities for feedback, concerns about explainability and reliability remain. In this paper, we present an AI-driven classroom assistant that leverages an explainable AI model to analyze student code, map logical errors to instructor-identified misconceptions, and deliver instructor-authored feedback, thereby grounding reliability in instructor-defined pedagogical knowledge. To evaluate the effectiveness of our framework, we conducted an expert evaluation to examine its alignment with instructor-verified feedback and deployed the system in a classroom setting to assess students' perceptions of its usability. Results indicate that the assistant can provide accurate, instructor-verified feedback to students while fostering a positive experience.

18.
arXiv (quant-ph) 2026-06-12

Quantized time in quantum walks under weak rank-K measurements

arXiv:2606.13552v1 Announce Type: new Abstract: Measurements can be used to monitor the evolution of quantum systems and may lead to a universally quantized time statistics. It is known that the mean return time is quantized for strong and indirect monitoring through the winding number of the return amplitude in a one-dimensional space. Here we discuss that under multi-channel strong or indirect monitoring, where the latter is achieved through ancilla coupling, the mean return time of a quantum walk in the projected subspace is also quantized. This reflects a universal time quantization for a higher dimensional evolution.

19.
arXiv (CS.LG) 2026-06-16

Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training

arXiv:2606.16384v1 Announce Type: new Abstract: Pretraining language models with extended context windows enhances their ability to leverage rich information during generation. Existing methods split input sequences into chunks, broadcast them across multiple devices, and compute attention block by block which incurs significant communication overhead. While feasible in high-speed clusters, these methods are impractical for decentralized training over low-bandwidth connections. We propose a compression method for communication-efficient context parallelism in decentralized settings, achieving a remarkable compression rate of over 95\% with negligible overhead and no loss in convergence. Our key insight is to exploit the intrinsic low-rank structure of activation outputs by dynamically constraining them to learned mixtures of subspaces via efficient reparameterizations. We demonstrate scaling billion-parameter decentralized models to context lengths exceeding 100K tokens on networks as slow as 300Mbps, matching the wall-clock convergence speed of centralized models on 100Gbps interconnects.

20.
medRxiv (Medicine) 2026-06-12

Cancer care disruption during the COVID-19 pandemic in Ontario, Canada: A sequential mixed-methods study

Introduction The COVID-19 pandemic profoundly disrupted healthcare delivery worldwide, with cancer care among the most affected services. Prior studies documented delays in referrals, reduced specialist access, and increased provider burden. However, the extent to which these experiences were reflected at the system level remains unclear. Objective To document cancer care experiences and examine whether these experiences were reflected in population-level health system indicators across Ontario, Canada. Methods We used an exploratory sequential mixed-methods design. Qualitative data were collected through focus groups and semi-structured interviews with 32 participants, including patients with cancer (n=8), caregivers (n=5), healthcare providers (n=14), and decision-makers (n=5) across two hospital settings in Ontario, Canada. Emergent themes informed the development of quantitative indicators. We then conducted a retrospective population-based analysis of linked administrative health databases for cancer patients in Ontario (n=87,786) to assess the prevalence of identified themes. Results Four themes emerged: (I) delays in diagnosis and screening; (II) disrupted access to primary care; (III) barriers to specialist and mental health services; and (IV) fragmented care for patients with multimorbidity. Quantitative findings corroborated major themes. Screening rates declined for cervical (64.8% to 57.5%) and breast cancer (64.5% to 57.2%). While in-person primary care shifted almost entirely to virtual modalities (8.5% to 95.4%), overall visit volumes remained stable. Specialist care showed uneven patterns, with increased oncology visits but declines in cardiology and mental health services. Patients with multiple comorbidities experienced the largest reductions in non-oncology specialist care. Conclusion The pandemic disrupted key components of cancer care, particularly screening, access to certain specialist services, and care for patients with complex needs. Integrating qualitative and quantitative evidence highlights areas of system vulnerability and underscores the need for coordinated, resilient cancer care capable of maintaining essential services during future crises.

21.
arXiv (CS.CV) 2026-06-11

Causal Clothes-Invariant Feature Learning for Cloth-Changing Person Re-ID

In cloth-changing person re-identification (CCReID), it is critical to learn clothes-invariant feature, which can provide discriminative ID features that remain robust against clothing changes. However, a spurious correlation currently limits existing ReID methods from effectively extracting these clothing-invariant features. This spurious correlation arises from clothing ownership: clothing is rarely shared across different identities, so models tend to memorize clothing cues for identity recognition, and this strategy generalizes poorly to unseen clothing. In this paper, we propose Causal Clothes-Invariant Learning (CCIL), which explicitly shifts CC-ReID from likelihood learning P (Y|X) to causal intervention learning P (Y|do(X)) to block the clothing shortcut. CCIL realizes this intervention through three modules: a Confounder Dictionary, an Intervention Module, and Disentangle Regularization. The causality-based modeling makes the entire model naturally clothes-invariant, effectively preventing the capture of spurious correlations in feature learning. Extensive experiments validate the effectiveness of CCIL. On PRCC and DeepChange datasets, CCIL achieves Rank-1 accuracies of 66.4% and 59.2%, outperforming state-of-the-art methods by 1.4 and 4.1 percentage points, respectively.

22.
arXiv (CS.CL) 2026-06-24

Best Preprocessing Techniques for Sentiment Analysis

Sentiment analysis in Twitter datasets is important because it enables monitoring public opinion on products and analysis of political and social movements. One critical step is preprocessing: the automated processing of text for machine learning algorithms. Preprocessing plays a critical role in reducing noise and improving efficiency. However, little research has systematically examined the order in which preprocessing techniques are implemented. We find that, when accounting for order, spelling correction is the least impactful preprocessing technique, whereas tokenisation is the most impactful. Stemming and stop-word removal are interchangeable, and it is better to remove stop words without removing negation. The best order for applying the preprocessing techniques was tokenisation, text cleaning, stemming, and then stopword removal. Our results provide a systematic approach for practitioners to deploy preprocessing to improve model output without the costly preprocessing exploratory phase.

23.
arXiv (CS.CL) 2026-06-15

Simulating Students' Java Programming Errors with Large Language Models

Understanding student errors in the programming is a cornerstone of programming education, yet obtaining a representative set of student errors for any newly designed task remains slow and costly, since authentic submissions only accumulate after extensive classroom deployment. This paper explores whether large language models (LLMs) can serve as scalable proxies for students by simulating realistic logical errors in code submissions. Using the CodeWorkout dataset of 74,000+ unique student Java submissions across 37 problems, we evaluate five LLMs under three mainstream prompting strategies: Input-Output (IO), Chain-of-Thought (CoT), and iterative Self-Refine. We assess performance along two key dimensions: diversity (the range of distinct error patterns) and alignment (alignment with authentic student mistakes), and examine how these vary by struggling level of programming tasks. Our quantitative findings reveal that while all models generate diverse errors, their alignment to human submissions diverges: Claude Sonnet 4 achieves the most balanced performance. In addition, we conducted a blinded expert annotation study (N = 401) comparing synthetic and authentic errors. This qualitative analysis confirms that the generated errors are functionally indistinguishable from authentic student errors. Moreover, higher-struggling-level problems elicit more diverse but less student-like errors. These results highlight trade-offs in using LLMs to simulate human learners and suggest design considerations for integrating synthetic errors into teachable agents, intelligent tutoring systems, and large-scale learning analytics.

24.
PLOS Medicine 2026-05-20

Prescribed hormonal contraceptive use trends in the Estonian Biobank: A longitudinal observational study

by Jelisaveta Džigurski, Märt Möls, Kristi Läll, Hannah Currant, Mall Eltermaa, Estonian Biobank Research Team , Reedik Mägi, Lili Milani, Triin Laisk Background Hormonal contraceptives (HCs) are widely used and have well-documented population-level statistics. Previous studies with short follow-ups have focussed on individual HC use and side effects. However, the same aspects over longer periods, HC formulation switching, and the impact of genetic factors on HC side effects remain understudied due to the limited availability of suitable datasets. We investigated whether the Estonian Biobank (EstBB) is suitable for studying genetic risk for HC side effects. Methods and findings This is a longitudinal descriptive study combining prescribed HC purchase data collected from 2004 to 2022 with genetic and health data from 73,071 female EstBB HC users aged 15–55 at the time of purchase. HC usage was defined by the Anatomical Therapeutic Chemical (ATC) codes G02B, G03A, and G03HB01. Methods included calculating age-stratified annual user prevalence, inferring usage periods from purchases, assessing formulation switching, identifying the International Classification of Diseases, Tenth Revision (ICD-10)-based side effect-related diagnoses and thromboembolism risk factors, and assessing carrier status for Factor V Leiden (FVL, rs6025) and prothrombin G20210A (PTM, rs1799963) genetic variants as proof-of-concept. Over 19 years, 20 HC formulations with five administration routes (oral pills, transdermal patches, vaginal rings, subdermal implants, intrauterine devices) were used. In the EstBB, combined HCs were the most commonly used among users aged 15–29, while progestin-only HC use increased with age and over time, comparable to the Estonian population. Overall, 64.2% (n = 46,920) of users switched formulations at least once, with 17.7% (n = 12,929) being rapid switchers. Side effect-related diagnoses were observed in 23.1% (n = 2,982) of rapid switchers, with excessive/irregular menstrual bleeding being the most common. Genetic analysis revealed that 5.3% (n = 3,886) of users carried at least one variant previously associated with increased thrombosis risk (3.5% (n = 2,556) carried FVL only, 1.8% (n = 1,276) PTM only, and 0.07% (n = 54) both). Carriers of thrombosis-associated variants had a significantly higher percentage of thrombosis (6.5%) than non-carriers (4.2%; OR = 1.61, 95% CI [1.40, 1.84], p 

25.
arXiv (CS.LG) 2026-06-11

Flow Matching with In-Context Priors for Out-of-Distribution Brain Dynamics

arXiv:2606.11833v1 Announce Type: new Abstract: Flow matching and diffusion models enable conditional generation across domains ranging from images to proteins, with recent extensions to out-of-distribution contexts. Yet generative models of neural time series have largely remained restricted to categorical conditioning, precluding compositional and zero-shot generalization. In this work, we propose a per-timestep conditioned diffusion transformer for generating realistic fMRI brain dynamics during unseen cognitive tasks by injecting both compositional language and optional spatial priors in-context. Such zero-shot generation could enable counterfactual neuroscience by supporting in-silico design and evaluation of novel cognitive experiments before empirical validation. Leveraging this model, we evaluate across hundreds of held-out task conditions and characterize predictive performance in relation to the training manifold. From language alone, the model recovers region-specific recruitment across tasks and held-out spatial activation patterns. Spatial priors, when available, complement the text pathway by anchoring generation in regions of task space where language alone degrades, while retaining the compositional structure needed for counterfactual task specification. To our knowledge this is the first generative model of whole-cortex fMRI dynamics for unseen cognitive tasks, advancing counterfactual neuroscience and data-driven experimental design.