Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CL) 2026-06-16

KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing

Post-hoc context erasing over the KV cache is challenging because a local edit has a global consequence: once a span has been processed, its influence propagates into the cached states of all subsequent tokens. This issue arises naturally in long-context LLM applications, where stale retrieved facts, incorrect tool observations, retracted user preferences, or harmful prompt injections may be identified only after prefill. Exact erasing must then recompute all tokens after the deleted span, making its computational cost depend on suffix length rather than erased-span length. We introduce KVEraser, a learned KV-cache editing method for efficient localized context erasing. Given a processed context and a span to remove, KVEraser replaces only the KV states of the erased interval with learned steering states while reusing the remaining cache unchanged. To learn a transferable erasing mechanism, we build a two-stage training pipeline: generic span-neighbor pre-training teaches the eraser to suppress the influence of the erased span, while task-specific fine-tuning adapts this capability to downstream scenarios. Experiments show that KVEraser nearly matches full recomputation in post-erasure performance on in-domain tasks across 1K–32K context lengths, while its latency increases by only 24% compared with a 17.6x increase for full recomputation. KVEraser also generalizes to unseen long-document QA tasks with harmful factual distractors, achieving the best performance among approximate baselines with a 3–4x speedup over full recomputation.

02.
arXiv (CS.AI) 2026-06-19

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

arXiv:2604.13416v2 Announce Type: replace-cross Abstract: Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

03.
arXiv (quant-ph) 2026-06-19

Accelerated Rydberg electromagnetically induced transparency quantum memory via shortcuts to adiabaticity

arXiv:2603.18399v2 Announce Type: replace Abstract: Electromagnetically induced transparency (EIT) enables coherent light-matter storage, forming the basis of photonic quantum memories that are essential for scalable quantum networks and distributed quantum computing. However, accelerating the storage process violates the adiabatic condition, resulting in the excitation of the lossy intermediate state and a reduction in writing efficiency. We propose and numerically investigate a high-speed, high-fidelity quantum storage scheme by incorporating a shortcut-to-adiabaticity (STA) technique based on counter-diabatic (CD) driving. By introducing a precisely engineered auxiliary field into a conventional EIT system, our protocol significantly shortens the writing time beyond the conventional adiabatic limit while effectively suppressing the transient population of the lossy intermediate state. Furthermore, our scheme demonstrates strong flexibility in pulse design, remaining effective across different temporal profiles of both the control and signal fields. It also exhibits robustness against imperfections in the CD drive. Even with imperfect single-photon writing and non-ideal Rydberg blockade, the scheme retains clear advantages, maintaining high storage performance and overcoming the intrinsic speed-fidelity trade-off of traditional EIT protocols. These features pave the way for fast and robust quantum devices suitable for high-throughput quantum repeaters and advanced quantum information processing.

04.
medRxiv (Medicine) 2026-06-22

Sequential Deep Learning to Predict Non-Central to Central Geographic Atrophy Progression from OCT Imaging

Purpose: To develop and validate a temporal deep learning framework for predicting geographic atrophy (GA) progression across multi-year horizons using longitudinal optical coherence tomography (OCT) sequences. Design: Retrospective longitudinal cohort study. Subjects, Participants, and/or Controls: A total of 91 patients with dry age-related macular degeneration (AMD) were identified from Wake Forest University School of Medicine (2013-2023), yielding 455 OCT volumes. Two prediction cohorts were defined: 32 patients with no GA (NGA) at baseline who subsequently developed GA, and 35 patients whose earliest GA manifestation was non-central GA (NCGA). Non-progressing patients served as negative controls. Methods: OCT B-scan volumes were encoded into visit-level feature representations using three pretrained architectures (ResNet-18, ResNet-50, ViT-B/16). Chronologically ordered visit embeddings, optionally augmented with inter-visit time intervals ({Delta}t), were processed through recurrent neural networks (RNN), long short-term memory networks (LSTM), and Transformer encoders to model longitudinal disease trajectories. Models were trained and evaluated independently for prediction horizons of 2, 3, 4, 5, and 6 years using patient-level stratified splits (80/20). Performance was assessed across five random seeds. Main Outcome Measures: Area under the receiver operating characteristic curve (ROC-AUC), F1-score, and accuracy for predicting two clinically critical transitions: NGA to GA onset and NCGA to central GA (CGA) involvement. Results: For NGA to GA prediction, models achieved ROC-AUC of 0.84-0.94 at 2-4 years and 1.00 at 5-6 years. For NCGA to CGA prediction, Transformer-based models achieved peak AUC of 0.95 at 4 years and 0.96 at 5 years. Longer input sequences (8 visits vs. 4 visits) consistently improved NCGA to CGA performance at extended horizons. Temporal interval encoding improved stability in several LSTM configurations.

05.
arXiv (CS.LG) 2026-06-19

We Need to Rethink Benchmarking in Anomaly Detection

arXiv:2507.15584v2 Announce Type: replace Abstract: Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. In current benchmarks, a trivial algorithm that only checks for extreme values in individual features performs competitively with state-of-the-art deep learning methods, despite failing on simple cases such as anomalies within an annulus of normal points. Moreover, existing benchmarks do not adequately reflect the diversity of anomaly detection applications, making it difficult for practitioners to reliably select algorithms for their applications. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that group applications sharing relevant characteristics, defined through a common taxonomy. Benchmarking within scenarios enables scenario-specific choices for preprocessing, metrics, and model selection, clarifying which advances transfer across similar applications and providing practitioners with reliable guidance for their specific contexts.

06.
arXiv (CS.CV) 2026-06-12

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source–target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

07.
arXiv (CS.LG) 2026-06-17

AIMER: Calibration-Free Task-Agnostic MoE Expert Pruning

arXiv:2603.18492v3 Announce Type: replace Abstract: Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token computation, yet deployment still requires storing the full expert pool, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert-pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, making pruning decisions sensitive to calibration-data variation while introducing substantial preprocessing cost. We propose AIMER (Absolute mean over root mean square IMportance for Expert Ranking), a simple calibration-free criterion that identifies more distinct experts by capturing the concentration pattern of expert weights, making it well suited for task-agnostic expert pruning. Across 7B to 47B MoE language models with distinct architectures and 16 diverse benchmarks, AIMER consistently delivers stronger capability balance across diverse tasks than existing calibration-free methods. Surprisingly, AIMER also achieves better balance than strong calibration-based expert-pruning baselines calibrated on the widely used task-agnostic C4 corpus, while requiring only 0.22–2.06 seconds to score all experts.

08.
arXiv (CS.CV) 2026-06-15

Improving Lunar Topography with Deep Learning Schrödinger Bridges

Increasing the resolution of planetary topography models can enable a better understanding of surface processes and geomorphology; however, existing analytical super-resolution methods are expensive and difficult to apply at large scales. Generative models provide the tools to learn complex relationships within data and can be applied at scale due to hardware accelerators and parallelization. We present a diffusion-based Schrödinger Bridge (SB) generative modeling approach for lunar topography super-resolution, connecting the distribution of low-resolution topography to that of high-resolution topography, incorporating physically-constraining optical imagery. Our approach is inspired by existing Shape-from-Shading methods, which improve a priori low-resolution topography by using optical images at the target resolution. We train SBs on a novel dataset of rendered lunar topography, emulating optical imagery from the Lunar Reconnaissance Orbiter Narrow Angle Camera. The result is a flexible approach for topography super-resolution which can provide pixel-level uncertainties in the reconstruction.

09.
arXiv (CS.CL) 2026-06-18

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

10.
arXiv (CS.AI) 2026-06-16

Stable Menus of Public Goods: AI-Enabled Progress

Authors:

arXiv:2606.16989v1 Announce Type: cross Abstract: Using an open problem from the EC 2025 paper "Stable Menus of Public Goods" as a testbed, we conduct experiments to understand the effectiveness of different AI-for-EconCS research workflows. Specifically, we study three questions: Does providing human intuition in the prompt help? Does automated multi-turn interaction help? And, does an LLM outperform a first-year PhD student? Regarding the first two questions, we provide evidence for the following workflow suggestions: (1) prompting with human intuition can encourage the LLM to have better "taste", (2) multi-turn workflows help when the pipeline encourages "ambitious" steps. Regarding the third question, using an unpublished manuscript written by the paper's senior authors prior to collaborating with the first-year PhD student, we compare the effectiveness of the LLM with that of the first-year PhD student, and find that the LLM is slightly less effective.

11.
arXiv (CS.LG) 2026-06-16

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

arXiv:2606.17043v1 Announce Type: cross Abstract: When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate $g_t$ merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

12.
arXiv (CS.CL) 2026-06-15

OLaPh: Optimal Language Phonemizer

Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary (OOV) terms. We introduce OLaPh (Optimal Language Phonemizer), a hybrid framework that integrates extensive multilingual lexica with advanced NLP techniques and a statistical subword segmentation function. Evaluations on the WikiPron benchmark show OLaPh significantly outperforms established baselines in overall accuracy and maintains robustness on OOV data through advanced fallback mechanisms. To further explore neural generalization, we utilize the framework to synthesize a high-consistency training corpus for an instruction-tuned Large Language Model (LLM). While the deterministic framework remains more accurate overall, the LLM demonstrates strong generalization, matching or partly exceeding the framework's performance. This suggests that the LLM successfully internalized phonetic intuitions from the synthetic data that transcend the framework's capabilities. Together, these tools provide a comprehensive, open-source resource for multilingual grapheme-to-phoneme conversion (G2P) research.

13.
arXiv (CS.LG) 2026-06-15

LapidaryEngine: Fully Conversational Crystal Generation

arXiv:2606.14215v1 Announce Type: new Abstract: The emergence of Large Language Models (LLMs) has inspired the vision of generating bespoke crystal materials directly from natural-language instructions, enabling users to design materials through intuitive, conversational interaction. Existing text-to-crystal generative models represent important early steps toward this goal, but they suffer from two critical limitations: (i) restricted input formats that require highly structured descriptions (e.g., chemical formulas), and (ii) one-directional generation, where models can map text to crystal but cannot perform the inverse. These limitations prevent fully conversational workflows and hinder alignment with users' inherently ambiguous and evolving desiderata. We address these challenges with LapidaryEngine, the first model to support fully conversational crystal generation. LapidaryEngine accepts free-form natural-language requests and performs iterative refinement and editing in a dialogue-like manner. The key innovation is a pivot representation, a third, intermediate form that enables bidirectional translation between text and crystal structures despite the absence of direct paired datasets. Leveraging this pivot allows robust interpretation of user feedback and precise structural control. We demonstrate LapidaryEngine across diverse tasks, including insulator discovery, stability optimization, compositional modification, and structural editing, showcasing its ability to align generated materials with user intent in an interactive manner.

14.
arXiv (math.PR) 2026-06-11

Mean-field limits for stochastic particle systems on dense graphs

arXiv:2606.11369v1 Announce Type: new Abstract: We study stochastic interacting particle systems whose interaction structure is described by dense weighted directed graphs converging to a graphon. In the thermodynamic limit, we prove a law of large numbers for the empirical measure process and derive a deterministic nonlinear master equation describing the macroscopic evolution. The limiting equation retains the heterogeneous interaction structure of the microscopic system through the limiting graphon, allowing for spatially non-homogeneous behaviors such as localized or community-type interactions.

15.
arXiv (CS.LG) 2026-06-11

A Multi-Modal Sensor Fusion Instrument for Measuring Regional Human Mobility: The Distributed Human Data Engine (DHDE)

arXiv:2603.21639v2 Announce Type: replace-cross Abstract: Accurately estimating human mobility in peripheral regional economies presents a fundamental measurement challenge: physical ground-truth sensors are sparse, behavioral intent signals are heterogeneous, and environmental friction introduces systematic bias into demand inference. We introduce the Distributed Human Data Engine (DHDE), a multi-modal sensor fusion architecture that addresses this challenge by integrating physical instrumentation (Edge-AI cameras), digital intent signals (route search impression metrics), behavioral records (90,350 spending records, 97,719 standardized survey responses), and meteorological data across four geographically distributed nodes in Fukui, Japan. The primary measurement-science contribution is the design, deployment, and cross-node validation of the DHDE as a sparse-sensor compensation instrument: a heterogeneous sensor fusion architecture that anchors non-stationary digital intent signals to concurrent physical ground-truth counts, correcting for systematic bias introduced by meteorological planning friction. The instrument is implemented as an ensemble inference pipeline (Random Forest and Ordinary Least Squares with Newey-West robust inference), calibrated across 397 daily observations and validated by chronological holdout replication across four geographically distinct node types. The primary OLS specification achieved an in-sample explanatory power of R2 = 0.810 and a chronological out-of-sample predictive performance of R2 = 0.683. Results identify an Under-Vibrancy Paradox where macro-regional visitor satisfaction correlates positively with crowd density (Spearman rank correlation rs = +0.150, p = 0.002). We estimate an annual proxy gap of 865,917 intent-implied visits, corresponding to JPY 11.96 billion (USD 72.6 million) in foregone revenue.

16.
arXiv (quant-ph) 2026-06-15

Merged amplitude encoding for Chebyshev quantum Kolmogorov–Arnold networks: trading qubits for circuit executions

arXiv:2603.02818v3 Announce Type: replace Abstract: Quantum Kolmogorov–Arnold networks based on Chebyshev polynomials (CCQKAN) evaluate each edge activation function as a quantum inner product, creating a trade-off between qubit count and the number of circuit executions per forward pass. We introduce merged amplitude encoding, a technique that packs the element-wise products of all $n$ input-edge vectors for a given output node into a single amplitude state, reducing circuit executions by a factor of $n$ at a cost of only 1–2 additional qubits relative to the sequential baseline. The merged and original circuits compute the same mathematical quantity exactly; the open question is whether they remain equally trainable within a gradient-based optimization loop. We address this question through numerical experiments on 10 network configurations under ideal, finite-shot, and noisy simulation conditions, comparing original, parameter-transferred, and independently initialized merged circuits over 16 random seeds. Wilcoxon signed-rank tests show no significant difference between the independently initialized merged circuit and the original ($p > 0.05$ in 28 of 30 comparisons), while parameter transfer yields significantly lower loss under ideal conditions ($p < 0.001$ in 9 of 10 configurations). On 10-class digit classification with the $8\times8$ MNIST dataset using a one-vs-all strategy, original and merged circuits achieve comparable test accuracies of 53–78\% with no significant difference in any configuration. These results provide empirical evidence that merged amplitude encoding preserves trainability under the simulation conditions tested.

17.
arXiv (CS.AI) 2026-06-11

What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study

arXiv:2606.11780v1 Announce Type: cross Abstract: We establish conditions for embedding a corpus of $N$ documents as $d$-dimensional vectors such that every $k$-subset $S \subseteq [N]$ is realizable as a result of top-$k$ retrieval by some query vector. Recent work shows that $d = O(k)$ suffices for such embeddings to exist in $\mathbb{R}^d$, independently of $N$. We theoretically prove that this corpus-independent bound is specific to infinite precision. With $B$ bits per coordinate, perfect top-$k$ retrieval requires $Bd = \Omega(k \ln N)$; thus, at any fixed precision, the dimension must grow at least logarithmically with $N$. Specializing to a $\ell_2$-normalized $B$-bit uniform scalar quantization model, we also identify a threshold on the precision $B^{*} = O(\ln \ln N)$ below which no dimension suffices, together with two further regimes that bound the feasible $(B, d)$ pairs. Our result implies that in practical vector databases and dense retrieval systems where quantization is standard, the embedding dimension and possibly the precision must grow with the corpus size.

18.
arXiv (quant-ph) 2026-06-16

Conditions for Unitarity in Timeless Quantum Theory

arXiv:2504.01579v3 Announce Type: replace Abstract: Quantum timeless approaches solve the problem of time by recovering the usual unitary evolution of quantum theory relative to a clock in a stationary quantum Universe. For some Hamiltonians of the Universe, such as those including an interaction term with the clock, the dynamics is substantially altered and can be non-unitary. This work derives necessary and sufficient conditions for the relative dynamics to be unitary and finds the general form of the unitary evolution operator. A physical interpretation of these conditions is given in terms of the clock's rate. Unitary dynamics is associated with rates that are constant in time and independent of the clock's internal structure.

19.
arXiv (CS.CL) 2026-06-15

Residual Context Diffusion Language Models

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~300 million tokens. RCD consistently improves frontier dLLMs by 4-11 percentage points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at baseline's peak accuracy.

20.
arXiv (CS.CL) 2026-06-15

Multimodal Speaker Identification in Classroom Environments

Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings with LLM-derived semantic context. Using a subset of the EDSI dataset (8 math classrooms, N = 2,801 utterances), we found an acoustic baseline (ECAPA-TDNN) achieved only 39.0% accuracy. By integrating transcript-based "contextual anchoring" into a gradient boosting classifier, our multimodal approach raised student identification to 50.3%. Performance also improved for utterances over 5 seconds, reaching 76.9% accuracy (vs. 64.9% baseline) with a 90.9% Top-3 accuracy. Additionally, the model distinguished teacher vs. student roles with 99.3% accuracy. This approach advances the feasibility of automated feedback systems capable of considering individual student participation, a crucial step for supporting equitable instruction at scale.

21.
arXiv (CS.CL) 2026-06-18

ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.

22.
arXiv (CS.AI) 2026-06-17

Confusion-Aware Transfer Teacher Curriculum Learning Framework: Disentangling Scoring and Pacing Effects

arXiv:2606.17706v1 Announce Type: cross Abstract: Curriculum learning couples two design choices, how samples are scored by difficulty and how harder samples are paced into training, making it difficult to attribute observed gains to either component. We disentangle these factors with two evaluation protocols: stage-wise test subsets that validate scoring functions independently of curriculum training, and a baseline that applies the same pacing schedule to randomly ordered data. Within the Transfer Teacher framework (TTF), we use these protocols to evaluate a confusion-aware difficulty score that considers both correct-class confidence and the probability distribution over incorrect classes. On CIFAR-10 with ResNet-18 and VGG-16, the proposed score produces model-interpretable difficulty rankings that align with human intuition. However, at full data, neither curriculum nor anti-curriculum ordering improves accuracy over standard training, indicating that improving the scoring function alone is insufficient to overcome the known failure modes of curriculum learning in TTF. In contrast, We find that confusion-aware curriculum ordering result in consistent data-efficiency benefits, outperforming random ordering by up to 8.7% points at the 20% data regime, suggesting the potential of TTF as a data-efficient training method.

23.
arXiv (CS.AI) 2026-06-17

TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins

arXiv:2606.17660v1 Announce Type: cross Abstract: Fine-tuning large language models (LLMs) is compute-intensive and error-prone: model performance depends sensitively on data quality and hyperparameter choices, and naïve runs can even degrade model performance. This raises a practical question:can we predict fine-tuning performance before committing to a full training run? We present TUNEAHEAD, a lightweight framework for pre-hoc prediction of fine-tuning performance. TUNEAHEAD encodes each candidate run as a meta-feature vector that combines static dataset descriptors with dynamic probe features from a short standardized probe. A predictor maps these features to performance estimates, while SHAP-based attributions provide interpretable diagnostics that reveal which specific features drive the prediction. Across 1,300+ fine-tuning runs on Qwen2.5-7B-Instruct, TUNEAHEAD consistently outperforms strong baselines such as Early-Stop Extrapolation and ProxyLM. On a held-out test set of 370 runs, TUNEAHEAD achieves an RMSE of 1.47 percentage points and places 95.1% of predictions within +3/-3 percentage points of the true score. These accurate continuous predictions support practical go/no-go screening policies that can reduce unnecessary full fine-tuning while retaining most promising runs.

24.
arXiv (CS.CL) 2026-06-17

Smarter edits? Post-editing with error highlights and translation suggestions

As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

25.
arXiv (CS.LG) 2026-06-15

Recursively Trained Diffusion Models: Limiting Collapse Distribution and Spectral Characterization

arXiv:2606.13796v1 Announce Type: cross Abstract: Recursive training of generative models on their own outputs can lead to model collapse, a compounding drift away from the true data distribution. Existing theoretical works bound finite-round error accumulation in the context of diffusion models, but two questions remain open:~what distribution does the recursion converge to, and how fast? We answer both, isolating a mechanism distinct from imperfect learning: even with perfect score estimation and exact sampling, the early stopping of the reverse diffusion (required for numerical stability) drives a progressive drift away from the data distribution. We prove that this recursion converges geometrically to a unique limiting distribution, which admits a closed-form characterization as an infinite mixture of increasingly Gaussian-smoothed versions of the data distribution. A Hermite spectral decomposition of this limit reveals that recursive training acts as a low-pass filter: higher-order modes, which encode fine non-Gaussian structure, are attenuated much more strongly than coarse modes. This spectral picture motivates annealed truncation schedules that progressively shrink truncation times across retraining rounds; we prove that any schedule converging to $0$ asymptotically eliminates recursive compounding. Finally, we show our idealized characterization is robust: in the presence of discretization and score estimation errors, the learned distribution remains in a Wasserstein-2 ball around the ideal limit, with mode-dependent contraction rates that contract high-order errors faster than low-order ones. We validate the theory on synthetic Gaussian mixtures and CIFAR-10.