Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-16

Pre-Training for Simulation-Based Science: A Study on Jet Foundation Model Training Objectives

arXiv:2606.14870v1 Announce Type: cross Abstract: Foundation models (FMs) trained on large datasets and fine-tuned on downstream tasks have emerged as a powerful paradigm in AI for science. Industrial FMs are typically trained using self-supervision with masking due to the lack of labels. In many scientific domains, accurate simulations are plentiful and facilitate large, labeled datasets. This opens up new possibilities for pre-training. We present a systematic comparison of pre-training methods using the OmniLearned High Energy Physics FM framework. We test supervised classification, flow-matching generation, and self-supervised masked particle modeling. All models are pre-trained on the JetClass dataset and fine-tuned on two representative downstream tasks, top jet classification and JetNet conditional generation. Among other observations, for classification tasks, we find that pure classifier pre-training is optimal when downstream labels and model capacity are plentiful, but combining it with self-supervised masked particle modeling (MPM) is uniquely powerful in the low-finetuning label regime. Flow matching-based generative pre-training seems to provide little benefit for downstream classification, and interestingly, for downstream generation, we find that flow matching must be in the pre-training objective to see a significant finetuning advantage, hinting at the orthogonality of classification and generation tasks. That is, for a model to transfer to both generative and classification downstream tasks, it must be pre-trained on both. This study provides a template for controlled scaling analysis of pre-training objectives for foundation models in simulation-based sciences.

02.
PLOS Computational Biology 2026-06-01

Supervised deep learning with gene functional annotation for cell classification

Authors:

by Zhexiao Lin, Yuanyuan Gao, Wei Sun Gene-by-gene differential expression analysis is a widely used supervised approach for interpreting single-cell RNA-sequencing (scRNA-seq) data. However, modern scRNA-seq datasets often contain large numbers of cells, leading to the identification of many differentially expressed genes with extremely small p-values but negligible effect sizes, thus making biological interpretation difficult. To overcome this challenge, we developed Supervised Deep learning with gene functional ANnotation (SDAN), a method that integrates gene functional annotation information (e.g., protein-protein interaction) with gene-expression profiles through a graph neural network. SDAN identifies functionally coherent gene sets that optimally classify cells, and the resulting cell-level classification scores can be aggregated to make individual-level predictions. We evaluated SDAN alongside three representative existing methods in three real-data applications aimed at identifying gene sets associated with severe COVID-19, dementia, and cancer immunotherapy response. Across all applications, SDAN consistently outperformed the alternative approaches by achieving two objectives simultaneously: accurate outcome classification and clear assignment of genes to functionally related gene sets.

03.
arXiv (CS.AI) 2026-06-12

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

arXiv:2606.13220v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.

04.
arXiv (CS.LG) 2026-06-17

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

arXiv:2602.17894v2 Announce Type: replace-cross Abstract: Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities - for example, health markers, demographics, or political affiliations - and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size - the total sample size divided by $D_{\chi^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{\chi^2}$ is the $\chi^2$-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

05.
arXiv (CS.LG) 2026-06-12

Reliability of Probabilistic Emulation of Physical Systems

arXiv:2606.12997v1 Announce Type: new Abstract: Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

06.
arXiv (CS.LG) 2026-06-11

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

arXiv:2606.12299v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

07.
arXiv (quant-ph) 2026-06-16

Theory of the correlated quantum Zeno effect in a monitored qubit dimer

arXiv:2503.22846v2 Announce Type: replace Abstract: We theoretically investigate the stochastic dynamics of two qubits subject to one- and two-site correlated continuous weak measurements. When measurements dominate over the local unitary evolution, the system's dynamics is constrained and part of the physical Hilbert space becomes inaccessible: a typical signature of the Quantum Zeno (QZ) effect. In this work, we show how the competition between these two measurement processes give rise to two distinct QZ regimes, we dubbed standard and correlated, characterised by a different topology of the allowed region of the physical Hilbert space being a simply and non-simply connected domain, respectively. We develop a theory based on a stochastic Gutzwiller ansatz for the wavefunction that is able to capture the structure of the phase diagram. Finally we show how the two QZ regimes are intimately connected to the topology of the flow of the underlying non-Hermitian Hamiltonian governing the no-click evolution.

08.
arXiv (CS.CL) 2026-06-15

Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression

Retrieval-augmented generation (RAG) often suffers from long and noisy retrieved contexts. Existing context compression methods typically rely on heuristic relevance estimation or supervised compression models rather than on how LLMs utilize retrieved context during inference. We propose Sentinel, a lightweight sentence-level compression framework that decodes inference-time contextual utilization behaviors from head-wise attention patterns of frozen LLMs. To ground supervision in retrieval-dependent answering behavior, Sentinel trains a lightweight probe using QA examples where the model succeeds only when retrieved context is available. Sentinel performs compression using only a single non-autoregressive forward pass without dedicated compression training or autoregressive scoring. Empirically, we find that effective contextual utilization signals remain accessible even in compact proxy models. On LongBench, Sentinel with a 0.5B proxy model achieves up to 5$\times$ compression while attaining question-answering performance competitive with compression methods built on 7B-scale models. Despite being trained only on English QA data, Sentinel also generalizes effectively to Chinese and out-of-domain settings.

09.
arXiv (CS.CV) 2026-06-18

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

10.
arXiv (CS.CV) 2026-06-18

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

11.
medRxiv (Medicine) 2026-06-11

Incremental costs of transitioning from four to eight WHO-recommended antenatal care visits in Uganda: A costing analysis from a societal perspective

Background In 2016, the World Health Organization revised its antenatal care (ANC) recommendation from four to eight visits. For low- and middle-income countries like Uganda, where achieving even four visits remains a challenge, this transition has significant cost implications for both the health system and households. This study estimated the incremental costs of adopting the eight-visit model from a societal perspective. Methods The study was conducted in six government health facilities in southwestern Uganda. A micro-costing approach estimated health facility costs (personnel, equipment, consumables, and overhead). Costs incurred at patients end (transport, ultrasound, medical expenses, and time) were collected from 785 women using a questionnaire, with all costs in 2025 USD. Results For an average of 4.3 visits, total cost per woman was $100.1: facility costs $43.7 (43.7%), and patient costs $56.4 (56.3%). Transitioning to eight visits would increase total cost by $57.8 (57.8%), of which $36.4 (63.0%) would fall on households, equivalent to 68.8% of average monthly household income. Total costs would rise by 55.4% ($115.5 to $179.5) at Health Center IVs and 64.3% ($102.3 to $168.1) at Health Center IIIs, with facility costs up 43.4% and 62.9% and patient costs up 61.2% and 65.7%, respectively. Conclusion Transitioning to eight ANC visits would impose a large financial burden on households, with the incremental patient cost equivalent to more than two-thirds of average monthly household income. Equitable implementation requires improving availability of medicines and diagnostics, subsidizing transport, exploring telemedicine or community-based models, and improving efficiency at lower-tier health centers.

12.
arXiv (quant-ph) 2026-06-15

Merged amplitude encoding for Chebyshev quantum Kolmogorov–Arnold networks: trading qubits for circuit executions

arXiv:2603.02818v3 Announce Type: replace Abstract: Quantum Kolmogorov–Arnold networks based on Chebyshev polynomials (CCQKAN) evaluate each edge activation function as a quantum inner product, creating a trade-off between qubit count and the number of circuit executions per forward pass. We introduce merged amplitude encoding, a technique that packs the element-wise products of all $n$ input-edge vectors for a given output node into a single amplitude state, reducing circuit executions by a factor of $n$ at a cost of only 1–2 additional qubits relative to the sequential baseline. The merged and original circuits compute the same mathematical quantity exactly; the open question is whether they remain equally trainable within a gradient-based optimization loop. We address this question through numerical experiments on 10 network configurations under ideal, finite-shot, and noisy simulation conditions, comparing original, parameter-transferred, and independently initialized merged circuits over 16 random seeds. Wilcoxon signed-rank tests show no significant difference between the independently initialized merged circuit and the original ($p > 0.05$ in 28 of 30 comparisons), while parameter transfer yields significantly lower loss under ideal conditions ($p < 0.001$ in 9 of 10 configurations). On 10-class digit classification with the $8\times8$ MNIST dataset using a one-vs-all strategy, original and merged circuits achieve comparable test accuracies of 53–78\% with no significant difference in any configuration. These results provide empirical evidence that merged amplitude encoding preserves trainability under the simulation conditions tested.

13.
arXiv (CS.CV) 2026-06-17

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency

14.
bioRxiv (Bioinfo) 2026-06-20

Evaluation of Trypanosoma brucei Phosphofructokinase Allosteric Inhibition: An In-Silico Study

Human African trypanosomiasis, caused by a protozoan parasite Trypanosoma brucei, is a neglected tropical disease for which well-tolerated, conveniently administered, and highly efficacious medicines are still missing. Previously, T. brucei Phosphofructokinase was targeted by small-molecule inhibitor development efforts. This approach has shown promise both in vitro and in vivo. In this study, we have used these wet-lab results, evaluated the compounds already characterised by Molecular Dynamics simulations, found relationships between in silico and wet-lab data and used these observations to evaluate compounds that we selected through several different approaches of virtual screens. We observed that inhibitor-ATP interactions are highly predictive of the inhibitory activity. Several compounds selected through virtual screens have outperformed previously characterised compounds.

15.
arXiv (CS.CL) 2026-06-11

Gumbel-BEARD: Automatic Layer Selection for Self-Supervised Adaptation of Whisper in Low-Resource Domains

Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.

16.
arXiv (quant-ph) 2026-06-15

Nanostructure modelling with early fault tolerant quantum computers

arXiv:2606.06442v2 Announce Type: replace Abstract: Semiconductor nanostructures are central to many developing technologies. Notably, double quantum dots are especially important for semiconductor spin-qubit architectures, quantum sensing applications, and quantum-dot solar cells. Accurate modelling is highly desirable but conventional methods can struggle when dynamics involve more than two interacting electrons. In this work, we present a quantum simulation framework capable of addressing multi-electron double quantum dots. We adopt an efficiently scaling 1$^st$ quantised representation of the system and develop algorithms based on both Trotterisation and Qubitisation. Incorporating insights from classical simulations enables us to produce resource estimates that are more realistic than those obtained from theoretical error bounds. Using a standard surface code model with physical noise at $10^{-3}$, our results indicate that the ground-state energy of four electrons in a double quantum dot can be estimated in approximately 22 hours using 226k physical qubits, or an eight-electron system in 3.3 days with 314k qubits (with runtimes falling dramatically when more qubits are available). We anticipate that incorporating recent advances in surface code architectures may reduce these costs significantly further. Our results suggest that early fault-tolerant quantum computers may become valuable tools for designing mature-era quantum technologies.

17.
arXiv (CS.CL) 2026-06-15

Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents

AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems. In this work, we introduce Dialogue SWE-Bench, an automatic benchmark dataset for evaluating the ability of coding agents to resolve real-world software engineering problems through dialogue with a user. We design a novel, persona-grounded user simulator to support our task evaluation, and augment our task evaluation with automatic evaluations of dialogue quality. We also propose a new schema-guided agent, aimed at improving the dialogue capabilities of off-the-shelf coding agents, which improves over strong baselines by 3-14%. Our results indicate that better coding models do not always correspond to better dialogue models, suggesting that dialogue capability is a distinct and currently understudied dimension of coding agent performance.

18.
arXiv (CS.LG) 2026-06-19

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

arXiv:2606.19491v1 Announce Type: new Abstract: Pretrained transformers sit near singular minima of the loss, where the Fisher information metric degenerates along dead directions: directions in parameter space along which the directional Fisher vanishes. Locating such a direction normally needs a forward pass and an eigendecomposition of activations, or a sampling-based complexity estimate; none returns a direction computable from the network's parameters alone. We give one, for LayerNorm transformers. The inverse-scale direction $\gamma^{-1}/\|\gamma^{-1}\|$ of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance, for any input distribution, and induces a corresponding dead direction in parameter space. It is read from the LN scale parameter alone, with no forward or backward pass and no eigensolve: the cheapest dead-direction read, specific to LayerNorm. We test it on $14$ pretrained transformers ($9$ LayerNorm, $5$ RMSNorm; $160$M-$35$B; language and vision objectives). At random initialisation the predicted direction matches the measured bottom singular direction (one forward pass, direct SVD) to four decimal places on $9/9$ LayerNorm models, and is correctly absent on $5/5$ RMSNorm models, which lack the mean-subtraction projector that creates it. On the trained checkpoint the covariance eigenvalue along this direction deepens by ${\sim}10^3\times$ and further dead directions open; the random-init-to-trained gap is a one-forward-pass, per-checkpoint readout of singular structure along the predicted coordinate. Two consequences follow in closed form: the residual stream's smallest singular value is preserved block-to-block on $13/14$ transformers measured on their own input distribution, the one exception (Gemma$4$-$31$B) a genuine dead direction the same read pinpoints; and the kernel direction's presence classifies a transformer's normalisation from the parameters alone.

19.
arXiv (CS.CL) 2026-06-17

E2Vec: Feature Embedding with Temporal Information for Analyzing Student Actions in E-Book Systems

Digital textbook (e-book) systems record student interactions with textbooks as a sequence of events called EventStream data. In the past, researchers extracted meaningful features from EventStream, and utilized them as inputs for downstream tasks such as grade prediction and modeling of student behavior. Previous research evaluated models that mainly used statistical-based features derived from EventStream logs, such as the number of operation types or access frequencies. While these features are useful for providing certain insights, they lack temporal information that captures fine-grained differences in learning behaviors among different students. This study proposes E2Vec, a novel feature representation method based on word embeddings. The proposed method regards operation logs and their time intervals for each student as a string sequence of characters and generates a student vector of learning activity features that incorporates time information. We applied fastText to generate an embedding vector for each of 305 students in a dataset from two years of computer science courses. Then, we investigated the effectiveness of E2Vec in an at-risk detection task, demonstrating potential for generalizability and performance.

20.
arXiv (CS.AI) 2026-06-18

Veriphi: Attack-Guided Neural Network Verification with Dataset-Dependent Training Methods

arXiv:2606.18454v1 Announce Type: cross Abstract: We present Veriphi, a GPU-accelerated neural network verification system that combines fast adversarial attacks with formal bound certification using alpha,beta-CROWN methods. Through systematic experiments on MNIST and CIFAR-10 using three training methodologies (standard, adversarial, certified), we demonstrate that training method effectiveness is fundamentally dataset-dependent. Interval Bound Propagation (IBP) achieves 78% certified accuracy on simple MNIST (784 dimensions) but provides negligible certification performance on the more complex CIFAR-10 dataset, where PGD adversarial training dominates with 94% certification at small perturbations. We achieve 5x verification speedup through attack-guided falsification and scale our approach to production-size models (105.8M parameters) for real-world aerospace logistics optimization. Our results challenge the assumption that certified training universally outperforms adversarial training, showing context matters critically for verification strategy selection.

21.
arXiv (CS.LG) 2026-06-19

Utility-Aware DRL-Based TXOP Adaptation for NR-U and Wi-Fi Coexistence Networks

arXiv:2605.00457v4 Announce Type: replace-cross Abstract: The coexistence of NR-U and Wi-Fi in the unlicensed spectrum introduces a challenging resource management problem, where heterogeneous channel access mechanisms can lead to unbalanced spectrum utilization and severe Wi-Fi performance degradation. To address this issue, this paper proposes a utility-aware deep reinforcement learning (DRL) framework for adaptive transmission opportunity (TXOP) control in NR-U/Wi-Fi coexistence networks. The coexistence process is formulated as a Markov decision process (MDP), in which the NR-U TXOP duration is treated as a controllable variable for regulating post-access channel occupancy. A deep Q-network (DQN) is then employed to learn adaptive TXOP control policies through online interaction with the coexistence environment. A key feature of the proposed framework is the integration of a configurable reward and criterion design, which enables explicit control of the fairness-efficiency-utility tradeoff. Three operating policies are developed, namely absolute fairness, moderate fairness, and utility-oriented moderate fairness, to characterize different coexistence operating points. Simulation results show that the proposed framework achieves a Jain fairness index above 0.9 under strict fairness control. Compared with the absolute fairness policy, the moderate fairness policy improves aggregate throughput by 68.22%, while the utility-oriented policy achieves a 177.6% improvement under the adopted utility evaluation metric. These results demonstrate that the proposed utility-aware DRL framework provides an effective and flexible solution for adaptive TXOP control and tradeoff management in heterogeneous unlicensed coexistence networks.

22.
arXiv (CS.LG) 2026-06-18

Mixed-Precision Communication-Avoiding SGD for Generalized Linear Models on GPUs

arXiv:2606.18463v1 Announce Type: cross Abstract: Distributed stochastic gradient descent (SGD) is limited by communication rather than computation, since each iteration requires an AllReduce across processes. Communication-avoiding SGD (CA-SGD) amortizes communication over $s$ iterations by replacing $s$ consecutive AllReduces with a single AllReduce of an $sb\times sb$ Gram matrix, trading more computation and bandwidth for fewer synchronization points. Modern GPUs with matrix hardware and reduced-precision formats offset this by accelerating the Gram GEMM and shrinking BF16 traffic. We study mixed-precision CA-SGD for generalized linear models on NVIDIA GPUs. Our finite-precision analysis decomposes the local rounding error of one CA-SGD outer iteration into nine independent precision choices, depending on the hardware only through its low-precision unit roundoffs, so the resulting recipes transfer in principle across GPU generations. The recipe stores the input matrix and margin vector in low precision, computes the Gram matrix from low-precision inputs with high-precision accumulation, communicates it in high precision, and performs the inner recurrence and weight updates in high precision. On NERSC Perlmutter A100 GPUs, mixed-precision CA-SGD matches FP32 SGD loss within $0.5\%$ on logistic, linear, and Poisson problems and reaches $5.1$–$6.8\times$ speedup over FP32 SGD on epsilon, SUSY, HIGGS, synth, and Poisson-synth. Our software is available at https://doi.org/10.5281/zenodo.20448273

23.
arXiv (CS.CL) 2026-06-12

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively – suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

24.
arXiv (CS.CV) 2026-06-19

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

25.
arXiv (CS.CL) 2026-06-12

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

Authors:

LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.