Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CL) 2026-06-19

A Survey of On-Policy Distillation for Large Language Models

As Large Language Models continue to grow in both capability and cost, transferring frontier capabilities into smaller, deployable students has become an important engineering problem, and knowledge distillation remains a common technique for this transfer. The prevailing recipe in industrial pipelines, static imitation of teacher-generated text, carries a structural weakness that grows more severe as tasks become longer and more reasoning-intensive. Because the student is trained on flawless teacher prefixes but generates its own at inference, small errors tend to accumulate into trajectories it has rarely been trained to recover from, and the resulting exposure bias has been shown to scale roughly with the square of sequence length. On-Policy Distillation reorganizes the training loop around this observation by having the teacher provide feedback on what the student actually produces, with the goal of reducing the compounding term toward linear and reframing distillation as an iterative correction process rather than single-pass imitation. The resulting literature has expanded along divergence design, reward-guided optimization, and self-play, yet contributions remain scattered across the knowledge distillation, RLHF, and imitation learning communities without a unified treatment. This survey provides such a treatment. We formalize OPD as f-divergence minimization over student-sampled trajectories, organize the field along three design axes (what to optimize, where the signal comes from, and how to stabilize training in practice), and consolidate success conditions, recurring failure modes, and the connection between OPD and KL-constrained reinforcement learning. We close with open problems that emerge from this synthesis, including distillation scaling laws, uncertainty-aware feedback, agent-level distillation, and the growing overlap between knowledge distillation and RL.

02.
PLOS Medicine 2026-06-04

Comparative impacts and cost-effectiveness of tuberculosis systematic screening strategies in prisons in Brazil, Colombia, and Peru: A mathematical modeling study

Authors:

by Yiran E. Liu, José Victor Bortolotto Bampi, Ronan F. Arthur, Argita D. Salindri, Caroline Busatto, Pedro Avedillo Jiménez, Daniele Maria Pelissari, Fernanda Dockhorn Costa Johansen, Robert Arana-Narvaez, Alvaro Fernando Moreno Roca, Wilfredo Santos Solís Tupes, Esther Mori Jiu, Christian Alfredo Moreno Roca, Erika Albertina Abregú Contreras, Valentina Antonieta Alarcón Guizado, Julián Trujillo Trujillo, Belkys Marcelino, Mónica Alonso Gonzalez, Mayra Cecilia Córdova Ayllon, Ted Cohen, Moises A. Huaman, Jeremy D. Goldhaber-Fiebert, Julio Croda, Jason R. Andrews Background Incarceration is a leading driver of tuberculosis in Latin America. Systematic screening in prisons may reduce tuberculosis burden, but optimal strategies and cost-effectiveness remain uncertain. We examined the population-wide health impacts and cost-effectiveness of systematic screening in prisons in Brazil, Colombia, and Peru, comparing different timepoints, frequencies, and screening algorithms. Methods and findings Using dynamic transmission models calibrated to Brazil, Colombia, and Peru, we simulated annual or biannual (twice-yearly) prison-wide screening, alone or combined with entry and exit screening from 2026 to 2035. We evaluated four algorithms: (1) symptom screening, (2) chest X-ray with computer-aided detection (CXR-CAD), (3) symptoms and CXR-CAD (follow-up testing if either is positive), and (4) GeneXpert Ultra (Xpert) with pooled sputum. Individuals screening positive then received individual Xpert. We projected impacts on within-prison and population-level tuberculosis incidence in 2035, along with discounted costs (2023 US dollars) and disability-adjusted life years (DALYs). Model projections showed that combined entry, exit, and biannual screening with CXR-CAD was highly impactful and cost-effective across countries, reducing tuberculosis incidence by 61%–87% in prisons and 18%–28% population-wide. Compared to only biannual CXR-CAD (the next best strategy), the incremental cost per DALY averted of adding entry and exit screening was $2,984 (Brazil), $2,925 (Colombia), and $645 (Peru). Adding symptom screening to CXR-CAD marginally increased benefit and was only cost-effective in Peru’s higher-incidence prisons. Biannual screening alone remained cost-effective at prison incidence levels well below national averages, as well as at far lower willingness-to-pay thresholds. In settings without CXR-CAD, pooled Xpert was an impactful, cost-effective alternative. Key limitations include the model’s simplified representation of tuberculosis disease states and lack of stratification by age, gender/sex, HIV, or drug resistance. Conclusions These modeling results support immediate national-level adoption of prison-wide tuberculosis screening twice-yearly and at entry and exit, using CXR-CAD or pooled Xpert.

03.
arXiv (CS.AI) 2026-06-24

No Certificate, No Categorical Speech Act: A Brouwerian Assertibility Constraint for Public Reason

arXiv:2603.03971v3 Announce Type: replace-cross Abstract: Generative AI can convert uncertainty into authoritative-seeming verdicts, intensifying the hypersuasive force of automated speech and displacing the justificatory work on which democratic epistemic agency depends. As a corrective, I propose a Brouwer-inspired assertibility constraint for responsible AI: in high-stakes domains, systems may assert or deny claims only if they can provide a publicly inspectable and contestable certificate of entitlement; otherwise they must return Undetermined. This constraint yields a three-status interface semantics (Asserted, Denied, Undetermined) in which statuses mark entitlement to categorical speech rather than truth values of the underlying world-claim. The semantics cleanly separates internal entitlement from public standing while connecting them via the certificate as a boundary object. It also produces a time-indexed entitlement profile that is stable under numerical refinement yet revisable as the public record changes. I operationalize the constraint through decision-layer gating of threshold and argmax decisions, using internal witnesses (e.g., sound bounds or separation margins where available, and contestable surrogates otherwise) and an output contract with reason-coded abstentions. A design lemma shows that any total, certificate-sound binary interface yields witnessed decidability of the deployed predicate on its declared scope, so Undetermined is not a tunable reject option but a mandatory status whenever no adequate forcing witness is available. By making outputs answerable to challengeable warrants rather than confidence alone, the paper aims to preserve epistemic agency against the persuasive pull of automated speech in public justification.

04.
arXiv (CS.LG) 2026-06-18

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

arXiv:2606.18844v1 Announce Type: new Abstract: Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

05.
arXiv (quant-ph) 2026-06-12

Simple analytical flux-tuned iSWAP pulses for leakage suppression

arXiv:2606.13052v1 Announce Type: new Abstract: Fast, high-fidelity two-qubit gates are a key requirement for fault-tolerant quantum computation. Tunable coupler architectures provide a flexible approach for implementing entangling gates through flux control with large on-off ratios, but fast flux modulation can induce diabatic transitions and population leakage to non-computational states, limiting gate performance. Here we present an analytical flux control method enabling derivative removal by adiabatic gate ($\Phi$-DRAG) for suppressing leakage in flux tunable two-qubit gates. We show that $\Phi$-DRAG differs fundamentally from conventional microwave implementations and derive modified flux modulation protocols that suppress leakage below $10^{-4}$ for fast entangling gates. The method remains effective across a range of asymmetry between qubit anharmonicities and different circuit parameters, enabling high-fidelity two-qubit gates within the fifteen nanosecond range.

06.
Nature (Science) 2026-06-22

Why heritage sites are at risk in a warming world — and how to save them

As rising seas and intensifying disasters threaten historic sites worldwide, new ways to understand, preserve and adapt these places are needed urgently. As rising seas and intensifying disasters threaten historic sites worldwide, new ways to understand, preserve and adapt these places are needed urgently.

07.
arXiv (CS.CL) 2026-06-11

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

Authors:

In this work, we introduce CHAIR (Classifier of Hallucination As ImproveR), a supervised framework for detecting hallucinations by analyzing internal logits from each layer of every token. Our method extracts a compact set of features such as maximum, minimum, mean, standard deviation, and slope-from the token logits across all layers, enabling effective hallucination detection without overfitting. Experiments on TruthfulQA and MMLU datasets demonstrate that CHAIR significantly improves detection accuracy, particularly in zero-shot scenarios, showcasing its robustness and generalizability. Beyond hallucination detection, CHAIR highlights the potential of using internal representations for designing advanced decoding strategies. By leveraging patterns in logits, we suggest that more sophisticated models and adaptive decoding methods could further reduce hallucinations and enhance text completion quality. CHAIR not only offers a practical solution for detecting hallucinations but also lays the groundwork for exploring richer representations in LLMs to improve their factuality and coherence.

08.
arXiv (math.PR) 2026-06-15

Trivariate Hypergeometric Series Formulas for Pure Partition Functions of Multiple $3$-SLE$_\kappa$

Authors:

arXiv:2606.14038v1 Announce Type: new Abstract: Pure partition functions of multiple SLE are characterized by null-state partial differential equations, Möbius covariance, and boundary asymptotics. After quotienting by Möbius covariance, the case of three curves is the first genuinely multivariable one: the moduli space has three independent variables, naturally represented by the three unoriented cross-ratios of the three pairs of links. We solve this Möbius-normalized three-variable problem for the two basic link-pattern types of multiple \(3\)-SLE\(_\kappa\), namely the rainbow and neighbor patterns. Writing \(\beta=4/\kappa\), we construct explicit trivariate hypergeometric-series normal forms and identify them with the corresponding pure partition functions for all \(\beta>1/2\) in the rainbow case and all \(\beta\ge2/3\) in the neighbor case. Equivalently, these ranges are \(\kappa\in(0,8)\) and \(\kappa\in(0,6]\), respectively. The proof is analytic. The null-state PDEs and Möbius covariance yield recursion relations for the trivariate coefficient arrays. In the rainbow case, coefficient estimates give convergence and boundary regularity on the closed cube. In the neighbor case, Pfaff systems continue the local power series to a neighborhood of \([0,1)^3\), while side-face equations, regular normal estimates, and corner propagation give continuity on \([0,1]^3\) for \(\beta\ge2/3\). The endpoint \(\beta=2/3\), corresponding to \(\kappa=6\), requires a logarithmic normal term. The two-dimensional boundary degenerations are classical Appell \(F_1\) and Horn \(G_2\) functions. The probabilistic identification uses SLE martingale arguments and Itô calculus, together with positivity and boundary regularity. We also discuss boundary degenerations, including heuristic connections with boundary Green's functions.

09.
bioRxiv (Bioinfo) 2026-06-12

Generalisable tissue-wide molecular reconstruction from histology

Spatial transcriptomics technologies measure gene expression within intact tissues but remain difficult to scale across large tissue sections and patient cohorts. Consequently, many studies rely on tissue microarrays (TMAs) or sparse spatial profiling designs, where molecular measurements are available for only limited tissue regions and are often generated using heterogeneous gene panels. Existing H&E to spatial gene expression prediction methods remain challenged by sparse molecular measurements, partially overlapping gene panels and tissue-wide reconstruction across heterogeneous spatial datasets. Here, we present GHIST+, a framework for tissue-wide reconstruction of single-cell molecular states from H&E histology. GHIST+ integrates cellular morphology, local tissue context and shared tissue representations to extend sparse molecular measurements into tissue-wide molecular maps across heterogeneous spatial datasets. Across multiple cancer types and GTEx breast tissues, GHIST+ reconstructs biologically meaningful tissue-wide molecular organisation from sparse TMA-derived measurements while preserving spatial tissue structure, cell-type organisation and age-associated tissue states across cancer and non-cancer settings. GHIST+ establishes a scalable framework for transforming sparse spatial profiling experiments into tissue-wide molecular maps, enabling cohort-scale molecular reconstruction from routine histology under heterogeneous spatial transcriptomic settings.

10.
medRxiv (Medicine) 2026-06-17

Diagnostic Concordance of Immediate Versus 1-Hour Technetium-99m Hydroxydiphosphonate Scintigraphy in Suspected Transthyretin Amyloid Cardiomyopathy

Background Bone-avid tracer myocardial scintigraphy for the diagnosis of transthyretin amyloid cardiomyopathy (ATTR-CM) has traditionally employed imaging at one or 3-hour intervals. Technetium-99m hydroxydiphosphonate (99mTc-HDP) has unique characteristics that may enable earlier imaging. We investigated the diagnostic concordance of immediate versus 1-hour acquisitions. Methods Consecutive patients with suspected ATTR-CM underwent planar imaging and SPECT/CT immediately and at 1-hour following the administration of 99mTc-HDP. Perugini grades and heart to contralateral lung (H/CL) ratios were assessed. Target-to-background ratios (TBRs) were calculated on the SPECT/CT acquisitions using the left ventricular (LV) septum and three background regions: aorta, LV blood-pool, and vertebrae. We assessed diagnostic concordance using Cohen's Kappa ({kappa}), temporal stability using paired t-tests, and correlation between timepoints using Pearson's coefficient (r). The 1-hour SPECT/CT interpretation served as the protocol reference standard. Results Forty-eight patients (83% male; median age, 80 [73-85] years) were evaluated. One-hour SPECT/CT identified 19 positive and 29 negative cases. Immediate SPECT/CT demonstrated 100% diagnostic concordance with the 1-hour reference standard ({kappa} = 1.000; 95% CI: 1.00 to 1.00; p < 0.001). The LV septum/LV Blood-Pool TBR showed the highest correlation (r = 0.956; 95% CI: 0.922 to 0.975; p < 0.001). The LV Septum/Aorta TBR demonstrated high correlation (r = 0.918; 95% CI: 0.857 to 0.953; p < 0.001) and remained stable in the ATTR-negative cohort (-0.02; 95% CI: -0.08 to 0.04; p = 0.54). Significant decrease in the LV Septum/Vertebrae TBR in the ATTR-negative (-0.55; 95% CI: -0.64 to -0.47; p < 0.001) and ATTR-positive cohorts (-1.14; 95% CI: -1.39 to -0.89; p < 0.001) was observed. Conclusions Immediate 99mTc-HDP SPECT/CT is diagnostically concordant with standard 1-hour protocols. By leveraging SPECT/CT and the favorable kinetics of 99mTc-HDP, immediate-phase imaging can accurately reproduce 1-hour acquisitions in cases of suspected ATTR-CM. This expedited approach may improve nuclear laboratory throughput and patient satisfaction.

11.
medRxiv (Medicine) 2026-06-22

Dengue and chikungunya virus transmission in Kinshasa, Democratic Republic of the Congo

Dengue (DENV) and chikungunya (CHIKV) are understudied in the Democratic Republic of the Congo (DRC) and across Africa despite evidence of transmission. We measured DENV and CHIKV IgG seroprevalences in Kinshasa Province, DRC, by antigen-capture ELISA, using dried blood spots from 2021. Force of infection (FOI) was estimated from age-stratified seroprevalences using Bayesian catalytic modeling. Among 1,250 participants, DENV IgG seroprevalence was 38.1% (95% CI: 34.5%-41.8%), increasing with age, and highest within peri-urban Kimpoko sites (54.9%). CHIKV IgG seroprevalence was 24.2% (95% CI: 21.1%-27.6%), increasing with age and comparable between peri-urban Kimpoko and rural Bu, with few seropositives in the city-center. DENV-CHIKV IgG co-occurrence was detected in 12.8% of participants. Time-varying FOI models provided best fit to age-stratified seroprevalences, with spatial variation detected. Sustained DENV and CHIKV circulation across Kinshasa highlights an under-appreciated transmission risk and underscores the need for strengthened arboviral surveillance in the DRC and surrounding region.

12.
arXiv (CS.LG) 2026-06-17

Data augmented bootstrap: Unifying confidence interval construction by approximate invariance

arXiv:2606.09049v2 Announce Type: replace-cross Abstract: We propose the data augmented bootstrap (DAB), a framework for constructing confidence intervals from approximately invariant transformations of the data. As special cases, DAB recovers popular methods that rely on exact group symmetries, such as conformal prediction, wild bootstrap for Maximum Mean Discrepancy U-statistics and the recently proposed SymmPI. Meanwhile, DAB also recovers the classical bootstrap method, which exploits the dataset's approximate invariance under uniform sampling of data indices as the dataset size grows. For all DAB methods, we establish theoretical coverage results that interpolate between finite-sample and asymptotic guarantees according to the strength of the invariance, and without assuming a group structure. The approximate invariance is measured in the Kolmogorov distance and, for statistics that satisfy Gaussian universality, reduces to conditional mean and variance matching. This allows us to incorporate data augmentation (DA), a widely used machine learning heuristic based on approximate invariances, into known statistical methods. We empirically test the performance of incorporating DA into bootstrap, wild bootstrap and conformal prediction for simulated settings as well as for image, language and scientific data.

13.
arXiv (CS.CL) 2026-06-11

Language Shapes Mental Health Evaluations in Large Language Models

Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equivalent mental health inputs elicit comparable evaluations across languages, or systematic shifts consistent with language-associated social and cultural contexts? We examine this question in an English-Chinese setting with GPT-4o and Qwen3-32B using a two-level framework: construct-level evaluative orientation, measured by psychometric stigma instruments, and decision-level behavior, measured by binary stigma detection and four-class depression severity classification. Across instruments and models, Chinese prompts elicit higher stigma-related scores than English prompts. At the decision level, Chinese prompts reduce sensitivity to stigmatizing content and produce more conservative depression severity judgments, leading to more under-estimation errors. These findings show that prompt language can shift both evaluative orientation and downstream behavior in LLM-based mental health evaluation. They highlight the need to evaluate multilingual LLMs not only for aggregate performance, but also for whether they apply comparable evaluative standards across languages in socially sensitive domains.

14.
arXiv (CS.CV) 2026-06-17

HRDX: A Large-Scale Vector HD-Map Dataset

Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX's scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at https://github.com/honda-research-institute/HRDX

15.
arXiv (CS.AI) 2026-06-19

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

arXiv:2606.19725v1 Announce Type: cross Abstract: Validating changes in low-level C firmware is expensive because unit tests (UTs) are fragile under strict build constraints, where missing headers, unresolved symbols, and dependency mismatches frequently prevent compilation and linking. This study introduces an automated UT authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware codebase maintained by Advanced Micro Devices (AMD) that reduces manual effort through a large language model (LLM) guided multi-agent pipeline. The workflow combines automated generation of test scaffolds, library-aware creation or reuse of stubs, mocks, and fakes, and an iterative compile-dispatch repair loop driven by build logs and line-coverage feedback. We evaluate the approach using compilation success, repair iterations, dispatch success, and line coverage, with time, cost, and token usage as secondary measures. Across 76 functions under test, the workflow generated compilable UTs for 73 functions. In a configuration without line coverage guidance or retrieval augmentation, mean line coverage reached 73.9%. On a 48-function subset evaluated under both configurations, mean line coverage reached 98.8% with line-coverage guidance alone and reached 94.7% when combined with vector-database retrieval. Results show that automated generation-and-repair pipelines can substantially improve UT creation efficiency and coverage for constrained firmware environments while reducing manual debugging effort.

16.
arXiv (CS.AI) 2026-06-16

NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

arXiv:2606.15888v1 Announce Type: cross Abstract: Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.

17.
arXiv (quant-ph) 2026-06-15

An integrated ultrahigh vacuum cluster tool for diamond surface science and single nitrogen-vacancy center measurements

arXiv:2606.13961v1 Announce Type: new Abstract: We present a custom-designed ultrahigh vacuum (UHV) cluster tool developed for studying shallow nitrogen-vacancy (NV) centers in diamond, enabling in situ diamond surface preparation, characterization, and single NV center dynamics measurements within a single connected platform. The system combines a surface science chamber for controlled surface modification and analysis with a cryogenic confocal microscope chamber dedicated to NV spin and optical measurements. This integrated approach enables a direct correlation between diamond surface chemistry and the resulting NV spin and charge properties. The instrument provides a versatile platform for systematic studies of surface-induced decoherence mechanisms and charge dynamics for shallow NV centers, and establishes a pathway toward reproducible surface engineering for quantum sensing applications.

18.
arXiv (CS.AI) 2026-06-16

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

arXiv:2606.14801v1 Announce Type: cross Abstract: Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

19.
arXiv (CS.AI) 2026-06-16

TechRAG: Evidence-Gated Multimodal Agentic RAG for Technical Literature Reasoning

arXiv:2606.01613v2 Announce Type: replace-cross Abstract: This paper presents an agentic multimodal retrieval-augmented generation (RAG) framework for domain-specific literature reasoning, instantiated on a curated corpus of several thousand papers in intelligent tires, vehicle dynamics, vehicle control, sensing, estimation, and machine learning. Unlike conventional single-pass RAG systems, the proposed architecture uses an autonomous, evidence-gated pipeline that classifies query intent, generates separate text and visual query rewrites, performs hybrid text retrieval with FAISS and BM25 followed by cross-encoder reranking, expands evidence through graph-guided chunk traversal over a Neo4j knowledge graph, and retrieves visual document evidence using ColSmol late-interaction embeddings with MUVERA fixed-dimensional encoding, approximate nearest-neighbor search, and MaxSim reranking. The framework scores evidence sufficiency using a 100-point rubric with hybrid rule-based/LLM review, retries retrieval through drift-guarded reformulation, searches external academic databases through optimize–search–vet loops, merges and deduplicates multimodal evidence, verifies citation integrity, and generates cited answers through Planner, Researcher, Writer, and Critic agents with self-correcting revision. Key contributions include: (i) a scalable multimodal retrieval architecture combining text, graph, and visual evidence over 40,000 document pages; (ii) an interpretable evidence sufficiency and retry mechanism; (iii) a multi-agent generation pipeline with evidence mapping and critic-driven revision; (iv) a domain knowledge graph with LLM-based entity extraction, OpenAlex author validation, and intra-corpus citation resolution; and (v) a route-dependent external search architecture for targeted literature expansion. The result is a practical, evidence-gated, multimodal agentic RAG architecture for technical reasoning over specialized research corpora.

20.
arXiv (CS.LG) 2026-06-16

A Decision-Theoretic View of Test-Time Training: When, How Far, and Which Directions to Adapt

arXiv:2606.15569v1 Announce Type: new Abstract: Test-time training (TTT) adapts a pretrained model to each prompt via parameter updates, improving accuracy under pretraining-to-test distribution shifts. Yet, its performance often suffers from instability and sensitivity to hyperparameters such as update steps and subspace. We explain this behavior through a decision-theoretic lens, treating TTT as implicit Bayesian inference in the kernel regime. Under a Gaussian process benchmark, we show that TTT reduces prediction error when updates are spectrally matched to the prompt's signal-to-noise ratio and aligned with query-relevant eigen-directions. This perspective underpins the following results: (1) we show when fixed update steps and subspaces fail under distribution shifts, motivating adaptive strategies; (2) we prove that selecting update steps via prompt evidence admits a PAC-Bayes guarantee against overfitting; and (3) we characterize the Bayes-optimal update subspace under a linear-Gaussian correction model, yielding a scoring rule for selecting Transformer blocks and heads. Our theory helps explain the empirical instability of TTT, taking a step toward principled guidance for when, how far, and which directions to adapt.

22.
bioRxiv (Bioinfo) 2026-06-23

Multi-Scale Machine Learning for Antibody-Antigen Binding Affinity Prediction Using Deep Mutational Scanning and Structural Features

Authors:

Predicting how mutations alter antibody-antigen binding affinity is essential for antibody engineering and vaccine design, yet current methods generalize poorly to unseen complexes. We present a multi-scale machine learning framework integrating 93 descriptors across four modalities: physicochemical, structural, ESM-2 protein language model, and solvent-accessible surface area (SASA)/{Delta}{Delta}G_fold features. Under leave-one-complex-out deep mutational scanning (LOCO-DMS) cross-validation on AbAgym (36,541 mutations, 68 experiments, 13 pathogens), gradient boosting achieved MCC = 0.206; a confidence-stratified ensemble reached MCC = 0.374 (83.5% accuracy, 25.5% coverage). No single modality exceeds the majority baseline alone; only multi-scale fusion succeeds. Boltzmann ceiling analysis shows 45.9% of mutations are near-neutral (|{Delta}{Delta}G| < k_BT), bounding theoretical maximum MCC at 0.473; our method achieves 79.1% of this limit. Five deep learning architectures benchmarked under LOCO-DMS showed self-attention matching gradient boosting (MCC = 0.200). Cross-pathogen transfer failed systematically (mean 46.7%), confirming universal binding predictors remain an open challenge.

23.
arXiv (CS.CV) 2026-06-17

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose TivTok (Time-Invariant Tokenizer), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.

24.
arXiv (CS.CV) 2026-06-17

Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset

Optical remote sensing imagery is frequently degraded by cloud and cloud-shadow contamination, which limits its reliability for near-real-time land use and land cover (LULC) mapping. Although synthetic aperture radar (SAR) can provide cloud-penetrating structural information, existing SAR-optical fusion methods often assume reliable optical observations and insufficiently address the semantic uncertainty introduced by cloud contamination. To address this issue, we propose CloudLULC-Net, an end-to-end heterogeneous SAR-optical fusion framework that directly predicts LULC maps from cloud-contaminated Sentinel-2 imagery and temporally adjacent Sentinel-1 SAR observations. The proposed network incorporates optical reliability modulation to suppress unreliable optical responses, heterogeneous information adaptive aggregation to model high-order spatial-channel interactions between optical and SAR representations, and a unified semantic mapping transformer to organize fused features in a LULC-oriented latent space. A semantic anchor-guided optimization strategy is further introduced to improve the consistency of intermediate semantic representations. To support this task, we construct CloudLULC-Set, a large-scale benchmark dataset containing 40,223 curated SAR-optical-label triplets with pixel-level LULC annotations across diverse geographic regions and cloud conditions. Experimental results show that CloudLULC-Net achieves an OA of 86.60%, an F1-score of 83.29%, and an mIoU of 73.51%, outperforming representative heterogeneous reconstruction-first and end-to-end SAR-optical mapping methods. Comparisons with existing global LULC products and analyses under different cloud-cover levels further demonstrate the robustness and practical value of CloudLULC-Net for target-date LULC mapping in cloud-prone regions.The project is publicly available at: https://github.com/RSIIPAC/CloudLULC

25.
PLOS Medicine 2026-06-04

Beyond associations: Navigating the safety of non-steroidal anti-inflammatory drugs (NSAIDs) in early pregnancy

by Andrew S. C. Yuen, Kenneth K. C. Man Pain and fever in pregnancy require treatment, but fetal safety concerns complicate analgesic choice. A recent PLOS Medicine study presents new evidence on the safety of first-trimester NSAID use and congenital malformation risk, but interpreting findings across studies is challenging. In this Perspective, Kenneth Man and Andrew Yuen highlight a recent PLOS Medicine study that presents new evidence on the safety of first-trimester NSAID use and congenital malformation risk, but discuss why interpreting findings across studies is challenging.