Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CL) 2026-06-24

Bayesian control for coding agents

Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty. We formulate orchestration as cost-sensitive sequential hypothesis testing: a Bayesian controller maintains a belief over candidate correctness and dynamically decides whether to gather more evidence, refine the candidate, verify it, or stop. Across six generators and nine coding benchmarks, Bayesian control proves to be most valuable when verification is costly and critics are informative but imperfect. Beyond control, the belief state yields an interpretable correctness score that outperforms token-probability and raw tool-success baselines for uncertainty quantification.

02.
arXiv (CS.CL) 2026-06-24

Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is consonant cluster reduction (CCR) in African American English (AAE), a widespread phonological process and a source of automatic speech recognition (ASR) disparity. To examine how CCR is represented, we conduct speaker-independent layer-wise probing of wav2vec2-base and Whisper-small using two tasks: segmental reduction detection and segmental restoration of underlying cluster identity. Both models distinguish reduced and canonical forms with high accuracy. Crucially, reduced segments retain cues to their underlying stops, indicating that CCR is encoded as structured gradient phonological variation rather than simple segmental deletion. These results demonstrate structured phonological encoding of AAE CCR patterns in modern speech models.

03.
arXiv (CS.LG) 2026-06-16

Distribution Alignment for One-Shot Federated Learning via Optimal Transport

arXiv:2606.16655v1 Announce Type: new Abstract: One-Shot Federated Learning (OSFL) addresses extreme communication regimes in which clients interact with the server only once, amplifying the impact of heterogeneous client data distributions. In particular, the interaction of domain shift and label shift across clients induces misaligned feature representations that cannot be corrected through iterative optimization. Existing OSFL methods rely on distillation, server-side generation or ensemble-based aggregation, but assume aligned representations or address domain and label shift separately. We introduce SLOT-Align (Single-round, Learning-free Optimal Transport Alignment), a geometry-aware feature harmonization framework for OSFL. SLOT-Align uses a shared frozen encoder to extract compact feature statistics, constructs a global reference via Bures-Wasserstein barycenters, and aligns local representations using closed-form geodesic optimal transport maps. The method is computationally efficient and can be combined with existing OSFL pipelines relying on frozen encoders without modifying their training procedures. Extensive experiments across multiple benchmarks, pretrained backbones, and OSFL methods show that SLOT-Align consistently improves accuracy and robustness under joint domain and label shift.

04.
arXiv (CS.AI) 2026-06-15

STaR-DRO: Stateful Tsallis Reweighting for Group-Robust Structured Prediction

arXiv:2604.09737v2 Announce Type: replace-cross Abstract: Structured prediction with large language models requires outputs that are label-accurate, ontology-constrained, structurally valid, and evidence-grounded under label imbalance and heterogeneous group difficulty. We present a unified framework for ontology-constrained generation. First, we introduce a modular prompt-engineering architecture combining XML-style structure, expert disambiguation rules, chain-of-thought reasoning, metadata-aware decision logic, schema contracts, and a self-validation gate. It targets recurrent in-context failures, including format drift, label ambiguity, evidence hallucination, and metadata-conditioned confusion. Second, we propose STaR-DRO, combining Tsallis mirror ascent, sparse entmax-style primal mapback, EMA-smoothed group-loss tracking, rescaled ascent signals, and bounded excess-only multipliers. Unlike conventional DRO, which relies on dense Shannon-entropy exponentiated-gradient updates, can introduce high-variance stochastic reweighting, assigns positive adversarial mass to groups that are not persistently hard, and incurs costs through simplex competition, STaR-DRO upweights only persistently hard groups without suppressing easier ones. We evaluate the framework on EPPC Miner, a clinically grounded high-stakes structured-prediction task requiring hierarchical label prediction and evidence-span extraction from patient-provider secure messages. Across 1B-70B Llama models, prompt engineering improves zero-shot extraction, yielding an average label F1 gain of +14.46 and a Span F1 gain of +17.40. Building on supervised fine-tuning, STaR-DRO further improves accuracy and robustness, increasing average label F1 by +1.08 and +2.20 while reducing mean groupwise validation cross-entropy by 21.3% and 14.8% relative to SFT and standard DRO, respectively. These results advance reliable automated communication mining for patient-centered clinical care analysis.

05.
arXiv (CS.AI) 2026-06-18

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

arXiv:2605.11287v2 Announce Type: replace-cross Abstract: A persistent paradox in time-series forecasting is that structurally simple MLP and linear models often outperform high-capacity Transformers. We argue that this gap arises from a mismatch in the sequence-modeling primitive: while many time-series dynamics are governed by global temporal operators (e.g., filtering and harmonic structure), standard attention forms each output as a convex combination of inputs. This restricts its ability to represent signed and oscillatory transformations that are fundamental to temporal signal processing. We formalize this limitation as a simplex-constrained mixing bottleneck in softmax attention, which becomes especially restrictive for operator-driven time-series tasks. To address this, we propose $Temporal Operator Attention (TOA)$, a framework that augments attention with explicit, learnable sequence-space operators, enabling direct signed mixing across time while preserving input-dependent adaptivity. To make dense $N \times N$ operators practical, we introduce Stochastic Operator Regularization, a high-variance dropout mechanism that stabilizes training and prevents trivial memorization. Across forecasting, anomaly detection, and classification benchmarks, TOA consistently improves performance when integrated into standard backbones such as PatchTST and iTransformer, with particularly strong gains in reconstruction-heavy tasks. These results suggest that explicit operator learning is a key ingredient for effective time-series modeling.

06.
PLOS Computational Biology 2026-06-01

BeetleAtlas 2: An enhanced <i>Tribolium castaneum</i> web resource for tissue and developmental transcriptomics allowing refinement of gene predictions

by David P. Leader, Muhammad T. Naseem, Janina L. Rinke, Kenneth Veland Halberg BeetleAtlas is an online resource for tissue- and stage-specific transcriptomics in the red flour beetle, Tribolium castaneum. On updating from the original Tcas5.2 genome assembly to the more recent improved icTriCast1.1 genome assembly it became evident that there were major discrepancies between the gene models of the two genome annotations in use: the OGS3 and the NCBI gene sets. As neither was clearly superior we implemented a new design in BeetleAtlas 2 (beetleatlas.org) comprising two parallel ‘modes’ — one incorporating results using the NCBI gene models and a second incorporating those using the OGS3 gene models. This allows direct comparison where equivalent gene models exist: 50–57% of cases. To aid resolution of discrepancies between the two gene model sets and verification of results, gene models are linked to a custom visualization of RNA-seq read coverage of the genome in the UCSC Genome Browser. This displays reads from 22 tissues and life stages superimposed on the icTriCast1.1 genome assembly. Reference tracks show the NCBI gene models, the OGS3 gene models after translation of their coordinates from the Tcas5.2 assembly, and 1050 discontinued NCBI gene models from the previous assembly after a similar transfer of coordinates. We document various situations in which distinct patterns of expression of the tissues can be used to confirm and extend correlations between the two gene sets, resolve discrepancies between them, make corrections and identify putative genes or exons absent from the current gene sets. BeetleAtlas 2 allows those involved in Tribolium research to avoid the pitfalls inherent in incorrect gene models when planning experiments on specific genes and interpreting the results. It also demonstrates how BeetleAtlas 2 might play an important role in establishing a revised gene set for Tribolium castaneum in the future.

07.
arXiv (CS.CV) 2026-06-11

ISAP-3D: Identity-Slot Aligned Part-Aware 3D Generation

Part-aware 3D generation aims to synthesize structured objects with semantically meaningful components, yet often suffers from structural ambiguity due to identity-layout entanglement. Existing methods either infer part identity and spatial layout implicitly, which can lead to unstable part allocation (e.g., slot swapping or part merging), or rely on strong layout conditions that are difficult to obtain in practice. We attribute this ambiguity to identity-slot permutation freedom: without explicit identity-slot alignment, the correspondence between semantic parts and generation slots is not identifiable during training, allowing multiple slot assignments to fit the same supervision and leading to inconsistent decomposition. Based on this insight, we argue that stable part-aware generation requires identity-aligned one-to-one slot modelling. We therefore propose an identity-slot aligned framework, ISAP-3D, which anchors each part with semantic identity tokens and performs identity-conditioned one-to-one layout prediction, followed by layout-conditioned geometry synthesis. Structured local-global conditioning maintains identity alignment across semantic, spatial, and geometric stages. We also construct a part-level dataset with a unified semantic protocol to enable learnable and consistent identity-slot alignment. Extensive experiments demonstrate improved structural stability, controllability, and robustness over state-of-the-art part-aware generation baselines.

08.
arXiv (CS.CV) 2026-06-24

MedP-CLIP: Medical CLIP with Region-Aware Prompt Integration

Contrastive Language-Image Pre-training (CLIP) has demonstrated outstanding performance in global image understanding and zero-shot transfer through large-scale text-image alignment. However, the core of medical image analysis often lies in the fine-grained understanding of specific anatomical structures or lesion regions. Therefore, precisely comprehending region-of-interest (RoI) information provided by medical professionals or perception models becomes crucial. To address this need, we propose MedP-CLIP, a region-aware medical vision-language model (VLM). MedP-CLIP innovatively integrates medical prior knowledge and designs a feature-level region prompt integration mechanism, enabling it to flexibly respond to various prompt forms (e.g., points, bounding boxes, masks) while maintaining global contextual awareness when focusing on local regions. We pre-train the model on a meticulously constructed large-scale dataset (containing over 6.4 million medical images and 97.3 million region-level annotations), equipping it with cross-disease and cross-modality fine-grained spatial semantic understanding capabilities. Experiments demonstrate that MedP-CLIP significantly outperforms baseline methods in various medical tasks, including zero-shot recognition, interactive segmentation, and empowering multimodal large language models. This model provides a scalable, plug-and-play visual backbone for medical AI, combining holistic image understanding with precise regional analysis.

09.
Nature (Science) 2026-06-24

Crude oil turns cheap porous membrane into a sieve to refine itself without heat

Authors: Unknown Author

An inexpensive, porous polymer membrane helps to separate raw crude oil, without heat or selective coating. Heavy hydrocarbons self-assemble inside the pores to create channels that are likely to be less than 2 nanometres wide, enriching the resulting permeate mix into light, ‘naphtha range’ hydrocarbons with 30% less energy cost and carbon dioxide emissions than conventional distillation. Long hydrocarbons are deposited on pore walls, narrowing channels until only smaller molecules can pass through.

10.
arXiv (CS.CL) 2026-06-15

Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, an on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

11.
arXiv (CS.LG) 2026-06-15

A Longitudinal Attribute-Conditioned Neural Network for Modeling Health-State Transition Probabilities in Temporally Irregular Data: The LANTERN Framework

arXiv:2606.13880v1 Announce Type: new Abstract: Accurate estimation of long-term care transition probabilities is central to disability insurance pricing, reserving, and solvency assessment. Classical actuarial multi-state models commonly rely on Markov, semi-Markov, or proportional-hazard specifications, which provide a direct connection to cohort projection but may be restrictive for irregular longitudinal health data with nonlinear aging patterns and heterogeneous covariate histories. This paper develops a well-calibrated estimator of multi-state transition probabilities for irregular longitudinal health data. The model learns from individual health history, incorporates the time elapsed between observations, and conditions transition probabilities on demographic and socioeconomic attributes. It produces a valid probability distribution over the next observed health state, with four possible states: healthy, mild disability, severe disability, and death. Individual probabilities are aggregated by age group and origin state to form transition matrices compatible with actuarial cohort projection. Using longitudinal data from the Health and Retirement Study, we compare the proposed estimator with logistic regression, gradient-boosted trees, a recurrent neural network, and a last-state persistence benchmark. The evaluation considers probabilistic accuracy, endpoint discrimination and calibration for severe disability and death, risk concentration, and transition matrix error after aggregation. The proposed estimator improves severe disability discrimination relative to logistic regression and gradient-boosted tree benchmarks, maintains strong calibration, and yields the lowest transition matrix error among the evaluated models in the held-out test analysis. Results show that a structured machine learning estimator can support long-term care transition modeling when judged by calibration and projection fidelity, beyond discrimination.

12.
arXiv (CS.AI) 2026-06-15

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

arXiv:2606.14604v1 Announce Type: cross Abstract: Wearable devices and smartphones generate rich behavioural time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures for these data are lacking. In particular, it remains unclear how models generalise across populations, how different architectures respond to participant-level fine-tuning and how forecasting accuracy degrades across multi-day horizons. We benchmark six deep learning architectures, two zero-shot Foundation Models (FM) and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time and sleep duration across 1-8 day horizons. We further conduct a per-feature personalisation study across all six architectures and assess FM transferability across dataset sizes and temporal granularities. Our key findings are: (i) no single architecture dominates, PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference; (ii) the FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes and (iii) participant-level fine-tuning reduces per-feature RMSE by 16-60\%, with sleep benefiting most and step counts least. These results provide practical guidance on architecture selection, FM applicability and personalisation strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs and personalisation for multi-horizon behavioural forecasting from wearables.

13.
medRxiv (Medicine) 2026-06-24

Differential COVID-19 Outcomes Across Lysosomal Disorders

Background Lysosomal disorders (LDs) are a heterogeneous group of rare inherited disorders characterized by multi-system involvement and high comorbidity burden, which raises concerns about severe COVID-19 outcomes. Conversely, because SARS-CoV-2 relies on endolysosomal pathways for cellular entry and replication, certain LDs may exert a protective effect against viral pathogenesis. Prior clinical evidence investigating LDs and severe SARS-CoV-2 infection has been limited by small sample sizes and inconsistent findings. Therefore, to resolve these conflicting biological hypotheses and estimate population-level outcomes, we conducted a large-scale retrospective cohort study using nationwide U.S. harmonized electronic health record data from the National Clinical Cohort Collaborative (N3C). This design utilized longitudinal records starting January 1, 2018, to evaluate COVID-19 infections captured between January 1, 2020, and July 11, 2024. Results The study included 16,380 individuals, comprising 5,460 patients with lysosomal disorders and 10,920 matched controls. Patients with LDs had significantly higher odds of COVID-19 hospitalization compared with controls (OR = 1.86, 95% CI: 1.70-2.04). Elevated odds were observed across the evaluated categories, but varied substantially. Notably, neurodegenerative LDs such as neuronal ceroid lipofuscinosis (OR = 9.32) and metachromatic leukodystrophy (OR = 2.33) remained associated with hospitalization after adjustment for comorbidities. Contrarily, the elevated odds for Fabry disease and Gaucher disease were no longer significant after adjustment. Mortality among hospitalized patients with LDs was comparable to that of matched controls (one-year survival: 82.1% vs 82.0%), suggesting that LD status does not independently worsen survival once hospitalization occurs. Conclusions Patients with LDs were at an increased odds of COVID-19 hospitalization, driven by a combination of elevated comorbidity burden and disorder-specific effects, which vary significantly across LD categories. This study clarifies that excess risk is concentrated in the transition to hospitalization. These patients may thus require personalized clinical care to mitigate the negative consequences of COVID-19.

14.
medRxiv (Medicine) 2026-06-18

Chest X-Ray as a critical screening tool for Household Contacts of TB: Lessons from Three Years of Programmatic Data in India

Introduction: Household contacts (HHCs) of pulmonary TB patients remain at high risk for TB infection and disease progression, yet many remain asymptomatic and are missed by symptom-screening pathways. While India expanded its TB preventative guidelines to include all HHCs in 2021, chest X-ray (CXR) screening continues to be used selectively, representing a missed opportunity in early case detection. Methods: The analysis uses programmatic data from Project JEET 2.0 (Joint Effort for Elimination of Tuberculosis), implemented by the William J. Clinton Foundation in India, between October 2021 and March 2024. Eligible HHCs (>=5 years) were offered CXR screening as part of TB preventive therapy (TPT) evaluation. Descriptive and multivariable analyses examined predictors of CXR uptake and TB yield. A two-stage logistic regression model estimated potential TB yield under universal CXR coverage. Model performance was evaluated using the area under the curve (AUC), and bootstrap simulations generated counterfactual estimates of missed TB cases. Results: Among 1,034,621 HHCs, 1.02% individuals were found positive for TB, which includes 7,786 HHCs who were on TB treatment already, while an additional 2,812 were identified during pre-TPT evaluation. Among eligible HHCs (n = 1,026,835), 70% were screened with CXR, of which 2.4% had suggestive TB findings. Of these, 79% went for further TB assessment. Symptomatic HHCs were more likely to be CXR screened (84% vs 69%) and assessed for TB, yet two-thirds of all detected TB cases were asymptomatic. It is estimated that universal CXR coverage and TB testing for suggestive cases can increase TB detection by at least 87%. Conclusion: The study provides a scalable approach to expand CXR coverage through public-private partnerships, enabling early TB detection among HHCs, especially among asymptomatic contacts. Future implementations will benefit from integrating AI-enabled reading, along with systematic follow up for those with suggestive findings.

15.
arXiv (CS.CL) 2026-06-19

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

16.
bioRxiv (Bioinfo) 2026-06-14

FENNEC: Fine-Tuned Ensemble Neural Networks Accelerate Chemically Modified siRNA Design and Screening

Small interfering RNAs (siRNAs) are a clinically validated therapeutic modality, yet designing potent chemically modified siRNAs remains a costly and iterative process, limited by scarce public data. Computational prediction of siRNA efficacy is therefore essential for rational design and accelerated preclinical development. However, despite the critical role of chemical modifications in therapeutic performance, current state-of-the-art machine learning methods either are not designed to model the chemical diversity of therapeutic siRNAs, or exhibit poor generalization performance. Here, we present FENNEC (Fine-Tuned Ensemble of Neural Networks for siRNA Efficiency Characterization), a machine-learning framework for predicting siRNA activity across chemically diverse design spaces. To support this effort, we curated the largest patent-derived dataset to date of chemically modified siRNAs from 42 patents using OCR-based table extraction and stringent filtering. FENNEC combines temporal convolutional networks with thermodynamic descriptors, experimental covariates, and embeddings from RNA foundation models to capture both local chemical determinants and broader target-context information. Importantly, we show that language-model-derived embeddings provide meaningful higher-order representations of target transcripts, particularly in data-scarce settings. FENNEC achieved robust predictive performance across both gene-level and scaffold-level validation settings, with additional experimental validation on a novel AHSA1-targeting dataset further supporting its generalizability across chemically modified siRNAs. In benchmarking, FENNEC outperformed classical machine-learning and state-of-the-art deep learning models, demonstrating generalization to unseen chemistry. Model interpretation recovered established design principles, including position-specific effects of glycol nucleic acid, 2'-fluoro modifications, and phosphorothioate backbones. Furthermore, in silico perturbation analyses suggest that FENNEC can serve not only as a predictive model, but also as an oracle for the design and optimization of chemically modified siRNAs. Together, our work addresses a key gap in the field by enabling chemically aware deep learning for siRNA design, supported by a large and diverse collection of chemically modified siRNA measurements.

17.
arXiv (CS.LG) 2026-06-17

Noise-Driven Exploration and Transient Freezing Select Flat Minima in Stochastic Gradient Descent

arXiv:2601.10962v2 Announce Type: replace Abstract: Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism that governs solution selection during training. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and migrate toward flatter regions of the loss landscape before becoming confined to a final basin. Using a tractable physical model, we show that SGD noise reshapes the loss landscape into an effective potential that preferentially stabilizes flat solutions. We further uncover a transient freezing mechanism: as training progresses, the flattening landscape suppresses transitions between competing valleys. Stronger SGD noise delays this freezing transition, prolonging the exploratory phase and thereby increasing the probability of convergence to flatter minima. Together, these results provide a unified physical framework connecting learning dynamics, loss-landscape geometry, and generalization, and suggest guiding principles for the design of more effective optimization algorithms.

18.
arXiv (CS.AI) 2026-06-12

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

arXiv:2606.12690v1 Announce Type: cross Abstract: In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

19.
arXiv (CS.CV) 2026-06-18

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

20.
arXiv (CS.AI) 2026-06-16

Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis

arXiv:2508.10967v3 Announce Type: replace-cross Abstract: Retrosynthesis prediction aims to infer the reactant molecules based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing methods rely on a static pattern-matching paradigm, which limits their ability to perform effective logical decision-making from chemical data, leading to a black-box process. We propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary strengths of Large Language Models and specialized models via pure reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models provide chemical knowledge that is distilled into a high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions with an interpretable reasoning path, and (3) knowledge-grounded policy optimization refines the interpretable decision policy. Experiments show that Retro-Expert surpasses both LLM-based and specialized models across different metrics, while generating chemically grounded explanations that enhance chemists' trust in practice. The source code for this paper is available at https://github.com/MagixRab-ll/Retro-Expert.

21.
arXiv (CS.CV) 2026-06-25

Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms

Authors:

Visual search has been one of the most productive paradigms in the study of visual attention: the way reaction time scales with the number of items distinguishes parallel, "pop-out" search from serial, attention-demanding search. I ask whether vision-language models (VLMs) exhibit the same behavioral signatures. I adapt four classic paradigms: feature versus conjunction search, spatial-configuration (T-vs-L) search, enumeration, and the tilted/vertical search asymmetry; and present them to current frontier and mid-tier models. Because a single model call has no reaction time, I use the number of reasoning ("thinking") tokens a model spends per trial as a within-model analog of search effort, and I compare against a large public human benchmark (Wolfe et al., 2010). The models reproduce several human signatures: feature search costs flat effort while conjunction effort climbs with set size; frontier models hold accuracy where mid-tier models collapse to chance; and a resolution control shows the conjunction cost is genuine search rather than difficulty resolving small shapes. They also diverge from humans in informative ways. The target-present effort slope exceeds the target-absent slope, reversing the human ordering; enumeration remains accurate where humans would lose count; and a reasoning model with adaptive deliberation declines to deliberate on detection tasks altogether, so that a single search expresses itself as an effort gradient in one model and as an accuracy cliff in another. I argue that psychophysical paradigms, applied behaviorally, are a sharp and inexpensive probe of machine visual cognition, and that the points of divergence are as informative as the points of agreement.

22.
arXiv (math.PR) 2026-06-16

Probabilities

arXiv:2601.18853v4 Announce Type: replace-cross Abstract: Probabilities is the English translation of the book Probabilités Tome 1 and Tome 2. The mathematic content is authored by Prof. Jean-Yves Ouvrard. The English version has been done by his eldest son Dr. Xavier Ouvrard. This probability theory book covers not only an introduction to this field, but also advanced concepts based on measure theory. The first part introduces the fundamentals of probability theory across 7 chapters, targeting bachelor level, including event algebras, random variables, independence, conditional probabilities, moments of discrete and continuous random variables, generating functions, and limit theorems. The second part contains 10 chapters and corresponds to master level. Following a brief introduction to measure theory, this part develops more advanced topics: probability measures and their complements, distributions and moments of random variables, modes of convergence, laws of large numbers, conditional expectation, Fourier transforms and characteristic functions, Gaussian random variables, convergence of measures, convergence in distribution, discrete-time stochastic processes, martingales, and Markov chains. The reader's work is greatly facilitated by the inclusion, in every chapter, of numerous exercises, all accompanied by detailed solutions that often provide substantial extensions to the theoretical material.

23.
arXiv (CS.LG) 2026-06-12

A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

arXiv:2606.12710v1 Announce Type: new Abstract: Diffusion models provide expressive data-driven priors for Bayesian inverse problems, but many diffusion posterior samplers rely on heuristic guidance approximations that can fail for nonlinear operators and multimodal posteriors. In this work, we develop a stabilized path-space framework for diffusion-based posterior sampling. Starting from a base diffusion process whose terminal marginal represents the prior, we define a likelihood-weighted target measure on trajectories and cast posterior sampling as learning a controlled stochastic process whose path measure matches this target. This formulation connects diffusion posterior sampling to stochastic optimal control while preserving the Bayesian structure needed for uncertainty quantification. We introduce a time reparameterization that makes the path-space control problem well posed by removing the bias induced by the unknown initial value function, without auxiliary training. We then learn the control via a trust-region path-space optimization method with log-variance objectives. The path-space perspective also unifies our learned control approach with existing guidance-based samplers, quantifies the sampling error induced by approximate controls, and yields importance sampling corrections for asymptotically exact posterior expectations. We evaluate the proposed framework on a suite of benchmark inverse problems with analytically characterized or high-quality reference posteriors, enabling principled assessment of sampling accuracy and uncertainty quantification. These experiments provide insight into the behavior of diffusion-based posterior samplers and demonstrate improved accuracy and robustness over leading approaches.

24.
medRxiv (Medicine) 2026-06-16

Recurrence After Hepatic Hydatid Cyst Surgery: Scolicidal Agent Application Technique and the Effect of Cystopiliary Fistula

Objective: This study aimed to evaluate long-term outcomes in patients who underwent surgical treatment for hepatic hydatid cyst (HCC) disease and, in particular, to investigate the effect of scolicidal agent (SA) application method and the presence of cystobiliary fistula (CBF) on the development of recurrence. Materials and Methods: This single-center, retrospective study included 197 patients who underwent surgical treatment for HCC disease. Hypertonic saline was used as SA in all patients and was classified as intracystic or pericystic application according to the application method. The presence of CBF was evaluated according to intraoperative and postoperative findings. Patients were followed for 86 months, and the development of recurrence was identified by radiological methods. Comparisons were made between the groups with and without recurrence in terms of SA application method and the presence of CBF. Results: The median age of the patients was 38 years, and the median follow-up period was 86 months. SA application was performed into the cyst in 51.3% of the patients and around the cyst in 48.7%. The presence of CBF was detected in 49.7% of the patients. No statistically significant difference was found between the recurrent and non-recurrent groups in terms of SA application method (p = 0.344). Similarly, no significant relationship was found between the presence of CBF and the development of recurrence (p = 0.721). Conclusion: This study showed that the SA application method and the presence of CBF are not determinants of recurrence in HCC disease. It is thought that recurrence rates can be kept low with appropriate surgical technique and effective biliary tract management.

25.
arXiv (CS.CL) 2026-06-25

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

Authors:

Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely checked. We check it. Using 596 human-labeled completions from the HarmBench classifier validation set, we compare the two judge families against human majority votes and then attack them. The two families fail in opposite ways. The dedicated classifier over-flags (precision 0.835, recall 0.974); three different LLM-as-judges keep high precision (0.81 to 0.94) but show erratic recall (0.06 to 0.65), so the same responses produce very different ASR depending on which judge scores them. The two families also differ sharply in robustness. Wrappers that leave the harmful text untouched and only add benign framing flip every LLM-judge between 57% and 100% of the time, and a single prepended refusal sentence accounts for much of this (39% to 88%). The dedicated classifier resists these surface attacks (at most 6.7%), but a white-box GCG attack on its open weights flips 70% of confident true positives (21 of 30; 95% CI 54 to 86%) even at a small optimization budget. A two-annotator audit confirms the attacks leave the harm intact: every one of 80 sampled flips still contained the harmful content. Because a large and growing share of reported ASR comes from LLM-judges, many such numbers are unreliable both on average and under deliberate pressure. We recommend that papers report judge precision and recall on a human-labeled slice, report ASR corrected for judge precision, and include an adversarial check of the judge. Our code is released.