Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CL) 2026-06-16

Evaluating the Robustness of Proof Autoformalization in Lean 4

Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4. Several works have developed LLM-based models for proof autoformalization. However, existing evaluations have typically focused on translating well-formed informal proofs from curated datasets. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own. We build a benchmark with both perturbations on miniF2F and MATH-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations. Code and data are available via https://github.com/ucr-rai/robust-proof-autoformalization.

02.
arXiv (CS.AI) 2026-06-11

LaQual: An Automated Framework for LLM App Quality Evaluation

arXiv:2508.18636v2 Announce Type: replace-cross Abstract: Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recommendation mechanisms in LLM app stores predominantly rely on static metrics, such as user interactions and favorites, making it challenging for users to efficiently identify high-quality apps. At the same time, current academic research focuses on specific vertical fields and lacks a general, automated evaluation framework applicable to the diverse LLM app ecosystem. To address the above challenges, we present LaQual, an automated framework for LLM app quality evaluation. LaQual integrates three key stages: (1) LLM app labeling and hierarchical classification for precise scenario mapping; (2) static indicator evaluation using time-weighted user engagement and functional capability indicators to filter low-quality apps; and (3) dynamic scenario-adapted evaluation, where an LLM generates scenario-specific evaluation metrics, scoring criteria, and tasks for comprehensive quality evaluation. Experiments on a mainstream LLM app store demonstrate the effectiveness of LaQual. Its automated scores show high consistency with human judgments. Through effective screening, LaQual can reduce the candidate LLM app pool by 66.7% to 81.3%. User studies further validate its significant outperformance over baseline systems, particularly in comparison efficiency (mean 5.45 vs. 3.30) and value of explanatory information (4.75 vs. 2.25). These results demonstrate that LaQual provides a scalable, objective, and user-centric solution for high-quality discovery and recommendation of LLM apps in real-world scenarios.

03.
arXiv (CS.CV) 2026-06-16

Efficient Flow Matching using Latent Variables

Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.

04.
arXiv (CS.CV) 2026-06-16

HSQ-VLM: A Novel Spatially-Constrained Quadrant Segmentation VLM Model for Explainability in Diabetic Retinopathy

Authors:

Diabetic Retinopathy (DR) is an aggressive retinal disease and a leading cause of global blindness, yet its clinical management is currently hindered by the black-box nature of diagnostic AI. While deep learning models achieve high classification accuracy, there is a critical lack of explainability methods capable of detailing the exact anatomical landmarks and lesion distributions that lead to a clinical decision for DR. Therefore, we propose HSQ-VLM, a novel quadrant segmentation pipeline on fundus images that utilizes a Landmark-Anchored Cartesian Cross-Attention mechanism to unify visual feature extraction with structured clinical reasoning. Unlike traditional methods that rely on arbitrary image partitioning, our pipeline implements 4-quadrant Topological Latent Partitioning (TLP) to dynamically align retinal features with a fovea-centered coordinate system. This allows the Vision-Language Model to generate natural language reports that quantify pathology with anatomical precision. On a dataset of 3,500 high-resolution fundus images, this innovative methodology achieved a lesion detection sensitivity of 99.6% for hemorrhages and 96.4% for microaneurysms, while demonstrating a significant reduction in boundary-ambiguity errors compared to standard segmentation baselines.

05.
arXiv (quant-ph) 2026-06-19

Optimizing resource allocation for accuracy in noisy variational quantum algorithms

arXiv:2606.20153v1 Announce Type: new Abstract: For quantum algorithms to achieve their full potential, we need methodologies to optimize them, such as reaching a given output accuracy with minimal resource costs. Here, we develop such a methodology for a class of Noisy Intermediate-Scale Quantum (NISQ) algorithms. We leverage simulations of a Variational Quantum Eigensolver (VQE) to propose a phenomenological model of such algorithms that captures the complex relationship between algorithmic accuracy, algorithmic resource costs, and the noise that exists in realistic quantum hardware. For this, we take the algorithmic resource cost to be the total number of quantum gate-operations in the algorithm; minimizing this cost typically makes the algorithm faster and more energy-efficient. We consider the subtle trade-off between quantum circuit size (small circuits are too imprecise, but large ones are too noisy), and the number of iterations of that quantum circuit for the full algorithm to sufficiently converge. Using a noise-metric-resource methodology, we identify the sweet spot (of circuit size versus iterations) that minimizes the algorithmic resource costs for a desired algorithm accuracy. It also gives the circuit size that maximizes algorithm accuracy for a fixed resource cost. Our methodology provides a practical guideline for near-term deployment of variational algorithms on realistic noisy hardware, including hardware that uses error mitigation.

06.
arXiv (CS.CL) 2026-06-11

Context-Aware Multimodal Claim Verification in Spoken Dialogues

Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.

07.
arXiv (CS.CV) 2026-06-15

StereoGeo: an end-to-end stereo camera calibration method

In this work, we propose StereoGeo, an end-to-end network-based approach for stereo camera calibration. Our method estimates the focal lengths and gravity directions of the left and right cameras, as well as the relative extrinsic transformation relating them. Existing methods often rely on calibration patterns in structured environments or address only a single camera configuration, being limited to either intrinsic or extrinsic estimation, and depending on a multi-view setups. StereoGeo extends the GeoCalib algorithm, integrating deep neural network feature extraction with a differentiable optimizer. Extensive experiments on real-world benchmarks demonstrate that StereoGeo achieves competitive performance for intrinsic calibration and provides accurate stereo extrinsic estimation, outperforming existing methods that are limited to monocular settings. The dataset used in this work is partially publicly available at https://github.com/meddourimane/StereoGeo-dataset.

08.
arXiv (CS.LG) 2026-06-17

Variational autoencoders with latent high-dimensional steady geometric flows for dynamics

Authors:

arXiv:2410.10137v5 Announce Type: replace Abstract: We develop Riemannian approaches to variational autoencoders (VAEs) for PDE-type ambient data with regularizing geometric latent dynamics, which we refer to as VAE-DLM, or VAEs with dynamical latent manifolds. We redevelop the VAE framework such that manifold geometries, subject to our geometric flow, embedded in Euclidean space are learned in the intermediary latent space developed by encoders and decoders. By tailoring the geometric flow in which the latent space evolves, we induce latent geometric properties of our choosing, which are reflected in empirical performance. We reformulate the traditional evidence lower bound (ELBO) loss with a considerate choice of prior. We develop a linear geometric flow with a steady-state regularizing term. This flow requires only automatic differentiation of one time derivative, and can be solved in moderately high dimensions in a physics-informed approach, allowing more expressive latent representations. We discuss how this flow can be formulated as a gradient flow, and maintains entropy away from metric singularity. This, along with an eigenvalue penalization condition, helps ensure the manifold is sufficiently large in measure, nondegenerate, and a canonical geometry, which contribute to a robust representation. Our methods focus on the modified multi-layer perceptron architecture with tanh activations for the manifold encoder-decoder. We demonstrate, on our datasets of interest, our methods perform at least as well as the traditional VAE, and oftentimes better. Our methods can outperform this and a VAE endowed with our proposed architecture, frequently reducing out-of-distribution (OOD) error between 15% to 35% on select datasets. We highlight our method on ambient PDEs whose solutions maintain minimal variation in late times. We provide empirical justification towards how we can improve robust learning for external dynamics with VAEs.

09.
arXiv (CS.CV) 2026-06-15

HumP-KD: A Hybrid Uncertainty-Aware Multi-Stage Progressive Knowledge Distillation Framework for Efficient Fire Classification

Real-time fire classification systems require models that are simultaneously accurate, computationally efficient, and deployable on resource-constrained hardware. This work proposes HumP-KD, a Hybrid Uncertainty-aware Multi-stage Progressive Knowledge Distillation framework for efficient fire classification. Two datasets, FlameVision and Dataset-II, containing 8,600 and 31,309 images, are used. Various CNN and transformer baselines are applied under standard preprocessing, online augmentation, Gaussian noise and motion blur robustness conditions. The proposed HumP-KD model distills knowledge from two frozen heterogeneous transformer teachers, Swin-Tiny and ViT-Base, along with their Meta-MLP ensemble, into a lightweight MobileViT-S student via three tightly integrated components. Hierarchical Progressive Knowledge Distillation employs a Hierarchical Feature Builder. It generates a fused spatial attention mask to guide distillation toward discriminative regions selectively. Multi-Stage Knowledge Distillation progressively activates three distillation stages across training. On Dataset-II, HumP-KD achieves a mean F1 score of $0.9876 \pm 0.0063$ across 10 independent trials, significantly outperforming the MobileViT-S baseline trained without distillation ($0.9537 \pm 0.0351$), with statistical significance confirmed by both independent t-test ($p = 0.0195$) and Wilcoxon signed-rank test ($W = 1$, $p = 0.0039$). The proposed method also demonstrates strong generalization across datasets and robustness under degraded visual conditions. The student model retains only 4.94M parameters and 19.01Mb model size, representing a $5.7\times$ parameter reduction over Swin-Tiny and a $17.5\times$ reduction over ViT-Base, while achieving 37.72 CPU FPS, making it suitable for real-time deployment.

10.
arXiv (math.PR) 2026-06-11

On the $d$-rigidity phase transition in random graphs

Authors:

arXiv:2605.25711v2 Announce Type: replace-cross Abstract: We study generic $d$-dimensional rigidity in sparse random graphs. Our main result is that for every $d\ge 2$, the Erdős–Rényi random graph $G\sim G(n,c/n)$ undergoes a $d$-rigidity phase transition at the known, explicit, $d$-orientability threshold $c_d$: If $cc_d$, then $G$ is a.a.s. not independent in the generic $d$-rigidity matroid, and we give a sharp asymptotic estimate for its rank. In addition, the $d$-rigidity closure of $G$ has a giant clique of linear size, which contains all but at most $o(n)$ vertices of the $((d+1)+d)$-core of the graph. More generally, we compute, up to a $1+o(1)$ factor, the generic $d$-rigidity rank of random graphs with a given degree distribution. For example, we show that the uniform $n$-vertex $k$-regular graph a.a.s. has rank $\min(k/2,d)n+o(n).$ Our approach is to estimate the rigidity rank of a random graph from its Galton–Watson local weak limit, using a parameter that we call local flexibility.

11.
arXiv (CS.LG) 2026-06-18

Sequential Hiring of Contingent Workers Through Learning-Based Optimization

arXiv:2606.18438v1 Announce Type: cross Abstract: In this paper, we study a sequential workforce management problem in a contingent labor setting with uncertainty in both worker production and labor supply. A firm seeks to maximize cumulative profit by maintaining an active team of fixed size while learning worker productivity over time. We emphasize two critical operational frictions in this problem: replacing workers is costly, and workers may not be available immediately for hiring because of, for example, prior job commitments, scheduling constraints, or onboarding procedures. Thus, hiring decisions take effect only after a random delay. We formulate this problem as a stochastic multi-play bandit with costly switching and delayed actions, and develop a learning-based hiring policy, DR-UCB (DelayedReplacement-UCB), that makes replacement and hiring decisions sequentially through learning cycles. In each cycle, the policy uses real-time production data to determine when to initiate workforce changes and which workers to replace and hire. We show that the leading-order regret of the proposed policy matches its lower bound in its dependence on the time horizon. Our numerical experiments show that DR-UCB outperforms benchmark policies.

12.
arXiv (CS.CL) 2026-06-11

Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

Authors:

Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation – Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition – the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p

13.
bioRxiv (Bioinfo) 2026-06-16

Better data, better trees: GenBank-GISAID deduplication and source-specific artifact masking in viral genomics

GenBank and GISAID are the primary repositories for viral genomic data, but integrating records across them remains a challenge. The same sequence could be made available in both databases without any cross-reference linking the two entries. Consequently, there is no systematic way to identify this redundancy, which compromises the compilation of representative, non-redundant large-scale datasets. In parallel, the growth of viral genomic data has increased the risk of systematic technical artifacts introduced during sequencing or assembly. These artifacts can inflate substitution rate estimates and degrade temporal signal, biasing evolutionary rate estimates. To address both challenges, here we present a formal, reproducible workflow integrating two newly developed complementary tools: G2G matcher for cross-repository harmonization and Lab-Specific Bias FILTer (LSBFILT) for masking of laboratory-specific artifacts. Using the Eastern/Central/South African (ECSA) chikungunya virus lineage as a proof-of-concept, we demonstrate that our integrated workflow restores temporal signal and provides a robust, curated dataset for downstream phylodynamic analyses. Critically, restricting masking of homoplastic sites to specific sequences reduces the substitution rate estimate from an inflated 8.517 x 10e-4; to 5.078 x 10e-4; substitutions/site/year and increases the coefficient of determination (R2) of the root-to-tip regression analysis from 0.353 to 0.677. By enabling systematic cross-repository harmonization and source-specific artifact masking, we provide the molecular epidemiological community with scalable tools to reconcile fragmented genomic data and reduce technical biases, fostering more accurate and reproducible phylogenetic analysis. G2G matcher is available at https://github.com/andrezaleite/G2G-Matcher, and LSBFILT at https://github.com/khourious/LSBFILT.

14.
arXiv (CS.AI) 2026-06-15

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

arXiv:2606.14123v1 Announce Type: cross Abstract: Deployed knowledge-tracing models are typically frozen after training, yet systematic per-item logit bias arises, from limited per-item expressivity in backbone architectures and from post-deployment shifts in item properties, degrading prediction quality. Global post-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged. This AUC invariance is a structural consequence of monotone score-only transforms; recovering the stranded discrimination requires conditioning on item identity. We propose SLC (State-space Logit Correction), which converts binary observations to Gaussian pseudo-observations via Laplace/IRLS, applies empirical-Bayes shrinkage through a Kalman smoother, and fits an offset-Platt link. The state-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities. Across four datasets, five backbones, and three seeds, SLC improves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items. Cross-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity-level bias.

15.
arXiv (math.PR) 2026-06-16

The Ornstein$-$Uhlenbeck process on $\mathscr P_2$ with a volatility operator

arXiv:2606.14917v1 Announce Type: new Abstract: We analyze a diffusion ${(\mu_t)}_{t\geq 0}$ on the $2$-Wasserstein space $\mathscr P_2$ over $\mathbb R^d$ for which \begin{equation*} |\mu_t|_2^2-|\mu_0|_2^2-2ct+2\int_0 ^t|\mu_s|_2^2\,d s,\qquad t\geq 0, \end{equation*} is a martingale, where the constant $c\in(0,\infty)$ equals the trace of a volatility operator on a Hilbert space and $|\mu_t|_2:=(\int_{\mathbb R^d}x^T x\mu_t(d x ))^{1/2}$. The invariant measure of ${(\mu_t)}_{t\geq 0}$ is a Gaussian on $\mathscr P_2$, as introduced by P. Ren and F.-Y. Wang. Moreover, the Dirichlet form and its generator are given explicitly on a dense subspace of $L^2$.

16.
arXiv (CS.CL) 2026-06-19

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models – DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) – both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

17.
arXiv (quant-ph) 2026-06-17

Intrinsic Pointer Basis and Irreversible Classicality from Coherence Contraction

Authors:

arXiv:2604.23304v4 Announce Type: replace Abstract: This work analyzes an operational route to classical behavior for reduced quantum states using the intrinsic reference basis (IRB). Relative to a fixed physical conjugation, the IRB separates intrinsic populations from a real antisymmetric cohesion sector. A globally bounded cohesion index is defined and its exponential contraction is proved for phase-free dephasing dynamics aligned with the IRB; for general aligned dephasing, the corresponding modulus-based coherence functional contracts at the same computable rates. The results provide distance bounds to the IRB-diagonal description and a logarithmic upper bound on the time required to reach a prescribed experimental tolerance. The IRB projectors constitute state-derived candidate pointer sectors, and they become dynamically stable pointer sectors when the effective dephasing generator is aligned with them and damps the relevant inter-sector coherences. Degenerate population sectors lead naturally to block-classicality and protected intra-block coherence. In a two-level active sector, the cohesion index equals fringe visibility, giving a direct interferometric test of the contraction law. The construction is independent of any spacetime- or unification-emergence hypothesis and is intended as a channel-level complement to environment-induced einselection.

18.
medRxiv (Medicine) 2026-06-16

Reporting patterns of adverse drug withdrawal events using individual case safety reports in United States and European databases

Introduction: Adverse drug withdrawal events (ADWEs) are a key safety concern with deprescribing but are infrequently reported in trials. Although pharmacovigilance systems have advanced our understanding of medication-related harms, it is unclear how extensively these systems have been used for ADWEs. Objectives: To examine the reporting patterns of ADWEs for all drugs recorded in United States and European pharmacovigilance databases between 2004 and 2023. Methods: A retrospective study was conducted using two pharmacovigilance databases, the publicly available FDA-FAERS dataset and EMA-EV Level 2A (individual-level) dataset. ADWE cases were identified using relevant MedDRA preferred terms. Data on patient characteristics, reporter type, drugs, indication, ADWE outcomes, dechallenge/rechallenge, seriousness criteria, time to onset, duration, and causality were summarised. Results: A total of 158,505 ADWE reports were analysed (FDA-FAERS: 145,514; EMA-EV: 12,987), with mean ages of 46.1 (FDA; 55.3% female) and 45.5 years (EMA; 57.1% female). The frequently reported drug classes were opioids (FDA: oxycodone, 29.8%; EMA: buprenorphine, 19%), antidepressants (FDA: duloxetine, 32%; EMA: venlafaxine, 25.9%) and gabapentinoids (FDA: pregabalin, 6.7%; EMA: pregabalin, 6.0%). The most common adverse outcomes were other serious medical conditions (FDA=63.9%; EMA=46.0%), hospitalisation (FDA=15.9%; EMA=28.3%), and disability (FDA=13.3%; EMA=6.2%) and these outcomes varied significantly based on sex and age group (p

19.
arXiv (CS.CL) 2026-06-18

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

20.
arXiv (CS.LG) 2026-06-12

Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows

arXiv:2606.12709v1 Announce Type: cross Abstract: As LLM-based multi-agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt-injection or jailbreaking to sabotage individual agents within MAS workflows, but the interaction between model scaling and system-level resilience remains poorly understood. This paper investigates how model scale affects the security of linear multi-agent workflows. Our experiments across scales of two open-weight model families on the HumanEval benchmark reveal a compliance-correction symmetry: larger models are far more likely to faithfully execute malicious instructions, with the control-to-malicious performance drop reaching 53.7pp at 27B in uncorrected pipelines. However, appending a lightweight terminal Fixer stage collapses this to 0.6pp and restores statistical parity with control-level performance, demonstrating that strictly linear collaboration structures can be viable and resilient to adversaries at this scale, and suggesting that the brittleness previously attributed to linear topology may stem from a lack of correction.

21.
PLOS Medicine 2026-06-09

Prediction of hospitalisation in young children with pneumonia in Malawi: A machine learning-based approach

by Patrick Staunton, Mohammad Adib Makrooni, Master Chisale, Billy Nyambolo, Joseph Wu, Damien McCarthy, Mark Ledwidge, Yasir Bin Nisar, Chris Watson, Balwani Mbakaya, Cathal Seoighe, Joe Gallagher Background Globally, pneumonia remains the single biggest cause of mortality in children under 5 years of age. This study sought to train and test a prediction model for hospitalisation within 7 days after initial presentation in 2- to 59-month-old Malawian children with WHO-defined pneumonia in primary care and compare its performance to existing risk prediction models. Methods and findings BIOTOPE is a cohort study of children with pneumonia in a primary healthcare setting in Malawi. The training cohort involved nine primary care centres and the testing cohort involved two primary care centres in Northern Malawi. The training cohort was recruited between December 2022 and April 2023 while the testing cohort was recruited in 2016. Participants were consecutive children aged 2–59 months presenting with cough and/or difficulty breathing and who were diagnosed as WHO-defined pneumonia in primary care of any severity. The training cohort was used to train and validate a machine learning model with a prespecified primary outcome defined as hospitalisation and/or death within 7 days as the outcome. This model was then further evaluated in the testing cohort.Median age was 15 months (interquartile range 8−27) in the training and 17 months (interquartile range 9−29) in the external testing cohort (52.1% and 54.4% male, respectively). Hospitalisation occurred in 14.3% (294) of the training cohort and 12.1% (55) of the testing cohort. There was one death in the training cohort only. WHO danger signs were present in 17.6% (360) and 15.9% (70) of children in the training and testing cohorts, respectively. The optimal machine learning model achieved an area under the receiver operating characteristic and precision recall curves of 0.87 and 0.57, respectively, in the testing cohort outperforming existing risk prediction models; furthermore, this model produced an expected calibration error of 0.16 (a logistic regression model using severity status as the response variable and the log odds of the machine learning model’s calibrated probabilities produced an intercept estimate of −0.32 and a slope estimate of 1.13). Key limitations include the use of hospitalisation and/or death as a severity outcome, which may reflect health system factors rather than true disease severity, that mortality-based comparisons were not possible due to low mortality in these primary care cohorts, and that comparator tools were developed for hospital populations rather than primary care populations. Conclusion This machine learning score outperformed traditional pneumonia risk scores in predicting hospitalisation within 7 days in Malawian children presenting to primary care. Traditional pneumonia risk scores diminish in performance when externally applied to new datasets suggesting they may not generalise well beyond their original derivation settings. Mortality-related findings are not applicable as there was only one death in this cohort. Overall these findings support the potential of machine learning to meaningfully improve early identification of children at risk of severe pneumonia in low-resource primary care settings. Further external validation and clinical impact studies are needed to confirm these results.

22.
arXiv (CS.LG) 2026-06-15

Hybrid Uncertainty Sensitivity Analysis Based on the HSIC for High-Dimensional Responses with Aleatory–Epistemic Separation

arXiv:2606.14053v1 Announce Type: cross Abstract: Quantifying the influence of hybrid aleatory and epistemic uncertainties on high-dimensional system responses remains a major challenge in global sensitivity analysis (GSA). Existing Hilbert–Schmidt Independence Criterion (HSIC)-based approaches are primarily restricted to single-output settings and lack a rigorous decomposition of heterogeneous uncertainty sources and their interactions. To address this limitation, a novel double-space tensor-product RKHS framework is proposed for sensitivity analysis under hybrid uncertainty. By constructing factorized kernels over both the latent input space and the multidimensional output space, a concurrent double Möbius inversion is derived to orthogonally decompose the global dependence measure into pure aleatory effects, pure epistemic effects, and their interaction contributions. The resulting dimension-wise sensitivity indices preserve the uncertainty attribution structure across all output dimensions. To satisfy the independence assumptions required by the decomposition, an auxiliary-variable representation based on the inverse probability integral transform is introduced, enabling the treatment of hierarchical uncertainties and Copula-induced correlations within a unified latent space. A fully vectorized single-loop implementation is further developed to avoid the computational burden of nested Monte Carlo simulation. Statistical significance and estimation uncertainty are quantified through permutation testing and Bootstrap confidence intervals. Numerical studies on a modified multi-output Ishigami function and an aerodynamic pressure-field problem demonstrate the accuracy, scalability, and practical applicability of the proposed framework.

23.
arXiv (CS.CL) 2026-06-11

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read–Write–Assess–Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

24.
arXiv (CS.CL) 2026-06-18

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

25.
arXiv (CS.LG) 2026-06-16

How Post-Training Shapes Biological Reasoning Models

arXiv:2606.16517v1 Announce Type: new Abstract: Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes reasoning and generalization remains poorly understood. We study when post-training improves performance and when it induces over-specialization. Across genomics, transcriptomics, and proteins, we train and evaluate more than 100 biological reasoning models under controlled variation in backbone, continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL), measuring both in-domain (ID) and out-of-domain (OOD) performance. We find that each post-training stage reshapes generalization in a distinct way rather than contributing uniform gains. CPT improves downstream performance by aligning models with biological language. SFT consistently increases ID performance but causes OOD performance to peak early and decline as models fit the training distribution. RL, when applied to strong SFT checkpoints with aligned rewards, improves OOD performance and partially recovers generalization. These results show that biological reasoning does not improve monotonically with additional supervision or compute. Instead, performance depends on how training stages are composed. Under fixed post-training budgets, the strongest ID-OOD trade-off comes from brief SFT, larger RL allocations, and asymmetric adaptation capacity across stages.