Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-19

GLARE: A Natural Language Interface for Querying Global Explanations

arXiv:2606.19735v1 Announce Type: new Abstract: While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

02.
arXiv (CS.CV) 2026-06-16

ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

Aerial Embodied Question Answering (EQA) requires Unmanned Aerial Vehicles (UAVs) to actively perceive the environment and answer natural language questions. Existing outdoor EQA systems usually stop once the target enters the UAV's field of view, leaving the fine-grained viewpoint adjustment needed for evidence-seeking questions largely unresolved. To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories. Drawing inspiration from the ``waggle dance'' of scout bees, which iteratively adjust their flight paths to verify target information, we propose ScoutVLA, an evidence-driven Vision-Language-Action model for outdoor EQA. To emulate this active exploration behavior, ScoutVLA features a decoupled dual-expert architecture: a vision-language expert infers the semantic intent to identify missing evidence, while an independent action expert employs high-DoF flow matching to generate continuous viewpoint-refinement trajectories. To balance the competing demands of continuous control and semantic reasoning, we devise a decoupled training strategy with a knowledge insulation mechanism that prevents the action gradients from erasing the model's multimodal reasoning ability. Extensive simulated experiments and a qualitative real-world field study both verify the superiority of ScoutVLA over the state-of-the-art baselines, demonstrating a 10.48$\boldsymbol{\times}$ higher average strict success rate and a 7.72$\boldsymbol{\times}$ higher average QA correctness.

03.
arXiv (CS.LG) 2026-06-16

QuantKAN: A Unified Quantization Framework for Kolmogorov Arnold Networks

arXiv:2511.18689v3 Announce Type: replace Abstract: Kolmogorov–Arnold Networks (KANs) replace linear weights with spline-based functions, offering strong expressivity but posing challenges for low-precision deployment due to heterogeneous parameter distributions. We introduce QuantKAN, the first unified framework for quantization-aware training (QAT) and post-training quantization (PTQ) of KANs. The framework employs branch-aware quantizers for base and spline parameters and extends modern QAT and PTQ methods to spline-based layers across EfficientKAN, FastKAN, PyKAN, and KAGN. Experiments on MNIST, CIFAR-10/100, TinyImageNet, and ImageNet provide the first unified QAT/PTQ KAN benchmarks and show that DSQ is the most robust QAT method at aggressive low-bit settings, while GPTQ is the strongest PTQ method at moderate precision. Sensitivity analyses reveal architecture-specific failure modes: spline/basis parameters dominate in FastKAN, while base or scaling parameters dominate in EfficientKAN, GRAM, and PyKAN. Vivado HLS estimates on a Xilinx UltraScale+ device further suggest up to 3.32$\times$ throughput and 7.7$\times$ lower estimated dynamic energy per inference under W4A4, exposing a residual basis-evaluation tax that motivates basis-aware microarchitecture. QuantKAN is available at https://github.com/OSU-STARLAB/QuantKAN/.

04.
arXiv (CS.LG) 2026-06-11

Learning from almost nothing: How neural networks survive heavy input corruption

arXiv:2606.11319v1 Announce Type: new Abstract: Learning from imperfect data is a central theme in machine learning, connecting practical questions of robustness to fundamental questions of learnability. Here we examine attribute noise: learning from corrupted inputs while keeping the labels intact, a setting that has received considerably less analytical attention than its label-noise counterpart. We consider two types of corruption models: additive noise and replacement noise. Through experiments with multi-layer perceptrons (MLPs) on corrupted classification datasets, we find that neural networks remain robust, maintaining well-above-chance accuracy even when inputs are >90% corrupted – far beyond human recognition. To understand this robustness, we analyze infinite-width networks in the heavy-corruption regime using a mean-field-inspired approach and derive a leading-order decision rule for the classification outcome: the network implements a prototype rule, the nearest-class-mean, assigning each test point to the class whose training-set average it most closely resembles. This leading-order decision rule is universal across a broad range of MLP architectures, holding for any depth, as well as a wide class of activation functions and noise distributions. The same centroid mechanism closely matches finite-width network behavior in our experiments and provides an interpretable and analytically tractable account of why learning can succeed even when individual training examples carry almost no signal.

05.
arXiv (CS.LG) 2026-06-11

PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework

arXiv:2505.08784v2 Announce Type: replace-cross Abstract: As machine learning (ML) enters high-stakes domains, trustworthy uncertainty quantification (UQ) is essential for safety. In this paper we introduce PCS-UQ, a framework based on the Predictability, Computability, and Stability (PCS) principles for veridical data science. Starting with a candidate set of models or algorithms, PCS-UQ integrates a rigorous prediction-check to screen out unsuitable models in the set and utilizes bootstrap samples, in order to capture both inter-sample variability and algorithmic instability for the prediction-checked algorithms. We then introduce a novel multiplicative calibration scheme to enhance local adaptivity, which basically corresponds to a new score in conformal prediction. Moreover, we produce a compilation of 17 real-world regression datasets with manually-constructed subgroups. On this benchmark, PCS-UQ maintains the target coverage while outperforming or matching conformal methods equipped with oracle-selected algorithms in interval width. PCS-UQ achieves consistent subgroup coverage, outperforming these oracle-selected conformal methods. Notably, PCS-UQ stands out in achieving both competitive interval widths and consistent subgroup coverage.Across 6 classification datasets, PCS-UQ reduces prediction set sizes by 20\%. To scale the framework for deep learning, we propose computationally efficient variants that bypass expensive retraining. On three computer vision benchmarks, these variants reduce prediction set sizes by 20\% over conformal baselines. Finally, we provide theoretical proof that a modified PCS-UQ algorithm preserves valid coverage under exchangeability as a form of split conformal inference.

06.
arXiv (quant-ph) 2026-06-12

Generalized two-qubit Hamiltonian for Projective Quantum Feature Maps

arXiv:2606.13641v1 Announce Type: new Abstract: Projected quantum feature maps provide a strategy for using quantum processors as feature generators for classical machine-learning models. Building on counterdiabatic Ising-glass and one-dimensional Heisenberg PQFMs, we introduce a generalized two-qubit Hamiltonian-based PQFM that provides a unified way to encode classical features through local Pauli fields and pairwise two-qubit Pauli interactions. This construction allows distinct classical variables to be embedded along different Pauli axes of the same qubit, increasing the information density of shallow circuits while remaining compatible with hardware constraints. We develop and implement these methods in pqfmlib, a publicly available Python library for constructing, executing, and benchmarking Hamiltonian-based PQFMs.We then benchmark the generalized Hamiltonian PQFMs against reference PQFMs on four biomedical classification datasets under a nested cross-validation protocol with paired statistical tests. Quantum features are generated using both IBM quantum processors with up to 156 qubits and statevector simulations. Our results show that the generalized two-qubit Hamiltonian family provides the most consistent pattern of statistically supported gains over matched classical baselines, although the performance of all methods depends on the dataset, encoding strategy, measured observables, and hardware conditions. These findings support generalized Hamiltonian PQFMs as a promising route toward near-term quantum utility.

07.
arXiv (CS.LG) 2026-06-16

Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation

arXiv:2606.14945v1 Announce Type: new Abstract: The autoresearch pattern enables autonomous experimentation by having a large language model (LLM) iteratively modify code to optimize a target metric. Its stateless design, however, reconstructs experimental context from scratch at every iteration, incurring $O(n)$ token cost per iteration and $O(n^{2})$ total. This work reformulates the pattern as a stateful ReAct agent using LangGraph, where typed persistent state carries experimental history across iterations via a tool-calling interface. Two benchmarks are evaluated: hyperparameter tuning (15 iterations, small per-iteration observations) and code performance optimization (40 iterations, large per-iteration observations containing full source code and benchmark results). On hyperparameter tuning, the stateful agent consumes 90\% fewer tokens (2{,}492 vs.\ 24{,}465). On code optimization, the stateful agent consumes 52\% fewer tokens (627K vs.\ 1{,}275K) while achieving comparable optimization quality on both tasks. The token reduction is structural: the stateless agent re-reads the full history at $O(n)$ cost per iteration, while the stateful agent operates within a fixed-size conversation window at $O(1)$ cost. This paper describes the architecture in sufficient detail for practitioners to implement a stateful autoresearch agent for their own workflows.

08.
arXiv (math.PR) 2026-06-12

A mathematical study of the excess growth rate

arXiv:2510.25740v2 Announce Type: replace-cross Abstract: The excess growth rate, defined as the gap in Jensen's inequality for the logarithm, is a fundamental functional in portfolio theory. In this paper, we present a mathematical study motivated by information theory. We begin by establishing its properties and showing that it has rich connections with information theoretic concepts such as the Helmholtz free energy, L. Campbell's measure of average code length and large deviations. Our main results consist of three axiomatic characterization theorems of the excess growth rate, in terms of (i) the relative entropy, (ii) the gap in Jensen's inequality, and (iii) the logarithmic divergence that generalizes the Bregman divergence. Furthermore, we study maximization of the excess growth rate and compare it with the growth optimal portfolio. Our results not only provide theoretical justifications of the significance of the excess growth rate, but also establish new connections between information theory and quantitative finance.

09.
arXiv (CS.CV) 2026-06-15

Temporal Backtracking Search for Test-time Generative Video Reasoning

While test-time scaling has revolutionized reasoning in large language models, generative video reasoning remains bottlenecked by a single-shot paradigm. We demonstrate that searching over denoising steps cannot rescue logically flawed rollouts because spatial trajectories commit early in the diffusion process. Root-level Best-of-N (BoN) sampling is similarly inefficient: reasoning errors cluster early in the temporal axis, and resampling blindly discards verified upstream progress. To unlock effective test-time scaling for video models, we introduce Temporal Backtracking Search (TBS), which shifts the search space to the temporal axis. TBS transforms video generation into an iterative generate-verify-restart loop via three core mechanisms: (1) variable-K conditioning to resume generation from arbitrary clean prefixes; (2) temporal process verification to localize failures and extract valid restart anchors; and (3) prefix-based search to reallocate compute toward extending correct trajectories rather than root resampling. Across algorithmic, navigation, and robotics domains, TBS Pareto-dominates matched-budget BoN. In a strict out-of-distribution setting where one-shot generation collapses (0.7% for BoN), TBS achieves 22.7%, with every solved episode stemming from a restarted branch. Ultimately, TBS reveals that the local reasoning competence of video models far exceeds what single-shot rollouts indicate, providing a scalable test-time framework to unlock it.

10.
arXiv (CS.CL) 2026-06-16

Data-Driven Decoding of Russell's Circumplex Model of Affect

Affective computing increasingly relies on deep learning to represent emotions, yet latent spaces often remain opaque, high-dimensional black boxes. This paper investigates whether Transformers' embeddings recover the geometric regularities of Russell's circumplex model. We unify two complementary experiments testing the hypothesis that, after training models on text and speech, their resulting latent spaces encode a topology consistent with valence-arousal and reproduce human-like neighborhood relations. Specifically, we evaluate deep representations extracted from Transformer-based text (RoBERTa) and speech (wav2vec 2.0) encoders, along with a multimodal Transformer fusion architecture, across naturalistic datasets like MSP-Podcast and controlled LLM-generated stimuli. Our analysis reveals that multimodal fusion of text and audio yields perfect topological alignment with Russell's primary emotion ordering. Furthermore, in a zero-shot setting using generic text embeddings, projected fine-grained emotion terms fall close to their established human-mapped coordinates. Our contribution is a novel, data-driven framework for validating emotion models, demonstrating that Russell's circumplex structure is intrinsically encoded in the embeddings of these modalities rather than being solely an artifact of human labeling, thereby bridging the gap between psychological theory and representation learning.

11.
arXiv (CS.AI) 2026-06-15

Transforming Shape Schemas with Composable Property-Graph Queries (Extended Version)

arXiv:2606.14309v1 Announce Type: cross Abstract: Property graphs may be constrained by schemas that inform both query engines and human users about the shape of valid data, enforcing a contract between data provider and consumer. Composable property-graph queries transform input graphs into output graphs. Then, the question arises of which schema can be expected after one (or several) transformation steps. We investigate how schema constraints can be inferred given an input schema and a transforming query. Specifically, we propose a reasoning procedure that, given an input schema in ProGS and a query in G-CORE infers an output schema. Since graph updates will happen frequently, our inference procedure does not rely on graph instances, such that the computed output schema applies to all graphs originating from any input graph complying with the input schema. Related work has addressed this problem for SPARQL CONSTRUCT queries, encoding it in Description Logics (DLs) so that the output schema is entailed by axioms inferred from input schema and queries. Property graphs and their queries, however, complicate the matter, as property graphs feature label and property annotations as well as first-class edges. Thus, reification has to be used in one way or another, though available DLs lack the means to encode such features directly. We approach this novel challenge via a family of mappings for i) property graphs reified in RDF, aligned with ii) a mapping from ProGS to SHACL and iii) a mapping from G-CORE to SPARQL CONSTRUCT queries. In this manner, schema inference for property graphs becomes manageable, as we break apart the problem through the extra mapping layer and utilize efficient DL reasoners. We develop the metatheory regarding the soundness of inferred schema constraints and the semantic equivalence of mapped schemas and queries.

12.
PLOS Computational Biology 2026-06-02

Linking reduced prefrontal microcircuit inhibition in schizophrenia to EEG biomarkers in silico

by Sana Rosanally, Frank Mazza, Heng Kang Yao, Faraz Moghbel, Hannah Seo, Etay Hay Reduced cortical inhibition by parvalbumin-expressing (PV) interneurons in schizophrenia is thought to be associated with impaired processing in the prefrontal cortex and altered EEG signals such as oddball mismatch negativity (MMN). Recent studies also suggest loss of somatostatin (SST) interneuron inhibition. However, establishing the link between reduced interneuron inhibition and reduced MMN experimentally in humans is currently not possible. To overcome these challenges, we simulated spiking activity and EEG during baseline and oddball response in detailed models of human prefrontal microcircuits in health and schizophrenia, with reduced PV and SST interneuron inhibition as constrained by postmortem patient data. We showed that reduced PV interneuron inhibition can account for the decreased MMN amplitude seen in schizophrenia, with a threshold below which the amplitude effect was low as seen in at-risk patients. In contrast, reduced SST interneuron inhibition did not affect the MMN amplitude. We further showed that both types of inhibition loss were necessary to account for changes in resting EEG in schizophrenia, with reduced SST interneuron inhibition increasing broadband power, and reduced PV and SST interneuron inhibition both leading to a right shift from alpha to beta frequencies. Our study thus links reduced PV and SST interneuron inhibition in schizophrenia to distinct EEG biomarkers that can serve to improve stratification and early detection using non-invasive brain signals.

13.
arXiv (CS.CL) 2026-06-16

TMASC: Transmasculine Attitude and Speech Corpus

作者:

We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the vocal health of transmasculine individuals. The audio recordings include cough and throat-clearing samples, a reading passage, and additional session-specific questions. This paper outlines the development of this corpus and the data collection procedures. To illustrate the utility of this corpus, we present three case studies demonstrating how this crowd-sourced multimodal corpus can be used to support transmasculine individuals. These include the integration of perceptual and acoustic data, the identification of group-level characteristics, and the calibration of acoustic measurements.

14.
medRxiv (Medicine) 2026-06-17

Frequency-dependent cognitive effects of Deep Brain Stimulation in Parkinson's Disease: A Systematic Review and Meta-Analysis

Background: Subthalamic nucleus deep brain stimulation (STN-DBS) improves levodopa-induced motor complications and cardinal motor symptoms of Parkinson's disease (PD), but stimulation frequency may differentially shape outcomes. This is evident for axial and gait symptoms, which may respond differently to lower-frequency stimulation. Whether frequency-dependent effects extend to cognition remains unclear. Objective: To investigate the cognitive effects of DBS at distinct frequencies in PD. Methods: We conducted a systematic review and meta-analysis (PROSPERO - CRD42024618253). PubMed, Web of Science, and EMBASE were searched for studies assessing cognitive outcomes under different stimulation frequencies. Eight cognitive domains were defined: verbal fluency, cognitive flexibility, executive control, working memory, attention, processing speed, episodic memory, and time processing. Multilevel random-effects meta-analyses were performed, with effect sizes expressed as Hedges' g. Results: Forty-three studies met the inclusion criteria, the majority (n = 31) involving STN-DBS. Twenty-one STN-DBS studies, including 355 patients, were included in the meta-analysis. Compared with HFS ([≥] 130 Hz), lower frequencies (4-80 Hz) were associated with better verbal fluency (g = 0.27) and cognitive flexibility (g = 0.38), with consistent effects across sensitivity and leave-one-out analyses. Accuracy-based executive control measures also favored lower-frequency stimulation. OFF-stimulation comparisons showed a concordant pattern. Evidence for other targets (PPN and NBM) was limited. Conclusions: Lower-frequency STN-DBS was associated with modest benefits in specific cognitive domains compared with HFS. These findings highlight the need for future research to determine how frequency interacts with stimulation location and symptom-specific networks to shape cognitive and cognitive-motor outcomes in PD.

15.
arXiv (CS.CV) 2026-06-11

UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

We study whether grounded reasoning supervision from abundant 2D medical images can improve 3D medical VQA when both input types are aligned through a common reasoning interface. We introduce UniReason-Med, a single-checkpoint framework that processes either a 2D image or a slice-serialized 3D volume at inference time, generating interleaved textual reasoning and localized visual evidence through shared box syntax, region-token injection, and a common grounded reasoning policy. To train this interface, we construct UniMed-CoT, a 220K instruction-tuning dataset with interleaved textual reasoning and grounded visual evidence, including 170K 2D and 50K 3D samples. Through supervised fine-tuning followed by outcome-level reinforcement learning, UniReason-Med learns to generate grounded reasoning traces without IoU/Dice-based localization rewards during RL. Data-mixture and component ablations show that joint 2D+3D grounded supervision substantially improves 3D reasoning over 3D-only training, while grounding and region-token injection consistently benefit both 2D and 3D tasks. These results suggest that a shared grounded reasoning interface can transfer reasoning structure from 2D images to slice-serialized volumetric medical understanding. The code and data are publicly available at https://github.com/IQuestLab/unireason-med.

16.
arXiv (quant-ph) 2026-06-16

Optimising Entanglement Distillation Policies

arXiv:2606.14908v1 Announce Type: new Abstract: Entanglement distillation is a fundamental operation in quantum information processing used to obtain higher-fidelity entangled pairs from a supply of less entangled quantum states using local operations aided by classical communication (LOCC). In a physically relevant setting, where states with an initial fidelity of $f_0$, probabilistically generated over multiple, $m$, memory pairs distributed between two parties, Alice and Bob, are pairwise distilled, the optimal policy identifies the system-configuration dependent sequence of entanglement generation and distillation operations that need to be performed in order to minimize the expected time to reach some target fidelity $f_T>f_0$. Here, we formulate and systematically analyze this task as a Markov decision problem and using a value iteration algorithm, obtain optimal deterministic policies that minimize the expected waiting time required to reach a target fidelity. Our results show that the expected waiting time under the optimal policy decreases with increasing generation probability $p$ and number of quantum memories $m$ - as expected. In contrast, it exhibits non-monotonic behavior with respect to $f_0$ for a fixed fidelity gap, $(\Delta f = f_T-f_0)$. While the optimal policy consistently outperforms baseline policies such as the greedy, nested and entanglement pumping policies, its relative advantage is regime-dependent, being determined by the system parameters ($p,f_0,f_T,m$), and exhibits a nontrivial dependence on the fidelity gap $\Delta f$. Our results highlight the value of formulating entanglement distillation as a Markov decision problem, enabling the systematic design of policies that achieve target fidelity thresholds for quantum information tasks in realistic resource-constrained settings.

17.
arXiv (CS.CV) 2026-06-18

RUB: Evaluating Residual Knowledge in Unlearned Models

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

18.
arXiv (CS.CV) 2026-06-18

Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition

In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.

19.
arXiv (CS.CL) 2026-06-17

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

20.
medRxiv (Medicine) 2026-06-17

Menopausal symptoms in peri- and postmenopausal women: systematic review and meta-analysis of prevalence, incidence, comorbidities, and clinical outcomes

Introduction: The global epidemiology of menopausal symptoms among middle-aged and elderly women remains unclear. Methods: Data on prevalence, comorbidities, incidence and outcomes of menopausal symptoms published up until March 1st 2019 were searched in PubMed, Embase and Cochrane databases. We used a random-effects model to compute point estimates of prevalence for 24 types of menopausal symptoms. We narratively summarized the patterns of the comorbidities, incidence and outcomes of menopausal symptoms due to limited data. Results: A total of 239 studies (n{approx}2.5 million middle-aged and elderly women) from 56 countries and regions were included in the analysis. The global pooled prevalence analysis revealed that hot flashes (48%) and night sweats (30%) were highly prevalent, alongside psychological symptoms like insomnia (47%), irritability (46%), anxiety (39%), and depression (30%). Physical symptoms including joint aches/pain (50%), backache (47%), and tiredness (61%) were also commonly reported. Heat intolerance showed the highest prevalence (76%), while symptoms like urinary incontinence (24%) and poor appetite (8%) were less frequent. These findings highlight the diverse and widespread impact of menopause on women globally, with significant variations across symptom types. Africa showed the highest pooled prevalence across a series of symptoms, compared with other continents. We observed high prevalence in developing countries, especially for psychological and physical symptoms; significant intra-Asian variation in vasomotor symptoms; hypertension and obesity as the most common comorbidities; joint pain, urinary incontinence, and vasomotor symptoms as the most incident complaints; and positive associations with cardiovascular disease in the psychological (depression and insomnia) and physical (joint pain) domains. Conclusion: This study highlights the global burden of menopausal symptoms, with significant differences across continents. The findings call for more inclusive research on underrepresented groups (particularly in Africa) and further investigation into drivers of this marked global heterogeneity in prevalence of menopausal symptoms and their comorbidities, incidence and outcomes.

22.
arXiv (CS.AI) 2026-06-12

What Type of Inference is Active Inference?

arXiv:2606.04935v2 Announce Type: replace Abstract: Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that full EFE-based planning outperforms ablations that omit either the planning correction or the epistemic corrections.

23.
arXiv (CS.LG) 2026-06-12

EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

arXiv:2606.12979v1 Announce Type: new Abstract: JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor's weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^{n=50} = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^{n=50} trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

24.
arXiv (CS.CL) 2026-06-18

Enhancing Multilingual Reasoning via Steerable Model Merging

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

25.
arXiv (CS.AI) 2026-06-16

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

arXiv:2606.15029v1 Announce Type: new Abstract: LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters – a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.