Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-16

The Complexity of Min-Max Optimization for Quadratic Polynomials

arXiv:2606.17000v1 Announce Type: cross Abstract: We prove that computing approximate stationary points of min-max optimization over the hypercube is PPAD-hard for quadratic polynomials. This holds even when the polynomials are multilinear, each variable appears in at most three monomials, and the approximation factor is inverse polynomial. As a direct consequence, we obtain the first PPAD-hardness results for two-team zero-sum polymatrix games.

02.
arXiv (CS.CL) 2026-06-16

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthropomorphism, and Maxims

Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.

03.
arXiv (quant-ph) 2026-06-12

Simple analytical flux-tuned iSWAP pulses for leakage suppression

arXiv:2606.13052v1 Announce Type: new Abstract: Fast, high-fidelity two-qubit gates are a key requirement for fault-tolerant quantum computation. Tunable coupler architectures provide a flexible approach for implementing entangling gates through flux control with large on-off ratios, but fast flux modulation can induce diabatic transitions and population leakage to non-computational states, limiting gate performance. Here we present an analytical flux control method enabling derivative removal by adiabatic gate ($\Phi$-DRAG) for suppressing leakage in flux tunable two-qubit gates. We show that $\Phi$-DRAG differs fundamentally from conventional microwave implementations and derive modified flux modulation protocols that suppress leakage below $10^{-4}$ for fast entangling gates. The method remains effective across a range of asymmetry between qubit anharmonicities and different circuit parameters, enabling high-fidelity two-qubit gates within the fifteen nanosecond range.

04.
arXiv (CS.CV) 2026-06-18

Recognizing and Reconstructing a Multi-Unit Floor Plan

Digital twins have a major potential to form a significant part of urban management in emergency planning, as they allow more efficient designing of the escape routes, better orientation in exceptional situations, and faster rescue intervention. Nevertheless, creating the twins still remains a largely manual effort, due to a lack of 3D-representations, which are available only in limited amounts for some new buildings. Thus, in this paper we aim to synthesize 3D information from commonly available 2D architectural floor plans. We propose two novel pixel-wise segmentation methods based on the MDA-Unet and MACU-Net architectures with improved skip connections, an attention mechanism, and a training objective together with a reconstruction part of the pipeline, which vectorizes the segmented plans to create a 3D model. The proposed methods are compared with two other state-of-the-art techniques and several benchmark datasets. On the commonly used CubiCasa benchmark dataset, our methods have achieved the mean F1 score of 0.86 over five examined classes, outperforming the other pixel-wise approaches tested. We have also made our code publicly available to support research in the field.

05.
arXiv (CS.CL) 2026-06-16

Human genetic evidence is associated with drug approval across therapeutic areas: an observational analysis of 26,278 target-disease pairs with temporal validation and feature ablation

Genetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.

06.
arXiv (CS.CV) 2026-06-15

Rethinking One-Step Image Editing through ChordEdit: Reproduction, Simplification, and New Insights

One-step image editing is important for making text-guided editing fast, practical, and easy to deploy, but its underlying mechanism is still not fully understood. We revisit ChordEdit through reproduction, ablation, and simplification. Our analysis shows that a) the chord window $\delta$ largely acts as an effective timestep shift from $t$ to $t - \delta$; b) chord transport acts on high-noise images and mainly performs low-frequency semantic editing; and c) proximal alignment acts on low-noise images and complements it by adding high-frequency target details. In this view, ChordEdit naturally decomposes editing into a coarse low-frequency transport stage and a fine high-frequency alignment stage. These findings suggest a path toward prompt-conditioned dynamic timestep selection for adaptive image editing. All code and results can be found at \href{https://github.com/Harvard-AI-and-Robotics-Lab/ChordEdit-Reproduction}{link}.

07.
arXiv (CS.CL) 2026-06-11

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

08.
arXiv (CS.AI) 2026-06-18

Target-confidence Recourse Using tSeTlin machines: TRUST

arXiv:2606.18832v1 Announce Type: cross Abstract: Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

09.
arXiv (CS.CV) 2026-06-17

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

10.
arXiv (CS.LG) 2026-06-19

Data Bias Mitigation under Coverage Constraints & The Price of Fairness

arXiv:2606.20461v1 Announce Type: new Abstract: Machine learning models have been shown to exhibit discriminatory outcomes or degraded performance for individuals at the intersection of multiple sensitive attributes, such as race and gender. This stems in part from two interrelated challenges: the lack of principled measures for quantifying bias (potentially intersectional), and insufficient representation of intersectional subgroups in training data. We extend a recent bias mitigation framework to incorporate coverage constraints that enforce sufficient representation across groups, including intersectional subgroups. Since achieving exactly zero bias for all groups may not be data efficient (meaning it may require large amounts of data), our solution trades small approximation errors in bias for greater data efficiency while satisfying coverage constraints. We also formulate bias mitigation as an integer linear program that optimizes over all mitigation strategies, and characterize the price of fairness, the minimum data modification cost, as a function of fairness tolerance. This is essential both for legal compliance, where regulations may mandate specific fairness thresholds, and for data governance, enabling practitioners to make informed trade-offs between bias reduction and data modification (particularly, data purchasing) costs. We evaluate our techniques on publicly available datasets, demonstrating that bias mitigation via our framework preserves predictive accuracy across multiple classifiers, and that coverage constraints, while motivated by statistical considerations, are essential for preserving downstream ML performance.

11.
arXiv (CS.AI) 2026-06-11

MLaGA: Multimodal Large Language and Graph Assistant

arXiv:2506.02568v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph methods excel in adapting LLMs to text-rich graphs, wherein node attributes are text descriptions. However, their applications to multimodal graphs–where nodes are associated with diverse attribute types, such as texts and images–remain underexplored, despite their ubiquity in real-world scenarios. To bridge the gap, we introduce the Multimodal Large Language and Graph Assistant (MLaGA), an innovative model that adeptly extends LLM capabilities to facilitate reasoning over complex graph structures and multimodal attributes. We first design a structure-aware multimodal encoder to align textual and visual attributes within a unified space through a joint graph pre-training objective. Subsequently, we implement a multimodal instruction-tuning approach to seamlessly integrate multimodal features and graph structures into the LLM through lightweight projectors. Extensive experiments across multiple datasets demonstrate the effectiveness of MLaGA compared to leading baseline methods, achieving superior performance in diverse graph learning tasks under both supervised and transfer learning scenarios.

12.
bioRxiv (Bioinfo) 2026-06-17

Posterior-calibrated multimodal motor states reveal longitudinal and imaging-associated heterogeneity in Parkinson's disease

Parkinson's disease (PD) motor heterogeneity is commonly summarized by hard subtype labels, although clinical states vary longitudinally, severity can dominate unsupervised structure, and model uncertainty is rarely calibrated. We developed a posterior and refit-stability calibrated multimodal motor state framework that assigns probabilistic MDS-UPDRS-III motor states, aggregates them at the patient level, separates global burden from residual tremor-axial profile, and tests whether imaging can recover the resulting posterior distribution. In 29,366 aligned PPMI motor-posterior visits spanning 4,773 participant identifiers, patient-level state families were stable on average (modal-family fraction 0.925; 95% CI 0.921 - 0.930), but 25.5% of patients transitioned state over follow-up (95% CI 24.1 - 26.7%). PD-only cohort definitions produced smaller denominators and are reported as sensitivity cohorts with rerun calibration and imaging-posterior checks. Severity and covariates explained substantial motor-domain variance, especially bradykinesia (rsecond=0.850), but residual profile modeling retained five active components across total-severity, principal-component, leave-one-domain, non-target-burden, and clinical-only severity axes. Refit-stability calibration with 250 patient-blocked bootstrap refits showed high nominal posterior confidence (0.989) but lower empirical label consistency (0.849), quantifying overconfidence rather than hiding it. Patient-held-out temporal modeling predicted future axial burden (best XGBoost rsecond=0.605) and future state transition (XGBoost AUC=0.830; 95% CI 0.822 - 0.837). DaTSCAN plus FreeSurfer ROI features predicted patient-level soft motor posterior vectors (RF jsd=0.209; 95% CI 0.199 - 0.220; macro-AUROC=0.692), while severity/demographic-adjusted imaging features further improved soft posterior recovery (jsd=0.188). BioFIND transfer reproduced clinically meaningful endpoint gradients after state assignment in 225 external patients, supporting external face validity rather than definitive transportability. These results support PD motor phenotypic states as calibrated, dynamic, clinically interpretable profiles with convergent imaging associations, not as definitive biological subtypes.

13.
arXiv (CS.CV) 2026-06-16

Latent Action Pretraining Through World Modeling

Vision-Language-Action (VLA) models have gained popularity for learning robotic manipulation tasks that follow language instructions. State-of-the-art VLAs, such as OpenVLA and $\pi_{0}$, were trained on large-scale, manually labeled action datasets collected through teleoperation. More recent approaches, including LAPA and villa-X, introduce latent action representations that enable unsupervised pretraining on unlabeled datasets by modeling abstract visual changes between frames. Although these methods have shown strong results, their large model sizes make deployment in real-world settings challenging. In this work, we propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way, by learning latent action representations from unlabeled video data through world modeling. These videos can be sourced from robot recordings or videos of humans performing actions with everyday objects. Our framework is able to transfer learned knowledge across tasks, environments, and embodiments. It outperforms models pretrained with ground-truth robot actions and other similar pretraining methods on the LIBERO benchmark and real-world setup, while being efficient and practical for real-world settings.

14.
arXiv (math.PR) 2026-06-18

On the Singular Control of a Diffusion and its Running Infimum or Supremum

arXiv:2501.17577v2 Announce Type: replace-cross Abstract: We study a class of singular stochastic control problems for a one-dimensional diffusion $X$ in which the performance criterion to be optimised depends explicitly on the running infimum $I$ (or supremum $S$) of the controlled process. We introduce two novel integral operators that are consistent with the Hamilton-Jacobi-Bellman equation for the resulting two-dimensional singular control problems. The first operator involves integrals where the integrator is the control process of the two-dimensional process $(X,I)$ or $(X,S)$; the second operator concerns integrals where the integrator is the running infimum or supremum process itself. Using these definitions, we prove a general verification theorem for problems involving two-dimensional state-dependent running costs, costs of controlling the process, costs of increasing the running infimum (or supremum) and exit times. Finally, we apply our results to explicitly solve an optimal dividend problem in which the manager's time-preferences depend on the company's historical worst performance.

15.
arXiv (CS.CV) 2026-06-17

MOCHI: Motion Enhancement of Collaborative Human-object Interactions

Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.

16.
arXiv (math.PR) 2026-06-16

Free energy of non-convex multi-species spin glasses with centered Ising spins

arXiv:2606.16636v1 Announce Type: new Abstract: We identify the limit free energy of all multi-species spin glasses with centered $\pm 1$ spins. The result was previously known only under a convexity assumption on the covariance function of the Hamiltonian. We also obtain a one-species reduction of the formula for balanced multi-species models.

17.
Nature (Science) 2026-06-10

Light slows down carbon nanotubes in water

Water-suspended carbon nanotubes move more slowly in green light, suggesting that excited electrons in the tubes couple to the water through ‘quantum friction’. Water-suspended carbon nanotubes move more slowly in green light, suggesting that excited electrons in the tubes couple to the water through ‘quantum friction’.

18.
medRxiv (Medicine) 2026-06-12

Order-Based Bayesian Network Modeling of Early Detection and Post-Diagnosis Control for Cardiovascular Disease Risk in Type 2 Diabetes

Patients diagnosed with type 2 diabetes (T2D) are at increased risk of developing cardiovascular disease (CVD), the leading cause of morbidity and mortality in this population. Early detection and glycemic control within the first year after diagnosis reduce CVD risk. However, gaps remain in how to operationalize early detection of T2D using Electronic Health Record (EHR) data and quantify its relationship with subsequent CVD risk using longitudinal observations. We developed a probabilistic graph model to analyze the interdependencies between early detection of T2D, post-diagnosis glycemic control, and CVD occurrence. Using a temporally structured Bayesian Network (BN) learned from EHR data of 9,450 primary care patients between 2017 and 2023, we quantified probabilistic dependencies between demographics, diagnostic delay surrogates, glycemic control, and post-diagnosis CVD occurrence. Percentile based thresholds defined risk groups, where individuals with predicted probabilities in the bottom decile ([≤] 10th percentile) were classified as low risk, and those in the top decile ([≥] 90th percentile) as high risk. Results demonstrated heterogeneity in predicted risks across glycemic and cardiovascular outcomes. Predicted probability of developing CVD within the first year after T2D diagnosis ranged from a mean of 5.2% in the low-risk group to 28.9% in the high-risk group, while predicted probabilities of mean Hemoglobin A1c (HbA1c) [≥] 8% during the first year post-diagnosis ranged from 1.6% in low-risk to 55.1% in high-risk group. Patients with HbA1c at diagnosis [≥] 8% had higher predicted probabilities of first-year post-diagnosis mean HbA1c [≥] 8% (53.3% vs. 1.9%) and high HbA1c coefficient of variation (18.7% vs. 3.1%) compared with those with HbA1c [≤] 6.5%. Incorporating early clinical outcomes refined later risk predictions, with long-term CVD risk reaching 33.5% among high-risk individuals. The proposed model achieved predictive performance comparable to conventional machine learning approaches while providing interpretable relationships for risk stratification in primary care populations.

19.
arXiv (CS.CL) 2026-06-15

FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

Large language models are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many natural language processing applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or non-verifiable facts, making the use of a single factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.74, indicating that the benchmark remains a challenging task for future research. We release our dataset and code at https://github.com/XiangyanChen/FineDialFact.

20.
arXiv (CS.LG) 2026-06-17

AoiZora: Topology-Aware Auto-Parallel Optimization for Inference of Diffusion Transformers

arXiv:2606.17566v1 Announce Type: cross Abstract: Video diffusion has quickly grown into a key generative serving workload, yet producing each clip demands many denoising iterations over large spatio-temporal latents, which puts low-latency inference out of reach on a single device. A denoising step is therefore typically distributed across multiple accelerators, and TPU sub-slices have become an attractive and practical fabric for doing so. Current auto-parallel systems, however, search almost exclusively over logical device meshes and disregard how a chosen sharding is actually laid out on the physical TPU interconnect – an oversight that leaves large, topology-dependent performance on the table. We address this gap with AoiZora, a compiler-mediated topology planner built for low-latency video diffusion inference on TPU sub-slices. Its guiding principle is to reconnect logical sharding with physical placement by drawing on different points in the compilation flow: AoiZora first eliminates weak sharding candidates from inexpensive pre-compilation IRs, then compiles only the ones that survive and orders their physical placements using compiled HLO together with a topology-aware communication model. The winning plan is realized along the ordinary compiler path, leaving model code, compiler lowering, collective kernels, and network routing entirely intact. On TPU v5e sub-slices, AoiZora reduces Wan 2.1 one-step denoising latency by as much as 1.42x relative to existing solutions.

21.
arXiv (CS.CV) 2026-06-16

Variational Test-time Optimization for Diffusion Synchronization

Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

22.
bioRxiv (Bioinfo) 2026-06-13

Virus-human protein-protein interactions predict viral phenotypes

Viral phenotypes such as host and tissue tropism are critical determinants of viral infection and transmission. Inferring viral phenotypes presents unique challenges compared to cellular organisms, as viruses rely entirely on host machinery for replication and survival. Current methods for predicting viral phenotypes mainly rely on viral genomic data, often overlooking host-related information. Here, we evaluated the utility of predicted virus-human protein-protein interactions (PPIs) in inferring diverse viral phenotypes using machine-learning algorithms. For predicting human infectivity, a PPI-based machine learning model outperformed both virus genomic and protein sequence-based models that used large language model embeddings. It also surpassed previous methods that incorporated both viral and host genomic data. The human proteins identified by the model were significantly enriched in functions related to viral infection and immune response. In predicting various phenotypes of human RNA viruses, PPI-based models performed better than virus sequence-based models in forecasting virulence, human transmissibility and transmission routes, while showing comparable performance to genomic sequence-based models in predicting tissue tropism. Finally, we demonstrated that a PPI-based model could distinguish high-risk HPV genotypes from low-risk ones. Proteins associated with high-risk HPV were involved in apoptosis and immune regulation, whereas those linked to low-risk HPV were enriched in telomere maintenance and DNA repair. Collectively, this study is the first to demonstrate the value of predicted virus-human PPIs in inferring viral phenotypes, thereby enhancing our understanding of the molecular mechanisms underlying these phenotypes. It also provides effective tools for risk assessment of emerging viruses, contributing to improved pandemic preparedness.

23.
medRxiv (Medicine) 2026-06-15

SPIRIT-CONSORT-ELM: Element-Level Assessment of Randomized Controlled Trial Reporting Using Large Language Models

Randomized controlled trials (RCTs) play a central role in assessing the benefits and harms of interventions. Incomplete reporting in RCT publications can compromise the verifiability and usefulness of RCTs. SPIRIT and CONSORT reporting guidelines aim to improve the completeness of RCT protocols and results publications, respectively. However, many RCTs are not reported completely. Checking manuscripts automatically could help authors improve the completeness of reports prior to publication. We previously annotated SPIRIT-CONSORT-TM, a corpus of 200 articles (comprising 100 protocol-results publication pairs) using 83 checklist items drawn from SPIRIT 2013 and CONSORT 2010. We also trained machine learning models to automatically assess reporting at the item level. Each checklist item can include multiple constituent elements (i.e., specific details required for that item), and an item might be considered fully reported when all of its elements are present. However, prior work does not explicitly capture or evaluate reporting at the element level. To address this gap, we extended SPIRIT-CONSORT-TM by incorporating element-level annotations and using them to assess reporting completeness (SPIRIT-CONSORT-ELM). We formulated element-level assessment as a machine reading comprehension task, operationalized through 119 questions, where each question targets a specific reporting element within a checklist item. Using the 200 articles included in SPIRIT-CONSORT-TM, two annotators independently answered 119 questions for 50 articles (25 protocol-results pairs) and resolved any discrepancies through discussion; the remaining 150 articles (75 protocol-results pairs) were assessed by a single annotator. We then developed an automated pipeline for element-level assessment using SPIRIT-CONSORT-ELM. The pipeline first applies a PubMedBERT-based model to identify sentences containing item-level reporting information, then it uses a generative large language model (LLM; GPT-5) with chain-of-thought reasoning to answer element-level questions based on the retrieved evidence. Agreement between the two annotators was high (Gwet's AC1: 0.782) and our pipeline achieved high accuracy in identifying element-level reporting evidence (F1: 0.822, Gwet's AC1: 0.796). Ablation studies indicate that chain-of-thought reasoning and the inclusion of illustrative in-context examples modestly improve LLM performance on the machine reading comprehension task. SPIRIT-CONSORT-ELM provides a benchmark for evaluating reporting guideline completeness at the element level, enabling assessment of RCT transparency beyond the simple presence or absence of checklist items and is publicly available at https://osf.io/kznx4/. The automated pipeline establishes a robust baseline for assessing RCT reporting and demonstrates potential as a practical aid for authors, reviewers, and editors to identify and address gaps in completeness and transparency of RCT reports.

25.
arXiv (CS.CL) 2026-06-16

When Correct Edges Cannot Be Verified: A Provenance Gap in Incomplete KGQA and a Provenance-Favoring Completion Policy

Incomplete Knowledge Graph Question Answering (IKGQA) requires completing missing edges to continue reasoning. A growing line of work verifies completed edges against retrieved text, treating textual support as a proxy for edge quality. We ask a question that, to our knowledge, has not been systematically tested: does textual verifiability actually track correctness? Exploiting the gold deleted triples provided by the standard random-deletion protocol, we measure both. The finding is counterintuitive: among gold-correct completed edges, 76-96% have no supporting passage even under exhaustive retrieval, robustly across deletion rates (20%/40%), datasets (CWQ/WebQSP), and relation types (structural, commonsense, long-tail). Most Freebase-style facts simply do not occur as head-tail co-mentions in text. Textual faithfulness therefore measures provenance, not correctness – separated by a paradigm-level gap no in-corpus retrieval closes. This reframes edge completion. Since most completed edges – correct or not – are causally redundant for the answer (95-97% of correct answers do not depend on any unsupported edge), the central question shifts from "is the edge correct?" to "admit or abstain under provenance uncertainty?" Within this framing we present TGComplete, a provenance-favoring admission policy that retrieves evidence at a reasoning breakpoint, verifies a candidate through a lightweight loop, and abstains when support is absent. Against the generate-to-complete baseline GoG, it attains higher edge precision against gold (15-21% vs 3-14%), with no statistically detectable EM loss and 3.1-7.4 times higher strict faithfulness of admitted edges – at the cost of lower recall. We position TGComplete not as uniformly better, but as a principled point on a precision/provenance-recall trade-off, appropriate when auditability matters.