Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
bioRxiv (Bioinfo) 2026-06-14

Transposable elements as evolutionary substrates of proteindisorder in the human proteome

Intrinsically disordered regions (IDRs) are central contributors to protein function, evolution and human disease, yet the evolutionary routes that seed new disordered segments within pre-existing proteins are still poorly understood. Sequence insertions provide a powerful mechanism for disorder expansion, but the genomic donors of inserted IDR and its long-term conformational fate remain largely unknown. Transposable elements (TEs), abundant mobile genetic elements with distinctive compositional biases, represent compelling candidates for generating disorder within proteins. Here, we systematically mapped TE-derived segments across human proteins and isoforms, and we found that these insertions are strongly enriched in intrinsic disorder. The structural consequences of their insertion are shaped by TE class and family, reflecting the sequence biases of the elements from which they originate. Recent, Primate specific insertions preferentially generate disordered segments, whereas older insertions more frequently occupy ordered structural contexts, revealing an age-dependent transition in the conformational state of TE-derived sequences. TE-containing isoforms are expressed at lower levels than TE-free isoforms, particularly when insertions are young and disorder-rich, suggesting that intrinsic disorder may constrain the cellular tolerance of newly exonized sequences. These findings identify TEs as a major evolutionary mechanism linking genome mobility to the emergence of new disordered conformational ensembles in the human proteome.

02.
arXiv (CS.CL) 2026-06-11

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

03.
medRxiv (Medicine) 2026-06-22

Multisite Real-World Validation of an Electronic Health Record-Integrated Generative Artificial Intelligence Tool for Venous Thromboembolism Risk Stratification

Background: Guiding risk-appropriate inpatient thromboprophylaxis requires venous thromboembolism (VTE) risk stratification; however, reliable risk determination remains inconsistent in routine care. Health systems increasingly pilot artificial intelligence (AI) tools, yet few studies demonstrate rigorous evaluation in the context of a learning health system (LHS). We evaluated the performance of a pilot electronic health record (EHR)-integrated generative AI (GenAI) system, inHealth General Reasoner (iHGR), for VTE risk stratification versus clinician order set classifications and physician-adjudicated chart review. Methods: This multisite retrospective validation study included adult inpatient admissions at Johns Hopkins Medicine between June 21, 2025, and Dec 18, 2025 (checklist-based order set from June 21, 2025 - November 19, 2025, and clinician judgement-based order set from November 29 - December 18, 2025). From 758 eligible admissions, we randomly sampled 500 balanced by site and order set periods. iHGR and clinician-selected order set classifications were compared with the reference standard (RS). Primary outcomes were iHGR sensitivity and specificity. Secondary analyses compared the order sets with the same RS to evaluate workflow comparators and error patterns. Results: iHGR achieved 81.8% sensitivity (95% CI 77.3-85.6) and 70.9% specificity (63.6-77.3). The checklist-based order set had 61.3% sensitivity (53.7-68.5) and 86.2% specificity (77.4-91.9). The clinician judgement-based order set had 78.1% sensitivity (71.3-83.7) and 65.4% specificity (54.3-75.0). False-negative iHGR classifications were associated with missed narrative risk factors. Conclusion: iHGR showed higher sensitivity for VTE risk than checklist-based order sets and clinician judgement without introducing systematic bias. In silico evaluation of pilot AI systems within LHSs can identify clinically important performance trade-offs and implementation targets before operational scale-up. Narrative clinical data abstraction remained a key limitation, supporting the use of GenAI to support rather than supplant clinician judgement.

04.
arXiv (CS.CV) 2026-06-16

An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

Early detection of plant diseases is crucial to plants and for the farmers. Plant diseases reduce fruit yield and quality, and plants are more susceptible to other stresses when they are infected. The lemon leaf disease dataset contains 1354 images. The dataset has 9 classes. Among the 9 classes only one class is for healthy leaf, and the other 8 classes are leaf diseases. The dataset was split into training (70%), testing (15%) and validation (15%) sets after comprehensive preprocessing. Two pretrained models (InceptionV3 and MobileNetV2) were applied and then combined these models using an ensemble technique to boost robustness. Ensemble models showed a promising performance of 99.27% accuracy. Adversarial Training is applied to improve models' ability and ensure reliable predictions under noisy data. Grad-CAM visualization highlights the important regions of leaf images that validate the model prediction with confidence level.

05.
arXiv (CS.CL) 2026-06-24

Meet UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies

Czech has been part of Universal Dependencies since its first release in 2015. It has also been one of the best represented languages, with the Prague Dependency Treebank being order of magnitude larger than most other UD treebanks. More recently, three other datasets from the Prague family were added and the annotations thoroughly revisited, forming the "Prague Dependency Treebank-Consolidated" (PDT-C). In comparison to the original PDT, PDT-C is more than twice as large, but it is also much more diverse in terms of genres and domains. In this paper, we describe the conversion of the new resource to Universal Dependencies. While the two annotation schemes are relatively similar at the first sight, there are numerous small differences in topology of the dependency structures and in granularity of the POS and relation type inventories. We demonstrate a selection of such differences on examples, discuss the diverging motivations, as well as ways to overcome the differences during conversion. We argue that while PDT is less "universal" and more tightly bound to one language, its multi-layer annotation is rich and provides all information needed for basic UD trees, and much more.

06.
arXiv (CS.CL) 2026-06-15

"I Didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.

07.
Nature Medicine 2026-06-08

Effects of SGLT2 inhibition on incident heart failure in carriers of cardiomyopathy-associated genetic variants

Although the beneficial effects of sodium–glucose cotransporter 2 (SGLT2) inhibition in heart failure (HF) have been well established, it is unknown whether SGLT2 inhibition confers benefit in carriers of rare variants in cardiomyopathy-associated genes. Here we evaluated whole-exome sequencing data from the randomized DECLARE-TIMI 58 trial, in which adults with type 2 diabetes and increased cardiovascular risk were randomized to dapagliflozin or placebo treatment. Pathogenic or likely pathogenic variants (P/LP) in high-confidence cardiomyopathy genes were identified, and treatment effects on hospitalization for HF (HHF) were compared between carriers of such variants and noncarriers. Among 12,685 patients for whom sequence data were obtained, 121 carried a cardiomyopathy variant (76 dilated cardiomyopathy, 25 hypertrophic cardiomyopathy and 25 arrhythmogenic cardiomyopathy). Over a median follow-up of 4.2 years, dapagliflozin lowered the risk of HHF more strongly in carriers (hazard ratio 0.18, 95% confidence interval 0.04–0.86) than in noncarriers (hazard ratio 0.70, 95% confidence interval 0.57–0.86; P interaction 0.03). Absolute risk reduction was 13.0% in carriers and 1.0% in noncarriers (P interaction 0.03). Most carriers (82%) had no prior HF, and in carriers without prior HF, treatment with dapagliflozin reduced the absolute risk of HHF by 12.8%, compared with a reduction of 0.6% in noncarriers (P interaction 0.01). The findings from this cohort of older and high-risk patients raise the possibility that SGLT2 inhibitor treatment should be started early to prevent HF in individuals who carry P/LP cardiomyopathy variants. These results need to be confirmed in a prospective, dedicated trial of preventive HF treatments in carriers of P/LP cardiomyopathy-associated variants. In a whole-exome sequencing analysis, the beneficial effects of the SGLT2 inhibitor dapagliflozin in reducing the risk of future heart failure hospitalization in individuals with type 2 diabetes were markedly greater in individuals who carried a cardiomyopathy-associated genetic variant compared with noncarriers, suggesting a personalized preventative therapy based on genetic information.

08.
arXiv (CS.AI) 2026-06-19

Efficient and Sound Probabilistic Verification for AI Agents

arXiv:2606.20510v1 Announce Type: cross Abstract: Securing AI agents that operate in complex digital environments has become a critical need, and runtime monitoring approaches that formulate and enforce policies expressed in a formal language like Datalog offer a promising solution. However, existing approaches are restricted to deterministic policies. In many practical applications of AI agents, there is a need to enforce security policies in the face of ambiguity, leading to probabilistic predicates or state transitions (for example, a declassifier or Personally Identifiable Information (PII) detector that has some failure probability on each invocation). Furthermore, in many such applications, one cannot easily make the independence assumptions necessary to invoke prior work on probabilistic inference in Datalog. We address this by introducing a sound and efficient framework for such verification based on distributionally robust optimization, computing sound upper bounds on the probability of policy violation regardless of possible correlations between predicates. On standard benchmarks for terminal and tool calling agents, we demonstrate that our approach outperforms prior art and improves the security-utility trade-off while ensuring rigorous bounds on the probability of policy violation.

09.
medRxiv (Medicine) 2026-06-18

Web-based education on Metabolism and Obesity is associated with improved lifestyle and health behaviours among Brazilian school teachers

Background: Obesity is a major global public health challenge, and teachers play a critical role in school-based health promotion. This study examined the perceived impact of a web-based educational program on metabolism and obesity delivered to Brazilian school teachers. Methods: This analytical cross-sectional study included 217 teachers who responded to the evaluation questionnaire after attending the course between 2017 and 2022. Statistical analyses included logistic regression and chi-square tests. Findings: Course completion rate was 81.98%, substantially exceeding the 5-15% typical of global MOOCs. However, ethnic disparities were observed: White respondents were 4.95 times more likely to complete the course than Black respondents (p=0.00097) and Brown respondents were 3.05 times more likely (p=0.0268) than Black respondents. Among non-completers, lack of time (64.7%) was the primary barrier. Participation was concentrated in Sao Paulo (77%), with no respondents from three northern states. Perceived difficulty showed a non-significant trend (p=0.0893) where by Black respondents had the lowest predicted difficulty; the most challenging course material was Scientific Content/Reading papers (50%). Completion was strongly associated with applying learned activities in teaching (p

10.
arXiv (quant-ph) 2026-06-24

Generalised simultaneous transmission of arbitrary quantum states and classical information

arXiv:2606.03181v3 Announce Type: replace Abstract: We present a protocol which allows for arbitrary optical quantum states to simultaneously carry and transmit classical data, without sacrificing the integrity of either the quantum or classical information. Our scheme encodes classical information via displacements in the phase space prior to transmission and retrieves each classical symbol via a Gaussian continuous-variable teleportation. The original quantum state is then restored by guessing the the original displacement and performing the appropriate inverse operation. In the limit of sufficiently high classical signal and high squeezing, we show that our scheme is capable of perfectly reconstructing both the input classical signal and the input quantum state without loss of coherence. An example is given in terms of the transmission of a dual-rail Bell state.

11.
arXiv (quant-ph) 2026-06-11

Compressed minimum-purity time evolution for late-time quantum dynamics

arXiv:2606.11392v1 Announce Type: cross Abstract: Unitary time evolution of initially simple quantum many-body states rapidly generates entanglement and complex correlations, which limits direct numerical simulations. The late-time dynamics of physical observables, however, typically exhibits an effective simplicity in the form of hydrodynamics or kinetic theory. This leads to the question whether microscopic equations of motion can remain accurate and tractable up to long time scales by discarding irrelevant information in a controlled manner. Here, we introduce compressed minimum-purity time evolution (CoMPuTE) as an approach to keep track of a consistent set of reduced local density matrices, closing the hierarchical equations of motion using a minimum-purity principle. In benchmark applications we demonstrate (i) accurate description of energy diffusion in the one-dimensional mixed-field Ising model, (ii) the applicability to genuinely out-of-equilibrium Floquet dynamics starting from a pure state, and (iii) the limitations of the local reduced density matrix approximation when describing transport in the XXZ chain at $\Delta=1$ that is governed by increasingly non-local integrals of motion. The CoMPuTE method enhances computational efficiency in comparison to the closely related local-information time evolution algorithm, opening a possible route towards an extension to systems in higher spatial dimensions.

12.
arXiv (CS.CV) 2026-06-24

Pocket-SLAM: Rendering-Area-Aware Pruning for Memory-Efficient 3DGS-SLAM

3D Gaussian Splatting (3DGS) has garnered significant attention in Simultaneous Localization and Mapping (SLAM) due to its advances in capturing fine-grained geometry features and synthesizing novel views. For SLAM in large-scale scenes, such as autonomous driving, 3DGS-SLAM faces a critical limitation: memory consumption increases continuously over time as Gaussian points accumulate, leading to poor memory efficiency and limiting its applicability. In this work, we propose a rendering-area-aware pruning strategy that selectively removes Gaussians based on their contribution to the effective rendering area, rather than solely relying on Gaussian-level heuristics such as opacity or gradient magnitude. This perspective directly targets the sources of memory redundancy, effectively reducing the peak memory footprint of 3DGS-SLAM during runtime. Evaluations on the EuRoC and KITTI datasets demonstrate that our method consistently outperforms existing pruning approaches in large-scale outdoor scenes, achieving over 60% memory reduction and more than 2 times FPS improvement while preserving localization and mapping accuracy. These results highlight rendering-area-aware pruning as a promising direction for scaling 3DGS-SLAM to real-world autonomous driving scenarios. Our code is publicly available at https://github.com/UMN-ZhaoLab/Pocket-SLAM.git.

13.
arXiv (CS.CV) 2026-06-16

Structural Energy Guidance for View-Consistent Text-to-3D Generation

Text-to-3D generation based on diffusion models often suffers from the Janus problem, leading to inconsistent geometry across viewpoints. This work identifies viewpoint bias in 2D diffusion priors as the main cause and proposes Structural Energy-Guided Sampling (SEGS), a training-free and plug-and-play framework to improve multi-view consistency. SEGS constructs a structural energy in the PCA subspace of U-Net features and injects its gradient into the denoising process. It can be easily integrated into SDS/VSD pipelines without retraining. Experiments show that SEGS reduces the Janus Rate by about 10% on average and improves View-CS scores across multiple baselines, including DreamFusion, Magic3D, and LucidDreamer. This method effectively alleviates viewpoint artifacts while preserving appearance fidelity, providing a flexible solution for high-quality text-to-3D content generation.

14.
arXiv (CS.AI) 2026-06-19

TerraMind: Large-Scale Generative Multimodality for Earth Observation

arXiv:2504.11171v5 Announce Type: replace-cross Abstract: We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) – the capability of generating additional artificial data during finetuning and inference to improve the model output – and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

15.
medRxiv (Medicine) 2026-06-18

Rare Coding Variants Reveal Distinct Genetic Architectures Across Multidimensional Sleep Phenotypes

Sleep and circadian traits have been widely studied using common variants, but the contribution of rare coding variation remains unclear. We analyzed rare coding variants in 397,065 whole-exome sequenced UK Biobank participants across 36 sleep phenotypes from self-report, diagnoses, sleep medication use and accelerometry, and meta-analyzed results with 171,536 whole-genome sequenced All of Us participants of diverse ancestries, with replication in the Mass General Brigham Biobank (N = 31,275). We identified 260 genes associated with sleep phenotypes, including novel associations with sleep medication use in 29 genes and 24 out of 29 have not previously been reported with any sleep phenotypes. We observed modest but significant rare variant heritability and strong genetic correlations between sleep medication use, insomnia and fatigue. Temporal gene expression trajectory analyses indicate that genes associated with self-reported sleep traits show constant high prenatal expression, whereas genes linked to sleep medication phenotypes exhibit peak expression in the late prenatal period. These findings highlight distinct biological mechanisms captured by different measurement sources of sleep phenotypes and reveal rare-variant-informed targets for therapeutic discovery.

16.
arXiv (CS.CV) 2026-06-12

Surflo: Consistent 3D Surface Flow Model with Global State

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

17.
arXiv (CS.AI) 2026-06-16

Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients

arXiv:2606.16210v1 Announce Type: new Abstract: Learned representations in intelligent sensing systems are often evaluated by reconstruction fidelity or downstream prediction accuracy, but these criteria do not specify which latent distinctions are justified by the sensing process. In sensor-conditioned environments, nuisance factors can change measurements without changing the scene, while distinct scenes may be indistinguishable under limited sensing capability. This paper formulates sensor-conditioned representation correctness as preserving sensing-supported scene distinctions while suppressing nuisance-induced and sensor-unsupported variation. We introduce the scene-relevant observation quotient, a representation target induced by sensing-supported distinguishability after nuisance canonicalization, and develop Observation-Quotient Tucker-Structured Autoencoding (OQ-TSAE), a scene-nuisance factorized framework with diagnostics for false distinction, false merge, nuisance sensitivity, and latent ordering consistency. Experiments on a controlled benchmark show that quotient-consistent supervision improves representation-correctness diagnostics over reconstruction-oriented, metric-learning, and contrastive-learning baselines. Sensitivity, perturbation, and ablation studies show the importance of quotient-aligned supervision, reliable quotient relations, and quotient geometry. Complementary real-radar experiments show that a reconstruction-only OQ-TSAE variant retains competitive downstream utility, robustness under observation degradation, and low seed-to-seed variability. These results suggest that sensor-conditioned representations should be evaluated not only by predictive utility, but also by whether their latent geometry preserves sensing-justified scene distinctions.

18.
arXiv (math.PR) 2026-06-17

A Tanaka-Type Formula for Compact Sets and Equilibrium Measures of L\'{e}vy Processes

arXiv:2606.17472v1 Announce Type: new Abstract: Tanaka's formula is a classical identity for Brownian motion, and Tsukada (2018) extended it to L\'{e}vy processes not necessarily symmetric. From a potential-theoretic point of view, this formula shows that the invariant function for the process killed upon hitting a singleton can be decomposed into the sum of a martingale part and a local time. In this paper, we generalize this singleton setting and derive a Tanaka-type formula for a compact set $B$. To this end, we introduce the equilibrium measure, defined as the rescaled limit of the $q$-capacity measures, and show that the invariant function for the process killed upon hitting $B$ can be represented as the integral, with respect to the equilibrium measure, of the invariant functions associated with processes killed upon hitting singletons, up to an additive constant called the Robin constant. Moreover, when $B$ is an interval, we obtain explicit representations of the equilibrium measure, the Robin constant, and the martingale part for recurrent stable processes as well as for recurrent spectrally negative L\'{e}vy processes. Finally, we discuss how an analogous Tanaka-type formula can also be established for transient L\'{e}vy processes.

19.
arXiv (CS.AI) 2026-06-19

Sensorimotor World Models: Perception for Action via Inverse Dynamics

arXiv:2606.20104v1 Announce Type: cross Abstract: Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

20.
arXiv (quant-ph) 2026-06-11

A semi-definite programming formulation of the device-dependent guessing probability

arXiv:2606.12079v1 Announce Type: new Abstract: In quantum mechanics, a measurement applied to a state in general produces some amount of intrinsic randomness. This is not only a fundamental feature of the theory, but is also at the basis of any quantum process to generate random numbers. The simplest of such processes consists of a single, fully charaterized, measurement acting on a single, fully characterized, state. Unfortunately, no general method to estimate the intrinsic randomness produced in such setups is known. In this work, we address this issue by presenting a semidefinite programming formulation of the maximum probability with which an adversary, Eve, can guess the outcomes of characterized but untrusted prepare-and-measure setups. We then present several applications of this construction. First, we apply our method to a variety of specific setups, allowing us both to benchmark the approach and, more importantly, to determine the exact amount of certifiable randomness in scenarios where only upper bounds were previously available. Then, we show that the presence of entanglement between the device preparing the state and the measurement strictly increases Eve's predictive power, already in the most elementary setup of a binary measurement acting on a qubit state.

21.
bioRxiv (Bioinfo) 2026-06-24

Beyond statistical significance: ranking transcription factor binding motifs by effect size

Chromatin immunoprecipitation-sequencing (ChIP-seq) has wide use in identifying transcription factor binding sites. DNA sequence motifs specific to a targeted transcription factor occur more frequently near ChIP-seq peak centres. The most common approach to quantifying relative motif enrichment ranks motifs by p-value . Because sample sizes can vary substantially across examined motifs, p-value magnitudes may reflect this heterogeneity rather than the biological effect of interest. As alternatives, we considered four ranking methods based on effect sizes: (a) a modified Cliffs delta, (b) the lower bound of a frequentist asymptotic confidence interval, (c) the lower bound of a frequentist finite-sample confidence interval, and (d) the lower bound of a Bayesian credible region. Through extensive simulations, the four alternatives better recovered the simulated central- enrichment ordering under heterogeneous sample sizes. Using published ChIP-seq data for GATA3, the effect size methods ranked the known targeted motif highest, even compared to highly similar motifs for other GATA family members, while p-value ranking did not. In a separate SRF application, all four alternative methods also consistently ranked the known motif highest. We recommend the asymptotic confidence interval lower bound for its simplicity, ease of implementation, and intuitive interpretation. The software is freely available (https://github.com/ScottMastro/motif-ranking).

22.
bioRxiv (Bioinfo) 2026-06-11

STITCH links cellular morphology and gene expression in spatial transcriptomics

In situ spatial (ISS) sequencing can uncover co-variation between cellular morphology and gene expression in vivo. However, a principled and interpretable mathematical representation of morphology has not yet been applied in this context. In particular, current deep learning-based representations of cell images confound a cell's shape with its size. We present an interpretable representation of cellular boundary contours, based on tangent principal component analysis (TPCA) in a Kendall shape manifold, that captures size-independent contour shape features. This approach successfully recovers shape-perturbing genes in an RNAi screen than a previous metric geometry-based approach. We build on TPCA to develop STITCH (Shape-TranscriptomIc Correlation and Harmonization), an approach to reveal covariation between cell morphology with gene expression in ISS datasets. In a Xenium dataset, STITCH outperforms a deep learning-based approach in both recovering the layered organization of keratinocytes and a spatial gradient in nuclear eccentricity. Across samples in a melanoma CosMx dataset, STITCH reproducibly associates elongated and triangular fibroblasts with proximity to malignant cells and myofibroblast-like transcriptional program. Finally, STITCH independently recovers a known link between mesenchymal-like malignant cell states and increased cell area in two melanoma cohorts. STITCH can thus yield interpretable morphology-transcriptome relationships across cell types, patients, and spatial transcriptomics platforms.

23.
arXiv (quant-ph) 2026-06-11

Handbook of Error-Correcting Codes

arXiv:2606.11484v1 Announce Type: new Abstract: Barcode scans, clear phone calls, reliable data storage, satellite communication, and large-scale quantum computation are all made possible by error correction. We present a handbook version of The Error Correction Zoo, a curated reference of methods for protecting classical or quantum information from errors during storage and transmission. The handbook includes descriptions of these error-correcting codes and a classification according to the symbols they use. It also catalogues relations among codes and related objects such as sphere packings, lattices, designs, groups, and classical and quantum phases of matter. The collection is intended both as a rigorous reference and as a practical aid for tracing the web of code relationships and uncovering new connections.

24.
arXiv (CS.CL) 2026-06-16

Evaluating LLM Personalization via Semantic Constraint Verification

Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address these limitations, we introduce Natural Language Inference Constraint Verification (NLICV), a scalable, semantically invariant framework that maps sentence meanings to truth-condition sets to verify personalization constraints via a Natural Language Inference (NLI) model. Moving beyond binary scoring, NLICV categorizes LLM behaviors into four distinct modes: personalization, generalization, sycophancy, and failure. Extensive experiments demonstrate that NLICV aligns closely with human annotations while drastically reducing the latency and token costs associated with LLM judges (up to 2100 inference speedup). Finally, through an ablation-based procedure, NLICV pinpoints the exact sentences driving the constraint verification, yielding faithful, understandable evidence for its evaluations.

25.
bioRxiv (Bioinfo) 2026-06-19

Nickel-Driven Dynamics of Urease in Sporosarcina pasteurii: Integrated Computational and Experimental Insights

Urease is a nickel-dependent enzyme that plays an important role in urea hydrolysis and in a process named as microbial-induced calcium carbonate precipitation (MICP), which is widely used in sustainable environmental biotechnology. Despite its ecological importance, urease powers Biogrout (biocementation), a promising green technology for soil stabilization and infrastructure repair. Yet, the relationship between nickel availability, enzyme activation, and bacterial fitness remains poorly understood. In this study, we reveal a striking dual effect of nickel on Sporosarcina pasteurii: while high Ni2+ concentrations strongly inhibit growth (IC50 {approx} 637.7 {micro}M), they simultaneously boost specific urease activity up to six-fold. This uncoupling between biomass and enzymatic efficiency highlights a previously overlooked adaptive strategy under metal stress. Using structural bioinformatics and molecular docking, we show that Ure1–the catalytic subunit–exhibits the strongest nickel affinity (-4.3 kcal{middle dot}mol-1), supported by highly conserved active-site residues, whereas accessory proteins UreE and UreG display moderate and weak binding, consistent with their roles in metal delivery and GTP-dependent maturation. In addition, microscopic observations confirmed that calcium carbonate precipitation was most pronounced at intermediate nickel concentrations (approximately 400-1000 {micro}M), whereas higher concentrations ([≥]1000-1300 {micro}M) led to reduced mineral formation due to loss viable cells. Taken together, these results indicates that nickel availability controls both urease activation and bacterial fitness, and that an optimal balance is required to maximize biomenerilization efficiency in environmental applications, particularly in biocementation technology.