Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-16

In-Domain Supervised Pathology Report Classification: A Reproducible Pipeline from Data Curation to Production-Matched Evaluation

We introduce an in-domain supervised pipeline designed to counter the out-of-distribution performance drop that hampers supervised biomedical NLP models, a problem observed when models trained on pathology reports are moved across cancer registries. Our contribution is a reproducible recipe for training a supervised classifier from routinely collected cancer registry data. It describes how to build the in-domain training set and a production-matched holdout, and to choose operating points that keep the false-negative rate (FNR) very low while keeping reviewer workload manageable. The pipeline standardizes data curation with facility-stratified sampling and separate handling of reports linked to registry cases, and includes a blinded manual audit to estimate positive-case prevalence and label noise. On a 418k-report holdout set, the Kentucky model achieved FNR 0.003 and false-positive rate (FPR) 0.097, improving over the Seattle-trained MOSSAIC OncoID baseline (FNR 0.010, FPR 0.183) and raising F1 from 0.860 to 0.922. In a blinded manual review of 600 reports, estimated positive prevalence declined from 0.500 to 0.398, indicating substantial label noise with errors concentrated in rare primary sites.

02.
arXiv (CS.CL) 2026-06-19

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple : the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction – ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction – over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

03.
arXiv (CS.AI) 2026-06-15

Mask, Sample, Revise: A Revisable CTMC Inference Stack for Guided Discrete Flow Matching Text-to-Speech

arXiv:2606.13989v1 Announce Type: cross Abstract: Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.

04.
arXiv (quant-ph) 2026-06-19

Ultrafast nonadiabatic dynamics of tetraphenylsubstituted nitrogen-based heterocycles

arXiv:2604.16897v2 Announce Type: replace-cross Abstract: Tetraphenylpyrazine (TPP) and 2,3,4,5-tetraphenyl-1H-pyrrole (TePP) are closely related heterocycles bearing four phenyl substituents, whose structural similarity makes them a useful pair for comparing how intramolecular flexibility influences excited-state relaxation and emission in the gas phase and in the solid state. TPP is a prototypical solid-state luminescence enhancement (SLE) emitter, exhibiting a markedly increased quantum yield upon molecular aggregation. In contrast, TePP displays similar quantum yields in solution and solid state, characteristic of dual-state emission (DSE). This behaviour indicates that intramolecular rotations are already significantly hindered in the isolated-molecule regime, consistent with our previous observations for TPP and other solid-state emitters (Hernández-Rodríguez et al., ChemPhysChem, 2024, 25, e202400563). To unravel the excited-state dynamics underlying this contrasting behaviour, we performed mixed quantum-classical trajectory simulations on a single molecule of TPP and TePP employing the surface-hopping method. Twelve singlet states were included at the TD-B3LYP-D3/def2-SVP level, which were previously benchmarked against coupled cluster methods. Simulated observables such as gas phase ultrafast electron diffraction (GUED) and time-resolved fluorescence (TR-FL) signals allow us to dissect the distinct deactivation pathways operating in both systems in the gas phase, while also providing mechanistic insight into how these pathways are expected to evolve in solution and solid-state environments.

05.
arXiv (math.PR) 2026-06-11

On Skorokhod Problems for Reflected and Singular Stochastic Heat Equations

arXiv:2606.11951v1 Announce Type: new Abstract: We prove a Skorokhod decomposition for the Markov processes $X^a$ and $X$ associated to the gradient Dirichlet forms with respect to the measures $\rho^a\mu^{\beta}$ and $\rho\mu^{\beta}$, respectively. Here, $\mu^{\beta}$ is the law of the standard Brownian bridge $\beta$, while $\rho^a$ and $\rho$ denote densities which are given by $\rho^a(z) := \mathbf{1}_{[0,\infty)}(\bar{z}_a)$ and $\rho(z) := \int_0^1 \mathbf{1}_{[0,\infty)}(\bar{z}_x) \, dx$, respectively, for all $z\in L^2(0,1)$ which have a (unique) continuous representative $\bar{z}$ which vanishes at zero and one. To this end, we derive infinite-dimensional integration by parts formulas (IbPFs) w.r.t. $\rho^a\mu^{\beta}$ and $\rho\mu^{\beta}$, which contain Hida distributions alongside the usual drift terms. We represent these Hida distributions by integration w.r.t. vector measures of bounded variation. The vector measures in question are constructed via an approximation argument, making use of a generalization of Prokhorov's theorem for vector measures. We further prove that, almost surely, the sample paths of $X^a$ and $X$ take values in the equivalence class of continuous functions vanishing at zero and one for all and $dt$-almost all times, respectively. The main motivation for studying $\rho^a\mu^{\beta}$ and $\rho\mu^{\beta}$ lies in the fact that the distributional terms in their IbPFs are simplifications of the distributional term in the IbPF w.r.t. the law of the reflected Brownian bridge on the unit interval $\mu^{|\beta|}$. Representing the latter by integration w.r.t. a vector measure of bounded variation is still an open problem.

06.
arXiv (CS.CV) 2026-06-19

Hierarchical mutual distillation for multi-view fusion: Learning from all possible view combinations

Multi-view learning often struggles to effectively leverage images captured from diverse angles and locations. Learning methods for unstructured multi-view images remain largely underexplored. We propose a novel Hierarchical Mutual Distillation for Multi-View Fusion (HMDMV) method, which can handle both structured and unstructured multi-view scenarios. It makes predictions utilizing all possible view combinations: single view, partial multi-view, and full multi-view. The method generates predictions for each view combination and then applies hierarchical mutual distillation to enhance inter-view consistency. An uncertainty-based weighting mechanism further refines the fusion process by adjusting the influence of each view combination according to its prediction confidence, reducing the impact of low-confidence views. Extensive experiments on large-scale structured and unstructured datasets demonstrate that HMDMV consistently achieves state-of-the-art classification accuracy. Another unique advantage of HMDMV is that it provides improved flexibility in inference, allowing for more or fewer view counts in inference than those used in training without additional processing. We also provide a light version with reduced training cost by designing an efficient strategy that randomly samples subsets of view combinations during each training iteration. These results highlight HMDMV's robustness in real-world settings where view availability is variable or incomplete. The code is available at https://github.com/labhai/HMDMV.

07.
arXiv (CS.AI) 2026-06-12

Can I Buy Your KV Cache?

arXiv:2606.13361v1 Announce Type: new Abstract: Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

08.
arXiv (CS.AI) 2026-06-19

PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement

arXiv:2606.19867v1 Announce Type: cross Abstract: Computed Tomography (CT) is essential for diagnosing pediatric craniofacial abnormalities, yet poses radiation risks to developing anatomies. Reconstructing 3D CT from sparse bi-planar X-rays offers a low-dose alternative but is severely ill-posed. Existing methods employ geometry-agnostic feature lifting, naively projecting 2D features into 3D without explicit spatial modeling, causing depth ambiguity and degraded osseous boundaries. We present PSCT-Net, a geometry-aware framework with differentiable back-projection. Differentiable back-projection establishes a spatially faithful volumetric prior, alleviating depth ambiguity. An Attention-Guided Projection (AGP-3D) module then learns non-linear voxel-wise correspondences between 2D regions and 3D locations. A Bidirectional Mamba (BiM-3D) module captures long-range volumetric dependencies with linear complexity. We further curate a private institutional pediatric skull CT cohort, PedSkull-CT, comprising normal and pathological cases for internal evaluation, addressing the gap in adult-centric, trunk-focused datasets.

09.
arXiv (CS.LG) 2026-06-11

Finding Multiple Interpretations in Datasets

arXiv:2606.12277v1 Announce Type: new Abstract: In this paper, we propose an approach to finding sets of similar-performing models (in terms of loss/accuracy measurements) with highly different context-aware characteristics. Through experiments on the METABRIC dataset, we show that the proposed method finds multiple models with highly different gene expressions than those found by the control methodology without performance penalties. We argue that the proposed methodology is important whenever one aims to analyze any global characteristic of a model to extract insight into the underlying phenomenon being studied.

10.
arXiv (CS.CL) 2026-06-16

Understanding, Detecting, and Repairing Real-World In-Context-Learning-Based Text-to-SQL Errors

Large language models (LLMs) have been adopted for text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into SQL queries. However, such a technique faces correctness problems. In this paper, we conduct the first comprehensive study of text-to-SQL errors of ICL-based techniques. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 27 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement while having high computational overhead and many mis-repairs. Based on these findings, we propose MapleDoctor, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleDoctor outperforms existing solutions by repairing 13.8% more queries with a negligible number of mis-repairs and reducing 67.4% repair latency. The artifact is publicly available at GitHub.

11.
arXiv (CS.CL) 2026-06-12

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

Language model agents are increasingly effective in solving realistic tasks through multi-turn tool use. However, training reliable tool-using agents remains challenging in practice. While reinforcement learning provides an on-policy paradigm for improving agents from their own environment interactions, its effectiveness depends heavily on the training task distribution. When tasks are fixed before training, the task distribution can become increasingly mismatched with the policy's evolving capabilities, causing many rollouts to be spent on uninformative tasks. We propose SENTINEL, a failure-driven reinforcement learning framework that turns the Solver's rollout failures into targeted training tasks. SENTINEL follows a Controller–Proposer–Solver loop: the Controller analyzes failed trajectories and summarizes recurring error patterns, the Proposer generates executable tasks that stress these weaknesses, and the Solver is trained on the targeted tasks. On Tau2-Bench Retail with Qwen3-4B-Thinking-2507, SENTINEL improves Pass\^{}1 from 66.4 to 74.9 and outperforms RL on general synthetic tasks across Pass\^{}k metrics. These results demonstrate that model failures provide an effective and scalable source of targeted training signal for improving tool-using language model agents.

12.
arXiv (quant-ph) 2026-06-17

Quantum Information Processing: A brief overview on Quantum Teleportation

作者:

arXiv:1604.00852v3 Announce Type: replace Abstract: Quantum Information Processing (QIP) exploits the principles of quantum mechanics to perform information storage, communication, and computation in ways that are fundamentally impossible within classical frameworks. This article presents a pedagogical overview of the mathematical foundations of quantum information theory, including qubits, Hilbert spaces, linear operators, quantum measurements, tensor products, density operators, and quantum entanglement. Building upon these concepts, we provide a detailed introduction to quantum teleportation, one of the most remarkable protocols in quantum communication. The discussion covers the no cloning theorem, the original teleportation protocol by Bennett et al., experimental realisations of quantum teleportation, and extensions involving probabilistic and multiqubit teleportation schemes. Particular emphasis is placed on the role of entanglement as a communication resource, together with the study of teleportation channels based on bipartite and multipartite quantum states. Various quantitative measures of entanglement, including concurrence, negativity, entanglement of formation, and relative entropy of entanglement, are reviewed alongside teleportation fidelity as a performance metric. Furthermore, the interplay between Bell nonlocality, mixed state entanglement, and teleportation efficiency is examined, followed by a survey of advanced developments such as controlled teleportation, bidirectional teleportation, cluster state teleportation, and recent advances in the Quantum 2.0 era. This review aims to provide students, researchers, and engineers with a coherent introduction to the theoretical foundations and practical significance of quantum teleportation in emerging quantum technologies.

13.
arXiv (quant-ph) 2026-06-17

Canonical regularization of the stationary Coulomb problem and an Aufbau-like spectral ordering

arXiv:2606.17359v1 Announce Type: new Abstract: The stationary hydrogen atom has Coulomb degeneracy across orbital levels, whereas the Aufbau/Madelung ordering is an empirical, many-electron rule established in atomic physics. We examine the hydrogen atom through a regularized de Broglie–Bohm representation, in which stationary amplitude current constraints generate separable Sturm–Liouville branches. In this formulation, the radial, orbital, and magnetic sectors acquire canonical Langer-like inverse square corrections. The modified boundary value problems allow analytical solutions and produce a hydrogen-like spectrum with regularized radial and angular indices. Consequently, radial Coulomb quantization acquires an orbital dependent shift, lifting the Coulomb degeneracy and producing a spectral ordering that follows the Aufbau/Madelung sequence. On this basis, we construct the ordering of the regularized de Broglie–Bohm states and show that the spectral structure retains the standard degenerate Rydberg sequence in the l=0 sector. The separated amplitudes are represented by generalized special function branches, including the associated Laguerre, Legendre, and Bessel functions with non-integral parameters arising from regularized separation. Therefore, the treatment is intended as an analytical examination of spectral ordering in a regularized one center Coulomb problem rather than as a replacement for the many electron atomic structure theory. Keywords: de Broglie–Bohm representation; Coulomb spectrum; canonical regularization; Langer correction; Sturm–Liouville equations; Aufbau principle; Madelung ordering; associated Legendre functions; associated Laguerre functions; Bessel functions.

14.
arXiv (CS.AI) 2026-06-12

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

arXiv:2606.13400v1 Announce Type: cross Abstract: While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

15.
arXiv (math.PR) 2026-06-11

Stochastic epidemic model with varying infectivity and waning immunity: the law of large numbers with unbounded infectivity

arXiv:2606.11845v1 Announce Type: new Abstract: We revisit the large population limit of our epidemic model with infection age dependent infectivity and progressive immunity waning, under the assumption that the supremum in $t$ of the random infectivity function has a finite expectation, while the previous proofs assumed that this supremum admits a deterministic upper bound.

16.
arXiv (quant-ph) 2026-06-11

Necessary and Sufficient Conditions for Universal Gates with Pauli Strings and Beyond

arXiv:2606.12096v1 Announce Type: new Abstract: Any quantum computation consists of a sequence of unitary evolutions described by a finite set of Hamiltonians. For the case where this set consists of only products of Pauli operators, known as Pauli strings, we provide a necessary and sufficient condition for it to generate $\mathfrak{su}(2^n)$, i.e., to be universal for quantum computation on $n$ qubits. When combining Pauli strings with a general Hamiltonian, we show a sufficient (and in certain circumstances even necessary) condition for universality based on the Pauli-basis expansion of the Hamiltonian. As an application of these results, we prove two corollaries: (i) a necessary and sufficient condition for the universality of a general Hamiltonian given arbitrary single-qubit control on all qubits, and (ii) the universality of an XYZ Heisenberg Hamiltonian with local control of just two adjacent qubits.

17.
arXiv (CS.AI) 2026-06-11

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

arXiv:2601.21898v2 Announce Type: replace Abstract: The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabilities. Yet, this modularity also creates a governance gap: downstream users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Because existing defenses are largely post-hoc and architecture-specific, they provide inconsistent protection across diverse architectures and release formats in practice. To close this gap, we propose Trap$^2$, an architecture-agnostic protection framework that encodes protection into updates during fine-tuning, regardless of whether they are released as adapters or full models. Instead of relying on architecture-dependent approaches, Trap$^2$ uses weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use, but degrades them under re-scaling that often arises in merging, undermining unauthorized recomposition.

18.
arXiv (CS.AI) 2026-06-16

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

arXiv:2606.08090v2 Announce Type: replace-cross Abstract: Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.

19.
arXiv (CS.LG) 2026-06-18

Ultrafast On-chip Online Learning via Spline Locality in Kolmogorov-Arnold Networks

arXiv:2602.02056v3 Announce Type: replace-cross Abstract: Ultrafast online learning is essential for high-frequency systems, such as controls for quantum computing and nuclear fusion, where adaptation must occur on sub-microsecond timescales. Meeting these requirements demands low-latency, fixed-precision computation under strict memory constraints, a regime in which conventional Multi-Layer Perceptrons (MLPs) are both inefficient and numerically unstable. We identify key properties of Kolmogorov-Arnold Networks (KANs) that align with these constraints. Specifically, we show that: (i) KAN updates exploiting B-spline locality are sparse, enabling superior on-chip resource scaling, and (ii) KANs are inherently robust to fixed-point quantization. By implementing fixed-point online training on Field-Programmable Gate Arrays (FPGAs), a representative platform for on-chip computation, we demonstrate that KAN-based online learners are significantly more efficient and expressive than MLPs across a range of low-latency and resource-constrained tasks. To our knowledge, this work is the first to demonstrate model-free online learning at sub-microsecond latencies.

20.
arXiv (math.PR) 2026-06-11

Martingale Solutions to a Stochastic Keller-Segel System with nonlocal Source and Super-linear Noise

arXiv:2606.11774v1 Announce Type: new Abstract: Global nonnegative martingale solutions are shown to exist for a stochastic Keller-Segel system with a nonlocal Fisher-KPP source and super-linear multiplicative noise. The result is obtained for nonnegative initial data with no smallness assumption, provided that the nonlocal source term is dominant. The main difficulty stems from the absence of a coercive structure and the super-linear nature of the noise. An additional cut-off with finite L^2 norm in the classical Galerkin method is added to establish a well-posed approximation problem. Moreover, due to the nonlocal Fisher-KPP structure, it is necessary to prove the positivity of the approximating solution in order to obtain uniform estimates. In the compactness arguments, the usual tightness argument in the framework of Hilbert spaces cannot be directly applied to the uniform estimates obtained in this paper. As a result, we develop a more general version of the compactness argument and tightness criterion, presented in the appendix, which will be applied throughout the paper. This allows for the global existence of nonnegative martingale solutions to be derived from Jakubowski's version of the Skorokhod Theorem, along with a thorough discussion of the convergence properties.

21.
arXiv (CS.CV) 2026-06-17

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

22.
Nature (Science) 2026-06-17

Structure of the pre-initiation complex explains CMGE biogenesis

When cells enter S phase, bidirectional DNA replication is initiated through the kinase-regulated recruitment of three activators (Cdc45, GINS and Pol ε) to a duplex-DNA-loaded double hexamer of minichromosome maintenance (MCM) ATPases. Together, these proteins form two CMGE helicases that establish divergent replication forks as they become separated1. Here, to gain an understanding of CMGE biogenesis, we reconstituted the pre-initiation complex with purified yeast proteins. The cryo-electron-microscopy structure shows a set of firing factors caught in the act of assembling two symmetrical CMGEs. We show how stepwise complex formation reshapes MCM in preparation for DNA opening, and we explain how ATP promotes firing-factor ejection and CMGE maturation. We find that although Sld2 facilitates the recruitment of GINS to MCM, as expected, it also aids the efficient separation of the CMGE dimer, and is essential for the ejection of the lagging strand from MCM. These findings have direct implications for our understanding of the metazoan Sld2 orthologue, RECQL4, and point to a replication-fork establishment mechanism that is conserved across eukaryotes. Cryo-electron microscopy and biochemical reconstitution experiments in yeast provide insight into the assembly of the CMGE complex, a helicase that establishes bidirectional DNA replication in eukaryotic cells, and elucidate the role of the firing factor Sld2.

24.
arXiv (CS.LG) 2026-06-17

Gradual Fine-Tuning for Flow Matching Models

arXiv:2601.22495v2 Announce Type: replace Abstract: Fine-tuning flow matching models is a central challenge in settings with limited data, evolving distributions, or computational constraints. While recent work has produced significant advances, particularly in the area of reward-based fine-tuning, current methods fail to demonstrate both theoretical correctness as well as strong empirical results in terms of stability, efficiency, and diversity preservation. In this work, we propose Gradual Fine-Tuning (GFT), a simple yet principled annealing-based framework for fine-tuning flow generative models when only samples from the target distribution are available. For stochastic flows, GFT defines a temperature-controlled sequence of intermediate objectives that smoothly interpolate between the pretrained and target drifts, provably approaching the true target as the temperature approaches zero. We analytically demonstrate that sample generation after GFT can be made substantially more efficient with the use of arbitrary (e.g., optimal transport) couplings, as well as by utilizing few-step inference methods. Empirically, GFT significantly improves convergence stability, while maintaining or improving generation quality, training speed, and generation diversity compared to other fine-tuning methods. Our results position GFT as a simple yet theoretically grounded and practically effective alternative for scalable adaptation of flow matching models under distribution shift.

25.
medRxiv (Medicine) 2026-06-15

Toward a National Registry for Inborn Errors of Immunity in Peru: A Qualitative Implementation Study

Background: Peru lacks an integrated information system for patients with Inborn Errors of Immunity (IEI). Although disease registries are essential tools for data management and health planning, their success depends on implementation science approaches that account for local contextual factors. This study reports Phase I of a three-phase mixed-methods implementation project to design and develop a national IEI registry. Methods: Phase I consisted of a phenomenological qualitative study exploring stakeholder perspectives. Semi-structured focus groups and in-depth interviews were conducted with 29 key stakeholders across four groups: policy-makers, clinical experts, end-users (immunologists, residents, allied health personnel), and patient organization representatives. Interviews followed a guide structured around four a priori domains (structure, navigation, feasibility, and perception of existing systems). Discussions were conducted in Spanish, audio-recorded, transcribed verbatim, and coded using ATLAS.ti. A hybrid thematic analysis combining deductive and inductive coding was performed. Data elements proposed for the registry were triangulated with qualitative findings. Results: Thirty-six initial codes were consolidated into 15 categories, which were further integrated into four overarching themes conceptualized as pathways toward intention to use: (1) Environment, where governance, regulatory backing, and sustainable financing were identified as key enablers, while limited interoperability emerged as a structural barrier; (2) Technical Dimension, emphasizing usability, alignment with clinical workflow, and a hierarchical data architecture (demographic, clinical, therapeutic); (3) Users, highlighting clinical leadership, protected time, digital readiness, and perceived usefulness as stronger motivators than financial incentives; and (4) Patients, underscoring data protection, transparency, trust, and advocacy as essential for legitimacy and sustainability. Conclusions: A national IEI registry in Peru is perceived as necessary and feasible if implemented with strong regulatory foundations, interoperable design, robust data security, and user-centered architecture. These findings informed the development of an initial functional prototype and the operational plan for Phase II, focused on usability evaluation.