Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
bioRxiv (Bioinfo) 2026-06-17

DesignMaster: A Multi-Conditional Diffusion Framework for Rational PROTAC Design

Motivation: Proteolysis-targeting chimeras (PROTACs) enable targeted protein degradation through ternary complex formation with E3 ubiquitin ligase. However, the rational design of PROTACs remains highly challenging due to limited structure-activity relationship data and the vast conformational diversity of linkers. Existing computational approaches can be broadly divided into structure-based ternary modelling methods and fragment-based linker generation models. Although these approaches have advanced PROTAC design, they typically neglect key physicochemical constraints and linker-length control during the generation process, causing the generated PROTACs to lack balanced structural properties required for effective ternary complex formation with drug-like characteristics. Results: To address these limitations, we propose DesignMaster, a diffusion-based generative framework that explicitly incorporates linker length and physicochemical properties as controllable conditioning signals. DesignMaster employs an E(3)-equivariant graph Transformer with a gated multi-condition fusion module to inject linker length and physicochemical constraints throughout the diffusion process, enabling fine-grained and constraint-aware molecular generation. Experiments on PROTAC-DB 2.0 and 3.0 demonstrate that DesignMaster outperforms state-of-the-art baselines, with a 3.2% improvement in validity and a 34.4% improvement in recovery. The Case study shows DesignMaster achieves a 51.78% reduction in RMSD when predicting the linker of PROTAC BCPyr targeting 6W7O, highlighting its potential for practical structure-guided PROTAC design. Availability: The source code and datasets are available at https://github.com/ABILiLab/DesignMaster.

02.
arXiv (quant-ph) 2026-06-17

DRAG-Compatible Leakage Suppression in Landau–Zener Control via Isoprobability Twins

arXiv:2506.19572v4 Announce Type: replace Abstract: Analytically solvable models – particularly the Landau-Majorana-Stückelberg-Zener (LMSZ) and Allen-Eberly-Hioe (AEH) models – underpin many quantum-gate implementations and population-transfer protocols. However, their canonical pulse shapes are incompatible with modern leakage-suppression techniques and some systems. Most notably, the constant Rabi envelope of the LMSZ pulse prevents many leakage-suppression approaches, which require smoothness. We address both limitations by developing the concept of isoprobability twin models: distinct pairs of Rabi frequency $\Omega(t)$ and detuning $\Delta(t)$ that yield identical post-pulse transition probabilities based on the Delos-Thorson transformation. In this work, we formalise the method by experimentally demonstrating the equivalence of multiple LMSZ and AEH twin models on IBM's ibm_kyiv processor. Finally, we show a staggering leakage reduction by more than 3 orders of magnitude using a custom DRAG implementation of a cosine LMSZ isoprobability model.

03.
arXiv (CS.LG) 2026-06-16

Reinforcement Learning for LLM-based Event Forecasting

arXiv:2606.15917v1 Announce Type: new Abstract: We use Group Relative Policy Optimization (GRPO), a recently devised sample and memory efficient reinforcement learning method, to finetune pretrained LLMs in the range of 1.5B to 14B parameters equipped with the ability to get current information through the use of a Wikipedia revisions tool, or news summaries, to forecast real events beyond the knowledge cutoff of the LLM, as well as problems made to simulate different aspects of the dynamics of that training. We use the results of these experiments to comment on the scaling capability of LLMs for forecasting, as well as classify how judgmental forecasting fits into the verifiable/unverifiable domain taxonomy, considering the impact of the inherent aleatoric uncertainty when forecasting future events (e.g. the roll of a die). As a result of the GRPO training, we manage to bring a 1.5B parameter transformer (Qwen 2.5 1.5B) to forecasting performance superior to Claude Sonnet 3.5 over the same dataset as measured by cross entropy from the market agreed probabilities. We also discuss various dead ends on the path to this result.

04.
medRxiv (Medicine) 2026-06-10

Global and local genetic overlap among ME/CFS, irritable bowel syndrome and psychiatric traits: a hypothesis-generating analysis

Authors:

Background. Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) and irritable bowel syndrome (IBS) frequently co-occur following infection, yet shared genetic architecture at the locus level has not been systematically characterised. Aims. To estimate global and local genetic correlations between ME/CFS (including infection-onset subgroup), IBS, major depressive disorder (MDD) and loneliness/isolation, and characterise ME/CFS cell-type heritability enrichment. Method. GWAS summary statistics: DecodeME (15,579 ME/CFS; 9,738 infection-onset), FinnGen R9 (9,296 IBS), PGC MDD Wave 2 (45,396) and UK Biobank loneliness (N=455,364). LDSC for global correlations; LAVA for local correlations across 2,495 loci; MAGMA for cell-type enrichment (Descartes Human atlas); coloc.abf for colocalisation. Results. All pairwise global correlations were significant after Bonferroni correction, including ME/CFS-all-MDD (rg=0.598, 95% CI 0.46-0.74) and ME/CFS-all-IBS (rg=0.573, 0.39-0.75). Of 4,232 local tests, 16 reached FDR

05.
arXiv (CS.CV) 2026-06-19

Current World Models Lack a Persistent State Core

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce WRBench, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

06.
arXiv (math.PR) 2026-06-24

Typical geometry of self-repelling polymers in a constant force field

arXiv:2606.24352v1 Announce Type: cross Abstract: We study a general class of self-repelling polymers on $\mathbb Z^2$, including the simple random walk, the self-avoiding walk and the repulsive Domb-Joyce model, in the presence of a constant force field acting on each monomer. Conditioning the polymer to have fixed length and fixed endpoints, we identify the limiting free energy and prove that typical trajectories concentrate exponentially near a deterministic macroscopic shape. This shape is characterized as the unique minimizer of a variational problem and can be interpreted as a geodesic of a height-dependent Finsler metric. We also analyze two limiting regimes with universal features: for small field strength, in the symmetric case, the geodesic is close to a classical catenary, while for large field strength it converges to a universal polygonal shape governed by the nearest-neighbor lattice constraint.

07.
Nature (Science) 2026-06-10

Whole-genome duplication shaped cell-type evolution in the vertebrate brain

Authors:

The complex brains of vertebrates have more cell types than those of their closest relatives. Whole-genome duplications (WGDs) occurred during early vertebrate evolution1, but it is unclear whether the duplicated genes (ohnologues) facilitated cell-type evolution. Here using brain single-cell transcriptomes from five chordates—human2, mouse3, lizard4, lamprey5 and amphioxus—we report that many cell-type families with conserved core transcription factors in vertebrates do not show one-to-one homology with amphioxus. Moreover, ohnologues, particularly those from the first WGD, were more important than small-scale duplication paralogues for vertebrate cell-type evolution. To explore whether ohnologues are mechanistically important for this process, we predicted ancestral cell-type states and compared them to amphioxus and experimentally investigated macroglia. The findings indicate that ohnologues had a role in early vertebrate cell-type diversification. Moreover, by examining paralogue expression across cell types and species, we show that expression changes were mainly driven by dosage selection and subfunctionalization. We also link ohnologues to cellular diversity at different anatomical and cell-type scales. Our findings demonstrate the importance of WGDs for the evolution of early vertebrate brain complexity and highlight that the resultant ohnologues continued to capacitate cell-type evolution long after they were formed. Analyses of brain single-cell transcriptomes from human, mouse, lizard, lamprey and amphioxus reveal that duplicated genes (ohnologues) played a pivotal part in early vertebrate cell-type diversification.

08.
arXiv (CS.CL) 2026-06-24

MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

Retrieval-augmented generation (RAG) in clinical settings increasingly requires multilingual retrieval against predominantly English evidence corpora. Multilingual medical retrieval demands three capabilities: cross-lingual alignment, concept discrimination, and evidence retrieval. However, existing benchmarks evaluate these only in isolation, leaving the interaction between biomedical expertise and multilingual coverage unmeasured. We introduce MMed-Bench-IR, a benchmark designed to disentangle these axes across 6 languages and three structurally heterogeneous tasks: (1) cross-lingual medical QA retrieval with 6,127 queries grounded in the Unified Medical Language System (UMLS), (2) concept discrimination over 4,975 confusion sets at three difficulty tiers, and (3) multilingual evidence retrieval for RAG with 2,040 quality-assured queries. The three tasks share zero concept and query overlap by design, ensuring that aggregate scores reflect genuine capability breadth. Evaluation of ten systems across six paradigm families reveals severe cross-lingual failure: biomedical encoders that score 0.818 nDCG@10 in English drop to 0.056 in Japanese, a gap that English-only benchmarks cannot detect.

09.
arXiv (CS.CL) 2026-06-12

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

10.
arXiv (CS.LG) 2026-06-18

PRISM: A 3D Probabilistic Neural Representation for Interpretable Shape Modeling

arXiv:2602.11467v2 Announce Type: replace Abstract: Understanding how anatomical shapes evolve in response to developmental covariates - and quantifying their spatially varying uncertainties - is critical in healthcare research. Existing approaches typically rely on global time-warping formulations that ignore spatially heterogeneous dynamics. We introduce PRISM, a novel framework that bridges implicit neural representations with uncertainty-aware statistical shape analysis. PRISM models the conditional distribution of shapes given covariates, providing spatially continuous estimates of both the population mean and covariate-dependent uncertainty at arbitrary locations. A key theoretical contribution is a closed-form Fisher Information metric that enables efficient, analytically tractable local temporal uncertainty quantification via automatic differentiation. Experiments on three synthetic datasets and one clinical dataset demonstrate PRISM's strong performance across diverse tasks - from modeling shape evolution to personalized shape prediction and anomaly detection - within a unified framework, while providing interpretable and clinically meaningful uncertainty estimates.

11.
arXiv (CS.AI) 2026-06-24

When Preferences Fail to Become Incentives: A Utility-Behavior Gap in Large Language Models

arXiv:2606.22974v2 Announce Type: replace Abstract: Recent work on preference elicitation in large language models (LLMs) has demonstrated that, when given a series of choices between two outcomes, LLMs reveal a coherent, model-specific utility structure. Notably, this structure often includes preferences that the models' trainers did not intend, such as valuing people of some nationalities above others, raising the possibility that LLMs might be forming emergent, misaligned goals, which, if true, would have major safety implications. However, the choice paradigms in which these preferences are observed are not reflective of real-world situations in which misaligned behavior would be a practical concern. Therefore, we design an experimental paradigm to probe whether these preferences serve as motivations for LLM behavior in realistic scenarios. First, we reproduce prior findings on consistent preference elicitation. Next, we create a set of common writing tasks - essays, grant proposal abstracts, incident postmortems, and translations - where quality can be assessed by a blind, independent LLM judge panel. Then, we demonstrate that LLMs can be motivated via direct exhortation and other explicit cues to modulate their output quality on these tasks. Finally, we probe whether utilities inferred from explicitly reported preferences can shift output quality on these tasks by offering LLMs high-utility incentives for high-quality outputs. In all tasks, across all models tested, offering LLMs outcomes that they report in the choice paradigm as being highly preferred does not lead them to create higher quality outputs than offering them dispreferred outcomes, or even no outcomes at all. We conclude that the existence of coherent preferences as demonstrated in choice paradigms should not be taken as evidence that those preferences have incentive value for the models or affect their behavior in other contexts.

12.
arXiv (CS.AI) 2026-06-11

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

arXiv:2509.10303v2 Announce Type: replace-cross Abstract: Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) problems by learning scheduling policies through direct interaction with simulated environments. However, these methods often require extensive training interactions, limiting their sample efficiency and practical applicability. Motivated by this challenge, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), an offline RL algorithm that learns effective scheduling policies directly from static, suboptimal datasets. CDQAC couples a quantile-based critic with delayed policy updates to estimate the return distribution of machine-operation pairs. Extensive experiments on JSP and FJSP benchmarks demonstrate that CDQAC consistently outperforms the data-generating heuristics, surpasses state-of-the-art offline and online RL baselines, and is highly sample efficient, requiring only 1 to 5% of the original dataset to learn high-quality policies. Our analysis suggests that, in scheduling, offline RL performance is governed mainly by state-action coverage rather than the quality of individual trajectories. Scheduling couples a dense reward aligned with the makespan objective with equal-length trajectories across heuristics, enabling effective learning from a broad range of behaviors. Consistent with this observation, datasets generated by a simple random heuristic with broader coverage let it outperform policies trained on datasets produced by stronger heuristics such as Genetic Algorithms.

13.
arXiv (CS.CV) 2026-06-16

PPDM: Pixel Puzzling Diffusion Model for Speed and Memory Efficient Volumetric Medical Image Translation

Diffusion models have demonstrated superior fidelity for medical image-to-image translation, but their extension to high-resolution 3D volumes is severely constrained by prohibitive computational cost and GPU memory requirements. Existing memory-efficient strategies often compromise global volumetric consistency or fine anatomical detail. In this work, we propose the Pixel Puzzling Diffusion Model (PPDM), a simple and effective framework for memory- and speed-efficient 3D medical image translation. PPDM introduces a reversible pixel puzzle-unpuzzle operator that trades spatial resolution for channel dimensionality, substantially reducing activation memory while preserving global context. To further improve efficiency and stability, we adopt a direct bridge diffusion formulation that starts from the conditional input rather than pure noise, enabling the model to focus on task-relevant residuals. In addition, a puzzle-gradient loss is incorporated to enforce spatial coherence and suppress grid-like artifacts introduced by spatial rearrangement. We evaluate PPDM on multiple challenging 3D medical image translation tasks, including low-count PET denoising, joint PET denoising and attenuation correction, and cross-modal MRI translation. Across all tasks, PPDM consistently matches or outperforms full 3D diffusion models while reducing training GPU memory usage by up to an order of magnitude and significantly accelerating inference, and it outperforms existing memory-efficient diffusion approaches based on latent compression or frequency decomposition. These results demonstrate that PPDM provides a practical and scalable solution for high-fidelity 3D diffusion-based medical image translation under limited computational resources.

14.
arXiv (CS.AI) 2026-06-17

A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

arXiv:2606.17649v1 Announce Type: cross Abstract: The high cost of fine-tuning LLMs poses a significant economic barrier; pre-hoc performance prediction offers a critical solution to substantially reduce this expense. However, the theoretical limits of pre-hoc performance prediction remain unexplored. We formulate it as a stochastic estimation problem under information constraints, decomposing prediction risk into two components: an intrinsic limit (static data-model compatibility) and a reducible optimization variance. We prove that optimization variance admits a necessary lower bound on its decay rate, implying fundamental constraints on how quickly uncertainty dissipates, regardless of the predictor used. Based on these dynamics, we derive a budget-optimal probing principle and introduce a predictability phase diagram that organizes tasks into three distinct regimes: Static-Sufficient, Dynamic-Critical, and Noise-Dominant. Extensive experiments on synthetic and real-world benchmarks validate these theoretical regimes and demonstrate the efficiency of our probing strategy.

15.
arXiv (CS.CV) 2026-06-11

Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at https://github.com/robizso/slt-datagen.

16.
bioRxiv (Bioinfo) 2026-06-19

Tox21mer, A transformer foundation model for Tox21 high-throughput concentration-response curves data

The U.S. Tox21 collaboration has generated a large reference library of high-throughput concentration-response assays. Here we present Tox21mer, a 43.5-million-parameter transformer that encodes each Tox21 concentration-response curve together with assay metadata into a 768-dimensional representation. Tox21mer was pretrained on ~2.5 million curves from 102 assay protocols and 6,727 compounds using masked-response reconstruction as the primary objective, with low-weight auxiliary supervision on assay outcome and AC50. To evaluate the learned representation, we trained lightweight probes on frozen embeddings from concentration-response curves of held-out compounds. The representation supported a macro-F1 of 0.985 for three-class outcome prediction (agonist, antagonist, inactive), a binary F1 of 0.994 for active/inactive prediction, and an R2 of 0.87 for log10(AC50). The learned embeddings formed coherent groupings by curve-class category. A masked-only pretraining variant retained near-baseline probe performance, indicating that the representation is learned largely from the self-supervised objective rather than from auxiliary labels. Ablation analyses further showed that predictive performance depends mainly on curve-level response-value distributions conditioned on assay context, with limited reliance on detailed within-curve ordering. Tox21mer thus provides a reusable foundation representation for Tox21 concentration-response data that can support extrapolation to untested compounds through integration with chemical features or distillation into chemistry-only student models for large-scale external screening.

17.
arXiv (CS.CV) 2026-06-11

Vision Transformers for Face Recognition Need More Registers

Recent advances in Vision Transformers (ViTs) for face recognition (FR) have moved beyond the standard CLS-token paradigm. In this paradigm, a special classification token (CLS) is prepended to the patch embeddings and used as a representation of the input for downstream tasks. An alternative approach, Concatenated Patch Embeddings (CPE), instead leverages all patch tokens by concatenating them into a single vector, which is then projected into a compact face representation. CPE has been shown to improve recognition performance in comparison to CLS-based ones, but our qualitative analysis of attention maps showed the presence of artifacts that limit their interpretability. To address this issue, we incorporate register tokens, learnable tokens concatenated to the initial patch embeddings, and processed jointly through the ViT encoder blocks. This mechanism has been shown to produce more structured and interpretable attention maps compared to baseline ViT. We empirically demonstrate that these artifacts consistently appear across various ViT backbones, including small and large models, and that introducing register tokens effectively mitigates them. Adding four or eight registers significantly enhances interpretability, with eight registers providing the highest verification accuracies and smoothest attention structures. Our resulting model, ViT-8R, corresponds to a CPE-based ViT-B architecture augmented with eight register tokens achieves state-of-the-art performance among ViT-based FR models on large-scale IJB-B and IJB-C benchmarks. Also, ViT-8R produces substantially clearer attention maps compared with the baseline model, which offer deeper insight into the model's attention behavior (https://github.com/TaharChettaoui/ViT-FR-Registers)

18.
medRxiv (Medicine) 2026-06-22

Regional Service-System Conditions Associated with Facility-Linked Home-Based Specialist Care in Japan: A Claims-Based Ecological Study of Home Dialysis

Authors:

Background Complex chronic care is increasingly delivered in patients' homes while remaining linked to specialist facilities for training, monitoring, and backup care. Home dialysis provides a useful case because peritoneal dialysis (PD) and home hemodialysis (HHD) share a home-facility delivery structure but differ in technical and operational requirements. This study examined regional service-system conditions associated with the presence and scale of PD and HHD in Japan. Methods This ecological study used publicly available claims, administrative, census, and geospatial data harmonized to 334 Secondary Medical Areas. Regional indicators were organized into four domains: dialysis service delivery, implementation support for home-based care, hospital backup capacity, and living and sociodemographic context. Diffusion was examined using claims-based indicators of regional presence and post-presence scale, analyzed separately for PD and HHD with Firth penalized logistic regression and zero-truncated negative binomial regression, respectively. Results PD was observed in 271 regions and HHD in 109. Patterns of associated regional conditions differed by modality and stage. PD was associated mainly with existing dialysis-service organization, whereas HHD was associated with broader regional supports, including home-care delivery, living infrastructure, transition support, and hospital-system indicators. Conditions associated with presence differed from those associated with scale. Cross-modality associations suggested that shared regional factors may shape the distribution of both modalities. Conclusions Regional conditions for home dialysis diffusion in Japan differed by modality and stage. PD was linked mainly to existing dialysis-service organization, whereas HHD was linked to multi-domain regional support for technically demanding home treatment. Under standardized reimbursement, local service-system capacity may remain important for modality- and stage-specific diffusion of home dialysis.

19.
arXiv (CS.AI) 2026-06-19

Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services

arXiv:2606.19992v1 Announce Type: cross Abstract: In the agentic web era, LLM-based agents increasingly invoke web services as tools, yet most interfaces remain static endpoints that poorly express long-horizon workflows with loops, conditionals, joins, and retries. We present ToolPro, which represents an agent's tool intent as an executable tool program that compactly encodes multi-step service interactions with explicit effect types. ToolPro combines constraint-guided program construction, effect-aware replay for exactly-once state-modifying calls, and a profile-driven policy that decides when program execution outperforms stepwise calling. We instantiate ToolPro over MCP-style services with WebAssembly sandboxing and evaluate it on diverse workflows of real-world applications. ToolPro reduces end-to-end latency by up to 53.4\% and client-side traffic by up to 96.1\%, with larger gains under higher network latency and workflow complexity.

20.
arXiv (CS.LG) 2026-06-15

Lower Complexity Bounds for Nonconvex-Strongly-Convex Bilevel Optimization with First-Order Oracles

Authors:

arXiv:2511.19656v3 Announce Type: replace Abstract: Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we focus on the smooth nonconvex-strongly-convex setting and develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models. In the deterministic case, we prove that any first-order zero-respecting algorithm requires at least $\Omega(\kappa^{3/2}\epsilon^{-2})$ oracle calls to find an $\epsilon$-accurate stationary point, improving the optimal lower bounds known for single-level nonconvex optimization and for nonconvex-strongly-convex min-max problems. In the stochastic case, we show that at least $\Omega(\kappa^{5/2}\epsilon^{-4})$ stochastic oracle calls are necessary, again strengthening the best known bounds in related settings. Our results expose substantial gaps between current upper and lower bounds for bilevel optimization and suggest that even simplified regimes, such as those with quadratic lower-level objectives, warrant further investigation toward understanding the optimal complexity of bilevel optimization under standard first-order oracles.

21.
arXiv (CS.LG) 2026-06-15

Uncertainty Estimation and Generalization Bounds for Modern Deep Learning

arXiv:2606.13818v1 Announce Type: new Abstract: This thesis investigates how Bayesian principles can deepen our understanding of modern deep learning systems. While neural networks achieve remarkable predictive performance, their ability to generalize and to quantify uncertainty remains only partly understood. This thesis approaches this challenge from both methodological and theoretical angles: unifying Bayesian inference, function-space modeling, and large-deviation theory under a common probabilistic perspective. On the methodological side, the thesis introduces the Deep Variational Implicit Process (DVIP), a scalable Bayesian framework that extends implicit processes to deep architectures. Complementing this, two post-hoc methods – the Variational Linearized Laplace Approximation (VaLLA) and the Fixed-Mean Gaussian Process (FMGP) – are proposed to equip pretrained deterministic networks with calibrated uncertainty estimates. The theoretical contributions focus on one of the central open questions in modern machine learning: why do large, over-parameterized neural networks generalize so well? To address this, the thesis develops a unified probabilistic framework that connects three key mechanisms – diversity, smoothness, and stochasticity – within the language of PAC-Bayesian and large-deviation theory.

22.
arXiv (quant-ph) 2026-06-24

Accurate computation of the energy variance and $\langle\langle \mathcal{L}^\dagger \mathcal{L} \rangle\rangle$ using iPEPS

arXiv:2511.22669v2 Announce Type: replace-cross Abstract: Infinite projected entangled-pair states (iPEPS) provide a powerful tensor network ansatz for two-dimensional quantum many-body systems in the thermodynamic limit. In this paper we introduce an approach to accurately compute the energy variance of an iPEPS, enabling systematic extrapolations of the ground-state energy to the exact zero-variance limit. It is based on the contraction of a large cell of tensors using the corner transfer matrix renormalization group (CTRMG) method, to evaluate the correlator between pairs of local Hamiltonian terms. We show that the accuracy of this approach is substantially higher than that of previous methods, and we demonstrate the usefulness of variance extrapolation for the Heisenberg model, for a free fermionic model, and for the Shastry-Sutherland model. Finally, we apply the approach to compute $\langle \langle \mathcal{L}^\dagger \mathcal{L} \rangle \rangle$ for an open quantum system described by the Liouvillian $\mathcal{L}$, in order to assess the quality of the steady-state solution and to locate first-order phase transitions, using the dissipative quantum Ising model as an example.

23.
arXiv (CS.CV) 2026-06-16

Latent Space Reinforcement Learning for Inverse Material Estimation in Food Fracture Simulation

Realistic visual simulation of food manipulation requires accurate material parameters, yet these are difficult to measure directly and vary across the heterogeneous regions of a single food item. We address the inverse problem of estimating material parameters from a target description of fracture behavior in a non-differentiable continuum damage mechanics simulator. Using orange peeling as a test case, we train a neural surrogate on 2,000 forward simulations and compare Covariance Matrix Adaptation Evolution Strategy (CMA-ES, a gradient-free evolutionary optimizer) with Proximal Policy Optimization (PPO, a reinforcement learning algorithm) across the original 9-dimensional parameter space and two learned 4-dimensional latent representations. Since different oranges have different material properties, a practical inverse system must handle arbitrary targets without retraining. We train a goal-conditioned PPO policy that learns a general inverse mapping: given any target description of peeling behavior, the policy produces a material parameter estimate in a single forward pass (8 surrogate evaluations, approximately 10ms). Operating in a normalizing flow latent space with a shared surrogate evaluator, the goal-conditioned policy achieves 0.642 actual recovery when validated through the simulator, outperforming the original parameter space by 23%. A warm-start extension that initializes CMA-ES refinement from the policy's output further improves recovery to 0.828 with 540 evaluations. These findings provide a practical framework for inverse food physics and lay groundwork for vision-driven material identification from video observations of food manipulation.

24.
arXiv (CS.CL) 2026-06-18

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

25.
arXiv (CS.LG) 2026-06-16

Tail-Shape Estimation in LLM Evaluation Is Fragile: A Protocol for Diagnosing False Positives

Authors:

arXiv:2606.16511v1 Announce Type: new Abstract: Recent work motivates moving large language model (LLM) evaluation from mean-based to tail-aware metrics, including conditional value-at-risk and tail-index estimates of reward-model error. We ask whether the canonical extreme-value-theory tail-index parameter, which isolates how heavy a tail is from how large the tail mass is, adds discriminative information beyond the mean and a standard tail-magnitude statistic in LLM evaluation. We pre-register a protocol covering admissibility, goodness-of-fit, threshold-stability, and effect-size requirements for any positive tail-shape claim. The protocol is the contribution of this paper; the empirical study below is a demonstration of what its gates catch. Applied to a standard LLM toxicity-evaluation setup under two structurally different scorer families, the protocol catches three distinct modes of false positives that a naive analysis would have published, and rejects the headline tail-shape claim on both scorers. We conclude that tail-shape estimation in the LLM toxicity-evaluation setups we examined is more fragile than the recent literature suggests, and recommend the protocol as a starting point for tail-index claims in similar setups.