Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (quant-ph) 2026-06-17

Fabless Quantum Chip Design and Commercial Production

arXiv:2606.17956v1 Announce Type: new Abstract: This paper proposes a fabless quantum-chip design and production architecture for superconducting quantum computing, centered on the SPICE-Q multiphysics simulation framework. The proposed ecosystem connects process-certified quantum PDKs, parameterized device cells, traceable model cards, SPICE-Q physical modeling languages, unified Q-EDA flows, foundry sign-off rules, cryogenic test feedback, and reusable quantum IP. In this model, design firms do not merely outsource fabrication; they prepare verified tape-outs under standardized process constraints and calibrated physical models. Its economic value lies in reducing repetitive device debugging, process exploration, and low-level layout effort, while its feasibility depends on PDK maturity, foundry yield, cryogenic test throughput, model-prediction accuracy, data-feedback mechanisms, and IP licensing boundaries. We argue that superconducting quantum chips can move from the current largely vertically integrated development model toward a fabless-foundry ecosystem only when hardware design is supported by standardized, verifiable, and reusable software and process interfaces. The required pillars are certified PDKs, PCell-based parameterized design, SPICE-Q cross-physics simulation, end-to-end Q-EDA automation, and a tradable quantum-IP market. By adapting lessons from the classical semiconductor industry to quantum hardware, this framework defines a path toward scalable, manufacturable, and commercially reusable superconducting quantum-chip design.

02.
arXiv (CS.LG) 2026-06-15

Binary Black Hole Parameter Estimation with Hybrid CNN-Transformer Neural Networks

arXiv:2606.13941v1 Announce Type: cross Abstract: The detection of gravitational waves has revolutionized our ability to explore fundamental aspects of the Universe. Traditionally, modeled gravitational-wave signals have been identified using template-based matched filtering, followed by coincidence analysis across multiple detectors in the signal-to-noise ratio time series. Recent advances in Machine Learning and Deep Learning have sparked growing interest in their application to both signal detection and parameter estimation. In this study, a hybrid Deep Learning strategy is proposed that leverages the effectiveness of Transformer encoders alongside well-established Convolutional Neural Network architectures in an attempt to estimate the intrinsic and extrinsic parameters of non-precessing binary black hole systems. The primary focus of this work is point estimation, producing single best-fit values for each parameter rather than full posterior distributions. This method is evaluated on both simulated signals embedded in Gaussian noise and real gravitational-wave events, and it demonstrates strong predictive performance and robustness across key astrophysical parameters.

03.
arXiv (CS.CV) 2026-06-16

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using expert vision models to generate RGB images as reasoning intermediates. It has three stages: visual grounding (SAM segmentation), geometric reasoning (Marigold depth maps), and semantic reasoning (Qwen2-VL integration). An adaptive router selects reasoning depth. Evaluations show Gen-VCoT improves spatial (25% better) and depth (50% better) questions, but may hurt simple factual queries. Text CoT outperforms visual intermediates on CLEVR (91.2% vs 62.5%), showing task-dependent optimal representations. Gen-VCoT establishes a new paradigm for interpretable multimodal reasoning.

04.
arXiv (quant-ph) 2026-06-19

Quantum models with the Yang-Lee phase transition

arXiv:2606.19732v1 Announce Type: cross Abstract: In this article, we present four different $1+1$D quantum models that realize the Yang-Lee (YL) phase transition under a deformation that preserves $PT$ symmetry. These are the antiferromagnetic Ising spin chain in transverse and longitudinal magnetic fields, the massive Schwinger model, the Blume-Capel model, and the three-state quantum clock model. Using the state-operator correspondence, we identify the YL critical point, compute the scaling dimensions of the lowest operators in each model, and find perfect agreement with the exact results for the YL criticality in two dimensions. Using bosonization for the Schwinger model and the Polyakov-Hubbard transformation for the other models, we show that in all of these quantum models the YL critical point is described, as expected, by a massless bosonic field with an $i \phi^3$ interaction. In the quantum clock model, this critical field interacts with a massive bosonic field, and we identify the massless and massive states in the Hamiltonian spectrum. In addition, we numerically compute the two-point function of $\phi$ at the Yang-Lee critical point and show that it grows with distance, in agreement with theoretical expectations.

05.
arXiv (CS.AI) 2026-06-19

GLARE: A Natural Language Interface for Querying Global Explanations

arXiv:2606.19735v1 Announce Type: new Abstract: While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

06.
arXiv (CS.LG) 2026-06-19

Weighted Bayesian Conformal Prediction

arXiv:2604.06464v2 Announce Type: replace Abstract: Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees, and recent work by Snell \& Griffiths reframes it as Bayesian Quadrature (BQ-CP), yielding powerful data-conditional guarantees via Dirichlet posteriors over thresholds. However, BQ-CP fundamentally requires the i.i.d. assumption. Meanwhile, weighted conformal prediction handles distribution shift via importance weights but remains frequentist, producing only point-estimate thresholds. We propose Weighted Bayesian Conformal Prediction (WBCP), which generalizes BQ-CP to arbitrary importance-weighted settings by replacing the uniform Dirichlet $\Dir(1,\ldots,1)$ with a weighted Dirichlet $\Dir(\neff \cdot \tilde{w}_1, \ldots, \neff \cdot \tilde{w}_n)$, where $\neff$ is Kish's effective sample size. We prove four theoretical results: (1)~$\neff$ is the unique concentration parameter matching frequentist and Bayesian variances; (2)~posterior standard deviation decays as $O(1/\sqrt{\neff})$; (3)~BQ-CP's stochastic dominance guarantee extends to per-weight-profile data-conditional guarantees; (4)~the HPD threshold provides $O(1/\sqrt{\neff})$ improvement in conditional coverage. We instantiate WBCP for spatial prediction as Geographical BQ-CP, where kernel-based spatial weights yield per-location posteriors with interpretable diagnostics. Experiments on synthetic and real-world spatial datasets demonstrate that WBCP maintains coverage guarantees while providing substantially richer uncertainty information.

07.
medRxiv (Medicine) 2026-06-22

Multisite Real-World Validation of an Electronic Health Record-Integrated Generative Artificial Intelligence Tool for Venous Thromboembolism Risk Stratification

Background: Guiding risk-appropriate inpatient thromboprophylaxis requires venous thromboembolism (VTE) risk stratification; however, reliable risk determination remains inconsistent in routine care. Health systems increasingly pilot artificial intelligence (AI) tools, yet few studies demonstrate rigorous evaluation in the context of a learning health system (LHS). We evaluated the performance of a pilot electronic health record (EHR)-integrated generative AI (GenAI) system, inHealth General Reasoner (iHGR), for VTE risk stratification versus clinician order set classifications and physician-adjudicated chart review. Methods: This multisite retrospective validation study included adult inpatient admissions at Johns Hopkins Medicine between June 21, 2025, and Dec 18, 2025 (checklist-based order set from June 21, 2025 - November 19, 2025, and clinician judgement-based order set from November 29 - December 18, 2025). From 758 eligible admissions, we randomly sampled 500 balanced by site and order set periods. iHGR and clinician-selected order set classifications were compared with the reference standard (RS). Primary outcomes were iHGR sensitivity and specificity. Secondary analyses compared the order sets with the same RS to evaluate workflow comparators and error patterns. Results: iHGR achieved 81.8% sensitivity (95% CI 77.3-85.6) and 70.9% specificity (63.6-77.3). The checklist-based order set had 61.3% sensitivity (53.7-68.5) and 86.2% specificity (77.4-91.9). The clinician judgement-based order set had 78.1% sensitivity (71.3-83.7) and 65.4% specificity (54.3-75.0). False-negative iHGR classifications were associated with missed narrative risk factors. Conclusion: iHGR showed higher sensitivity for VTE risk than checklist-based order sets and clinician judgement without introducing systematic bias. In silico evaluation of pilot AI systems within LHSs can identify clinically important performance trade-offs and implementation targets before operational scale-up. Narrative clinical data abstraction remained a key limitation, supporting the use of GenAI to support rather than supplant clinician judgement.

08.
arXiv (CS.CV) 2026-06-11

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

09.
bioRxiv (Bioinfo) 2026-06-23

Multi-Scale Machine Learning for Antibody-Antigen Binding Affinity Prediction Using Deep Mutational Scanning and Structural Features

Authors:

Predicting how mutations alter antibody-antigen binding affinity is essential for antibody engineering and vaccine design, yet current methods generalize poorly to unseen complexes. We present a multi-scale machine learning framework integrating 93 descriptors across four modalities: physicochemical, structural, ESM-2 protein language model, and solvent-accessible surface area (SASA)/{Delta}{Delta}G_fold features. Under leave-one-complex-out deep mutational scanning (LOCO-DMS) cross-validation on AbAgym (36,541 mutations, 68 experiments, 13 pathogens), gradient boosting achieved MCC = 0.206; a confidence-stratified ensemble reached MCC = 0.374 (83.5% accuracy, 25.5% coverage). No single modality exceeds the majority baseline alone; only multi-scale fusion succeeds. Boltzmann ceiling analysis shows 45.9% of mutations are near-neutral (|{Delta}{Delta}G| < k_BT), bounding theoretical maximum MCC at 0.473; our method achieves 79.1% of this limit. Five deep learning architectures benchmarked under LOCO-DMS showed self-attention matching gradient boosting (MCC = 0.200). Cross-pathogen transfer failed systematically (mean 46.7%), confirming universal binding predictors remain an open challenge.

10.
arXiv (CS.CV) 2026-06-11

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Frames2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Frames2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Frames2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Frames2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

11.
arXiv (math.PR) 2026-06-18

Multi-floor generalization of TASEP

arXiv:2603.13610v2 Announce Type: replace Abstract: We consider an interacting particle system, which generalizes the classical totally asymmetric simple exclusion process (TASEP), in that each site can contain up to a fixed finite number of particles, and the particle movement is governed by a back-pressure (BP) algorithm (also often called MaxWeight). There are $N$ sites (with $N$ finite or infinite), each may contain at most $c$ particles, $1 \le c < \infty$. New particles enter the system at the left-most site $1$ as a Poisson process of rate $\alpha\le 1$, unless site $1$ has $c$ particles. Particles (if any) are removed from the right-most site $N$ as a Poisson process of rate $\beta \le 1$. The left-to-right movement of particles between neighboring sites is governed by the BP rule: one particle moves from site $n$ to $n+1$ at epochs of a rate $1$ Poisson process, as long as the former site has strictly more particles than the latter. When $c=1$, this is the standard TASEP. Our main results address the asymptotics of the stationary distribution of a finite system, and especially the limit of the flux (current) as $N\to\infty$. In particular, we prove that interesting non-trivial phase transitions take place in a system with $c>1$. For example, if $c>1$ and $1/2 \le \beta \le 1$, the maximum limiting flux $1/4$ is achieved as long as $\alpha \ge \alpha_c^*$, where $\alpha_c^* < 1/2$ is some non-trivial threshold. (For the standard TASEP the threshold is $1/2$.) We also put forward a general conjecture about the stationary distribution asymptotics under an arbitrary parameter setting. We illustrate our formal results and the conjecture by simulations, and identify interesting directions for further research.

12.
arXiv (CS.CV) 2026-06-12

NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/

13.
arXiv (quant-ph) 2026-06-19

Asymmetric and chiral dynamics of two-component anyons with synthetic gauge flux

arXiv:2512.19139v3 Announce Type: replace-cross Abstract: In this work, we investigate the non-equilibrium dynamics in a one-dimensional two-component anyon-Hubbard model, which can be mapped to an extended Bose-Hubbard ladder with density-dependent hopping phase and synthetic gauge flux. Through numerical simulations of two-particle dynamics and the symmetry analysis, we reveal the asymmetric transport with broken inversion symmetry and two dynamical symmetries in the expansion dynamics. The expansion of two-component anyons is dynamically symmetric under spatial inversion and component flip, when the sign of anyonic statistics phase or the signs of gauge flux and interaction are changed. In the non-interacting case, we show the dynamical suppression induced by both the statistics phase and gauge flux. In the interacting case, we demonstrate that both chiral and antichiral dynamics can be exhibited and tuned by the statistics phase and gauge flux. The dynamical phase regimes with respect to the chiral-antichiral dynamics are obtained. These findings highlight the rich dynamical phenomena arising from the interplay of anyonic exchange statistics, synthetic gauge fields, and interactions in multi-component anyons.

14.
arXiv (CS.CV) 2026-06-17

Effective Gaussian Management for High-fidelity Object Reconstruction

This paper proposes an effective Gaussian management framework for high-fidelity scene reconstruction of both appearance and geometry. Unlike recent Gaussian Splatting (GS) pipelines that treat all primitives uniformly during optimization, our framework explicitly manages the attribute activation, representation and pruning of Gaussian. Specifically, our framework first introduces GauSep, a novel densification strategy that selectively activates Gaussian color or normal attributes to alleviate destructive gradient conflicts arising from dual supervision. We further propose GauRep, an adaptive Gaussian representation that dynamically adjusts spherical harmonics (SHs) orders and performs task-decoupled pruning to reduce redundancy at both the individual and global levels. To provide reliable geometric supervision for above mangement process, we additionally introduce CoRe, an regularized surface reconstruction module that distills robust normal fields from an SDF branch to the Gaussian representation through a confidence mechanism. Notably, the proposed Gaussian management is compatible with various reconstruction architectures and can be seamlessly integrated to improve performance while reducing size of the model. Extensive experiments demonstrate that our approach achieves superior or comparable performance in appearance and geometry reconstruction compared with state-of-the-art methods, while using significantly fewer parameters.

15.
arXiv (CS.CV) 2026-06-15

Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process

This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.

16.
arXiv (CS.CL) 2026-06-16

Misinformation Propagation in Benign Multi-Agent Systems

Multi-agent systems, in which multiple large language model agents solve problems through turn-based interaction, are increasingly deployed in high-stakes settings such as medical diagnosis, legal analysis, and forensic decision-making. Their reliability can be at risk when single agents reason from incorrect or misleading context, e.g., from tool calls, since errors may propagate through agent interactions. This work studies this risk by injecting intent-based misinformation into benign single-agent and multi-agent systems across reasoning, knowledge, and alignment tasks. We find that misinformation can degrade single-agent performance and persists across multi-agent debate, with agents often retaining answers introduced by misinformed peers. Nevertheless, multi-agent debate reduces the resulting performance degradation compared to single-agent prompting, especially when most agents are not exposed to misinformation. Robustness depends on group composition and decision protocol. Consensus can be more stable than voting under peer pressure, while majorities can often steer misinformed agents back toward correct answers. Our results show that misinformation robustness in multi-agent systems depends on the underlying model and also on how agents exchange information and aggregate decisions.

17.
arXiv (CS.LG) 2026-06-19

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

arXiv:2606.19451v1 Announce Type: new Abstract: We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

18.
arXiv (CS.CL) 2026-06-18

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

Authors:

Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hateful-content propagation may yield moderation strategies that behave less effectively when deployed in real-world scenarios. Multi-agent large language model (LLM) systems can, in principle, make each reshare decision depend on the user's profile, the surrounding community, and the post's content, but it remains unclear whether this added flexibility actually reproduces real hateful cascades more faithfully than classical baselines. We study three hateful Bluesky cascades and a size-matched benign control. In the empirical Bluesky data, we found that: 97.4–99.7\% of reposters take a hostile stance; toxicity-engagement homophily is higher on the diffusion tree than on the follower graph for hateful cascades; topology is star-like for the hateful cascades (most reposts come directly from the root) versus tree-like for the benign cascade (reposts propagate through multi-hop chains). In simulation, a multi-LLM-agent simulator reproduces the stance monoculture and the toxicity-delta direction. A structured ablation identifies agent heterogeneity as the leading fidelity factor, and amplifier targeting on dense networks yields 7.5–12.9\% reduction at 5.7\% benign collateral.

19.
medRxiv (Medicine) 2026-06-15

Unveiling the Awareness of Private Health Insurance Coverage among Healthcare Professionals in Freetown, Sierra Leone: Insights Extracted from Their Perspectives.

Our study is an assessment of the knowledge, personal coverage, and related determinants of private health insurance as revealed by healthcare professionals in Freetown, the urban capital of Sierra Leone. This study stands as a precursor for Low- and Middle-Income Countries (LMICs), like Sierra Leone, seeking to establish Universal Health Coverage (UHC) to provide healthcare access and coverage through publicly arranged risk pooling, designed to help protect against unmanageable medical costs. In parallel, such countries face significant challenges with achieving sustainable universal coverage due to limited public resources, inefficient allocation systems, uneasy reliance on out-of-pocket payments, and large struggling populations. Our research sheds particular light on how healthcare professionals view their own participation with private healthcare options. A cross-sectional, analytical study was conducted, openly recruiting individuals from various facilities in Freetown. Using the Yamane Formula, a sample size of 109 participants was calculated. STATA 14.0 was used for data analysis. Our findings revealed that 96 (88.9%) participants did not have private health insurance, while 12 (11.1%) did have private coverage. However, 105 (97.2%) reported other modes of health insurance, with only 3 (2.8%) uninsured. Notably, 97.2% expressed willingness to join a private health insurance scheme. Our study found no statistically significant associations between selected indicators (demographic or socioeconomic fac tors) and current insurance coverage among study participants. These results highlight a low prevalence and understanding of private health insurance among healthcare professionals in a representative urban center in Sub-Saharan Africa (SSA), while acknowledging high willingness to enroll. The lack of any significant determinants suggests other unexamined factors, such as cost, accessibility, or awareness, capable of influencing the adoption and implementation of a universal health program.

20.
arXiv (CS.CL) 2026-06-16

Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

In this report, we present LOGOS (Language Of Generative Objects in Science), a scientific generative language model that unifies heterogeneous tasks across the natural sciences within a single autoregressive framework based on a shared scientific grammar. It encodes diverse scientific objects and their spatial interactions as token sequences over a common vocabulary. By representing spatial contact and constraint patterns as discrete tokens, the model captures complex structural interactions in a purely sequential manner, without relying on explicit coordinates or geometric neural networks. This unified representation enables a wide range of downstream tasks to be formulated consistently as next-token prediction in the same grammar space, creating strong alignment between continued multi-domain pre-training and downstream objectives. Across diverse tasks, LOGOS consistently matches or outperforms domain-specific baselines, providing preliminary evidence for the feasibility of "one model fits all" in the natural sciences. We train LOGOS models at different scales (1B, 3B, and 8B parameters) and find a consistent positive correlation between model size and performance. This suggests that the future of AI for Science (AI4S) may not lie in building an independent technical stack that is separated from large language models (LLMs). Instead, it may depend on deeply aligning scientific foundation models with LLMs through shared architectures, shared training paradigms, and shared inference infrastructure, so that LLMs can truly become a new entry point for AI4S. We release the model weights and associated resources to facilitate further research.

21.
arXiv (CS.CV) 2026-06-18

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

Authors:

We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

22.
arXiv (CS.CV) 2026-06-17

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

23.
arXiv (math.PR) 2026-06-17

The Erdős-Hajnal High-Girth Subgraph Conjecture Holds in the Polynomial Chromatic-Sparsity Regime

Authors:

arXiv:2606.17901v1 Announce Type: cross Abstract: For a graph $G$ put $h_r(G)=\max{\chi(H):H\subseteq G,\operatorname{girth}(H)\ge r}.$ Erdős and Hajnal asked whether $h_r(G)\to\infty$ as $\chi(G)\to\infty$, for every fixed $r\ge4$. We prove this in every fixed polynomial edge-density regime: for all $r\ge4$, $k\ge2$, $P,C>0$, there is $M=M_{r,k}(P,C)$ such that $\chi(G)\ge M,\ e(G)\le C\chi(G)^P\Longrightarrow h_r(G)\ge k.$ Quantitatively, after replacing $P$ by $P\vee2$ and $C$ by $C\vee2$, $M_{r,k}(P,C)\le \exp!\left(O_{r,k}\bigl((P+2+\log(C\vee2))^2\bigr)\right),$ and consequently the same conclusion holds throughout the quasi-polynomial range $e(G)\le \exp\bigl(C_0(\log\chi(G))^a\bigr),\ 1 < a < 3/2,$ for all sufficiently large $\chi(G)$. In each fixed polynomial-density regime we also obtain $f_{P,C}(k,r)\le k^{O_{r,P,C}(1)}.$ The proof combines a chromatic-defect random extraction lemma, compact and near-quadratic sparse-core bases, and a peeling/thinning bootstrap increasing the admissible edge exponent by $1/(r-1)$. We also prove structural saturation results for possible counterexamples, including Moore-strength exact-cycle packings and quadratic saturation in projected colour-pair space. Finally, writing $h_r^{\mathrm f}(G)=\max{\chi_{\mathrm f}(H):H\subseteq G,\operatorname{girth}(H)\ge r},$ we develop a fractional random-extraction framework based on Mohar-Wu preservation. We prove sufficient cheap-cycle-killing criteria and verify them for several structured families, including clique-organised families, line graphs of incidence graphs of equal-order generalized quadrangles and generalized hexagons, and the Bohman-Keevash tracking-time triangle-free-process graph. We also isolate a density-free obstruction that any proof using this fractional surgery route must overcome.

24.
arXiv (CS.LG) 2026-06-16

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

arXiv:2602.01394v2 Announce Type: replace-cross Abstract: This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in WER across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream detection of the acoustic scene. Code and pretrained models will become available upon acceptance. Demo page: https://ssnaps2026.github.io/ssnaps2026/

25.
arXiv (CS.AI) 2026-06-11

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

arXiv:2606.12406v1 Announce Type: cross Abstract: Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at https://jasonjzliu.com/factr2