Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CL) 2026-06-25

Neural Scaling Universality: If Exponents Are Fixed, Time to Understand Coefficients

Neural scaling laws describe how pre-training loss decays as power laws with training time, model size, and compute. This position paper argues that the exponents of these power laws are fixed by generic mechanisms: a one-third time scaling due to the strong nonlinearity of Softmax, an inverse width scaling due to representational superposition, and an inverse depth scaling due to ensemble averaging of Transformer layers. These mechanisms are robust to a wide range of data structures and architectural details, placing current large language models in a universality class with fixed exponents. The coefficients, however, are expected to be sensitive to data and architecture details, and directly determine practical quantities such as the optimal model shape and the compute-optimal frontier. We therefore argue that understanding the coefficients is the key to near-term performance improvements, and that a closer examination of the current universality class may reveal pathways to better universality classes.

02.
arXiv (CS.CV) 2026-06-25

Dual Distribution Estimation for Zero-shot Noisy Test-Time Adaptation with VLMs

While test-time adaptation (TTA) empowers vision-language models to adapt without costly retraining, it remains highly vulnerable to out-of-distribution (OOD) outliers prevalent in real-world applications. This discrepancy motivates Noisy TTA (NTTA), an online task to filter noisy OOD samples on the fly while maximizing in-distribution (ID) classification accuracy. Existing zero-shot NTTA approaches typically rely on test-time discriminative training, leading to overconfident misclassifications and significantly degraded inference efficiency. To address these limitations, we propose a novel framework named Dual Distribution Estimation (DDE), shifting the zero-shot NTTA paradigm from instance-level learning to training-free Gaussian distribution modeling. DDE incorporates two novel modules: Positive Feature Distribution Estimation (PFDE) and Negative Label Distribution Estimation (NLDE). PFDE explicitly models class-wise inclusion and exclusion Gaussian distributions to formulate a calibrated contrastive score, robustly enhancing ID accuracy. In parallel, NLDE improves OOD identification by explicitly modeling the negative label distribution to mine highly discriminative labels, effectively mitigating spurious correlations. Extensive experiments show that on the large-scale ImageNet benchmark, DDE achieves an improvement of 3.70\% in harmonic mean accuracy and reduces the FPR95 for OOD detection by 6.20\%, while ensuring highly scalable and efficient online inference. Furthermore, DDE is zero-shot and training-free, demonstrating remarkable robustness in data-scarce scenarios. Codes are available at https://github.com/ZhuWenjie98/DDE.

03.
arXiv (quant-ph) 2026-06-19

Propagating Collective Spin-valley Modes in Twisted WSe2

arXiv:2507.18770v2 Announce Type: replace-cross Abstract: The emergence of neutral collective modes is a hallmark of correlated quantum phases but is often challenging to probe experimentally. In two-dimensional flatband systems, charge responses have been intensively investigated yet neutral excitations remain largely unexplored. In particular, intervalley coherent state (IVC) features a neutral Goldstone mode due to spontaneously broken valley U(1) symmetry. While IVC state has been proposed as a unifying theme across graphene and semiconductor based systems, its defining feature, the neutral Goldstone mode, remains elusive in experiment. Here we investigate space and time resolved transport of neutral modes in twisted WSe2 moire superlattices through a novel ultrafast imaging technique. We uncover two new propagating collective modes with very different velocities, which emerge near the van Hove singularity (VHS) in both intermediate (3.5 to 4 degree) and large (around 5 degree) angle twisted WSe2. The fast-propagating mode has a large speed of about 3 km/s and is consistent with a Goldstone mode for an IVC state, while the slow-moving mode is likely a gapped amplitude mode. They can be understood as the spin-valley analogues of collective modes of a superfluid, whose propagation is imaged for the first time in a condensed matter system. Our study demonstrates a powerful new approach for probing charge-neutral modes in quantum materials and offers key insights into the interplay between charge and spin-valley physics in moire superlattices.

04.
arXiv (CS.CL) 2026-06-15

Deja Vu at Scale: Paraphrase-Robust Detection of Duplicate Gherkin Steps in Behaviour-Driven Software Testing with Sentence-Transformer Embeddings and a 1.1M-Step Open Benchmark

Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step pairs under a released rubric (60-pair overlap, Fleiss kappa = 0.84). We report precision, recall, and F1 with bootstrap 95% CIs under the primary rubric and a score-free relabelling, and benchmark against SourcererCC-style and NiCad-style lexical baselines. Results. Step-weighted exact-duplicate rate is 80.2%; median-repository rate is 58.6% (Spearman rho = 0.51). The top hybrid cluster has 20,737 occurrences across 2,245 files. Near-exact reaches F1 = 0.822 on score-free labels; semantic F1 = 0.906 under the primary rubric reflects a disclosed stratification artefact. Lexical baselines reach F1 = 0.761 and 0.799. The savings model estimates 893,357 corpus-wide eliminable step occurrences; on the median repository 62.5% of step lines are eliminable.

05.
arXiv (quant-ph) 2026-06-24

Mode-selective nonlinear interference for high-brightness and high-purity fiber-coupled SPDC sources

arXiv:2606.23836v1 Announce Type: new Abstract: Single-mode-fiber-coupled spontaneous parametric down-conversion (SPDC) sources are a key resource for photonic quantum technologies, but in single-crystal geometries brightness, heralding efficiency, and spectral purity remain constrained by intrinsic trade-offs. Here, we show how nonlinear interference in a cascaded two-crystal type-II SPDC source can be used to engineer the modal structure of SPDC emission, improving the brightness–heralding-efficiency trade-off by more than one order of magnitude beyond the single-crystal limit. We further demonstrate two routes to near-unity spectral purity while retaining high brightness and/or heralding efficiency, even with standard periodically poled crystals, and study the additional advantages of aperiodic poling with Gaussian phase matching. Using a spectrally resolved Laguerre–Gauss modal decomposition, we show that these improvements arise from mode-selective interference of spatial-spectral SPDC modes within the nonlinear interferometer. We experimentally validate the model through sum-frequency-generation measurements of the spatial-spectral state.

06.
medRxiv (Medicine) 2026-06-24

ADVISE: A Machine Learning Framework for Early Recognition of a Surrogate Marker for Ventilator-Associated Pneumonia Using Routinely Collected Critical Care Data

Background Ventilator-associated pneumonia (VAP) is the most frequent nosocomial infection in critical care, affecting 20-36% of mechanically ventilated patients. Early prediction is hampered by the absence of a reliable, objective diagnostic standard. We developed ADVISE (Automated Dudley Ventilation Infection Series Evaluation), a machine learning model to predict physiological deterioration consistent with developing VAP using routinely collected electronic health record data from a UK NHS intensive care unit. Methods Retrospective observational study of admissions at Russell's Hall Hospital ICU (2008-2026). Following National Data Opt-Out exclusion (158 admissions, 4.2%), 3,566 admissions generated 33,208 candidate 48-hour observation blocks. Six temporal variables - FiO2, ventilator mode, P:F ratio, procalcitonin (PCT), secretion amount, and secretion description - were extracted across the baseline window (hours 1-24). A composite VAP-surrogate outcome required concurrent P:F ratio decline (>=5%) and PCT rise (>=0.5 ng/mL) across the outcome window (hours 25-48). After sequential quality filters, 2,134 blocks (18 positive, 0.84% prevalence) were retained. An XGBoost classifier was trained using nested 5-fold cross-validation with scale_pos_weight=114.0 and ROC-based hyperparameter optimisation on 1,495 training blocks, evaluated on 639 held-out test blocks. Performance was assessed via AUROC, AUPRC, and calibration (Brier score). Bootstrap resampling (1,000 iterations) generated 95% confidence intervals. Results On the held-out test set (n=639, 5 positive outcomes), ADVISE achieved AUROC 0.874 [95% CI: 0.771-0.939] and AUPRC 0.031 [0.008-0.069], representing a 4.0-fold improvement over the no-skill baseline. Nested cross-validation mean AUROC was 0.844 +/- 0.078 (range 0.716-0.915). At the Youden-optimal threshold, sensitivity was 0% with specificity 97.8%, reflecting extreme class imbalance (0.78% test prevalence). A threshold targeting 80% sensitivity achieved sensitivity 80.0% [33.3-100.0%], specificity 87.4% [84.8-89.9%], positive predictive value 4.8% [1.1-9.9%], and negative predictive value 99.8% [99.4-100.0%], detecting 4 of 5 VAP cases with approximately 80 false alarms (12.6% false positive rate). Brier score was 0.0078. Feature importance identified baseline P:F ratio as the dominant predictor (41.3% total gain), followed by ventilator mode (26.1%), secretion amount (13.2%), secretion description (9.1%), procalcitonin (5.9%), and FiO2; (4.5%). Conclusions ADVISE demonstrates that baseline oxygenation trajectory and ventilatory support patterns - derived exclusively from routinely charted ICCA variables - can identify admissions at risk of VAP-related physiological deterioration with meaningful discrimination (AUROC 0.874) despite severe class imbalance. The 80% sensitivity operating point offers a clinically actionable alert rate (12.6% FPR), supporting integration into existing ICU workflows. This proof-of-concept study establishes feasibility; multi-site prospective validation is required before clinical deployment.

07.
arXiv (math.PR) 2026-06-17

Decay of correlations and zeros for the hard-core model

arXiv:2603.17858v2 Announce Type: replace Abstract: In a recent paper the last author proved that absence of complex zeros of the partition function of the hard-core model near a parameter $\lambda>0$ implies a form of correlation decay called strong spacial mixing. In this paper we investigate the reverse implication. We introduce a strengthening of strong spatial mixing that we call very strong spatial mixing (VSSM). Our main result is that if VSSM holds at a parameter $\lambda>0$ for a family of graphs, this implies that the partition function has no zeros near that parameter for each graph in the family. We also demonstrate that a closely related variant of very strong spatial mixing does not imply zero-freeness. As a consequence of our main result, we moreover obtain that VSSM implies spectral independence. Our proof relies on transforming the problem to the analysis of an induced non-autonomous dynamical system given by Möbius transformations.

08.
arXiv (CS.AI) 2026-06-16

PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency

arXiv:2510.15966v2 Announce Type: replace Abstract: Memory systems are fundamental to AI agents, yet existing work often lacks adaptability to diverse tasks and overlooks the constructive and task-oriented role of AI agent memory. Drawing from Piaget's theory of cognitive development, we propose PISA, a pragmatic, psych-inspired unified memory system that addresses these limitations by treating memory as a constructive and adaptive process. To enable continuous learning and adaptability, PISA introduces a trimodal adaptation mechanism (i.e., schema updation, schema evolution, and schema creation) that preserves coherent organization while supporting flexible memory updates. Building on these schema-grounded structures, we further design a hybrid memory access architecture that seamlessly integrates symbolic reasoning with neural retrieval, significantly improving retrieval accuracy and efficiency. Our empirical evaluation, conducted on the existing LOCOMO benchmark and our newly proposed AggQA benchmark for data analysis tasks, confirms that PISA sets a new state-of-the-art by significantly enhancing adaptability and long-term knowledge retention.

09.
arXiv (CS.AI) 2026-06-24

Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

Authors:

arXiv:2606.24169v1 Announce Type: new Abstract: Adapting a streaming speech recognition model to a new language requires choosing between two plausible warm starts: a multilingual (ML) encoder or an English-only (EN) encoder. The common intuition is that the multilingual encoder should help most at low data, but it is unclear how long that advantage persists, whether tight streaming latency amplifies it, and whether it survives deployment quantization. We answer these questions with a controlled sweep of a 0.6 B-parameter cache-aware FastConformer transducer across eight European languages, up to five target-language data scales (100 h to 2500 h), three streaming tiers plus offline decoding, and up to four public test sets. The main result is that multilingual initialization is a data-limited advantage, not a latency-limited one. On FLEURS at 160 ms, the mean EN-ML word error rate (WER) gap falls from +4.21 percentage points (pp) at 100 h to +0.20 pp at 2500 h; a power-law fit summarizes this decay, with each doubling of target-language data roughly halving the remaining advantage. Across the three streaming tiers, the across-language mean EN-ML gap is approximately stable at each scale from 100 to 1000 h, and is near zero by 2500 h. Finally, 4-bit weight-only encoder quantization at the matched 560 ms streaming tier reduces the encoder footprint by about 3x, with an average FLEURS WER increase of about 0.5 pp. The resulting guideline is simple: use multilingual initialization in low-data regimes, treat the choice as effectively irrelevant at large data, and make latency and quantization decisions independently.

10.
arXiv (CS.CL) 2026-06-24

The Warrant Gap: Claim-Conditioned Re-scoring for Fact-Checking

Fact-checking systems built on LLMs achieve high verdict accuracy on standard benchmarks, yet routinely output Supports labels whose cited evidence does not license the claim. Structured decomposition is the natural way to inspect those warrants, but rigid extraction protocols strip the full-claim context that facets need. We introduce SIFT – claim-conditioned re-scoring of extracted evidence spans against the full claim – paired with WSP (Warranted Supports Proportion), an automatic NLI check that the cited warrant entails the claim. We evaluate on FEVER, SciFact, 5PILS, and DP across four open-source backbones. SIFT recovers accuracy on cells where naive decomposition costs up to 27.6 points, while raising WSP above direct prompting; WSP itself calibrates against human gold evidence at AUC 0.92 and precision 0.98.

11.
arXiv (CS.CV) 2026-06-16

LentiAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

Real-time stereoscopic video communication has long been a goal of immersive telepresence, yet practical systems still require specialized capture rigs or reduce remote users to a single portrait view. We present LentiAvatar, a Gaussian head-avatar system that connects monocular avatar capture with subpixel-encoded glasses-free lenticular display for real-time autostereoscopic communication. From a monocular portrait video, LentiAvatar reconstructs a controllable head avatar and optimizes it for the lateral viewing zones induced by the display. The method uses natural head turns as pseudo-multiview (PMV) supervision to constrain regions that are otherwise weakly observed in monocular training, including hair, ears, jaw contours, and neck boundaries. Reliable side frames are yaw-binned, aligned to virtual cameras, and supervised within a strict head-and-hair domain; contour-aware losses and staged regularization further suppress ghosting, alpha leakage, and depth instability while preserving lateral detail. At runtime, LentiAvatar renders 32 virtual views and encodes them into a 4K lenticular raster with calibrated subpixel-routing masks. The live-tracker prototype sustains 10.65 FPS, and a subject-specific distilled driver raises the same display pipeline to 38.49 FPS.

12.
arXiv (CS.CL) 2026-06-17

E2Vec: Feature Embedding with Temporal Information for Analyzing Student Actions in E-Book Systems

Digital textbook (e-book) systems record student interactions with textbooks as a sequence of events called EventStream data. In the past, researchers extracted meaningful features from EventStream, and utilized them as inputs for downstream tasks such as grade prediction and modeling of student behavior. Previous research evaluated models that mainly used statistical-based features derived from EventStream logs, such as the number of operation types or access frequencies. While these features are useful for providing certain insights, they lack temporal information that captures fine-grained differences in learning behaviors among different students. This study proposes E2Vec, a novel feature representation method based on word embeddings. The proposed method regards operation logs and their time intervals for each student as a string sequence of characters and generates a student vector of learning activity features that incorporates time information. We applied fastText to generate an embedding vector for each of 305 students in a dataset from two years of computer science courses. Then, we investigated the effectiveness of E2Vec in an at-risk detection task, demonstrating potential for generalizability and performance.

13.
arXiv (CS.AI) 2026-06-15

When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer More

arXiv:2606.14476v1 Announce Type: new Abstract: A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct-style LLM agent as an explicit tool and measure, on node classification over a text-attributed graph (ogbn-arxiv, replicated on WikiCS), whether the agent uses the tool or merely obeys it. We find the agent does not exercise judgment: its predictions agree with the raw GNN's 97.6-99.2% of the time (5 seeds), collapsing into a GNN parrot that adopts the tool's output wholesale and bypasses its own reasoning. Sweeping backbone capability (Qwen2.5 0.5B-7B), the deference is not a weak-model artifact: among models able to invoke the tool, agreement rises with capability (0.60 to 0.98 from 1.5B to 7B). Crucially, the cost of deference does not shrink as capability grows and grows where alternatives emerge: a per-node oracle over the available actions beats the parrot by 0.09-0.18 at 3B and 0.12-0.22 at 7B, roughly doubling at high homophily, because the parrot is pinned to the frozen GNN while the agent's alternatives improve; at 7B a simple neighbour-label tool overtakes the GNN at high homophily (0.81 vs 0.71) yet the agent still defers. A simple selective-invocation gate recovers about half of that high-homophily gap (0.71 to 0.83) but yields no net global gain, and held-out estimates bound the best achievable gate over standard test-time features to at most a third of the oracle headroom: reliable selective invocation looks limited by available information, not merely router design. Our results are a cautionary measurement: evaluations of agent+tool systems cannot assume the agent adds judgment on top of the tool, and selective invocation must be designed in rather than expected to emerge from scale.

14.
arXiv (CS.LG) 2026-06-15

Deep Doubly Debiased Longitudinal Effect Estimation with ICE G-Computation

arXiv:2602.12379v2 Announce Type: replace Abstract: Estimating longitudinal treatment effects is essential for sequential decision-making but is challenging due to treatment-confounder feedback. While Iterative Conditional Expectation (ICE) G-computation offers a principled approach, its recursive structure suffers from error propagation, corrupting the learned outcome regression models. We propose D3-Net, a framework that mitigates error propagation in ICE training and then applies a robust final correction. First, to interrupt error propagation during learning, we train the ICE sequence using Sequential Doubly Robust (SDR) pseudo-outcomes, which provide bias-corrected targets for each regression. Second, we employ a multi-task transformer with a covariate simulator head for auxiliary supervision, regularizing representation learning, and a target network to stabilize training dynamics. For the final estimate, we discard the SDR correction and instead use the uncorrected nuisance models to perform Longitudinal Targeted Minimum Loss-Based Estimation (LTMLE) on the original outcomes. This second-stage, targeted debiasing ensures robustness and optimal finite-sample properties. Comprehensive experiments demonstrate that our model, D3-Net, robustly reduces bias and variance across different horizons, counterfactuals, and time-varying confoundings, compared to existing state-of-the-art ICE-based estimators.

15.
bioRxiv (Bioinfo) 2026-06-11

Tumour evolution as ground truth for cancer whole-genome sequencing

Cancer genomes are shaped by evolutionary processes that couple mutagenesis, clonal selection, chromosomal instability, spatial growth and treatment response into structured genomic patterns, yet current benchmarking strategies largely ignore this evolutionary dependency. Here, we present SCOUT, a large-scale synthetic whole-genome sequencing resource of over 200 samples, designed for systematic benchmarking of tumour genomic analysis and evolutionary inference under controlled evolutionary ground truth. Unlike conventional task-specific simulations, SCOUT models tumour evolution as a latent generative process that simultaneously shapes mutations, copy-number alterations, variant allele frequencies, mutational signatures and clonal architectures. SCOUT recapitulates key features of solid and haematological malignancies, including driver mutations, chromosomal instability, intratumour heterogeneity, spatial sampling and treatment-associated evolutionary dynamics in tumour and matched-normal longitudinal and multi-region sequencing designs. Using SCOUT, we benchmarked widely used methods for somatic variant detection, copy-number analysis, mutational signature inference and tumour evolutionary reconstruction. Across analytical tasks, performance deteriorated in low-purity, highly subclonal and structurally complex tumours, while spatial sampling bias and hypermutation generated spurious evolutionary signals that confounded tumour interpretation across multiple inference layers. Evolutionary simulations further distinguished lineage-restricted genetic bottlenecks from multi-lineage resistance dynamics associated with tumour plasticity. Tumour purity consistently exerted a stronger effect on inference accuracy than sequencing depth. Together, our results establish evolutionary ground truth as a prerequisite for reproducible benchmarking and biologically interpretable analysis of cancer whole-genome sequencing data.

16.
arXiv (CS.CV) 2026-06-11

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon Layer-wise Accumulated Structural Attention (LASA), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

17.
arXiv (CS.LG) 2026-06-24

Efficient reduction of stellar contamination and noise in planetary transmission spectra using neural networks

arXiv:2602.10330v3 Announce Type: replace-cross Abstract: Context: The characterization of exoplanetary atmospheres has been transformed by the James Webb Space Telescope (JWST), whose infrared sensitivity enables transmission spectroscopy at unprecedented precision. However, stellar heterogeneities (e.g., spots and faculae) remain a dominant source of contamination that can bias atmospheric retrievals if not properly corrected. Aims: We present a methodology for reducing stellar contamination and instrument-specific noise from exoplanet transmission spectra using neural networks, in particular the so-called Denoising AutoEncoders (DAE). Our goals are to enable fast, accurate corrections that improve the reliability of atmospheric parameter retrievals and to promote the use of unsupervised algorithms for efficient data processing. Methods: We designed and trained DAE architectures using large synthetic datasets of terrestrial (TRAPPIST-1e analogues) and sub-Neptune (K2-18b analogues) planets. Atmospheric retrieval experiments were then performed on contaminated spectra in order to compare our deep-learning approach against standard correction methods in terms of accuracy and computational cost. Results: Our autoencoders successfully reconstruct uncontaminated spectra, preserving essential molecular features even in low-S/N regimes. In retrieval tests, the denoising autoencoder pre-processing reduces bias in retrieved abundance parameters compared to uncorrected observations. Notably, our method matches the accuracy of simultaneous stellar-contamination fitting while maintaining a much lower computational cost, typically one order of magnitude smaller. Conclusions: These results demonstrate that DAEs outperform conventional correction methods in computational efficiency while maintaining high accuracy, paving the way for their integration into future atmospheric characterization pipelines for both rocky and giant exoplanets.

18.
arXiv (CS.AI) 2026-06-19

Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

arXiv:2505.22829v2 Announce Type: replace-cross Abstract: This paper bridges distribution shift and AI safety through a comprehensive analysis of their conceptual and methodological synergies. While prior discussions often focus on narrow cases or informal analogies, we establish two types connections between specific causes of distribution shift and fine-grained AI safety issues: (1) methods addressing a specific shift type can help achieve corresponding safety goals, or (2) certain shifts and safety issues can be formally reduced to each other, enabling mutual adaptation of their methods. Our findings provide a unified perspective that encourages deeper integration between distribution shift and AI safety research.

19.
arXiv (CS.CV) 2026-06-25

Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

General-purpose models often struggle to reliably identify and understand real-world multimodal risks, largely due to the inherent multimodal adversarial nature of content and AI safety. We present Yuvion VL, a family of multimodal large language models purpose-built for content and AI safety, with both instruction-tuned and reasoning-oriented variants. Yuvion VL addresses this gap by treating safety as an inherently adversarial and multimodal problem and designing the entire pipeline around adversarial robustness. For data construction, we develop an automated pipeline integrating adversarial-aware data synthesis with multi-stage quality control, producing large-scale, high-quality multimodal samples augmented with domain knowledge and reasoning annotations. For training, we adopt a three-stage pipeline that includes continued pretraining for risk-concept cross-modal alignment, instruct post-training for production-grade safety tasks, and reasoning post-training for enhanced interpretability and performance in complex tasks. We further introduce Confuse-then-Contrast Fine-Tuning, a contrastive framework that mines model-specific confusions and constructs multi-image contrastive groups to enforce explicit discrimination of fine-grained visual-semantic elements, enabling the model to distinguish between visually similar cases with different safety implications in adversarial safety tasks. To support rigorous evaluation, we further introduce Yuvion VL RiskEval (YVRE), a collection of benchmarks covering diverse open and internal evaluations, with a focus on content and AI safety, adversarial robustness, and real-world capability requirements. Experiments show that Yuvion VL-32B achieves industry-leading safety performance, surpassing comparably sized open-source models and best closed-source commercial models, while maintaining comparable general capabilities.

20.
arXiv (CS.LG) 2026-06-24

A Comparative Study of Bayesian Contextual Bandits for Real-Time Warehouse Sorter Optimization

arXiv:2606.23977v1 Announce Type: new Abstract: Efficient sorter diversion control of automated material handling systems (MHS) is critical for optimizing operational efficiency in large-scale warehouse environments. In this study, we use an inbound receiving sorter at a high-volume e-commerce warehouse as our primary use case, where the sorter diversion system relies on cost functions with static weight configurations that fail to adapt to highly dynamic system contexts, such as volume mode, congestion level, equipment physical status, and upstream/downstream dependencies. To address this real-time sorter diversion optimization challenge, we conducted a comparative study of three candidate hybrid machine learning frameworks: Linear Regression with Gradient Descent Optimization (LR+GDO), XGBoost with Bayesian Optimization (XGB+BO), and Bayesian Contextual Bandits (BCB). Model training and evaluation were enabled by leveraging a high-fidelity physics-aware emulator to overcome the cold-start problem and allow a safe transition from offline to online learning. We performed comprehensive evaluations including reward model predictive accuracy, contextual sensitivity, action distribution, and projected reward uplift. Our results demonstrate that while tree-based reward models offer slightly better predictive power, the BCB framework achieved overall higher performance with 2.03% reward uplift over the heuristic baseline. Furthermore, BCB exhibits several superior characteristics, such as its decisive time-optimal policy backed by Bang-Bang control theory, continuous online learning capability, strategic balance between exploration and exploitation, and significantly shorter inference latency. These results demonstrate the potential of the BCB framework for real-time control optimization in large-scale warehouse environments, motivating further investigation toward operational deployment.

21.
arXiv (CS.CL) 2026-06-18

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

22.
arXiv (CS.CL) 2026-06-11

SOMA-SQL: Resolving Multi-Source Ambiguity in NL-to-SQL via Synthetic Log and Execution Probing

Natural language interfaces to databases aim to translate user questions into executable SQL, yet remain brittle in real-world settings where questions are underspecified and schemas are large and ambiguous. Ambiguity across user questions, database schemas, and model interpretations are central failure modes in NL2SQL, leading to misaligned intent, incorrect schema grounding, and erroneous SQL generation. Existing approaches rely on human clarification or treat ambiguity as a schema representation problem, but these do not scale nor resolve ambiguity autonomously. We propose SOMA-SQL to automatically resolve ambiguity via targeted synthetic query log and ambiguity-driven probing. SOMA-SQL constructs synthetic query log to ground schema interpretation and guide candidate SQL generation; it then executes targeted probing queries, driven by a structured ambiguity taxonomy and candidate disagreements, to produce disambiguation evidence for final SQL selection and repair. This active approach to ambiguity discovery and resolution generalizes across unseen schemas and query distributions without human-in-the-loop. Experiments on six public benchmarks demonstrate that SOMA-SQL improves execution accuracy by 13.0% on average over state-of-the-art baselines, with gains of up to 16.7% on ambiguous questions.

23.
arXiv (CS.CL) 2026-06-11

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

When a language model processes a hallucinated response, its attention routing tends to fail in one of two shapes: over-concentrating on a narrow set of positions, or spreading so diffusely that relevance is diluted, and the shape of the failure carries diagnostic signal. We study these shapes as a diagnostic characterization, computed from attention matrices under forced scoring of benchmark-labeled responses rather than during live generation. A widely used family of spectral methods analyzes the symmetric component of the degree-normalized attention operator, which governs transport capacity; we prove that every transpose-invariant spectral diagnostic of this operator is structurally orientation-blind (it cannot distinguish an operator from its transpose, and therefore cannot detect information-flow direction), with a converse to the blindness theorem bounding any Lipschitz diagnostic's transpose sensitivity by the asymmetry coefficient $G$. Pairing this with a closed-form bipartite-Cheeger landscape for canonical causal architectures, we show that uniform causal attention satisfies an $n$-independent floor $\phi \ge 1/5$, while window attention pierces the floor as $O(w/n)$; failure modes are shape-different, not just value-different. This floor is an idealized-architecture benchmark, not an empirical attractor: the fraction of real attention heads that pierce it is itself an architectural signature. The resulting two-axis diagnostic ($\phi$ for capacity, $G$ for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity. Under length-controlled evaluation, transport features retain interpretable signal (0.62-0.84 LC-AUROC) across the tested decoder-only, encoder-only, and encoder-decoder models, with polarity reversing as predicted between HaluEval and MedHallu.

24.
medRxiv (Medicine) 2026-06-12

Reduced nighttime smartphone use among cohabiting partners: a longitudinal study under the lens of social control of health behaviors theory

Objective: We examined the link between cohabitation with a partner and nighttime smartphone use through the social control of health behavior theory. Background: Nighttime smartphone use is a behavioral risk factor for sleep problems. While previous research has predominantly focused on individual-level risks of sleep disturbances, the role of social context remains underexplored. Theoretical frameworks, specifically the Social Control of Health Behavior, suggest that social relationships regulate health-related behaviors; however, it is unclear how far this regulation extends to modern digital behaviors among couples. Method: We analyzed survey data from three waves of the SmartSleep Study (2018, 2020, and 2023; total N = 25,028), including a longitudinal follow-up subset (N = 1,003). We tested multivariate associations between living with a partner, changes in cohabitation status and frequent nighttime smartphone use by fitting generalized linear mixed-effects models. Additionally, we mapped the complex interplay between indicators of social integration, social support, smartphone use, and sleep quality using hierarchical clustering of non-linear correlations. Results: Cohabiting participants had lower odds of frequent nighttime smartphone use compared to those living alone (OR = 0.66; 95% CI: 0.61, 0.72). This lower risk was driven primarily by cohabitation with a partner (OR = 0.49; 95% CI: 0.36, 0.66). Longitudinal analysis supported these findings, showing that sustained cohabitation was associated with less frequent nighttime use (OR = 0.56; 95% CI: 0.38, 0.82). Clustering analysis revealed that indicators of social integration and support clustered with favorable sleep quality. Conclusion: Our findings suggest that the health-protective effects of cohabitation with a partner extend to digital behaviors. Consistent with social control of health behavior theory, the presence of a partner appears to reduce frequent nighttime smartphone use, highlighting the critical importance of considering social context when addressing digital health hygiene and promoting sleep.

25.
arXiv (CS.AI) 2026-06-16

Calibrated Sampling-Free Uncertainty Estimation in Bayesian Deep Learning

arXiv:2606.16214v1 Announce Type: cross Abstract: Modern deep learning models remain notoriously prone to overconfidence, limiting their reliability in high-stakes applications. Bayesian methods aim to counter this by learning a distribution over model parameters, and recent advances now make this feasible for large-scale architectures at costs comparable to AdamW. However, a challenge remains at test time: predictions must be averaged across many forward passes with weights sampled from the posterior, which is prohibitively expensive. Variance propagation offers an efficient alternative, computing layer-wise analytical approximations of uncertainty in a single forward pass. While such techniques are effective for MLPs, their extension to modern architectures remains challenging, due to increased depth and diversity of layer types. To fill this gap, we propose Calibrated Variance Propagation (CVP), which introduces a new propagation method for normalization layers, combines it with recent techniques for handling activation functions, and absorbs residual error through a light calibration step. CVP yields comparably accurate uncertainty estimates to MC sampling across transformers and CNNs, at a fraction of the cost. Against prior variance propagation work, CVP improves coverage at $0.5\%$ risk from $8.2\%$ to $14.6\%$ with BEiT-3 on Visual Reasoning (NLVR2) and from $2.6\%$ to $10.8\%$ with ViLT on VQAv2, with gains extending to convolutional architectures.