Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-16

Fantastic Pretraining Optimizers and Where to Find Them II: Hyperball Optimization

arXiv:2606.16899v1 Announce Type: new Abstract: Matrix based optimizers such as Muon can substantially speed up language model pretraining, but their gains over AdamW are observed to shrink as model size and data scale grow when using standard constant decoupled weight decay. We propose Hyperball, a simple optimizer wrapper that addresses this issue. Given a base optimizer such as Adam or Muon, Hyperball sets the Frobenius norms of weight matrices and their corresponding optimizer updates to fixed constants. On Qwen3 style models up to 1.2B parameters, Muon Hyperball achieves 20–30% token equivalent speedup over weight decay baselines. Hyperball also improves learning rate transfer across widths and depths compared to decoupled weight decay. This method is motivated by prior theory showing that training with weight decay leads to an equilibrium weight norm that only depends on the training hyperparameters. Through this mechanism, the weight decay then decides the angular learning rate, i.e. how fast the direction of the weight matrix changes.

02.
arXiv (quant-ph) 2026-06-16

Hardy-type self-testing and exposedness of tripartite GHZ correlations

arXiv:2512.16242v2 Announce Type: replace Abstract: Nonlocality can be witnessed either through Bell-inequality violations or through logical contradictions such as Hardy's paradox. In the bipartite two input two outcome scenario, these two routes have distinct geometric behavior: CHSH-maximal correlations are exposed points of the quantum set, whereas known Hardy-type self-testing correlations on the no-signaling boundary are non-exposed. Here we show that this bipartite intuition fails in the tripartite two input two outcome scenario. We study the tripartite instance of a multipartite Hardy-type paradox and prove that the correlation attaining the maximal Hardy success probability self-tests the Greenberger–Horne–Zeilinger state and the associated measurements. Although this correlation lies on the no-signaling boundary, we show that it is an extremal and exposed point of the quantum correlation set. Moreover, it coincides with the correlation attaining the maximal violation of the Mermin inequality. Thus, in the tripartite GHZ scenario, the logical-paradox and Bell-inequality routes to nonlocality select the same exposed quantum boundary point. We also establish a robust version of the self-test, showing that small deviations from the ideal Hardy constraints imply quantitative closeness to the target state and measurements. Our results reveal a qualitative geometric difference between bipartite and tripartite Hardy-type nonlocality and suggest a broader investigation of exposedness for multipartite Hardy correlations in the multiparty setting.

04.
arXiv (CS.AI) 2026-06-19

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

arXiv:2606.19741v1 Announce Type: new Abstract: Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

05.
arXiv (CS.CL) 2026-06-19

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect'' in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.

06.
arXiv (CS.CL) 2026-06-16

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

Social interaction depends on both language and visible social signals, such as facial expressions, posture, gaze, and emotional shifts. Yet existing social-agent benchmarks are largely text-based and rarely test whether multimodal agents can use visual cues to guide interaction. We introduce \textsc{\benchmarkname{}}, a benchmark evaluating visual social intelligence in multimodal social simulation. It contains 240 scenarios, 585 role instances, and 2,340 role-task instances, combining aligned textual-visual evidence, structured role profiles, and four role-level tasks: expression task, characteristic task, interaction regulation task, and interaction outcome task. Evaluating seven recent MLLMs under verbalized-vision and direct-vision reveals a clear gap between local role enactment and interaction management: role-specific expression and conflict handling are near saturation, whereas interaction regulation and visually grounded outcome achievement remain substantially more difficult. The code is released at https://github.com/JunsWan/AgentViSS, and the dataset is available at https://huggingface.co/datasets/JunsWan/AgentViSS.

07.
medRxiv (Medicine) 2026-06-18

Early-life Urban Environment, Nutrition, and Pubertal Timing in Southern Europe: An Exposome Analysis

Background: Urban environmental and lifestyle factors during early life may influence pubertal timing, but the combined effects of multiple environmental exposures within an exposome analytical framework remain poorly understood. Objective: To examine the association between early-life urban environmental exposures and pubertal timing, and to explore whether these exposures interact with early-life nutritional factors, namely breastfeeding duration and childhood diet quality. Methods: Data from two European population-based birth cohorts were analysed: Generation XXI (G21, Portugal; n=5263; 51.5% girls) and INfancia y Medio Ambiente (INMA, Spain; n=1019; 50.1% girls). Urban environmental exposures including indicators of air pollution, traffic, built environment, and natural spaces were estimated at 4 early-life stages at both cohorts: pregnancy (INMA only), birth, 1 year, and 4-5 years of age. Pubertal development timing was assessed using Tanner staging and/or the Pubertal Development Scale (PDS), and age at menarche was self-reported. Exposome-Wide Association Study (ExWAS) models and unsupervised clustering followed by ordinal logistic regression models were used to examine single- and multi-exposure associations, respectively. Regression models were fitted adjusting for relevant child characteristics, maternal factors, and household socioeconomic conditions, and corrected for multiple testing. Results: Individuals living in more unfavourable urban environments characterised by higher building density, air pollution, and lower access to natural spaces showed earlier pubertal timing according to multiple outcomes, across multiple early-life exposure periods, and in both cohorts. In the G21 cohort, these environmental profiles were associated with earlier age at menarche, particularly for exposures at 1-1.5 and 4-5 years (e.g., 1-1.5y: {beta}=-0.172, FDR-adjusted p-value=0.041), while in the INMA cohort, boys exposed to more unfavourable environmental profiles showed more advanced pubertal development, also particularly for exposures at 1-1.5 and 4-5 years of age (e.g., 1-1.5y; {beta}=0.572, FDR-adjusted p-value=0.008). Among environmental domains, air pollution and traffic were the factors most consistently associated with pubertal timing. Regarding early-life nutritional factors, longer duration of exclusive breastfeeding was associated with a lower Tanner stage among girls in G21. No significant interactions between breastfeeding duration and environmental exposure clusters were observed. Conclusion: Early-life urban environmental exposures, particularly air pollution and traffic, may influence pubertal timing. Exclusive breastfeeding may have a protective role against earlier pubertal development. These findings highlight the importance of improving urban environmental conditions and promoting breastfeeding to support healthy developmental trajectories.

08.
arXiv (CS.CV) 2026-06-16

Question-Aware Evidence Ledgers for Video Relational Reasoning

The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.

09.
arXiv (CS.CV) 2026-06-15

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

10.
bioRxiv (Bioinfo) 2026-06-11

EditorForge: An Active-Site-Aware Framework for Inverse-Folding-Based Protein Redesign

Inverse-folding models can rapidly generate protein sequences compatible with a supplied backbone, but unconstrained redesign is poorly suited to enzyme and genome-editor-associated domains, where catalytic, substrate-proximal, and conserved structural regions must remain protected. In this paper, we present EditorForge, a modular constraint-and-audit suite for editor-domain protein redesign that wraps fixed-backbone inverse folding with explicit design masks, fixed-position enforcement, active-site-proximity auditing, active-site-shielded regeneration, and downstream structural quality control. Using full-length Moloney murine leukemia virus reverse transcriptase structure 4MH8 (MMLV RT 4MH8) as a demonstration target, EditorForge first restricted redesign to a bounded 25-position envelope while fixing 428 residues. An initial audit detected active-site-proximal failure modes despite fixed-position integrity. Later, the Active Site Shield module then removed five unsafe design positions, replaced them with lower-contact alternatives, and regenerated candidates under stricter constraints. Post Shield Audit evaluated 24 regenerated candidates, all of which satisfied the hard sequence/mask and active-site-shield constraints. For the eight candidates that were selected or returned for structure-prediction/refolding quality control. Enhanced RefoldQC found that all 8 evaluated predicted structures passed the computational structure-QC screen. That said, the selected 8 candidates passed the computational structure-QC screen, with global C RMSD values of 1.2061–1.5555~[A], active-site C RMSD values of 0.4098–1.8397~[A], mutation-neighborhood C RMSD values of 1.3155-1.6848~[A], and average pLDDT-like confidence values of 94.87-95.11. In short, EditorForge provides a reproducible triage layer that converts general inverse-folding output into constrained and editor-specific candidate sets for downstream structural and biological review on top of existing structural prediction tools.

11.
arXiv (quant-ph) 2026-06-16

Enhancing Quantum Machine Learning with Anyons

arXiv:2606.16090v1 Announce Type: new Abstract: The power of quantum computing and quantum machine learning relies on harnessing uniquely quantum phenomena as computational resources. While superposition, coherence and entanglement have been central to this effort, the role of particle exchange statistics remains largely unexplored. Here, we introduce a quantum kernel framework that unifies bosonic, fermionic, and anyonic (fractional) exchange statistics within a single learning paradigm. We study this family of kernels from three perspectives. At the representation level, Haar-averaged effective-dimension analysis shows that fractional exchange phases access feature-space directions inaccessible to the purely symmetric or antisymmetric limits. At the level of kernel geometry, the corresponding Gram matrices show greater separation from the distinguishable-particle baseline and reduced label-dependent model complexity. Finally, on learning benchmarks, anyonic kernels consistently outperform their bosonic and fermionic counterparts, with stronger target alignment and more favorable class geometry. Together, these findings show that exchange statistics reshape the structure and geometry of quantum feature space, leading to enhanced learning performance. Our work identifies particle exchange statistics as an overlooked computational ingredient for quantum machine learning and provides the first systematic comparison of quantum learning models across exchange phases.

12.
medRxiv (Medicine) 2026-06-18

Can Vision-Language Models See the Vital Signs? Benchmarking and Fine-Tuning for Intraoperative Monitor Reading

Background Vital-sign deterioration is a leading contributor to preventable perioperative death, yet manual monitor reading is intermittent, error-prone, and subject to alarm fatigue. Automating this perceptual step could enable continuous surveillance, but existing solutions depend on device-specific hardware integration or cloud-hosted vision-language models (VLMs), which raise privacy, cost, and connectivity barriers in resource-limited healthcare facilities. Methods We constructed a benchmark of 200 in-the-wild intraoperative monitor photographs (spanning multiple vendors, angles, and illumination conditions) annotated for eight vital-sign parameters: heart rate, SpO2, ETCO2, respiratory rate, systolic/diastolic/mean blood pressure, and temperature. We evaluated an optical character recognition (OCR)-based pipeline, nine instruction-tuned VLMs (four commercial, five open-weight ranging from [≤]4B to 31B parameters) under two prompting regimes, and a compact open model (Qwen3.5-9B) adapted via low-rank fine-tuning (LoRA, 0.46% of parameters updated). Results Under a domain-aware prompt, frontier VLMs reached 0.98-0.997 exact-match accuracy zero-shot, whereas the OCR pipeline and [≤]4B model scored approximately 0.20 lower, defining a 9B-class usable floor. LoRA fine-tuning Qwen3.5-9B on 80-120 images raised accuracy from 0.953 to 0.994 (statistically indistinguishable from the best commercial model) and reduced the critical-error rate fivefold (0.0313 [->] 0.0063). Ablations showed that performance saturated at 80 training images and rank-8 adapters. Conclusion Monitor reading is a solved perception problem for VLMs above the 9B scale. A lightweight fine-tuned open model achieves frontier accuracy while running entirely on local hardware, preserving data privacy, offline capability, and near-zero marginal cost. Residual errors stem from blood-pressure source ambiguity and are addressable with explicit disambiguation logic.

13.
Nature (Science) 2026-06-15

Daily briefing: Iron-Age human bones were made into tools before interment

作者:

Newly uncovered bones hint at how Iron Age Britons treated their dead. Plus, AI models have failed to beat human mathematicians at research-level problems and the everyday items that make great scientific tools. Newly uncovered bones hint at how Iron Age Britons treated their dead. Plus, AI models have failed to beat human mathematicians at research-level problems and the everyday items that make great scientific tools.

14.
arXiv (CS.AI) 2026-06-18

Towards Understanding What State Space Models Learn About Code

arXiv:2602.06774v2 Announce Type: replace Abstract: State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

15.
arXiv (CS.LG) 2026-06-15

Ensembling Sparse Autoencoders

arXiv:2505.16077v2 Announce Type: replace Abstract: Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we introduce and formalize SAE ensembles. Furthermore, we propose to ensemble multiple SAEs through naive bagging and boosting. In naive bagging, SAEs trained with different weight initializations are ensembled, whereas in boosting SAEs sequentially trained to minimize the residual error are ensembled. Theoretically, naive bagging and boosting are justified as approaches to reduce reconstruction error. Empirically, we evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that, compared to an expanded SAE that matches the number of features in the ensemble, ensembling SAEs improves the reconstruction of language model activations along with SAE stability. Additionally, on downstream tasks such as concept detection and spurious correlation removal, SAE ensembles achieve better performance, showing improved practical utility.

16.
arXiv (CS.CL) 2026-06-16

TMASC: Transmasculine Attitude and Speech Corpus

作者:

We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the vocal health of transmasculine individuals. The audio recordings include cough and throat-clearing samples, a reading passage, and additional session-specific questions. This paper outlines the development of this corpus and the data collection procedures. To illustrate the utility of this corpus, we present three case studies demonstrating how this crowd-sourced multimodal corpus can be used to support transmasculine individuals. These include the integration of perceptual and acoustic data, the identification of group-level characteristics, and the calibration of acoustic measurements.

17.
arXiv (CS.AI) 2026-06-17

Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization

arXiv:2606.17915v1 Announce Type: cross Abstract: Big-Data-as-a-Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post-deployment monitoring. However, existing LLM-based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle-level orchestration, artifact governance, human oversight, and drift-aware adaptation. This paper proposes a trustworthy self-composable BDaaS frame work based on LLM-orchestrated multi-agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human-in-the-loop checkpoints, and drift-aware feedback loops. A prototype-based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML-only, and single-agent LLM baselines, the pro posed multi-agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle-level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM-orchestrated multi-agent systems can extend conventional AutoML toward trustworthy, adaptive, and production-oriented BDaaS lifecycle automation.

18.
medRxiv (Medicine) 2026-06-10

Transcriptomic Architecture of Type 2 Diabetes in Human Pancreatic Islets:An Integrative Meta-Analysis and Machine Learning Framework for Biomarker Discovery

作者:

Background. Type 2 diabetes mellitus (T2D) is defined by progressive pancreatic {beta}-cell dysfunction whose molecular underpinnings remain incompletely understood. Single-cohort transcriptomic analyses of donor islets have yielded heterogeneous gene lists of limited cross-study reproducibility, constraining both mechanistic interpretation and biomarker development. Methods. We combined two complementary analytical strategies applied to four public human islet transcriptomic cohorts (GSE25724, GSE20966, GSE38642, and GSE164416; n = 7-57 donors per contrast). For the integrative arm, three microarray datasets and one bulk RNA-seq dataset were processed independently and unified through gene-level random-effects meta-analysis, hallmark pathway scoring (GSVA/MSigDB), and iterative module refinement, yielding a two-axis disease framework. For the diagnostic arm, a consensus multi-method machine learning pipeline, combining LASSO penalized logistic regression, Support Vector Machine Recursive Feature Elimination (SVM-RFE), and Random Forest importance scoring, was applied to 184 differentially expressed genes from the RNA-seq cohort, with all normalization steps performed within leave-one-out cross-validation (LOOCV) folds to prevent data leakage. Machine learning classification of the RNA-seq cohort was additionally subjected to external transportability testing in the independent bulk human islet RNA-seq cohort GSE50244 using an overlap-restricted reduced score and a threshold fixed in the discovery cohort. Results. Meta-analysis across all four cohorts identified 337 high-confidence T2D-associated genes (96.1% directional concordance in beta-cell-enriched tissue). These were distilled into two refined 14-gene modules: ImmuneStress (MICB, HLA-DRA, HLA-DPA1, IL1R2, and others) and BetaCellIdentitySecretion (RASGRP1, PPP1R1A, SLC2A2, and others), whose composite IsletDysfunctionScore provided the most stable cross-platform separation of non-diabetic from T2D islets (Hedges' g = 1.80, p = 9.83 x $10^-17$, $text{I}^2$= 0%). Consistent with progressive disease, IsletDysfunctionScore increased monotonically from non-diabetic to impaired glucose tolerance to T2D. Separately, the machine learning pipeline derived a 10-gene diagnostic panel: GABRA2, SLC2A2, ARG2, DKK3, PRIMA1, TAFA4, HHATL, PARVG, RNU1-70P, and the novel lncRNA ENSG00000284653, that achieved perfect discrimination in LOOCV (AUC = 1.000, sensitivity = 1.000, specificity = 1.000, zero misclassifications across all 57 donors). A leakage-verification experiment confirmed that this performance reflected genuine biological signal: global quantile normalization prior to cross-validation collapsed AUC to 0.380. External testing showed that 8 of the 10 panel genes were measurable in GSE50244. The frozen 8-gene reduced score retained strong discrimination (external AUC = 0.907), with 6 of 8 genes preserving directional concordance, but the discovery-derived threshold did not transfer because the external score distribution was shifted upward and compressed, yielding complete sensitivity but zero specificity at the frozen cutoff Conclusions. Integrating pathway-level meta-analysis with machine learning classification, we present a coherent two-axis model: immune/stress activation and loss of beta-cell identity/secretory competence, together with a compact, biologically interpretable 10-gene diagnostic signature. Panel genes converge on GABA signaling, glucose transport, arginine metabolism, WNT pathway inhibition, and a novel lncRNA, providing both mechanistic hypotheses and high-priority targets for external validation. These findings offer a reproducible transcriptomic scaffold for future mechanistic, biomarker, and clinical translation studies of human islet dysfunction. They also support external transportability of the core biological signal, while indicating that absolute operating thresholds are cohort-dependent and would require recalibration before deployment in independent datasets.

19.
arXiv (CS.LG) 2026-06-17

Amortized Probabilistic Retrieval of Atmospheric CO2 from OCO-2 Spectra Using Deep Learning with Laplace Approximations and Normalizing Flows

arXiv:2606.17413v1 Announce Type: new Abstract: Space-based monitoring of atmospheric carbon dioxide (CO2) is essential for constraining the global carbon budget. NASA's Orbiting Carbon Observatory-2 (OCO-2) estimates column-averaged dry-air mole fractions of CO2 (XCO2) using high-resolution spectra. However, current operational retrieval algorithms are computationally expensive and do not properly quantify uncertainties. We present a novel deep learning framework that addresses these challenges. Due to the difficulties of ground-truth data for real satellite observations, we develop and validate our approach using a high-fidelity simulation dataset. This dataset, created to support OCO-2 uncertainty quantification (UQ), incorporates realistic forward model errors. Our architecture encodes spectral bands using a multi-branch neural network and estimates posteriors of the full CO2 column or desired summaries thereof using two scalable UQ methods: Laplace approximations and normalizing flows. Our approach has five key advantages relative to operational "full-physics" solvers: (1) Amortization: Inference is orders of magnitude faster, enabling real-time processing of massive data streams; (2) Model error robustness: By training on simulations that explicitly include model discrepancies, our method accounts for systematic errors often neglected by standard inversions; (3) Point estimate accuracy: We achieve superior predictive accuracy compared to baseline methods; (4) Improved UQ: The probabilistic outputs yield better-calibrated uncertainty estimates; and (5) Non-Gaussian posteriors: When utilizing normalizing flows, our framework successfully models complex, asymmetric posterior distributions, overcoming the limitations of the Gaussian assumption. These results suggest that simulation-based deep learning is a viable path toward next-generation operational processing systems.

20.
arXiv (CS.CV) 2026-06-17

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

21.
arXiv (CS.AI) 2026-06-12

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

arXiv:2606.12500v1 Announce Type: cross Abstract: Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

22.
arXiv (CS.CV) 2026-06-18

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

23.
arXiv (CS.AI) 2026-06-12

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

arXiv:2606.12852v1 Announce Type: new Abstract: Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become performance bottlenecks due to repeated execution failures. We argue that a key limitation is not only the lack of episodic memory, but also the decoupling of what-where-when memory from which-why reasoning. To address this, we propose WISE (Which-Why Informed Semantic Explorer), a long-horizon agent framework with an enhanced low-level controller equipped with a Causal Event Graph that augments episodic memory with explicit causal structure linking observations to task relevance. Unlike prior work such as MrSteve, which relies on feature similarity for retrieval, WISE enables robust recall under viewpoint changes and supports opportunistic task reordering through causal reasoning. Building on this memory, we propose an Opportunistic Task Scheduler that dynamically re-prioritizes subtasks when causally relevant opportunities are detected. We further equip WISE with a multi-scale progressive exploration strategy to provide spatially comprehensive observations for downstream reasoning. Experiments show that WISE largely improves task success and efficiency on long-horizon sparse tasks, particularly in settings requiring adaptive decision-making.

24.
arXiv (CS.LG) 2026-06-12

Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

arXiv:2606.13338v1 Announce Type: new Abstract: Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

25.
arXiv (CS.AI) 2026-06-16

Runtime Analysis of Cartesian Genetic Programming in Evolving Boolean Functions

arXiv:2606.15923v1 Announce Type: cross Abstract: Cartesian Genetic Programming (CGP) is among the practical and popular forms of Genetic Programming as it uses a graph-based representation of programs. This paper presents a first runtime analysis of CGP in evolving Boolean functions using complete training sets. We prove an asymptotic bound $O(n D^5)$ for the expected number of fitness evaluations of CGP to construct a conjunction of $n$ inputs using at most $D \geq n-1$ binary gates, a minimal function set, and even with a strict survival selection. When the non-strict selection is used, the bound is improved to $O(n D^4)$. Our analysis reveals interesting characteristics of CGP induced search, which have been only observed empirically. In particular, enabling the acceptance of equally good solutions, including those with connected gates non-contributing to fitness, can lead to a speedup, and consequently a better asymptotic time bound. In contrast to conjunctions, we also prove a negative result which shows that CGP requires exponential time to evolve an exclusive disjunction. Experiments evolving conjunctions complement our theoretical findings. The use of incomplete training sets is found to further reduce the average number of fitness evaluations while maintaining a good level of generalisation.