Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
bioRxiv (Bioinfo) 2026-06-16

THEOBROMA: an aggregated open database of 1.13 million natural products with per-compound license auditing, three-tier classification, and stereochemistry-aware deduplication

Natural products remain one of the most productive sources of pharmacologically active compounds for drug discovery, yet the current open aggregator landscape attributes licenses at database rather than compound granularity, with consequences that have become tangible as the field grows. A recent relicensing event in one constituent source (the September 2024 transition of the Natural Products Atlas to CC BY-NC 4.0) demonstrates how database-level licensing propagates across an aggregate and motivates the per-compound audit framework presented here. The same peer cohort separately leaves classification provenance and stereoisomer-family relations coarser than either layer warrants. THEOBROMA, accessible at url{https://theobroma.l3s.uni-hannover.de}, integrates 1{,}133{,}004 natural products from 29 open sources under a per-compound license audit that resolves each compound's license tier across all attesting sources under a most-restrictive-wins rule, identifying 900{,}170 compounds (79.4%) under open-use licenses and exposing the per-source attestation chain and resolved tier through a dedicated audit endpoint and a query-time license filter. A three-tier classification stratifies 89.3% coverage into 35.1% curated, 43.9% high-confidence inferred, and 10.3% exploratory tiers, with 486{,}215 stereoisomer families preserved by full 27-character InChIKey deduplication and exposed via a dedicated texttt{/api/stereoisomers/} endpoint and a radial-family display. Per-compound license provenance is the primary differentiator. Classification stratification and stereoisomer-family exposure add finer-grained access to two related axes, supporting license-compatible virtual screening and isomer-specific bioactivity analysis at corpus scale. As an evolving open resource, THEOBROMA pairs continuous pipeline maintenance with interactive geographic, taxonomic, and chemical-space exploration.

02.
arXiv (CS.AI) 2026-06-24

Reentrant value fields as delayed coupled reaction-diffusion systems on finite graphs

arXiv:2605.03940v4 Announce Type: cross Abstract: We describe a dynamical system in which a symbolic field is coupled to a geometric field via a bipartite Hilbert-Schmidt kernel. The system is fully described by a retarded functional differential equation (RFDE) on the history space, subject to Lipschitz and small gain conditions. We show that the RFDE is well-posed under constant input and that it admits a compact global attractor. The principal subsystem $(H_L, X_R, P)$, which is comprised of the two primary fields as well as an executive field, is shown to be globally stable independent of delay, provided that the interfield coupling satisfies $C_{\mathcal{K}}^2

03.
arXiv (CS.AI) 2026-06-12

GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns

arXiv:2606.12419v1 Announce Type: cross Abstract: Several educational domains rely heavily on diagrams and visual cues, yet most existing tutoring datasets are limited to text-only interactions. This limits the development of AI tutors that can teach in visually grounded ways used by human instructors. Thus, we introduce GeoDial, a multimodal tutoring dataset of over 1.3K teacher-student dialogs in the domain of geometry collected from experienced math teachers, where instructional turns are explicitly grounded in diagram highlights. We propose a scalable annotation protocol that integrates dialog acts, visual highlighting, and feedback, enabling fine-grained supervision of both language and visual tutoring behavior. To illustrate the challenges posed by this setting, we fine-tune several vision-language models on GeoDial and evaluate their ability to generate tutoring utterances and diagram highlights. While supervised fine-tuning substantially improves the quality of generated dialog, it struggles to produce accurate diagram highlights, revealing a key limitation of current methods and highlighting the need for approaches that more effectively integrate visual reasoning with pedagogical interaction.

04.
arXiv (CS.LG) 2026-06-18

Be Your Own Teacher: Steering Protein Language Models via Unsupervised Reward Optimization

arXiv:2606.18961v1 Announce Type: new Abstract: Protein language models (PLMs) have emerged as powerful tools for controllable biomolecular design, yet their post-training adaptation typically relies on costly wet-lab validation or curated preference datasets. To overcome this supervision bottleneck, we introduce unsupervised reward optimization of PLMs, a comprehensive framework for steerable protein generation without ground-truth labels. Our key insight is that task-agnostic rewards, which combine intrinsic model uncertainty with extrinsic semantic consistency informed by protein representation models, exhibit strong correlation with controllability measures across base models and temperature regimes. Building upon this discovery, we propose two offline algorithms: Soft Reward Optimization (SRO) and Binarized Reward Optimization (BRO), which effectively maximize the classical RLHF objective induced by these proxy rewards. Extensive experiments on compositional out-of-distribution prompts demonstrate that both methods significantly outperform competitive baselines (DPO, KTO), while approaching oracle performance across multiple sampling temperatures, model scales and protein families. Moreover, PLMs fine-tuned with unsupervised rewards can achieve consistently higher coverage compared to their base model in pass@k evaluations. By enabling self-improvement of PLMs through their own generated experience, our framework provides a scalable pathway toward controllable biomolecular design in settings where labeled preferences or experimental feedback are scarce or unavailable.

05.
arXiv (CS.CL) 2026-06-15

Coping in Crisis: Computational Modeling of Coping Styles in Digital Crisis Discourse During the 2023 Turkiye Earthquake

How do people cope when disaster strikes and can we detect it at scale, in real time, from what they write? This study addresses that question using over one million Turkish-language tweets posted in the aftermath of the February 6, 2023 earthquake in Turkiye, which unfolded in a deeply polarized political context just months before a national election. Drawing on Lazarus and Folkman's (1984) coping theory, we develop a multi-label BERTurk classifier to detect three coping styles (problem-focused, emotion-focused, and meaning-making) across four theoretically motivated crisis phases. BERTurk achieves a macro F1 of 0.693, substantially outperforming a zero-shot mDeBERTa baseline (macro F1 = 0.324). Applied to the full corpus, the classifier reveals a clear temporal trajectory: problem-focused coping dominates the urgency phase and declines sharply, emotion-focused coping rises and stabilizes, and meaning-making increases monotonically. Anger correlates most strongly with meaning-making (Spearman r = 0.387), suggesting it functions as a mobilizing force toward blame attribution rather than practical action. These findings demonstrate that coping theory can be reliably operationalized in real-world digital crisis data and that doing so can help humanitarian organizations tailor their responses to where a population actually is.

06.
arXiv (CS.CL) 2026-06-16

Enhancing LLM Safety Through a Theoretical Minimax Game Lens

The rapid advancement of large language models (LLMs) necessitates effective mechanisms to ensure their responsible deployment by accurately distinguishing unsafe content from benign content. While substantial safety datasets are available in English, multilingual safety modeling remains underexplored due to limited open-source safety datasets in other languages. Even within English datasets, safe yet sensitive corner-case content is scarce, leading to shortcut learning by models and non-trivial false-positive rates. To mitigate these issues, we introduce a novel minimax reinforcement learning (RL) framework wherein a data generator and a classifier model co-evolve, facilitating the production of high-quality synthetic multilingual safety data. We theoretically formalize this interaction as a minimax game and rigorously demonstrate convergence to a Nash equilibrium. Empirical evaluations confirm that our synthetic data generation method significantly enhances the classifier model performance, enabling a substantially smaller model to surpass the state-of-the-art by nearly 10% on English benchmarks while achieving 4.5x faster inference speed. These results establish a scalable and efficient methodology for synthetic data generation, advancing the development of safer and more robust multilingual LLM deployments.

07.
arXiv (quant-ph) 2026-06-12

New bounds on private simultaneous quantum message passing

arXiv:2606.12557v1 Announce Type: new Abstract: In the private simultaneous message (PSM) setting, $k$ players obtain inputs $x_i\in\{0,1\}^n$ and then each send messages to a referee, who should learn $f(x_1,...,x_k)$ but no other information about $(x_1,...,x_k)$. The PSM setting was introduced as a minimal model for secure multiparty computation and has connections to Boolean function complexity. In the quantum setting, PSM has been related to non-local quantum computation (NLQC). The communication and correlation cost of implementing PSM remains poorly understood. Here, we give new upper and lower bounds on the (quantum) PSM model. For lower bounds, we show: 1) Nečiporuk's measure lower bounds the entanglement required for $k$-player quantum PSM with perfect correctness. This leads to quadratic lower bounds for explicit functions. 2) The rank of the communication matrix of $f(x_1,x_2)$ lower bounds 2-player quantum PSM with perfect privacy but imperfect correctness. This implies a previously unknown lower bound on classical PSM with imperfect correctness. When allowing quantum communication and shared entanglement, these are the first lower bounds on quantum PSM that make use of the privacy condition. For upper bounds, we show: 1) Letting $s$ be the size of a quantum circuit computing $f$, $d_f$ be the circuit depth, $k$ the number of players, $n$ the number of bits received by each player, and $\epsilon$ a correctness parameter, we obtain $\mathsf{PSM}_k^*(f) \leq (kn +s) \cdot \log^{O(d_f)}(s/\epsilon)$. 2) The square of the Fourier 1 norm of $f$, $\Vert \hat{f}\Vert_1^2$, upper bounds the classical PSM complexity, $\mathsf{PSM}(f)\leq O(\Vert \hat{f} \Vert^2_1)$. In proving the first upper bound, we generalize existing $T$-depth based techniques for NLQC from $2$ to $k\geq 2$ parties, and consider cases where the Clifford layers are restricted to having small light cones.

08.
arXiv (CS.AI) 2026-06-18

Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning

arXiv:2606.18561v1 Announce Type: cross Abstract: The quality of recorded data depends on the stability of the sensor system that acquires it. Sensor motion and aging can degrade the performance and stability of downstream data-driven methods. We present a Wasserstein-GAN-inspired approach for unsupervised inference of physically interpretable transformation parameters that map a changed detector response distribution back to a nominal reference distribution. In contrast to standard generative modeling, the generator is used as a learnable calibration transformation whose trainable weights represent the sought parameters, while the critic provides a distributional distance signal via the Wasserstein objective. We validate the approach on a tracking-detector toy model with controlled layer shifts and demonstrate its application on high-granularity Geant4-simulated calorimeter data with cell-wise aging effects. The method recovers aging coefficients for individual cells with correlation to ground truth and improves agreement between calibrated and reference energy-sum distributions, while exhibiting the expected degradation at increasing channel-to-channel noise levels. These results indicate that adversarial distribution matching can serve as a data-driven component of calibration strategies in settings where direct labels for degradation parameters are unavailable.

09.
medRxiv (Medicine) 2026-06-10

Developing a Unified Criminal Justice Pathway into Drug and Alcohol Treatment from Police Custody: A Public Health Service Evaluation and Pathway-Design Project in Blackpool, United Kingdom

Introduction: Blackpool, England's most deprived local authority, has the highest drug-related death rate in the country. People in police custody with problem substance use are a key Core20PLUS5 inclusion-health group, yet referral from the police into structured drug and alcohol treatment is fragmented and relies heavily on self-report. We evaluated the current police-to-treatment route in Blackpool and designed an evidence-informed unified pathway. Materials and Methods: A mixed-methods service evaluation and pathway-design project was conducted during a six-month General Practice / Public Health rotation. Routinely collected referral data from Horizon (the local specialist drug and alcohol service) covering the 47-month period from December 2019 to October 2023 were analysed. Findings were triangulated with national policy, the Project ADDER and Liaison and Diversion evaluations, and the international evidence on police-led pre-arrest diversion. Results: Of 5,900 total referrals into Horizon over 47 months, only 269 (4.56%) originated from the police. Police referrals accounted for fewer than 5% of monthly referrals in 30 of 47 months, for 5 to 9.9% in 16 months, and for >/= 10% in only one month (10.8%, December 2022). Blackpool recorded 76 drug-misuse deaths in 2019-21 (19.4 per 100,000, approximately four times the England rate). A six-step unified pathway is proposed: Initiate Referral (opt-out, from ADDER Police and Liaison and Diversion); Initial Assessment; Tailored Treatment Plan; Continuous Support; Collaboration and Monitoring; and Evaluation and Adjustment. Conclusions: Police contact is markedly under-used as a gateway to treatment despite Blackpool having the highest drug-related mortality in England. An opt-out, multi-agency pathway anchored in Core20PLUS5 has the potential to narrow the treatment gap, reduce re-offending, and address the structural health inequalities that drive premature mortality.

10.
arXiv (CS.CV) 2026-06-16

Multi-Task Tennis Stroke Biomechanics Analysis Using MediaPipe Pose

We built a multi-task pipeline for tennis stroke biomechanics from plain RGB video. On top of pose-based stroke recognition, it adds two new tasks, predicting shot direction and grading posture quality, plus a rule-based feedback layer that suggests coaching tips. Strokes are found automatically using a weighted joint velocity score, s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder, removing the need for manual annotation. Pose comes from MediaPipe Pose Landmarker (33 landmarks, metric world coordinates), with each stroke turned into a 30-frame by 39-feature sequence for TennisTransformerGPU, a compact 564,103-parameter transformer (4 layers, 4 heads, d=128) with three parallel output heads. Trained on 1,281 labeled strokes from 7 pros and 1 amateur across 11 videos, it hits 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture under a random 80/20 split. The interesting test is cross-player: train on pros, evaluate on the amateur. Stroke type barely budges, 82.9%, a 0.8% drop. Direction prediction does not transfer; it just falls back to the majority class. An ablation shows why world coordinates matter so much here: switching to image-space landmarks tanks cross-player stroke-type accuracy from 83% to 47% and direction from 68% to 21%. Everything runs on Kaggle's free T4 GPU tier and is fully reproducible.

11.
arXiv (CS.AI) 2026-06-19

Optimal Scheduling in a Question-Answering Forum of Knowledge Workers

arXiv:2606.19759v1 Announce Type: new Abstract: As individuals turn to the Internet to find answers to questions they may have, several Question Answering (QA) forums have evolved, where users knowledgeable in certain topics can contribute their expertise to answering these requests for information. While these are currently volunteer based, we consider a future version employing knowledge workers who are experts in certain topics. In such a system, the request-answer processes forming the queuing system may utilize schedulers that assign requests in different topics to the experts in the forum, who may be able to answer them according to their expertise levels in different topics. With this model, we calculate the capacity of the system for handling the requests while keeping the system stable, and design schedulers that achieve capacity. We also investigate how collaboration between experts in answering requests can potentially increase capacity.

12.
medRxiv (Medicine) 2026-06-12

Crimean-Congo haemorrhagic fever virus transmission: exploring perceptions of human-animal-tick interactions across six districts in Uganda

Crimean-Congo haemorrhagic fever virus (CCHFV) causes a viral zoonotic disease transmitted through tick bites and direct contact with infected blood or tissue of infected animals. Socio-ecological and behavioural risk factors for CCHFV exposure in Uganda remain poorly understood, which can lead to the omission of key risk factors in quantitative survey design and limit our wider understanding. In this study, we explored human-animal-tick interaction transmission risks in Uganda. We conducted 24 focus group discussions (FGDs) and 31 key-informant interviews (KIIs) across six environmentally and socio-ecologically diverse districts, between October 2023 and March 2024. Study sites were selected using K-prototype analysis, which combined environmental and socio-ecological variables to identify distinct clusters within Uganda. FGDs were conducted separately with groups of community leaders, men, women and teenagers with stratified purposive sampling. Medical doctors, veterinarians, traditional healers, district surveillance officers, and herdsmen were individually interviewed as key informants and purposively sampled. Data were transcribed and translated into English, and analysed thematically using iterative categorisation in NVivo 14. Most participants reported tick bites, some as frequently as every day. Close contact with animals was common, including sleeping next to them in the same building, largely due to concerns about animal theft. Less frequent but notable practices included slaughtering animals for consumption or sacrifice and interactions with wild animals during hunting. Slaughtering and butchering an animal which was sick or had died was reportedly performed by participants in most districts. Plucking and roasting engorged ticks was a practice described in the Kaabong and Arua districts of Northern Uganda. These practices and behaviours highlight potential key risks of CCHFV transmission and underscore the need for future studies to address specific behaviours, to quantify if, and to what extent, they present an exposure risk. Further work should include underlying reasons for the behaviours, which would help ensure that culturally appropriate interventions are targeted.

13.
PLOS Computational Biology 2026-06-05

StPedf: Cell trajectory inference of spatial transcriptomics via spatial proximity embedding and spatial density-adaptive fusion

作者:

by Yuan Zhang, Ziyan Sun, Zhixin Shi, Mengdi Nan, Yuhan Fu, Qing Ren, Jie Gao Spatial transcriptomics is transforming our multidimensional understanding of cellular spatial organization and its functional mechanisms in processes such as development and disease by systematically resolving the spatial heterogeneity of gene expression within tissues. To delve deeper into the dynamic processes underlying spatial expression patterns, spatial trajectory inference integrates genetic and spatial information to reconstruct the spatial developmental trajectories of cells within tissues. This approach reveals the patterns of differentiation and dynamic changes as cellular states evolve continuously along spatial axes. However, existing methods often struggle to uniformly model the complex, nonlinear interactions between high-dimensional gene expression and spatial coordinates. Here, we introduce StPedf, whose core lies in employing a neural network with a masking mechanism to capture complex nonlinear interactions between high-dimensional genes and spatial positions. It further leverages spatial proximity information as a guiding cue, dynamically and adaptively adjusting the embedding of gene and spatial information and the weighting of spatial proximity information based on spatial density. This enables trajectory inference guided by spatial information. This enables optimal transport to derive intercellular transition matrices, reconstruct cellular differentiation trajectories, and construct pseudo-spatiotemporal maps. StPedf demonstrates superior performance over existing methods on five structurally distinct simulated datasets. Using StPedf, we successfully mapped distinct lineages in the spatial trajectories of telencephalon regeneration in the Ambystoma mexicanum, multiple malignant lineages expanding within primary tumors, and developmental spatial trajectories and pseudo-spatiotemporal maps in human dorsolateral prefrontal cortex (DLPFC). StPedf significantly enhances the accuracy and interpretability of spatial trajectory inference, providing critical technical support for revealing the dynamic patterns of cellular fate transitions within tissue microenvironments.

14.
arXiv (math.PR) 2026-06-18

On two overlooked stick-breaking constructions of the normalized inverse Gaussian process

arXiv:2606.19306v1 Announce Type: new Abstract: We shed light on two alternative stick-breaking constructions of the normalized inverse Gaussian (NIG) random discrete distribution which appear to have been overlooked so far in the Bayesian nonparametric setting. The first is derived from a result in Aldous and Pitman (1998) for the conditional Brownian excursion partition, mixing over the local time at zero up to time one. The second arises as a particular case of a result in James (2013) for priors obtained by a random spatial and temporal change of the normalized generalized Gamma subordinator. Both constructions are in terms of straightforward transformations of standard random variables and can be easily generalized to provide the stick-breaking construction of any element, respectively, in a) the family of mixed Poisson-Kingman models driven by the $1/2$ stable Lévy measure and b) the family of Poisson-Gamma processes driven by the Inverse Gaussian subordinator.

15.
arXiv (CS.AI) 2026-06-18

Practical Anonymous Two-Party Gradient Boosting Decision Tree

arXiv:2605.26903v2 Announce Type: replace-cross Abstract: Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features across mutually distrustful parties. High speed and interpretability make GBDTs popular in finance and healthcare, where neural networks may fall short. Enabling secure computation for GBDTs poses unique challenges, requiring secure record alignment for comparison. Relying on private set intersection (PSI) is a de facto approach. Mistaking PSI for a safety measure actually exposes which record identifiers (IDs) are shared between the datasets. Although circuit-PSI could help, it is costly for generic uses. New ideas are needed to efficiently train in a "dark forest". Aiming to hide the IDs, we initiate the study of anonymous GBDT training on split data held by two parties. Dual circuit-PSI in our design lets the parties alternate as receiver to run pick-then-sum over local features. Via oblivious programmable pseudorandom functions, we propagate circuit-PSI outputs as shared state across runs. Avoiding universal alignment, we resolve the neglected dilemma that ID hiding incurs a cost that scales with domain size. Next, we halve the cost of ciphertext packing used to convert single-instruction multiple-data homomorphic encryption from (ring) learning with errors in prior secure GBDT (Usenix Security' 23) and related secure machine-learning computations. Comparative experiments show our protocol remains competitive with leaky approaches in efficiency. Enabling ID-hiding aggregation, our techniques can extend to other vertically partitioned analytics.

16.
arXiv (quant-ph) 2026-06-12

Quasi-local Edge Mode in XXX Spin Chain/Circuit with Interaction Boundary Defect

arXiv:2603.17835v2 Announce Type: replace-cross Abstract: We study the Heisenberg spin-1/2 model on a semi-infinite chain - or, equivalently, a trotterized unitary SU(2) symmetric six-vertex quantum circuit - with a boundary defect where the interaction between the two spins nearest the edge differs from that in the bulk. For sufficiently strong boundary interaction we explicitly construct a conserved operator quasi-localized near the boundary using a matrix-product ansatz. This quasi-local edge mode leads to non-decaying boundary correlation functions, corresponding to a nonzero boundary Drude weight. The correlation length of the edge mode diverges at a finite critical value of the boundary interaction, signaling a transition to ergodic boundary dynamics for subcritical interactions.

17.
medRxiv (Medicine) 2026-06-17

Accounting for Human Movement to Improve Exposure-Health Models

Background. Current exposure-health models rely on averaged, residential-based environmental exposures, failing to account for human movement. This aggregation can lead to exposure misclassification and biased exposure-response estimates, potentially distorting our understanding of the true health effects of environmental conditions. We developed exposure disaggregation regression models that explicitly account for human movement when linking environmental exposures to health outcomes. Methods. By weighting pixel-level exposures according to distance from home as a simple proxy for human movement, our model linked disaggregated environmental exposures to individual-level health outcomes. Weights were either fixed a priori or derived from a latent distance-decay power parameter learned from the data. We additionally evaluated model performance under a nonlinear exposure-response relationship. Model performance was assessed across multiple sample sizes (N = 1,114; 50,000; and 100,000). A simulation study examined parameter recovery using bias, empirical standard error (EmpSE), and credible interval coverage. As a case study, Demographic and Health Surveys (DHS) data from Albania were used to link acute respiratory infection (ARI) outcomes among children under five to pixel-level NDVI within a 3 km buffer around DHS cluster centroids, and the proposed models were applied to these data. Results. Across all models (fixed-weight, learned-weight, and restricted cubic spline models), parameter recovery improved with increasing sample size. At N = 1,114, estimates were biased and imprecise, with incorrect effect direction for exposure-response parameters (e.g., learned-weight {beta}1 bias = - 0.79; EmpSE = 2.61; coverage = 0.88). In contrast, the models accurately recovered parameters at larger sample sizes, including the latent distance-decay parameter (bias = - 0.02; EmpSE = 0.15; coverage = 0.95 at N = 100,000), demonstrating their ability to reliably learn movement-based exposure weights when sufficient data were available. Conclusion. Instead of relying on arbitrarily-sized buffers, this statistical framework provides a novel method for studying environmental exposure-health relationships whilst accounting for human movement. With sufficiently large sample sizes, it can accurately estimate the influence of disaggregated environmental exposures on individual-level health and help address exposure misclassification arising from residential-only metrics. This methodological framework remains scalable, interpretable, and adaptable to other exposures and outcomes, offering a foundation for future work that integrates richer mobility-informed exposure-health research.

18.
Science (Express) 2026-05-21

Observation of quantum vortex core fractionalization and skyrmion formation in a superconductor | Science

作者: 未知作者

Magnetic fields can penetrate a superconductor in the form of quantum vortices, which consist of a core singularity with circulating currents. London’s quantization implies that there is one core singularity per quantum of magnetic flux in single-component superconductors. Here, we report signatures of quantum vortex core fractionalization on the potassium-terminated surface of a multiband superconductor KFe 2 As 2 . The observed splitting of single integer-flux vortices into several fractional vortices results in a disparity between the numbers of flux quanta and vortex cores. These fractional vortices often arrange in chains, which calculations show are characterized by a ℂP 2 skyrmionic topological invariant; this constitutes a different type of topological defect: the chiral skyrmion. The disparate natures of integer and fractional vortices comprising skyrmions lead to distinct spectroscopic signatures.

19.
arXiv (CS.CL) 2026-06-11

Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.

20.
arXiv (CS.CL) 2026-06-18

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

21.
arXiv (CS.AI) 2026-06-17

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

arXiv:2606.17073v1 Announce Type: cross Abstract: While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

22.
arXiv (CS.LG) 2026-06-12

Disparate Impact in Synthetic Data Generation

arXiv:2606.13105v1 Announce Type: new Abstract: We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

23.
medRxiv (Medicine) 2026-06-15

Poly-Social Risk for Hypertension Among Black and Latina Women

Background: Hypertension is a leading modifiable cardiovascular risk factor prominently influenced by health-related social needs (HRSN). Whether detailed information on HRSN can improve identification of hypertension among minoritized women is unknown. Methods: Black and Latina women aged 18-65 years completed the Centers for Medicare and Medicaid Services Accountable Health Communities Screening Tool, assessing 13 HRSN domains. Hypertension was ascertained by a validated EHR-based algorithm or self-report of hypertension. Logistic regression tested associations of HRSN with hypertension. LASSO regression with 10-fold cross-validation was used to derive a poly-social risk score in the training set (random 70%) and tested in the validation set (30%) against a sociodemographic model (age, race, income, education). Results: Among 1302 participants (mean [SD] age 40.1 [11.3] years, 70.4% Black, 44.3% Latina), higher cumulative burden of HRSN was associated with increased odds of hypertension (adjusted odds ratio [aOR] for each additional domain of HRSN: 1.07 [95% CI 1.01-1.14], P=0.02). Food insecurity (aOR 2.30 [1.37-3.87], P= 0.002), lapse in utilities (aOR 1.44 [1.04-1.96], P=0.02), poor concentration (aOR 1.57 [1.13-2.17], P=0.007), and social isolation (aOR 1.77 [1.14-2.73], P=0.01) were associated with hypertension. In the validation set, the poly-social risk score did not improve discrimination for hypertension vs. the sociodemographic model (AUC 0.76 [95% CI 0.71-0.81] vs. AUC 0.80 [0.75-0.85]). Conclusion: In this cross-sectional analysis of Black and Latina women, greater cumulative social disadvantage was associated with hypertension. While inclusion of HRSN did not improve hypertension prediction beyond conventional sociodemographic indices, findings may inform targeted interventions among minorities at cardiometabolic risk.

24.
arXiv (CS.AI) 2026-06-16

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

arXiv:2604.18827v2 Announce Type: replace-cross Abstract: Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling – even in the mouse visual cortex, a relatively simple system – models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at https://github.com/enigma-brain/omnimouse.

25.
arXiv (CS.AI) 2026-06-15

AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

arXiv:2606.14295v1 Announce Type: cross Abstract: Frontier AI systems are increasingly capable of cybersecurity tasks, including codebase inspection, vulnerability detection, and exploitation. However, evaluating their offensive capabilities remains constrained by limited access to open, reproducible, multi-host cyber ranges. Existing public benchmarks capture isolated skills such as CTF solving, vulnerability reproduction, and exploit generation, but often abstract away realistic intrusion workflows: discovering exposed services, gaining a foothold, collecting internal information, and expanding compromise across hosts. This gap makes it difficult to observe emerging risks early, because frontier AI systems are rarely evaluated under realistic attack conditions. We introduce AgentCyberRange, the first open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges. It combines 110 vulnerabilities across 15 real web applications and 8 enterprise-like cyber ranges with 156 internal hosts, plus Cage, a toolchain for execution, orchestration, result collection, and verification. The benchmark covers two core stages: web exploitation, where agents explore exposed applications and validate vulnerabilities, and post exploitation, where agents turn an initial foothold into broader internal compromise. We evaluate six frontier AI systems under matched prompts and budgets. GPT-5.5 with Codex performs best, solving 16.1% of web exploitation tasks and 31.7% of post-exploitation tasks; with more concrete hints, these rates increase to 33.0% and 46.3%. We also observe out-of-benchmark findings, including unknown vulnerabilities in popular projects, and payload mutation that bypasses host defenses. These results show that open cyber-range evaluation is necessary for observing emerging offensive capabilities under realistic and reproducible conditions.