Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-17

Evaluating Uplift Modeling under Structural Biases: Insights into Metric Stability and Model Robustness

arXiv:2603.20775v2 Announce Type: replace Abstract: In personalized marketing, uplift models estimate the incremental effect of an intervention by modeling how customer behavior would change under alternative treatments using counterfactual analysis. However, real-world marketing data often exhibit various biases, such as selection bias, spillover effects, measurement error, and unobserved confounding. These biases can adversely affect both the accuracy of uplift estimation and the validity of evaluation metrics. Despite the importance of bias-aware assessment, there remains a lack of systematic studies evaluating how different models and metrics perform under such biased conditions. To bridge this gap, we design a systematic benchmarking framework. Unlike standard predictive tasks, real-world uplift datasets inherently lack counterfactual ground truth. This limitation renders the direct validation of evaluation metrics infeasible and prevents the precise quantification of biases. Therefore, a semi-synthetic approach serves as a critical enabler for systematic benchmarking. This approach effectively bridges the gap by retaining real-world feature dependencies while providing the ground truth needed to isolate structural biases. Our investigations reveal that (i) uplift targeting and prediction can manifest as distinct objectives, where proficiency in one does not ensure efficacy in the other; (ii) while many models exhibit inconsistent performance under diverse biases, TARNet shows notable robustness, providing insights for subsequent model design; (iii) the stability of evaluation metrics is linked to their mathematical alignment with the ATE, suggesting that ATE-approximating metrics yield more consistent model rankings under structural data imperfections. These findings suggest the need for more robust uplift models and evaluation metrics under real-world data imperfections.

02.
arXiv (CS.AI) 2026-06-16

VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

arXiv:2606.15096v1 Announce Type: new Abstract: The Riemann Hypothesis remains one of the central unsolved problems in mathematics. Rather than claiming proof, we investigate whether a verifiable AI-assisted reasoning system can produce reliable, formally checked partial progress while explicitly identifying the remaining mathematical obstructions. We apply the Verifiable Growing Physical Transformer with Recursive Self-Improvement (VGPT-RSI) to two RH-adjacent certification tasks. First, we construct and verify a finite RH-boundary certificate for inequality on a parameterized safe lower curve over a region. The numerical boundary curve is converted into a certificate-backed lower curve, audited using outward-rounded interval arithmetic and Arb/FLINT ball arithmetic, and then checked in Rocq/CoqInterval for the parameterized theorem. Second, we initiate a formal Lagarias-route certificate. Lagarias criterion states that RH is equivalent to the global inequality. We formalize the finite quantity and produce a Coq-checked finite certificate. The final system identifies the exact unresolved mathematical bottlenecks: formalizing the Lagarias equivalence, proving the global tail theorem beyond any finite cutoff, and potentially reducing counterexamples to colossally abundant or related extremal integers. These results demonstrate that VGPT-RSI can produce certified RH-adjacent formal progress, organize proof dependencies, and avoid overclaiming when the remaining obstruction is genuinely mathematical.

03.
bioRxiv (Bioinfo) 2026-06-19

HTS-Oracle v2: Prospective AI-Guided Discovery and Experimental Validation of Small Molecule Modulators Across Multiple Targets

High-throughput screening (HTS) remains the cornerstone of early-phase small molecule discovery yet consistently underperforms against immunotherapy targets, yielding validated hit rates below 0.1%. Here we introduce HTS-Oracle v2, which features rigorous cross-validation that ensures honest performance estimates. HTS-Oracle v2 was trained and validated across four clinically significant immune checkpoint targets (CD28, ICOS, LAG-3, and TIGIT) achieving ROC-AUC values of 0.968, 0.969, 0.875, 0.928 respectively under rigorous cross-validation. For prospective experimental validation, HTS-Oracle v2 was applied to an 8,960-compound Enamine Protein Mimetic Library, selecting only 25 compounds per target for experimental testing using temperature-related intensity change (TRIC) technology, a 99.7% reduction in screening burden. HTS-Oracle v2 identified 4, 5, 4, and 6 validated binders from 25 prospectively selected compounds per target, corresponding to validated hit rates of 16%, 20%, 16%, and 24%, respectively. Notably, 67-80% of all experimentally confirmed hits across the full 8,960-compound library were captured within just 25 model-selected compounds per target. For CD28, this represents a 28-fold improvement over HTS-Oracle v1 (239x versus 8.4x), establishing HTS-Oracle v2 as an efficient platform for AI-guided prospective hit discovery across immunotherapy targets.

04.
arXiv (CS.AI) 2026-06-17

Constitutional On-Policy Safe Distillation

arXiv:2606.03089v2 Announce Type: replace-cross Abstract: On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety–helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

05.
arXiv (CS.CL) 2026-06-18

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor's control over medical reasoning and final wording.

06.
arXiv (CS.AI) 2026-06-12

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

arXiv:2606.12896v1 Announce Type: cross Abstract: While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a test-time step-level backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

07.
arXiv (CS.CL) 2026-06-17

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

08.
arXiv (CS.LG) 2026-06-15

NeST: Neuron Selective Tuning for LLM Safety

arXiv:2602.16835v2 Announce Type: replace-cross Abstract: Safety alignment is essential for the responsible deployment of Large Language Models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods, e.g., Low-Rank Adaptation (LoRA), trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. We present NeST, a Neuron-Selective Tuning framework for efficient post-hoc safety alignment. NeST identifies safety-relevant feed-forward neurons via activation probing on vanilla harmful and benign prompts, clusters neurons with similar activation profiles, and trains shared cluster-level updates while freezing the rest of the model. Importantly, NeST is trained only on vanilla malicious prompts, without using jailbreak-specific attack data, yet generalizes robustly to diverse jailbreaks. The learned updates are then folded into the original weights, incurring no inference-time overhead. Evaluated on 14 open-weight language and multimodal models, NeST outperforms lightweight baselines and approaches full fine-tuning robustness with significantly fewer trainable parameters. On text-only models, NeST reduces average jailbreak attack success rate from 44.5% to 1.1% while training only 0.4M parameters on average. Across multimodal settings, it reduces ASR from 55.3% to 1.1%, and for downstream fine-tuned variants, it restores safety by reducing ASR from 53.8% to 0.8%. These results show that robust, maintainable safety alignment can be achieved by concentrating adaptation on localized, functionally coherent safety structures.

09.
arXiv (CS.CV) 2026-06-15

Multi-Agent Embodied Autonomous Driving: From V2X Information Exchange to Shared World Models

Autonomous driving is shifting from isolated vehicle intelligence toward multi-agent embodied systems that share perception, infer intent, and coordinate action under uncertainty. This survey examines this transition through the lens of Shared World Models (SWMs): predictive cross-agent representations maintained across vehicles, infrastructure, and other traffic participants. We review more than 380 publications spanning vehicle-to-everything (V2X) communication, collaborative perception, inter-agent cognition, cooperative planning, end-to-end cooperative driving, and simulation and data engines for closed-loop validation. The organizing question is how exchanged observations become aligned state, intent-aware interaction, and coordinated downstream action. Across the surveyed literature, evaluation remains concentrated in simulation, curated benchmarks, and offline protocols. Foundation-model-based coordination also lacks verified real-time safety guarantees in open traffic. These gaps motivate key research priorities for multi-agent embodied autonomous driving (MAEAD): verifiable shared-state maintenance, robust intent and plan alignment, and safe coordinated action under communication, latency, and deployment constraints.

10.
arXiv (CS.AI) 2026-06-15

Application of Artificial Intelligence and Machine Learning in Libraries: A Systematic Review

arXiv:2112.04573v2 Announce Type: replace-cross Abstract: As the concept and implementation of cutting-edge technologies like artificial intelligence and machine learning has become relevant, academics, researchers and information professionals involve research in this area. The objective of this systematic literature review is to provide a synthesis of empirical studies exploring application of artificial intelligence and machine learning in libraries. To achieve the objectives of the study, a systematic literature review was conducted based on the original guidelines proposed by Kitchenham et al. (2009). Data was collected from Web of Science, Scopus, LISA and LISTA databases. Following the rigorous/ established selection process, a total of thirty-two articles were finally selected, reviewed and analyzed to summarize on the application of AI and ML domain and techniques which are most often used in libraries. Findings show that the current state of the AI and ML research that is relevant with the LIS domain mainly focuses on theoretical works. However, some researchers also emphasized on implementation projects or case studies. This study will provide a panoramic view of AI and ML in libraries for researchers, practitioners and educators for furthering the more technology-oriented approaches, and anticipating future innovation pathways.

11.
arXiv (CS.LG) 2026-06-15

Free Heavy-Tailed Lunch for Muon: A Theoretical Justification of Empirical Success

arXiv:2606.14560v1 Announce Type: cross Abstract: Non-Euclidean optimisation methods with matrix-valued updates, such as Muon and Scion, have recently shown strong empirical performance for training Transformer models, yet their theoretical advantages over Euclidean methods remain poorly understood. We address this gap in the heavy-tailed non-convex regime, where stochastic gradients have bounded $p$-th central moments, $p \in (1,2]$. We show that certain non-Euclidean methods achieve optimal sample complexity under stronger stationarity measures, while Euclidean methods incur additional dimension-dependent costs. As a consequence, for $m \times n$ matrices, Muon finds an $\varepsilon$-stationary point in nuclear norm within $\mathcal{O}\left(\min\{m, n\} \frac{\Delta_1 L}{\varepsilon^2} \left(\frac \sigma \varepsilon \right)^{\frac p {p-1}}\right)$ samples, absorbing heavy-tailed noise without extra dimension dependence, unlike Euclidean methods. We further prove this sample complexity, including its dimension dependence, is optimal for all first-order methods under nuclear-norm stationarity. Experiments on large language models support our theory. Surprisingly, our results suggest that other Schatten geometries beyond the spectral geometry of Muon can perform competitively in certain settings.

12.
arXiv (CS.CL) 2026-06-18

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0's developments since version 1.0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.

13.
arXiv (CS.AI) 2026-06-15

When Should Agent Trust Be Conditional? Characterizing and Attacking Skill-Conditional Reputation in Agent Swarms

arXiv:2606.14200v1 Announce Type: new Abstract: Open platforms increasingly route tasks among heterogeneous LLM agents–differing in base model, scaffold, and tool stack–whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent by a single global trust score, but that scalar is the wrong object here, because routing every task to the globally most-trusted agent leaves the value of specialization unclaimed. We study skill-conditional trust R(i | k)–the trust to place in agent i for a task requiring skill k, rather than one score per agent–and pose three falsifiable questions: when is conditioning worth it, how much cross-skill evidence should be borrowed, and whether that borrowing is safe. A controlled phase-diagram analysis answers the first two: conditional trust wins only in a specific regime–high agent heterogeneity, sparse per-skill evidence, and correlated skills–and the coupling strength beta that buys this data efficiency is dual-use, because the same cross-skill borrowing is also a laundering channel. On a public benchmark of 14 genuinely heterogeneous AppWorld agents, real pools land inside the beneficial regime–a small but genuine gain, with the per-skill best agent genuinely changing across skills. We then show that an attacker with cheap evidence in one skill and none in a target skill hijacks the conditional router, driving routing regret from 0 to 0.94 on a pool our zero-cost Conditional Information Value Test (CIVT) rates GREEN–while the ungated trust verdict it contaminates reads -0.06 instead of the honest +0.19. A zero-evidence gate bounds the attack but does not eliminate it; we characterize the residual cost under an explicit budget. We do not claim Sybil-resistance–we quantify the trade-off.

14.
arXiv (CS.AI) 2026-06-11

Irresponsible AI: big tech's influence on AI research and associated impacts

arXiv:2512.03077v2 Announce Type: replace-cross Abstract: The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing presence of big tech in the AI field. This trend has been accompanied by growing ethical concerns and intensified societal and environmental impacts. This position paper argues that irresponsible AI development is strongly driven by big tech's influence and involvement in the field. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech's influence. Third, we discuss the underlying economic forces driving big tech's actions. Finally, as a call to action, we invite AI researchers to counter big tech's influence in irresponsible AI development through strategies that build on the responsibility of implicated actors and collective action.

15.
arXiv (CS.LG) 2026-06-18

Hierarchical Planning with Latent World Models

arXiv:2604.03208v2 Announce Type: replace Abstract: World models are a promising path to zero-shot embodied control through planning. However, existing world model planners struggle on long-horizon, multi-stage tasks: prediction errors compound and naive search is exponential in the planning horizon. Hierarchy mitigates both by decomposing tasks into shorter, tractable subproblems; yet prior hierarchical approaches either amortize control into task-specific policies (hierarchical RL) or assume low-dimensional states and known dynamics (classical hierarchical MPC). We present Hierarchical Planning with Latent World Models (HWM), an architecture and planning paradigm for hierarchical model predictive control (MPC) directly on visual world models trained solely via next-latent prediction. HWM learns world models at multiple temporal scales within a shared latent space, so predictions from the long-horizon model serve as subgoals for the short-horizon model via latent matching, without task-specific rewards, skill learning, or hierarchical policies. To keep long-horizon search tractable, HWM learns an action encoder that compresses primitive action chunks into latent macro-actions. On real-world Franka manipulation, HWM solves pick-and-place from a single goal image at 70% success vs. 0% for single-level planning. Across simulated push manipulation and maze navigation, HWM consistently improves performance on long-horizon tasks while requiring up to 3x less planning compute.

16.
arXiv (CS.CV) 2026-06-19

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

作者:

We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

17.
arXiv (math.PR) 2026-06-18

A Stochastic ISCS Markov Model for Fake News Propagation

arXiv:2606.18282v1 Announce Type: cross Abstract: This paper studies the propagation of fake news through a stochastic rumor spreading model based on Markov chains. Inspired by classical epidemiological SIR models, we consider a generalization of the Daley-Kendall framework for rumours that incorporates fact-checkers, following the Ignorant/Spreader/Checker/Stifler model introduced in Piqueira (2020). The model analyzes the influence of checkers on fake news dynamics. Numerical simulations are used to illustrate the behavior of the system and the impact of fact-checkers.

18.
arXiv (CS.LG) 2026-06-16

Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation

arXiv:2606.14945v1 Announce Type: new Abstract: The autoresearch pattern enables autonomous experimentation by having a large language model (LLM) iteratively modify code to optimize a target metric. Its stateless design, however, reconstructs experimental context from scratch at every iteration, incurring $O(n)$ token cost per iteration and $O(n^{2})$ total. This work reformulates the pattern as a stateful ReAct agent using LangGraph, where typed persistent state carries experimental history across iterations via a tool-calling interface. Two benchmarks are evaluated: hyperparameter tuning (15 iterations, small per-iteration observations) and code performance optimization (40 iterations, large per-iteration observations containing full source code and benchmark results). On hyperparameter tuning, the stateful agent consumes 90\% fewer tokens (2{,}492 vs.\ 24{,}465). On code optimization, the stateful agent consumes 52\% fewer tokens (627K vs.\ 1{,}275K) while achieving comparable optimization quality on both tasks. The token reduction is structural: the stateless agent re-reads the full history at $O(n)$ cost per iteration, while the stateful agent operates within a fixed-size conversation window at $O(1)$ cost. This paper describes the architecture in sufficient detail for practitioners to implement a stateful autoresearch agent for their own workflows.

19.
arXiv (CS.CV) 2026-06-17

NTIRE 2024 Challenge on Image Super-Resolution (x4): Methods and Results

This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field.

20.
arXiv (CS.AI) 2026-06-16

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

arXiv:2606.15231v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

21.
arXiv (CS.AI) 2026-06-12

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

arXiv:2606.12767v1 Announce Type: new Abstract: Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.

22.
arXiv (CS.CL) 2026-06-19

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

23.
Nature (Science) 2026-06-10

Measurement of reactor neutrino oscillation with the first JUNO data

Neutrino oscillations (see refs. 1,2 and references therein), a quantum effect manifesting at macroscopic scales, are governed by lepton flavour mixing angles and neutrino mass-squared differences3 that are fundamental parameters of particle physics, representing phenomena beyond the Standard Model. Precision measurements of these parameters are essential for testing the completeness of the three-flavour framework, determining the mass ordering of neutrinos and probing possible new physics. The Jiangmen Underground Neutrino Observatory (JUNO)4 is a 20-ktonne liquid-scintillator detector located 52.5 km from multiple reactor cores, designed to resolve the interference pattern of reactor neutrinos with sub-percent precision5,6. Here we report, using the first 59.1 days of data collected since detector completion in August 2025, the first simultaneous high-precision determination of two neutrino oscillation parameters, $${\sin }^{2}{\theta }_{12}=0.3092\,\pm \,0.0087$$ and $$\Delta {m}_{21}^{2}=(7.50\,\pm \,0.12)\times 1{0}^{-5}\,{\mathrm{eV}}^{2}$$ for the normal mass ordering scenario, improving the precision by a factor of 1.6 relative to the combination of all previous measurements. These results advance the basic understanding of neutrinos, validate the design of the detector and indicate the readiness of JUNO for resolving the neutrino mass ordering with a larger dataset. The rapid achievement with a short exposure highlights the potential of JUNO to push the frontiers of precision neutrino physics and paves the way for its broad scientific programme. The first data of the Jiangmen Underground Neutrino Observatory deliver high-precision neutrino oscillation parameters, improving measurements and demonstrating readiness to determine neutrino mass ordering.

25.
arXiv (CS.LG) 2026-06-19

MolGraphBench: A Benchmark of GNN Architectures for Molecular Regression Tasks

arXiv:2602.20573v3 Announce Type: replace Abstract: Molecules are often represented as SMILES strings, which can be readily converted to hand-crafted descriptors or fingerprints (FP) for molecular property prediction. Research has demonstrated that SMILES can be converted to molecular graphs $G = (V, E)$, with atoms as nodes $(V)$ and bonds as edges $(E)$. These molecular graphs can subsequently be used to train graph neural networks (GNN) models. Despite the recent surge in application of GNN (existing and novel architectures) for molecular property prediction, a rigorous benchmark is still lacking. We propose MolGraphBench, a comprehensive benchmark of four commonly used GNN models for molecular property prediction. Benchmarking results demonstrate graph convolutional network (GCN) and graph isomorphism networks (GIN) as the optimal GNN architectures for molecular graph regression tasks, based on absolute performance, training efficiency, transfer learning and prediction quality. The study also indicates the non-complementary nature of molecular fingerprints in the fusion (GNN-FP) framework. Furthermore, our GNN models achieved performance superior or comparable performance to current state-of-the-art GNN baselines across three datasets (GCN with RMSE of $0.518$ on B3DB, GIN-FP with RMSE of $1.022$ on FreeSolv and GIN with MAE of $63.783$ on RT datasets). Findings from this study indicate that type of GNN-layer, should be treated as a tunable hyperparameter rather than a fixed design choice to achieve superior performance.