Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-16

Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

arXiv:2606.14817v1 Announce Type: cross Abstract: This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

02.
arXiv (CS.CV) 2026-06-12

OccAny: Generalized Unconstrained Urban 3D Occupancy

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

03.
arXiv (CS.CL) 2026-06-11

When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 ($p < 0.05$). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results –creating what we call the precision-faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.

04.
arXiv (quant-ph) 2026-06-12

Reservoir-controlled electromagnetically induced gratings in a weakly driven two-level medium

arXiv:2606.13085v1 Announce Type: cross Abstract: We theoretically investigate the transmission and diffraction of a weak probe field from an electromagnetically induced grating formed in a weakly driven two-level medium coupled to engineered quantum reservoirs. Using a perturbative solution of the optical Bloch equations in the weak-driving regime, we analyze how normal-vacuum, thermal, and broadband squeezed-vacuum environments modify the probe susceptibility and consequently reshape both the spatial transmission function and the far-field diffraction patterns. We show that reservoir statistics have a pronounced impact on the diffraction response by altering the amplitude and phase of the induced grating. Thermal reservoirs enhance the transmission modulation and increase the intensity of the dominant diffraction orders, whereas squeezed-vacuum reservoirs generate strongly phase-sensitive modifications that selectively redistribute optical power among diffraction channels. We further demonstrate that the detuning between the squeezed reservoir and the driving field provides an efficient mechanism for controlling diffraction directionality, leading to substantial amplification of selected angular orders. In two-dimensional geometries, squeezed-vacuum correlations produce highly structured phase landscapes and strongly anisotropic diffraction patterns, enabling directional enhancement of specific diffraction channels while suppressing others. These results establish reservoir engineering as a versatile approach for controlling transmission, diffraction efficiency, and angular selectivity in minimal two-level systems, with potential applications in programmable photonic devices, beam steering, and quantum optical platforms.

05.
Nature (Science) 2026-06-10

Gene ancestries reveal diverse microbial associations during eukaryogenesis

The origin of eukaryotes remains a central enigma in biology1. Continuing debates agree on the pivotal role of a symbiosis between an alphaproteobacterium and an Asgard archaeon2,3. However, the nature, timing and contributions of other potential bacterial partners4–6 and the role of interactions with viruses7–9 remain contentious. To address these questions, we used advanced phylogenomic approaches and comprehensive datasets spanning the known diversity of cellular life and viruses. Our analysis provided a revised reconstruction of the last eukaryotic common ancestor (LECA) proteome, in which we traced the phylogenetic origin of each protein family. We found compelling evidence for multiple waves of horizontal gene transfer from diverse bacterial donors, with some likely to have preceded mitochondrial endosymbiosis. We inferred plausible traits of the major donors and their functional contributions to the LECA. Our findings support a contribution of horizontal gene transfers to shaping the proteomes of pre-LECA ancestors and suggest a facilitating role of Nucleocytoviricota viruses. Taken together, our results suggest that ancient eukaryotes may have originated within complex microbial ecosystems through a succession of diverse associations that left a footprint of horizontally transferred genes. Phylogenomic reconstruction of the proteome of the last eukaryotic common ancestor sheds light on the origin of eukaryotes, indicating an important role of horizontal transfer of genes from diverse bacterial and viral donors.

06.
arXiv (CS.CL) 2026-06-12

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

07.
arXiv (CS.CV) 2026-06-15

FoleyGenEx: Unified Video-to-Audio Generation with Multi-Modal Control, Temporal Alignment, and Semantic Precision

We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.

08.
arXiv (CS.CL) 2026-06-17

EmoFSM: A Finite State Machine for Emotional Support Conversation

Emotional support conversation (ESC) aims to alleviate people's emotional distress through effective conversations. Although large language models (LLMs) have made remarkable progress in ESC, most of these studies may not define the diagram from a state-model perspective, thereby providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Finite State Machine (FSM) on LLMs, and propose a framework called EmoFSM. Our framework allows a single LLM to bootstrap the planning during ESC, and self-reason the seeker's emotion, support strategy, and the final response upon each conversation turn. Substantial experiments in ESC datasets suggest that EmoFSM outperforms many baselines, including direct inference, self-fine, chain of thought, finetuning, and externally supported methods, even those with many more parameters.

09.
arXiv (CS.CV) 2026-06-16

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

10.
arXiv (CS.CL) 2026-06-18

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

11.
arXiv (CS.AI) 2026-06-11

A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data

arXiv:2601.17717v3 Announce Type: replace Abstract: Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the LLM Data Auditor framework. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two dimensions: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we analyze the experimental evaluations of representative generation methods for each modality and identify substantial deficiencies in current evaluation practices. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.

12.
arXiv (CS.CL) 2026-06-11

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

A social highlighter's most useful signal – which passages a crowd of readers marks – exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies – it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

13.
arXiv (CS.CV) 2026-06-11

Causal Clothes-Invariant Feature Learning for Cloth-Changing Person Re-ID

In cloth-changing person re-identification (CCReID), it is critical to learn clothes-invariant feature, which can provide discriminative ID features that remain robust against clothing changes. However, a spurious correlation currently limits existing ReID methods from effectively extracting these clothing-invariant features. This spurious correlation arises from clothing ownership: clothing is rarely shared across different identities, so models tend to memorize clothing cues for identity recognition, and this strategy generalizes poorly to unseen clothing. In this paper, we propose Causal Clothes-Invariant Learning (CCIL), which explicitly shifts CC-ReID from likelihood learning P (Y|X) to causal intervention learning P (Y|do(X)) to block the clothing shortcut. CCIL realizes this intervention through three modules: a Confounder Dictionary, an Intervention Module, and Disentangle Regularization. The causality-based modeling makes the entire model naturally clothes-invariant, effectively preventing the capture of spurious correlations in feature learning. Extensive experiments validate the effectiveness of CCIL. On PRCC and DeepChange datasets, CCIL achieves Rank-1 accuracies of 66.4% and 59.2%, outperforming state-of-the-art methods by 1.4 and 4.1 percentage points, respectively.

14.
arXiv (CS.CV) 2026-06-11

Non-frontal face recognition using GANs and memristor-based classifiers

Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.

15.
Nature (Science) 2026-06-17

Optical fibre gripper for high-performance 3D micromanipulation

作者:

Optical tweezers offer precise, non-contact control, but operate in a limited force regime and impose strict requirements on the characteristics of the targets as well as the environmental conditions1–4. Millimetre-scale mechanical tweezers can offer higher gripping force but are not suitable for precise manipulations5–11. Integrating microgrippers directly at the optical fibres provides a new approach for precise micromanipulation. However, existing fibre-integrated tweezers still face challenges in achieving high-performance manipulation of micro-objects (for example, single cells) within narrow spaces, mainly due to simplified architectures, constrained designs and millimetre-scale footprints12–14. Here we report a three-dimensional (3D) optical fibre gripper (OFG), which is fabricated by two-step, two-photon polymerization. The OFG consists of rigid photoresist microclaws and soft thermoresponsive hydrogel muscle doped with silver nanoparticles, and its size is only 38 × 38 × 61 μm3. The OFG exhibits a force-to-mass ratio of about 340 μN mg−1, outperforming previously reported fibre-integrated tweezers by one to two orders of magnitude. The OFG can manipulate opaque particles, irregular micromechanical components and diverse single-cell types. We further demonstrated its potential in 3D microassembly of complex microdevices (bearings, shafts and gearboxes) and biomimetic sampling in the narrow environment (&lt;300 μm). These results position the OFG as a compact fibre-tip manipulator for 3D micromanipulation, offering reversible and tunable gripping in an intermediate force regime between optical field trapping and millimetre-scale mechanical tweezers. A miniature three-dimensional optical fibre gripper enables powerful, precise micromanipulation of particles and single cells in confined spaces, bridging the gap between optical and mechanical tweezers.

16.
bioRxiv (Bioinfo) 2026-06-18

Bayesian modeling of longitudinal metatranscriptomes of broiler meat spoilage microbiomes shows shared predictive signature associated with spoilage at refrigerated temperatures

Microbial spoilage of packaged meat is driven by complex microbial succession and related metabolic activity, yet conventional shelf-life assessment is mainly based on shelf-life studies relying on culturing and sensory analysis. In routine quality assurance, results are obtained retrospectively, and they are only indirectly linked to the metabolic activity related to sensory deterioration. Functional, time informative approaches that capture the active metabolic state of the spoilage microbiome and predict the rate of spoilage are lacking. We developed a censoring-aware Gaussian process (CAGP) framework to model longitudinal pathway expression profiles from broiler meat metatranscriptomes collected over consecutive storage days at 4 or 6{degrees}C. Samples were annotated using odor-based sensory scores defining fresh, early-spoilage, and late-spoilage phases. Because observed zeros in pathway-level data may reflect non-detection rather than true absence, the model treats low values as left-censored observations below a detection threshold while estimating smooth temporal trajectories with uncertainty. In leave-one-out prediction within the 4{degrees}C time series, predicted sampling days differed from the true days by an average of 0.43 days, and predicted spoilage phases agreed with the sensory classification. Trajectories learned at 4{degrees}C also transferred to an independent 6{degrees}C time series at the spoilage-phase level, suggesting that shared functional spoilage programs are preserved despite temperature-dependent changes in spoilage rate. Cross-entropy ranking further identified pathway modules carrying time- and phase-informative signals across temperatures. Overall, this framework provides a probabilistic approach for linking metatranscriptomic functional dynamics to sensory spoilage progression, supporting shelf-life assessment beyond retrospective microbial enumeration.

17.
arXiv (CS.AI) 2026-06-11

Sustainability assessment using multimodal AI agents

arXiv:2507.17012v2 Announce Type: replace Abstract: Reducing the rapidly growing environmental impact of the computing industry requires assessing the emissions of electronics at scale. However, a traditional life cycle assessment (LCA) of an electronic device, which maps materials and processes to environmental impacts, often requires proprietary or unavailable data. Here, we reimagine conventional sustainability assessment by introducing a multimodal multi-agent AI system that emulates the collaborative process between LCA professionals and stakeholders (such as product managers and engineers) to automatically estimate the carbon footprint of electronic devices. The agents iteratively construct a complete life-cycle inventory by leveraging a structured data abstraction and software tools that mine information from the public internet, including repair communities and government regulatory databases. This reduces data gaps and data collection from weeks or months of expert time to under one minute. The system can calculate carbon footprint within 19% of expert LCAs with zero proprietary data (typical of the variation between human LCAs). We also show that by encoding domain-specific knowledge, environmental impact estimation can be reframed as a data-driven prediction task, in which both unknown products and emission factors are represented as weighted combinations of similar ones with known emissions.

18.
arXiv (quant-ph) 2026-06-17

$\mathcal{PT}$-Symmetric Spin–Boson Model with a Continuous Bosonic Spectrum: Exceptional Points and Dynamics

arXiv:2512.20277v2 Announce Type: replace Abstract: This work studies a $\mathcal{PT}$-symmetric non-Hermitian spin–boson model, consisting of a non-Hermitian two-level system coupled to a continuous bosonic bath. The static properties of the system are analyzed through a projection method derived from the displacement operator. We find that only a single exceptional point (EP) emerges, in contrast to non-Hermitian spin–boson models with finite modes, which typically exhibit multiple EPs. Notably, only a single real eigenvalue is found before the EP, which differs markedly from typical non-Hermitian systems where a pair of real eigenvalues precedes the EP. The time evolution of observables is further investigated via the Dirac–Frenkel time-dependent variational principle. Compared to its Hermitian counterpart, the non-Hermitian model exhibits distinct dynamical signatures, most notably the emergence of oscillations with periodic amplified amplitude. In the $\mathcal{PT}$-unbroken phase, the system exhibits sustained oscillatory dynamics with suppressed decoherence, whereas in the $\mathcal{PT}$-broken phase, additional dissipative channels accelerate decoherence and drive rapid convergence toward a stable steady state. These results shed light on how $\mathcal{PT}$ symmetry protects coherent light–matter interactions in non-Hermitian quantum systems.

19.
arXiv (CS.LG) 2026-06-16

Imbalanced Classification under Capacity Constraints

arXiv:2605.03289v2 Announce Type: replace-cross Abstract: Detecting observations from a minority class under severe class imbalance is a central challenge in applications such as fraud detection, medical screening, and industrial quality control. In these settings, each positive prediction triggers a costly follow-up action, an MRI scan, a transaction audit, whose execution is subject to real operational constraints. This paper proposes a formal classification framework under capacity constraints: given a user-defined bound limit $b$ on the proportion of observations that can be labeled as belonging to the minority class, the goal is to find the classifier that maximizes sensitivity on that class. We characterize the optimal classifier under this constraint and establish its equivalence with the classical Bayes classifier under a reweighting of the prior probabilities. We also introduce a capacity-adjusted performance metric $M$ that accounts for the effective detection rate when the capacity constraint is binding. The framework is implemented on top of standard learning methods, k-NN, SVM, random forests, and neural networks, and statistical consistency is established for each. We further show that these methods reduce to post-hoc thresholding when no hyperparameters are oriented toward the capacity-constrained objective, and introduce a capacity-aware support vector machine that exploits the constraint during training and achieves the strongest empirical performance. Experiments on the Taiwanese credit card default dataset confirm that capacity-constrained classifiers substantially outperform both classical approaches and SMOTE under high imbalance regimes. The framework extends naturally to multiclass settings and online environments.

20.
arXiv (CS.CL) 2026-06-12

GENIE: A Fine-Grained Measure for Novelty

Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. Prior work has focused on addressing whether models are capable of generating creative outputs. Here, we aim to consider novelty and investigate what makes model-generated content novel or not novel in a task-specific manner. We propose a fine-grained evaluation metric GENIE to measure the novelty of responses along task-specific features with respect to a population of responses. We show that unlike GENIE, holistic metrics struggle to capture the high-dimensionality of novelty and do not provide insight on which properties they target. Finally, we use GENIE to measure the effectiveness of mitigation methods that address creativity to better understand where these methods can improve novelty.

21.
arXiv (CS.AI) 2026-06-11

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

arXiv:2606.04145v2 Announce Type: replace-cross Abstract: Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p

22.
arXiv (CS.AI) 2026-06-16

Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course

arXiv:2606.16842v1 Announce Type: cross Abstract: Teaching Software Engineering for AI-enabled systems entails addressing the integration of AI components within full-scale software architectures under realistic constraints. While machine learning courses emphasize model development, students often lack experience in architectural design, deployment, and monitoring of AI-enabled systems. Empirical evaluations of such system-oriented AI courses remain limited. This paper reflects on the design and implementation of a project-based master's-level course titled AI Algorithms: Theory and Engineering, at the University of Bremen, in which students developed a movie recommendation system while making architectural design decisions to address challenges related to scalability, deployment, and evolving requirements. We conducted a mixed-methods study combining analyses of student submissions and questionnaire responses to investigate integration challenges, learning outcomes, and opportunities for improvement. Our results indicate persistent difficulties in early architectural decisions, heterogeneous ML integration, evolving requirements, and data management, largely due to uneven ML and software engineering expertise. From the educator's perspective, the course fostered system-level reasoning and strengthened awareness of data-centric ML practices in AI-enabled systems.

23.
medRxiv (Medicine) 2026-06-17

Treatment of Multi-Drug-Resistant Tuberculosis with Second-Line All-Oral Drugs in Ghana: Incidence of Adverse Events.

Introduction: The treatment of multidrug-resistant tuberculosis (MDR-TB) remains challenging due to the toxicity of second-line medications and suboptimal treatment outcomes. This study aimed to determine the incidence of adverse events and identify factors associated with these events in patients undergoing treatment for MDR-TB with second-line all-oral drugs in Ghana. Methods: This retrospective cohort study reviewed the medical records of 384 MDR-TB patients treated with second-line all-oral drugs at selected health facilities in Ghana, including the Greater Accra Regional Hospital, Eastern Regional Hospital, and Kumasi South Hospital. Data were extracted using the Kobo Collect tool, capturing patient demographics, baseline clinical and laboratory characteristics, treatment regimens, and adverse events. The study period spanned from 2020 to August 2024. Results: The study included a total of 384 MDR-TB patients, with a mean age of 45 years (SD = 15). The majority of patients were male (65.78%), and most were within the 45-64 years age group (33.85%), followed by those aged 25-44 years (31.25%). Regionally, the highest number of cases were reported from the Greater Accra Region (39.06%), followed by the Eastern Region (31.25%) and Kumasi South Hospital (29.69%). Approximately one in four patients (25%) presented with comorbidities, with HIV being the most common (19.5%). The most frequently reported adverse events were diarrhea (14%), dizziness (13.7%), and vomiting (12.3%). Most of these were mild to moderate in severity and tended to decrease as treatment progressed. Severe adverse events, such as leukopenia and acute kidney injury, were rare, occurring in less than 5% of patients. Over the course of treatment, gastrointestinal adverse events such as vomiting and nausea showed a significant decline, indicating possible patient adaptation or improved clinical management. Results from the multivariate Poisson regression analysis revealed that age and comorbidities were significant predictors of adverse events. Patients aged 65 years and above had a 56% lower risk of developing adverse events compared to younger patients (Adjusted Risk Ratio [aRR] = 0.44, 95% CI: 0.25-0.79, p = 0.005). Conversely, patients with comorbid conditions such as diabetes or hypertension were approximately 2.6 times more likely to experience adverse events compared to those without comorbidities (aRR = 2.65, 95% CI: 1.58-4.43, p < 0.001). The effect of sex was not statistically significant after adjustment (aRR = 1.03, 95% CI: 0.70-1.50, p = 0.86). At the end of the treatment period, 74.9% of patients achieved successful outcomes, including both those who were cured and those who completed treatment without being classified as cured. However, 25.1% had unsuccessful outcomes, which included treatment failure, relapse, or death. Conclusion: In conclusion, adverse events are common in the treatment of MDR-TB with second-line All-Oral drugs, with gastrointestinal adverse events being the most prevalent. These findings highlight the importance of monitoring and managing adverse events to optimize treatment outcomes for MDR-TB patients in Ghana.

24.
medRxiv (Medicine) 2026-06-16

Investigating naming error patterns after non-invasive brain stimulation and language treatment in persons with aphasia

Abstract Background: Transcranial direct current stimulation (tDCS) paired with behavioral language therapy can improve naming in persons with aphasia (PWA), yet naming errors persist. Little is known about how naming error patterns change after non-invasive brain stimulation is combined with language treatment. Aims: To examine whether right cerebellar tDCS plus computerized aphasia therapy changes the types of naming errors in people with chronic aphasia across timepoints, and to determine whether effects differ by cerebellar tDCS polarity (anode vs. cathode). Methods and Procedures: In a randomized, double-blind, sham-controlled, within-subject crossover study, we retrospectively analyzed behavioral data from 24 individuals with post-stroke aphasia. Each participant completed two 15-session intervention periods (3-5 sessions/week) with active cerebellar tDCS + computerized aphasia therapy and sham + computerized aphasia therapy, separated by a two-month washout. General linear models (GLMs) assessed longitudinal changes in six error types (semantic, phonological real word, phonological nonword, no response, mixed, unrelated) on an untrained picture naming task (Philadelphia Naming Test; PNT) and a trained task (Naming 80; N80). Additional GLMs evaluated polarity effects with 2 (Group: anode vs. cathode) x 2 (Treatment) interactions, and treatment-order effects with 2 (Group: tDCS-first vs. sham-first) x 2 (Treatment) interactions. Outcomes and Results: Active cerebellar tDCS did not significantly change error types for trained items (N80). For untrained items (PNT), active tDCS reduced several error types relative to sham, with the clearest and most durable reduction in phonological nonword errors; more moderate reductions occurred for phonological real word and unrelated errors. Mixed errors showed a marginally opposite pattern, tending to increase after tDCS and decrease after sham. Polarity analyses indicated broadly similar effects across anodal and cathodal stimulation overall, but only the anode group showed a reliable treatment effect for phonological nonword errors on the PNT. Treatment-order analyses revealed no significant order effects. Conclusions: Our results indicate a shift in naming error types, particularly after tDCS treatment for the untrained naming task (PNT). These findings may help guide the course of treatment approaches of those with aphasia and what error naming pattern types may show changes post stroke when combining non-invasive brain stimulation and computerized aphasia therapy. Clinical Trial Registration: Cerebellar Transcranial Direct Current Stimulation and Aphasia Treatment [NCT02901574] Keywords: aphasia, naming errors, non-invasive brain stimulation, cerebellar tDCS, computerized aphasia treatment

25.
arXiv (CS.LG) 2026-06-12

Reliability of Probabilistic Emulation of Physical Systems

arXiv:2606.12997v1 Announce Type: new Abstract: Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.