Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-16

On the Energy Distribution of the Galactic Center Excess' Sources

arXiv:2507.17804v2 Announce Type: replace-cross Abstract: The Galactic Center Excess (GCE) may yet herald the discovery of annihilating dark matter. Weighing against that conclusion are analyses showing evidence for dim point sources within the spatial structure of the emission. Due to technical limitations these analyses are purely spatial with all spectral information that could disentangle the excess from astrophysical backgrounds discarded. Here, we demonstrate that a neural network simulation-based inference approach can jointly analyze the spatial and spectra data. The addition is profound: energy information drives the putative point sources to be significantly dimmer, indicating either the GCE is truly diffuse in nature or made of an exceptionally large number of sources. Quantitatively, for our best fit background model, the excess is essentially consistent with Poisson emission as predicted by dark matter. If due to point sources, our median prediction is $\mathcal{O}(10^5)$ sources, or more than 35,000 at 90\% confidence, both orders of magnitude larger than the hundreds preferred by earlier point-source analyses of the GCE, although variations allowed by background systematics could reduce the required number of sources by roughly an order of magnitude.

02.
arXiv (CS.CV) 2026-06-12

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

03.
arXiv (CS.CV) 2026-06-15

NEST3D: A High-Resolution Multimodal Dataset of Sociable Weaver Tree Nests

Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine-grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open-access, 1.4 TB multimodal drone dataset of 104 nest-bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert-annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA-Net, and Point Transformer V3, with PT-v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer-based and point-wise methods, they also highlight architecture-dependent challenges, particularly for convolution-based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture-dependent performance under extreme class imbalance.

04.
arXiv (CS.LG) 2026-06-18

Unraveling the Mechanism of Drug Binding to SARS-CoV-2 RNA Pseudoknot with Thermodynamics-Driven Machine Learning

arXiv:2604.14906v3 Announce Type: replace-cross Abstract: The pseudoknot secondary structure in SARS-CoV-2 RNA is essential for regulating protein synthesis through $-$1 programmed ribosomal frameshifting ($-1$ PRF), a mechanism that allows the virus to generate both structural and non-structural proteins from overlapping reading frames. This pseudoknot exhibits both threaded and unthreaded long-lived topologies. The influence of ligand binding on its folding is a process critical for the development of $-$1 PRF small-molecule inhibitors. Understanding this process through unbiased molecular dynamics (MD) simulations can be facilitated by introducing collective variables (CVs) that capture the corresponding slowest dynamical modes. Here, we use spectral map (SM), a thermodynamics-driven machine learning technique, to learn such CVs directly from all-atom MD trajectories of the SARS-CoV-2 RNA pseudoknot in complex with the $-$1 PRF inhibitor merafloxacin and its two structural analogs in neutral and ionized forms. Free-energy landscapes (FELs) derived from the learned CVs indicate that ligand-induced destabilization is topology-selective. In the threaded pseudoknot, the inhibitors destabilize the S2 stem, while in the unthreaded pseudoknot, destabilization occurs in the S1 and S3 stems. Furthermore, the extent to which each ligand reshapes the FEL matches experimentally reported antiviral potency, whereas the protonation state qualitatively alters dynamics within the same RNA topology. Overall, our results show how pseudoknot topology, ligand type, and protonation state collectively influence the slow conformational dynamics of viral RNA and establish physiological protonation as a critical factor for modeling RNA-targeted drug action.

05.
arXiv (CS.LG) 2026-06-12

Attacking the First-Principle: A Black-Box, Query-Free Targeted Mimicry Attack on Binary Function Classifiers

arXiv:2605.18231v2 Announce Type: replace Abstract: Binary function classifiers play a crucial role in maintaining the security and integrity of software systems by detecting malicious code and unauthorized modifications. However, machine learning-based classifiers are vulnerable to adversarial attacks that can evade detection. In this study, we present Kelpie, a novel framework for executing mimicry attacks, a stronger type of targeted evasion attacks, on binary function classifiers in a black-box, zero-query setting. Unlike previous approaches that rely on querying the target classifier to refine untargeted evasion attacks, Kelpie leverages code transformations that preserve the functionality of malicious payloads while causing them to be misclassified as we want. Through extensive experimentation, we demonstrate that Kelpie can successfully execute mimicry attacks against six state-of-the-art binary function classifiers representing different model architectures without requiring direct interaction with them. We further validate our approach with a practical demonstration, involving a keylogger and a wiper concealed within benign-looking functions embedded in an application. This work, to our best knowledge, is the first to demonstrate such a mimicry attack in a black-box, zero-query context, raising important questions about the reliability and security of existing machine learning-based binary function classifiers.

06.
arXiv (CS.CV) 2026-06-12

SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.

07.
medRxiv (Medicine) 2026-06-22

Modelling the decadal expansion of West Nile virus in Italy: the role of climatic, anthropogenic, and macroecological drivers

Abstract BACKGROUND West Nile virus (WNV) is a growing health burden in Italy. Anticipating human infection risk is hampered by the pathogen's complex ecology, highlighting the need for comprehensive early-warning tools. AIM We aimed to model municipal-level WNV risk in Italy and characterize its decadal expansion in Italy, providing a comprehensive ecological understanding of viral emergence. METHODS We applied a machine learning framework to annual human WNV case data from 2014 to 2024. The model integrated a suite of environmental, socio-economic, and macroecological predictors to generate risk projections. We evaluated the model's performance through multiple validation settings. We also performed an anticipation test for the 2025 epidemic season, using 2024 environmental data to assess the model's predictive accuracy against observed 2025 human cases. RESULTS Our model achieved robust performance (True Skill Statistic > 0.4) and captured WNV progressive expansion from 184 predicted positive municipalities in 2014 to 2,012 in 2024 (an 11-fold increase in 11 years). Seasonal minimum temperature was the primary risk driver, followed by monitoring year and population density, indicating active spatial spread. Environmental suitability consistently preceded clinical detection. Municipalities with cases in 2023-2024 exhibited significantly higher predicted suitability during 2018-2022 than those without cases (average risk 0.58 vs 0.20). Our model successfully identified emerging risk hotspots along the Adriatic coast and southern Italy before the official human spillover of 2025. CONCLUSION Embedding macroecological drivers into WNV risk modelling provides an improved understanding of drivers of rapid WNV expansion. Our model enables proactive risk mapping, surveillance efforts, and targeted public health measures.

08.
arXiv (CS.CV) 2026-06-16

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($\rho$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.

09.
PLOS Computational Biology 2026-06-01

On real-time calibrated prediction for complex model-based decision support in pandemics: Part 2

by Trevelyan J. McKinley, Daniel B. Williamson, Xiaoyu Xiong, James M. Salter, Robert Challen, Leon Danon, Ben Youngman, Doug McNeall Calibration of complex stochastic infectious disease models is challenging. These often have high-dimensional input and output spaces, with the models exhibiting complex, non-linear dynamics. Coupled with a paucity of necessary data, this results in a large number of non-ignorable hidden states that must be handled by the inference routine. Likelihood-based approaches to this missing data problem are very flexible, but challenging to scale, due to having to monitor and update these hidden states. Methods based on simulating the hidden states directly from the model-of-interest have an advantage that they are often more straightforward to code, and thus are easier to implement and adapt in real-time. However, these often require evaluating very large numbers of simulations, rendering them infeasible for many large-scale problems. We present a framework for using emulation-based methods to calibrate a large-scale, stochastic, age-structured, spatial meta-population model of COVID-19 transmission in England and Wales. By embedding a model discrepancy process into the simulation model, and combining this with particle filtering, we show that it is possible to calibrate complex models to high-dimensional data by emulating the log-likelihood surface instead of individual data points. The use of embedded model discrepancy also helps to alleviate other key challenges, such as the introduction of infection across space and time. We conclude with a discussion of major challenges remaining and key areas for future work.

10.
arXiv (CS.LG) 2026-06-18

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

arXiv:2606.18640v1 Announce Type: new Abstract: Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized that the lack of standardized model performance evaluation benchmarks makes fair comparison difficult and hinders further innovation, and thus benchmark standardization is in urgent need. Furthermore, many published glucose forecasting algorithms are limited to CGM data alone, ignoring other multimodal signals such as insulin dosing and carbohydrate intake. Here, we introduce MetaboNet-Bench, a benchmark for multimodal glucose forecasting for patients with type 1 diabetes that provides an extensible open-source evaluation framework for comparison of glucose forecasting algorithms that leverage glucose, insulin, and carbohydrate data. We then demonstrate its utility by benchmarking several recently published glucose forecasting models and a custom multimodal time-series model, representing different model architectures. The results show that the benefit of adding data modalities is conditioned on the complexity of the model and that incorporating more clinical metrics helps identify meaningful gaps to fill for future research.

11.
arXiv (CS.CV) 2026-06-16

PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching

Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson r=0.993); (2) a Residual Conditional Flow Matching (Residual CFM) architecture that factorizes generation into analytic species-centroid lookup and learned residual prediction, enabling generation from as few as ~4 specimens per species. We evaluate PhyloSDF on 100 micro-CT-scanned skulls of Darwin's Finches and their relatives across 24 species. The model generates novel meshes achieving 88-129% of real intra-species variation at the code level, with all 180 generated meshes verified as non-memorized. Residual CFM surpasses denoising diffusion (which fails entirely at this scale), standard flow matching (which mode-collapses to 3-6% variation), and a Gaussian mixture baseline in both fidelity (Chamfer Distance 0.00181 vs. 0.00190) and morphometric Fr\'{e}chet distance (10,641 vs. 13,322). Leave-one-species-out experiments across 18 species demonstrate phylogenetic extrapolation capability, and smooth latent interpolations produce biologically plausible ancestral skull reconstructions.

12.
arXiv (CS.LG) 2026-06-11

Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping

arXiv:2506.01396v2 Announce Type: replace Abstract: Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority, while significantly reducing the accuracy for others. We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression. Our method improves worst-class accuracy by over 10 percentage points on Skewed and Fashion MNIST compared to unbounded adaptive clipping, 7 points compared to Automatic clipping, and 5 points compared to constant clipping. The code is available at https://github.com/TrustworthyMLHelsinki/adaptive-clipping-fairness.

13.
arXiv (CS.CV) 2026-06-18

EDoF-NeRF: extended depth-of-field neural radiance fields using a coded aperture camera

We propose a method for extending the depth-of-field (DoF) to construct high-fidelity neural radiance fields (NeRF) – an emerging technique for rendering photorealistic novel views from a dataset of images captured at different viewpoints, based on implicit neural representations. The trade-off between DoF and light quantity is inherent not only in conventional cameras but also in NeRF, since the datasets used by NeRF are captured by these cameras. To address this issue, we introduce a coded aperture placed at the camera pupil, preserving spatial frequency components under defocused conditions. We develop a camera model incorporating coded apertures into NeRF, allowing direct input of coded images and enabling the generation of novel views with an extended DoF. We validate the proposed method, termed extended DoF-NeRF (EDoF-NeRF), through simulations and experiments, demonstrating its superior performance compared to conventional aperture cameras.

14.
arXiv (CS.AI) 2026-06-19

Implicit Semantic-Aware Communication Based on Hypergraph Reasoning

arXiv:2606.20162v1 Announce Type: new Abstract: Semantic-aware communication has emerged as a transformative paradigm for next-generation communication systems, shifting the fundamental goal from transmitting bit-level symbols to reliably recovering and understanding the semantic meaning of information. Previous studies have demonstrated that representing the semantic content of source messages as graph-based structures can significantly improve communication efficiency and the accuracy of semantic inference at the receiver. However, existing solutions typically employ graphs that capture only pairwise relationships, thereby neglecting higher-order implicit correlations commonly observed in real-world scenarios, such as group interactions, multi-entity associations, and complex relational contexts. This limitation reduces semantic expressiveness and makes semantic inference susceptible to ambiguity and performance degradation, particularly under noisy or corrupted channel conditions. To address these issues, this paper proposes a novel hypergraph-based implicit semantic reasoning framework, HISR, which leverages hypergraphs to represent complex multi-entity relationships among semantic knowledge entities. In HISR, entities and their associated higher-order relations are mapped into dedicated semantic subspaces tailored to distinct relational contexts. This design not only disentangles diverse semantic interactions to mitigate the over-smoothing effects commonly found in traditional graph embedding methods but also enables robust semantic inference even when partial information loss occurs during transmission. Numerical results show that the proposed HISR achieves up to a 36.6% improvement in implicit semantic interpretation accuracy over the state-of-the-art benchmarks.

15.
arXiv (CS.AI) 2026-06-17

Embedded Machine Learning for Microcontroller-Class Edge Devices: Data, Feature, Evaluation, and Deployment Pipelines

arXiv:2606.18122v1 Announce Type: cross Abstract: Embedded machine learning moves inference from cloud services to resource-constrained devices that must acquire data, preprocess signals, run a model, and act within tight limits on memory, energy, and latency. This paper presents a systems-oriented synthesis of an embedded machine-learning workflow for microcontroller-class platforms. The emphasis is placed on engineering decisions that are often hidden in generic machine-learning introductions: sampling and buffering, feature extraction as dimensionality reduction, validation under class imbalance, model/runtime co-design, and streaming deployment. Two representative signal families are used throughout the paper. The first is inertial motion recognition, where a two-second, three-axis accelerometer window is transformed from raw samples into root-mean-square and spectral features before classification. The second is keyword spotting, where audio is sampled, anti-aliased, transformed into mel-frequency cepstral coefficients, and processed by a compact one-dimensional convolutional network. The paper concludes with practical design rules for robust on-device inference, including data curation, quantization, thresholding, scheduling, and field monitoring.

16.
arXiv (CS.CV) 2026-06-11

TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation

The explosion of generative 3D assets has created a massive demand for animation, yet current motion capture methods remain brittle, restricted to species-specific templates (e.g., SMPL) or requiring labor-intensive manual rigging. We introduce TopoCap, the first unified framework capable of extracting motion from monocular video and retargeting it onto characters with arbitrary, unseen skeletal topologies, i.e., from bipeds to hexapods and inanimate objects, without test-time optimization. Our key insight is that while skeletal structures are combinatorial and discrete, the underlying physics of motion occupy a continuous, low-dimensional manifold. We materialize this insight via a two-stage generative pipeline. First, we learn a Universal Motion Manifold using a Graph CVAE that compresses heterogeneous kinematic chains into a shared, fixed-length latent code. By explicitly conditioning the decoder on a structural embedding of the target rig, we disentangle motion dynamics from skeletal topology. Second, we treat video-to-animation as a conditional flow matching problem, predicting these topology-agnostic codes from visual features. To learn this generalized prior, we introduce Mobjaverse, a massive-scale dataset curated from Objaverse-XL. Comprising over 5,000 unique skeletal topologies and 2 million frames, it exceeds the structural diversity of existing datasets by two orders of magnitude. Extensive experiments demonstrate that \MethodMotion outperforms specialist models on human and quadruped benchmarks while enabling zero-shot retargeting for the long tail of 3D creatures. Dataset is publicly available at https://huggingface.co/datasets/duckduckplz/Mobjaverse.

17.
arXiv (CS.AI) 2026-06-19

Advancing DialNav through Automatic Embodied Dialog Augmentation

arXiv:2606.19948v1 Announce Type: new Abstract: For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness. While DialNav[han2025dialnav] provides a framework for holistic evaluation of the dialog–execution loop in photorealistic indoor navigation, its performance remains limited by a critical scarcity of training data (2K episodes). To address this, we propose an automatic generation pipeline, and construct the RAINbow dataset, a large-scale training dataset with 238K episodes for DialNav. Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset. Then, we introduce two additional complementary advances to unlock the data's full potential: (1) Dual-Strategy Training, a navigation training scheme to align the navigation training with the dynamic dialog-navigation loop, and (2) a localization model that leverages VLN knowledge. By combining these complementary solutions, our model substantially outperforms the baseline in success rate on both Val Seen (58.24, +89\%) and Val Unseen (29.05, +100\%) splits, establishing a new state of the art.

18.
arXiv (CS.AI) 2026-06-16

EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

arXiv:2606.15141v1 Announce Type: cross Abstract: While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.

19.
arXiv (CS.LG) 2026-06-15

PostDeg: Placement Beats Parameterization in LayerNorm GNNs

arXiv:2606.14022v1 Announce Type: new Abstract: LayerNorm-based GNNs routinely erase the topology signals (degree, centrality, $k$-core) that node-selection policies should depend on, but the literature has not located where in the residual block the erasure happens. We answer that question: a positive per-node scalar inserted before LayerNorm is divided out up to a stabilizer term, while the same scalar inserted after LayerNorm reaches the score head as representation magnitude. The surviving slot is the post-LayerNorm position. We instantiate it with PostDeg, a parameter-free post-LayerNorm inverse-degree scale, and pre-register four falsifiers (graphwise scalars, extra LayerNorm, expressive same-slot capacity, backbone-agnostic source) that would reject the rule. PostDeg gains $+3.5\%/+2.5\%/+5.6\%$ over the LN backbone on influence maximization, network dismantling, and maximum independent set, with $10/10$ paired-seed wins per task; none of the four falsifiers fires. The takeaway is that placement, not parameterization, carries the gain – a small invariance check that generalizes to any positive topology scalar in any normalized residual stack.

20.
arXiv (CS.AI) 2026-06-16

Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

arXiv:2606.14817v1 Announce Type: cross Abstract: This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

21.
arXiv (CS.CL) 2026-06-11

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and (c) the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.

22.
arXiv (CS.AI) 2026-06-16

Agentomics: Economic Foundations for the Valuation, Attribution, and Pricing of AI Agents in Human-AI Workflows

Authors:

arXiv:2606.14769v1 Announce Type: cross Abstract: Agentic AI systems are increasingly being deployed as productive resources in organizational workflows, yet existing evaluation methods primarily measure isolated technical performance rather than economic contribution. This paper introduces Agentomics, a workflow-based framework for valuing, attributing, and pricing human and artificial agents. The framework models a workflow as a configuration of heterogeneous agents whose collective performance determines gross value, deployment cost, reliability, and expected failure loss. Workflow value is treated as a team-level quantity that may include complementarities, substitution effects, bottlenecks, and nonlinear production; additive stage-level value is only a special case. Building on this workflow model, the paper formulates AI deployment as a coalition-formation problem and defines coalition value as the incremental net surplus generated relative to a benchmark human workflow. The Shapley value is then used to attribute economic surplus among participating AI agents, yielding a principled connection among valuation, accountability, and market pricing. The resulting Shapley pricing equilibrium provides a normative benchmark for assessing whether agent prices reflect expected marginal contribution. A security-operations case study illustrates how the framework accounts for productivity gains, deployment costs, reliability losses, and coalition-level complementarities in hybrid human–AI workflows.

23.
arXiv (CS.CL) 2026-06-16

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

24.
arXiv (CS.LG) 2026-06-19

We Need to Rethink Benchmarking in Anomaly Detection

arXiv:2507.15584v2 Announce Type: replace Abstract: Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. In current benchmarks, a trivial algorithm that only checks for extreme values in individual features performs competitively with state-of-the-art deep learning methods, despite failing on simple cases such as anomalies within an annulus of normal points. Moreover, existing benchmarks do not adequately reflect the diversity of anomaly detection applications, making it difficult for practitioners to reliably select algorithms for their applications. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that group applications sharing relevant characteristics, defined through a common taxonomy. Benchmarking within scenarios enables scenario-specific choices for preprocessing, metrics, and model selection, clarifying which advances transfer across similar applications and providing practitioners with reliable guidance for their specific contexts.

25.
arXiv (CS.AI) 2026-06-16

RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting

arXiv:2606.16925v1 Announce Type: new Abstract: Time-series foundation models show strong transfer performance when given a non-empty history window. However, true cold-start scenarios, where a new item has no prior observations, violate this assumption. We propose RAID (Retrieval-Augmented Iterative Diffusion) a framework, which replaces history-based correlation learning with metadata-driven semantic retrieval and graph-conditioned diffusion. RAID maps textual metadata into a shared semantic space using a frozen multilingual embedding model and constructs an inductive retrieval graph that extends naturally to unseen items. It first forms a base forecast by aggregating information from semantically related neighbors, then refines this forecast with a gated diffusion module to model residual uncertainty. Under a strict true cold-start protocol, RAID outperforms strong foundation models and competitive baselines on both forecasting accuracy and prediction interval coverage, while reducing inference latency by an order of magnitude through non-autoregressive decoding. The shared semantic space also enables zero-shot cross-lingual transfer, allowing a model trained on English descriptions to generalize to items described in other languages without direct supervision.