Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

02.
arXiv (CS.LG) 2026-06-16

How Should World Models Be Evaluated? A Decision-Making-Centric Position

arXiv:2606.15032v1 Announce Type: new Abstract: World models have rapidly become one of the central abstractions in modern AI. Yet the term now refers to several different objects: action-conditioned environment models, latent imagination models, future-video predictors, interactive neural simulators, latent predictive representations, and synthetic-data engines. Evaluation has broadened with the term. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement. The result is not only metric diversity but also a recurring problem of claim/evidence mismatch: papers frequently make a stronger claim about what their model is useful for than their evaluation can actually establish. This paper surveys the recent literature and argues that the central question is use-dependent. When a model is presented as a world model for embodied decision-making, a more decisive issue is not whether it generates visually compelling videos, but whether it supports reliable counterfactual reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. We organize the literature using an L0–L7 ladder that ranges from visual plausibility to policy optimization utility. In our interpretation, L0–L3 are most naturally read as diagnostics of generated artifacts, L4 is often the first genuinely interventional test, and L5–L7 provide the most direct evidence of decision usefulness. Based on this diagnosis, we propose a decision-making-centric evaluation framework and a benchmark protocol that foreground counterfactual action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration.

03.
arXiv (CS.CV) 2026-06-15

Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

Hallucinations in Large Vision-Language Models (LVLMs) remain a persistent challenge, often stemming from inadequate integration of visual information during multimodal reasoning. A key cause is the model's over-reliance on textual priors and underutilization of visual cues, leading to outputs that are linguistically fluent but visually inaccurate. For example, given an image of an empty kitchen countertop, an LVLM might hallucinate a "bowl of fruit" or "cup of coffee", relying on language associations rather than visual evidence. Most LVLMs incorporate visual features by appending them to the input stream of a pre-trained LLM and training on large-scale vision-language datasets. Our systematic analysis reveals that this strategy often leads to over-dependence on textual information due to the inherent bias of LLMs towards language-dominant representations. This imbalance skews attention towards the text over visual content, weakening the model's ability to ground outputs in visual inputs. To address this, we propose a simple yet effective visual feature incorporation method that encourages the model to learn visually-informed textual embeddings distinct from those of the base LLM and promotes a more balanced attention distribution. Experimental results across multiple hallucination benchmarks demonstrate that our method significantly reduces hallucinations and fosters more balanced multimodal reasoning. Notably, our approach achieves substantial gains, including +9.33% on MMVP-MLLM, +2.99% on POPE-AOKVQA, up to +3.4% on Merlin, and +3% on the hard-data split of HallusionBench.

04.
arXiv (CS.LG) 2026-06-15

A Water Efficiency Dataset for African Data Centers

arXiv:2412.03716v3 Announce Type: replace Abstract: Artificial intelligence (AI) computing and data centers consume large amounts of freshwater, both directly for cooling and indirectly for electricity generation. While most attention has been paid to developed countries such as the U.S., this paper presents the first-of-its-kind dataset that combines nation-level weather and electricity generation data to estimate water usage effectiveness for data centers in 41 African countries across five different climate regions. We also use our dataset to evaluate and estimate the water consumption of inference on two large language models (i.e., Llama-3-70B and GPT-4) in 11 selected African countries. Our estimates suggest that writing a 10-page report using Llama-3-70B could consume as much as {0.66 liters} of water, while the water consumption by GPT-4 for the same task may go up to about {59 liters}. For writing a medium-length email of 120-200 words, Llama-3-70B and GPT-4 could consume about {0.13 liters} and {2.9 liters} of water, respectively. All the numbers for generative model inference tasks are based on public information available in 2024, when we initially prepared the analysis. Since then, AI inference systems have improved substantially. For example, recent disclosures suggest that energy efficiency improved by more than 30x between May 2024 and May 2025. Accordingly, our 2024 estimates should be interpreted as historical reference values rather than as representative of current performance. Interestingly, given the same AI model, 9 of the 11 selected African countries consume less water than the global average, mainly because of lower water intensities for electricity generation.

05.
arXiv (CS.CL) 2026-06-12

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target–text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($\alpha=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution–abstention axis rather than removing the high-complexity bottleneck.

06.
arXiv (CS.CV) 2026-06-12

DIMOS: Disentangling Instance-level Moving Object Segmentation

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

07.
arXiv (CS.AI) 2026-06-16

A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata

arXiv:2510.22266v3 Announce Type: replace-cross Abstract: Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and principal management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school's average socioeconomic level is the most dominant predictor, demonstrating that systemic factors have a greater impact than individual characteristics in isolation. The primary conclusion is that academic performance is a systemic phenomenon deeply tied to the school's ecosystem. This study provides a data-driven, interpretable tool to inform policies aimed at promoting educational equity by addressing disparities between schools.

08.
arXiv (CS.CL) 2026-06-11

Neuron-based Personality Trait Induction in Large Language Models

Large language models (LLMs) have become increasingly proficient at simulating various personality traits, an important capability for supporting related applications (e.g., role-playing). To further improve this capacity, in this paper, we present a neuron-based approach for personality trait induction in LLMs, with three major technical contributions. First, we construct PersonalityBench, a large-scale dataset for identifying and evaluating personality traits in LLMs. This dataset is grounded in the Big Five personality traits from psychology and is designed to assess the generative capabilities of LLMs towards specific personality traits. Second, by leveraging PersonalityBench, we propose an efficient method for identifying personality-related neurons within LLMs by examining the opposite aspects of a given trait. Third, we develop a simple yet effective induction method that manipulates the values of these identified personality-related neurons. This method enables fine-grained control over the traits exhibited by LLMs without training and modifying model parameters. Extensive experiments validate the efficacy of our neuron identification and trait induction methods. Notably, our approach achieves comparable performance as fine-tuned models, offering a more efficient and flexible solution for personality trait induction in LLMs. We provide access to all the mentioned resources at https://github.com/RUCAIBox/NPTI.

09.
arXiv (CS.AI) 2026-06-18

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

arXiv:2606.18747v1 Announce Type: cross Abstract: Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

10.
arXiv (CS.AI) 2026-06-17

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

arXiv:2606.18192v1 Announce Type: new Abstract: As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

11.
arXiv (CS.LG) 2026-06-15

AGORA: Can Deliberation and Governance Gates Absorb Participation Bias in Transit Planning?

arXiv:2606.13696v1 Announce Type: cross Abstract: Transit network design depends not only on the optimization algorithm but also on who shows up to the public hearing. Current practice often collects one-directional comments from self-selected attendees, leaving participant mix as an uncontrolled source of outcome variation. We present AGORA, a framework that holds the network, demand, and solver fixed while systematically varying meeting composition through stakeholder agents, structured deliberation, and governance gates. Across two standard benchmark networks at different scales, we find that (i) aggregate outcomes vary little across compositions, but on tail risk and fairness disparity, representative sampling still tends to outperform skewed compositions; (ii) without deliberation, composition produces no variation at all, showing that deliberation is the mechanism through which who attends affects outcomes; and (iii) governance gates compress cross-profile variance without shifting the average outcome on Mandl, but low acceptance on Mumford0 shows thresholds require instance-specific calibration. These findings reframe participation bias from an uncontrollable input to a process-design problem: even without guaranteed representative attendance, well-structured deliberation and governance criteria can substantially reduce how much outcomes depend on who is in the room.

12.
arXiv (quant-ph) 2026-06-19

Indefinite Quantum Causality

arXiv:2606.19438v1 Announce Type: new Abstract: In recent years, operational approaches to quantum foundations have been developed as a means of understanding the core principles and distinctive features of quantum theory. Such approaches typically view physical processes as sequences of operations, with earlier operations serving as causes of later effects. However, a growing literature is emerging on the possibility of relaxing this assumption and allowing for quantum indefiniteness in the causal order. This development stems from a variety of motivations, both fundamental and applied, including exploring the role of causality in quantum theory, the interplay between quantum theory and general relativity, and higher-order quantum computing. A prominent offshoot of this development is the emergence of indefinite causal order as a feasible resource for quantum information processing. This review provides an overview of the current state of the art in the field, covering the methodology underlying indefinite quantum causality within the so-called "process matrix formalism", outlining key results and experimental implementations, and discussing recent advances.

13.
arXiv (quant-ph) 2026-06-24

The most discriminable quantum states in the multicopy regime

arXiv:2604.26927v2 Announce Type: replace Abstract: This work investigates which sets of quantum states give rise to the highest achievable success probability in minimum-error state discrimination if multiple copies of the unknown state are given. Specifically, we consider uniformly distributed ensembles of the form $\left\{\frac{1}{N},\rho_i^{\otimes k}\right\}_{i=1}^N$, where $N$ states in dimension $d$ are provided in $k$ identical copies, and derive universal limits in this scenario. For pure state ensembles, we prove that whenever $N$ is large enough to support a state $k$-design, these designs will exactly give rise to the maximally discriminable sets. We further show that when $N$ exceeds the size required for a $k$-design, mixed states can outperform all pure state ensembles. We then recognise that the problem of most discriminable classical states in the multi-copy regime is in one-to-one correspondence to the concept of the multiplicative Bayes capacity of independent uses of classical channels, a concept that emerges naturally in the context of classical information leakage. This connection allows us to completely solve the classical analogue of our problem when $N\geq \binom{d + k - 1}{k}$, and to prove that quantum systems offer a quadratic advantage (in number of copies $k$) over classical ones. Then, we prove that this classical over quantum advantage is strongly reduced when one is restricted to real quantum states, more precisely, when $N \geq k + 1$, pure real qubits only offer a constant advantage over classical bits. Finally, we introduce computational techniques to find sets of most discriminable ensembles and to obtain rigorous universal upper bounds on the maximal success probability for multi-copy state discrimination in cases that are analytically intractable.

14.
arXiv (CS.CV) 2026-06-18

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ''alignment gap'', where most models fail to generatively outperform the linear separability of their representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual alignment issue. Our method augments standard next-token prediction with a contrastive objective to restructure the visual manifold into a more one-dimensionally linear geometry, improving image-to-image comparison and enabling models to significantly surpass the LSC on abstract compositional reasoning tasks.

15.
arXiv (CS.CL) 2026-06-19

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

We study cross-lingual transfer by fine-tuning seven large language models (4B–671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding – the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.

16.
bioRxiv (Bioinfo) 2026-06-18

Bayesian modeling of longitudinal metatranscriptomes of broiler meat spoilage microbiomes shows shared predictive signature associated with spoilage at refrigerated temperatures

Microbial spoilage of packaged meat is driven by complex microbial succession and related metabolic activity, yet conventional shelf-life assessment is mainly based on shelf-life studies relying on culturing and sensory analysis. In routine quality assurance, results are obtained retrospectively, and they are only indirectly linked to the metabolic activity related to sensory deterioration. Functional, time informative approaches that capture the active metabolic state of the spoilage microbiome and predict the rate of spoilage are lacking. We developed a censoring-aware Gaussian process (CAGP) framework to model longitudinal pathway expression profiles from broiler meat metatranscriptomes collected over consecutive storage days at 4 or 6{degrees}C. Samples were annotated using odor-based sensory scores defining fresh, early-spoilage, and late-spoilage phases. Because observed zeros in pathway-level data may reflect non-detection rather than true absence, the model treats low values as left-censored observations below a detection threshold while estimating smooth temporal trajectories with uncertainty. In leave-one-out prediction within the 4{degrees}C time series, predicted sampling days differed from the true days by an average of 0.43 days, and predicted spoilage phases agreed with the sensory classification. Trajectories learned at 4{degrees}C also transferred to an independent 6{degrees}C time series at the spoilage-phase level, suggesting that shared functional spoilage programs are preserved despite temperature-dependent changes in spoilage rate. Cross-entropy ranking further identified pathway modules carrying time- and phase-informative signals across temperatures. Overall, this framework provides a probabilistic approach for linking metatranscriptomic functional dynamics to sensory spoilage progression, supporting shelf-life assessment beyond retrospective microbial enumeration.

17.
arXiv (quant-ph) 2026-06-17

Microwave-free vector magnetometry and crystal orientation determination with Nitrogen-Vacancy centers using Bayesian inference

arXiv:2512.13835v2 Announce Type: replace Abstract: Nitrogen-vacancy (NV) centers in diamond provide a solid-state platform for quantum sensing. While optically detected magnetic resonance techniques offer high sensitivity, their reliance on microwaves introduces heating and stray electromagnetic fields that can perturb nearby samples. Optical approaches based on cross-relaxation between differently oriented NV centers remove this constraint but have so far required stringent alignment of the external field with crystallographic axes, restricting their practicality. Here we introduce a general framework for microwave-free vector magnetometry at near-zero field that leverages Bayesian inference to extract both the magnetic field vector and the NV orientation directly from photoluminescence maps. An analytical model of cross-relaxation resonances enables efficient inference under arbitrary field and orientation configurations, while naturally incorporating the discrete degeneracies of the NV symmetry. We experimentally demonstrate robust orientation determination and vector-field reconstruction, establishing a general route toward compact and alignment-free NV magnetometers for practical sensing applications.

19.
arXiv (CS.LG) 2026-06-16

Multi-Fidelity SINDy: Sparse Discovery of Nonlinear Dynamical Systems with Fidelity-Weighted Measurements

arXiv:2606.15690v1 Announce Type: new Abstract: Data from simulations and experiments are rarely noise-free and often exhibit heterogeneous levels of fidelity. Measurement uncertainty may vary across repeated observations, sensing devices, or even within a single experiment. This work addresses the problem of discovering nonlinear dynamical systems from such inhomogeneous data. We extend the Sparse Identification of Nonlinear Dynamical Systems (SINDy) framework to account for variable noise levels by combining Ensemble SINDy and Weak SINDy within a weighted regression formulation derived from generalized least squares. A statistical justification for the weighting strategy is also provided. The methodology is validated on several benchmark systems, including ordinary and partial differential equations. In addition, we show the benefit of multi-fidelity integration for forecasting the dynamics of a double pendulum system. The results confirm that the proposed approach mitigates the adverse effects of heteroscedastic noise and that repeated, low-cost, low-quality measurements can improve model recovery, in some cases matching or outperforming reconstructions obtained using only high-fidelity data.

20.
arXiv (CS.CL) 2026-06-15

SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model

Large language models (LLMs) are increasingly deployed as planners for autonomous agents in household environments. While existing benchmarks evaluate whether LLM-generated plans execute successfully, they overlook a critical type of failure: latent failures. Unlike immediate failures that trigger instant feedback at execution time and enable timely correction, latent failures do not immediately halt plan execution but silently compromise goal achievement. In severe cases, they cause irreversible harm. To address this gap, we introduce SIMMER, a benchmark for evaluating latent failures in LLM planning through a human-curated symbolic world model grounded in the kitchen domain. SIMMER defines a world model comprising 77 actions, 262 unique objects, and approximately 46,800 possible interactions that are semantically realistic, derived from real-world cooking scripts. It then leverages a state machine executor that validates plans against the world model and detects immediate precondition violations, latent hazards, and irreversible failures. Experiments across six LLMs show that even frontier models achieve at most 17% error-free plans. Moreover, up to 56% of plans contain latent failures, the majority of which lead to irreversible consequences. We further demonstrate that explicit state reasoning via counterfactual foresight simulation can reduce latent failures by up to 72% and irreversible cases by up to 75%, suggesting a promising direction for more robust LLM planners.

21.
arXiv (CS.CV) 2026-06-18

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

22.
arXiv (CS.AI) 2026-06-15

FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories

arXiv:2606.14119v1 Announce Type: new Abstract: Fault diagnostics and recovery in smart factories is challenging because critical information is dispersed across manuals of multiple machines which are interconnected through the manufacturing process. Large Language Models (LLMs) can provide a promising approach. In this paper, we propose FactoryLLM, a safe and open-source AI playground designed for evaluating different LLM-based retrieval-augmented generation (RAG) models by analysing documents from multiple machines across the manufacturing process. FactoryLLM enables the user to configure the LLM, and assess performance when reasoning over multiple documents, through a dual evaluation setup using both RAGAS and NVIDIA's LLM-as-a-Judge metrics. FactoryLLM is safe because it allows users to run local or open-source LLMs without sharing sensitive industrial data, providing a controlled environment for experimentation. We demonstrate the efficacy of FactoryLLM through a case study which involves an Autonomous Intelligent Vehicle and its Mobile Planner software, evaluating three LLMs across 30 maintenance queries derived from approximately 600 pages of cross-machine documentation. The results suggest that FactoryLLM is effective in cross-machine document reasoning: every model achieved a groundedness score above 0.88. The full code and documentation for community to test FactoryLLM with their manufacturing specific scenarios are publicly available.

23.
arXiv (math.PR) 2026-06-16

Higher-order spectral perturbation expansions II: Kernel matrices and manifold learning

arXiv:2606.16373v1 Announce Type: cross Abstract: We study spectral concentration bounds for kernel matrices as approximation of the corresponding kernel integral operator. Results are established under weak assumptions on the data setting and the reproducing kernel relying only on a Mercer condition and a local Weyl law. This allows us to deal with key features of kernel matrices, such as large multiplicities, large effective dimension, and heavy-tailed distributions. Our results apply to infinite dimensional principal component analysis, manifold learning, and Bayesian nonparametric statistics. We illustrate this via two prototypical examples: The heat kernel on the sphere and a wavelet prior from Bayesian nonparametrics.

24.
arXiv (CS.LG) 2026-06-12

Mixing Makes Markovian Contexts Cheap for Linear Bandits

arXiv:2603.12530v2 Announce Type: replace Abstract: Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap'' perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. However, this reduction crucially relies on the independence of contexts and does not extend to settings with temporally correlated (e.g., Markovian) contexts, which arise frequently in practice. Motivated by applications with temporally correlated availability, we extend this perspective to linear bandits with Markovian context processes, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown stationary distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle in sufficiently fast mixing regimes. We then validate our results on a real-world instance, where we show practical gains over a LinUCB baseline.

25.
arXiv (CS.AI) 2026-06-12

Representing Time Series as Structured Programs for LLM Reasoning

arXiv:2606.12481v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks – editing, captioning, and question answering – where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.