Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
medRxiv (Medicine) 2026-06-12

Heterogeneity of Treatment Effect of Aspirin and Clinically Significant Bleeding in Older Adults

Aim: The global population of older adults is growing, and older age is linked to higher bleeding risk. Although guidelines discourage aspirin for primary prevention in healthy older adults due to bleeding harms outweighing benefits, many continue taking it without a clear indication. It remains unclear whether all older adults face uniform aspirin-related bleeding risk or if certain subgroups are more vulnerable. Methods: We analyzed data from 19,114 ASPREE trial participants to develop machine learning models using 116 baseline variables. Random forest (RF) and random survival forest (RSF) models predicted 5-year bleeding risk, and participants were stratified into low, intermediate, and high-risk groups based on the 20th and 80th percentiles of predicted risk. We assessed heterogeneity of treatment effect (HTE) by testing treatment-by-risk group interactions on the relative scale using Fine-Gray models, and on the absolute scale using observed 5-year cumulative incidence rates. Results: Over a median follow-up of 4.7 years, 626 major bleeding events occurred. The RF model had moderate discrimination (AUC = 0.65, 95% CI: 0.63-0.67) and good calibration (Brier = 0.032, 95% CI: 0.029-0.034). Statistically significant HTE was observed on the relative scale, with the greatest relative increase in bleeding risk seen in the low-risk group (subdistribution hazard ratio = 2.26, 95% CI: 1.27-4.01). On the absolute scale, low-risk participants experienced higher bleeding with aspirin (absolute risk difference (ARD) = 1.17%, 95% CI: 0.37-1.95), but heterogeneity in ARDs was not statistically significant (Cochran's Q p > 0.45). Similar findings were observed when using the RSF model. Conclusion: Participants at lowest baseline bleeding risk experienced the greatest relative increase in bleeding risk with aspirin therapy. We found statistically significant heterogeneity in treatment effects on the relative but not absolute scale. These findings support an individualized, risk-based approach to aspirin therapy decision-making in older adults.

02.
arXiv (CS.AI) 2026-06-19

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

arXiv:2606.20235v1 Announce Type: cross Abstract: Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark's ability to provide multi-dimensional evaluation signals for academic paper search agents.

03.
arXiv (CS.AI) 2026-06-24

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

arXiv:2606.24388v1 Announce Type: new Abstract: We introduce a large-scale, open-source dataset of pre-generated adversarial attacks for vision-language models (VLMs). The dataset is designed to be diverse, representative, and practical, extending existing benchmarks by covering 10 high-level categories and 55 subcategories of harmful intents. Our primary goal is to make adversarial data accessible to the research community, given the computational cost and complexity of generating large numbers of attacks. The dataset comprises 47 524 adversarial samples, generated using state-of-the-art attack strategies from recent literature. Our work complements existing efforts by consolidating and extending prior benchmarks from multiple established sources, resulting in 7 826 intents, and introduce an additional category to broaden coverage. This provides realistic evaluation resources for studying model robustness and alignment. Our dataset intends to enable researchers and practitioners to systematically evaluate the robustness and safety of VLMs, fine-tune attack-generation models, and develop or stress-test defensive guardrails under diverse adversarial conditions. By releasing this resource, we aim to lower the barrier to adversarial research and foster more reproducible, comprehensive, and comparable evaluations of VLM safety.

04.
arXiv (CS.LG) 2026-06-25

Neural operator-based digital twins for modeling amyloid-$\beta$ and tau propagation and treatment optimization in Alzheimer's disease

arXiv:2606.25185v1 Announce Type: new Abstract: Accurately predicting the spatiotemporal evolution of amyloid-$\beta$ and tau proteins at the individual level is critical for improving the diagnosis and treatment of Alzheimer's disease. We consider the problem of constructing patient-specific digital twins that model the propagation of these biomarkers on the cortical surface using reaction–diffusion dynamics. A major challenge is that the underlying nonlinear aggregation mechanisms are unknown and must be inferred from sparse, noisy, and heterogeneous longitudinal PET imaging data. To address this, we develop a data-driven framework that learns biomarker dynamics directly from clinical observations. The approach combines operator learning with reduced-order representations to infer governing equations of disease progression from data. Using this framework, we achieve predictive accuracies of 87\% for amyloid-$\beta$ and 81\% for tau. Building on the learned dynamics, we further formulate a PDE-constrained optimal control problem to design personalized therapeutic strategies that regulate pathological protein propagation. By integrating data-driven dynamical modeling with treatment optimization, the proposed digital twin framework provides an interpretable and predictive platform for understanding disease progression and enabling precision interventions in neurodegenerative disorders.

05.
arXiv (CS.AI) 2026-06-16

SkillVetBench: LLM-as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills

arXiv:2606.15899v1 Announce Type: cross Abstract: Open-source LLM agent ecosystems are growing rapidly, yet the security of community-contributed skills - modular tool definitions that extend agent capabilities - remains largely unvetted. The gap we fill: existing scanners operate at the code layer and are structurally blind to instruction-layer and multi-agent risk - natural-language directives that hijack an agent, exfiltrate data through encoded side channels, or chain harm across pipelines - so what is needed is a semantic, multi-dimensional vetting system rather than another signature matcher. We present SKILLVETBENCH, a live public leaderboard on Hugging Face that uses an LLM-as-Judge to vet agent skills. What is new: SARS (Skill Agentic Risk Score), a five-dimensional agentic-risk metric with a principled weighted formula for instruction-following systems. What is integrated: full CVSS v4.0 vector decomposition and a ClawHub dual-view that places our LLM-generated review beside the official marketplace verdict. What is demonstrated: drawing on our companion benchmark paper [ 1], the LLM-as-Judge stage achieves zero false negatives across 78 confirmed-malicious skills and zero false positives across 22 benign controls, while the best static baseline (SKILLSIEVE) still misses 15%; for instruction-layer categories such as Prompt Injection and Memory Poisoning, conventional tools miss between 89% and 100% of threats (e.g., CODEBERT detects none of nine memory-poisoning skills). Detection rates vary from 35% to 95% across four LLM evaluators, motivating ensemble scoring in production deployments.

06.
arXiv (CS.AI) 2026-06-18

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

arXiv:2606.18309v1 Announce Type: cross Abstract: Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.

07.
arXiv (CS.AI) 2026-06-11

Estimating Tail Risks in Language Model Output Distributions

arXiv:2604.22167v2 Announce Type: replace-cross Abstract: Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk

08.
arXiv (quant-ph) 2026-06-11

Shadow Engineering of Quantum Processes

arXiv:2606.12035v1 Announce Type: new Abstract: Characterizing quantum processes is essential for hardware benchmarking, error diagnosis, and algorithm verification. While recent work [PRX QUANTUM 4, 040337 (2023)] extended classical shadows from quantum state to quantum process, enabling efficient single-channel $\mathcal{E}$ property prediction, its applicability to composite processes $f(\mathcal{E}_1, \mathcal{E}_2,\cdots, \mathcal{E}_k)$ remains unexplored. We introduce shadow engineering, a framework encoding the classical shadows of processes into sparse transfer matrices to predict $f(\mathcal{E}_1, \mathcal{E}_2,\cdots, \mathcal{E}_k)$ properties with proven polynomial sample complexity, matching single-channel efficiency while exponentially lower than quantum process tomography. Crucially, this approach repurposes existing $\mathcal{E}_m$-shadow data without physical execution of $f(\mathcal{E}_1, \mathcal{E}_2,\cdots, \mathcal{E}_k)$, enabling flexible quantum process characterization with minimal hardware overhead. We demonstrate the framework's effectiveness and practicality on a superconducting quantum processor for typical applications such as error mitigation and Hamiltonian dynamical simulation. This framework unlocks new capabilities for predicting complex quantum behaviors without physical re-execution, with immediate applications in near-term device calibration and quantum simulation.

09.
arXiv (CS.AI) 2026-06-24

Computing Evolutionarily Stable Strategies in Imperfect-Information Games

Authors:

arXiv:2512.10279v3 Announce Type: replace-cross Abstract: We present an algorithm for computing evolutionarily stable strategies (ESSs) in symmetric perfect-recall extensive-form games of imperfect information. Our main algorithm is for two-player games, and we describe how it can be extended to multiplayer games. The algorithm is sound and computes all ESSs in nondegenerate games and a subset of them in degenerate games which contain an infinite continuum of symmetric Nash equilibria. The algorithm is anytime and can be stopped early to find one or more ESSs. We experiment on an imperfect-information cancer signaling game as well as random games to demonstrate scalability.

10.
arXiv (CS.CL) 2026-06-25

BitNet Text Embeddings

LLM-based text embedders have substantially improved retrieval and semantic representation quality, but their deployment remains costly: large backbone models slow down embedding inference, while high-dimensional full-precision embeddings impose substantial storage and bandwidth overhead on large-scale indexes. In this paper, we present BITEMBED, an extreme low-bit framework for LLM-based text embedding that jointly targets encoding efficiency and vector storage. BITEMBED converts pretrained LLM backbones into BitNet-style embedding encoders with ternary weights, quantized activations, and lightweight normalization refinement. The converted model is adapted to representation learning through continual contrastive pre-training, followed by supervised contrastive fine-tuning with both similarity-distribution distillation and attention-relation distillation from a full-precision teacher. Beyond quantizing the backbone, BITEMBED further trains output embeddings to support multiple storage precisions meeting different storage needs in various scenarios. Experiments on MMTEB (eng, v2) with Qwen3-0.6B and Gemma3-270M show that BITEMBED is largely comparable to full precision teacher embedders. Moreover, BITEMBED flexibly obtains text embeddings of various precisions, achieving a trade-off between performance and storage cost.

11.
arXiv (CS.LG) 2026-06-12

From geometry to dynamics: Learning overdamped Langevin dynamics from sparse observations with geometric constraints

arXiv:2512.23566v2 Announce Type: replace-cross Abstract: How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved high-frequency observations, or rely on geometric arguments that apply only to conservative systems, limiting the range of dynamics they can recover. Here, we present a new framework that reconciles these two perspectives by reformulating inference as a stochastic control problem. Our method uses geometry-driven path augmentation, guided by the geometry in the system's invariant density to reconstruct likely trajectories and infer the underlying dynamics without assuming specific parametric models. Applied to overdamped Langevin systems, our approach accurately recovers stochastic dynamics even from extremely undersampled data, outperforming existing methods in synthetic benchmarks. This work demonstrates the effectiveness of incorporating geometric inductive biases into stochastic system identification methods.

12.
arXiv (math.PR) 2026-06-24

The Zeta Tail Distribution: A Novel Event-Count Model

arXiv:2506.17496v3 Announce Type: replace-cross Abstract: We introduce the Zeta Tail$\left(a\right)$ probability distribution as a new model for random damage-event counts in risk analysis. Although a natural analogue of the Geometric$\left(p\right)$ distribution, Zeta Tail$\left(a\right)$ has received little attention in the scholarly literature. In the present work, we show this distribution to be reasonably tractable by deriving various fundamental properties, including moments, generating functions, and reliability functions. We then assess its usefulness as an alternative to Geometric$\left(p\right)$, both theoretically and through application to a set of meteorological data. Finally, we discuss conceptual differences between employing the Zeta Tail$\left(a\right)$ model conditionally (i.e., given observed data with certain known characteristics) and unconditionally (i.e., for arbitrary, as yet unobserved data).

13.
arXiv (quant-ph) 2026-06-17

Quantum statistical enhancement of collective behaviour in a bosonic active Ising model

arXiv:2606.18091v1 Announce Type: new Abstract: Collective behaviour such as flocking (the collective motion of a spontaneously formed group along a common direction) or aster formation (the binding of opposing flocks, inhibiting each others motion) are intriguing emergent phenomena in active systems with local alignment rules. Until recently, their occurrence was mainly studied for classical systems, a prime example being the active Ising model (AIM), which translates the main ingredients of flocking and aster formation (i.e., alignment and self-propulsion) to a lattice framework. Here we introduce and study a one-dimensional (1D) quantum lattice variant of the AIM, based on ideal bosons with a spin degree of freedom. We find that both the collective behaviours of the 1D classical model, flocking and aster formation, are markedly enhanced by the bosonic quantum statistics. This contrasts with a recent quantum generalization of the AIM based onto hard-core bosons [Khasseh et al., Phys. Rev. Lett. 135, 248302 (2025)], where flocking, but neither its quantum-statistical stabilization nor aster states were observed as a consequence of interactions. Moreover, we investigate the competition of this quantum statistical stabilization of collective phases with their suppression by the quantum fluctuations induced by a transverse external magnetic field.

14.
arXiv (CS.AI) 2026-06-16

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

arXiv:2606.15157v1 Announce Type: cross Abstract: KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all transformer layers. This uniform design ignores the fact that different layers can play different roles during prefill and decoding, and may therefore require different eviction strategies and cache capacities. We present PolyKV, a layer-wise KV cache optimization framework that considers design space with method selection and budget allocation. PolyKV routes each layer to a suitable KV compression policy based on layer-level signals, while assigning non-uniform budgets under a fixed total budget. This formulation enables heterogeneous compositions of existing KV cache methods. Experiments on LLaMA-3.1-8B and Qwen3-8B show that, under the same 512-token average KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap between the strongest single-policy baseline and FullKV, respectively. Across 128-1024 budget sweep, PolyKV consistently improves over the strongest baseline by 1.7%-6.4%, corresponding to 40.0%-54.5% recovery of the FullKV gap.

15.
arXiv (math.PR) 2026-06-17

Non-asymptotic Tail Bounds for the Kostlan–Shub–Smale Field: Tensor PCA and Spherical $k$-Spin Complexity

arXiv:2606.17665v1 Announce Type: cross Abstract: This paper builds a hierarchy of explicit, non-asymptotic tail bounds for the supremum of the Kostlan–Shub–Smale (KSS) random field on the sphere, and applies it to two problems: Spiked Tensor PCA and the landscape of the spherical $k$-spin model. For Tensor PCA, we study the non-asymptotic statistical limits of estimating a rank-$R$ symmetric signal tensor of order~$k\ge 3$ and dimension~$d\ge 3$ from a single Gaussian observation at signal-to-noise ratio~$\lambda$, through the profile maximum likelihood estimator, the MLE restricted to normalized rank-$R$ tensors of coherence at least~$\kappa$. Our analysis uses a single reduction: a deterministic geometric inequality (the Tube Method) and a rank-reduction step bound the estimation error by the supremum of the canonical KSS field, which the Kac–Rice formula turns into a Gaussian integral against the expected absolute characteristic polynomial of a shifted Gaussian Orthogonal Ensemble, controlled in turn by the four explicit tail bounds of our hierarchy (three from a Mehta–Fyodorov representation, one from a Ben Arous–Dembo–Guionnet large deviation). The same reduction yields two results, each with explicit constants. For estimation, a finite-$(k,d)$ error bound recovers the asymptotically optimal rate~$\sqrt{d\log k}$ of Perry, Wein and Bandeira, with explicit dependence on the rank~$R$ and the coherence~$\kappa$. For the landscape, a two-sided non-asymptotic bracketing of the annealed complexity of the spherical $k$-spin Hamiltonian recovers the Auffinger–Ben Arous–\v{C}ern\'y complexity function in the high-dimensional limit.

16.
arXiv (CS.AI) 2026-06-16

Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

arXiv:2606.14817v1 Announce Type: cross Abstract: This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

17.
arXiv (CS.CV) 2026-06-19

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference. We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference. Our approach accelerates autoregressive image generation by up to 13.3x while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

18.
arXiv (CS.LG) 2026-06-25

Structured Approximations of Measures

arXiv:2310.09149v3 Announce Type: replace-cross Abstract: We study the approximation of probability measures in the Wasserstein-$p$ distance by structured classes of approximators, motivated by applications in imaging, machine learning, and physical measurement under sensor constraints. We obtain three sets of results. First, for measures with densities bounded away from zero on a bounded Lipschitz domain $\Omega$, we prove that any approximation scheme for functions in $\mathrm{L}_p(\Omega)$ transfers, with linear rate, to a corresponding approximation scheme for measures in $\mathrm{W}_p(\Omega)$. The argument applies a theorem of Bogovskii on regularity of solutions to the continuity equation in the Benamou-Brenier formulation of optimal transport. We exhibit concrete approximation schemes (polynomials, shift-invariant spaces, cardinal interpolation with radial basis functions, kernel density estimators, and piecewise approximations on nonuniform Voronoi partitions) that fit the framework. As a matter of independent interest, we prove a negative Sobolev lower bound that generalizes existing bounds from $p=2$ to all $p\in(1,\infty)$. We also consider deterministic bounds for discrete approximations to arbitrary measures in terms of the mesh norm of a quasi-uniform set of points. We specialize these bounds to show that compactly supported measures admit a deterministic $N$-term approximation $\mu_N$ such that $\mathrm{W}_p(\mu,\mu_N) = O(N^{-\frac{1}{d}})$ for all $d\geq 1$, which matches the asymptotic optimal quantizer rate. We also extend these results to non-compactly supported measures with appropriate tail decay.

19.
arXiv (quant-ph) 2026-06-15

Certifying Macroscopic Quantum Mechanics via Hypothesis Testing with Finite Data

arXiv:2506.22092v2 Announce Type: replace Abstract: We address the challenge of certifying quantum behavior with single macroscopic massive particles, subject to decoherence and finite data. We propose a hypothesis testing framework that distinguishes between classical and quantum mechanics based on position measurements. While interference pattern visibility in single-particle quantum superposition experiments has been commonly used as a sufficient criterion to falsify classical mechanics, we show that, from a hypothesis testing perspective, it is neither necessary nor efficient. Focusing on recent proposals to prepare macroscopic superposition states of levitated nanoparticles, we show that the likelihood ratio test – which leverages differences across the entire probability distribution – provides an exponential reduction in measurements needed to reach a given confidence level. These results generalize to a broad class of quantum states, and offer a principled, efficient method to falsify classical mechanics in interference experiments, relaxing the experimental constraints faced by current efforts to test quantum mechanics at the macroscopic scale.

20.
arXiv (CS.CL) 2026-06-18

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on https://github.com/AmirAbaskohi/MCompassRAG.

21.
arXiv (CS.AI) 2026-06-11

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

arXiv:2606.11463v1 Announce Type: cross Abstract: Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

22.
arXiv (CS.CL) 2026-06-16

The Value Axis: Language Models Encode Whether They're on the Right Track

We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a "value" axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.

23.
arXiv (CS.LG) 2026-06-16

Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection

arXiv:2510.24043v4 Announce Type: replace Abstract: This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.

24.
arXiv (CS.AI) 2026-06-16

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

arXiv:2606.14929v1 Announce Type: cross Abstract: Modern recommendation systems increasingly rely on dynamically routing diverse queries to multiple embedding models. Despite its practical significance, this problem remains poorly understood under realistic conditions like adversarial queries, bandit feedback, and limited observability of models. We formalize embedding model routing as an adversarial contextual linear bandit with low-rank experts, where contexts are queries, actions are items, and experts are the embedding models working on low-rank latent representation spaces. We first establish that standard regret notions suffer from structural misspecification or statistical intractability, and we identify a log-quadratic policy class that is expressive enough to capture query-dependent model routing, yet structured enough to allow efficient online learning. Second, we propose a policy gradient algorithm called Hypentropy Policy Gradient (HPG). It provably adapts to the unknown low-rank structure under incomplete information and attains $\tilde{\mathcal O}(s\sqrt{M T})$ linearized policy regret – where $s, M$, and $T$ are the intrinsic rank of the experts, the number of models, and the number of rounds – thus avoiding a curse of dimensionality. Finally, we also provide an computationally efficient and parameter-free implementation of HPG.

25.
arXiv (CS.CL) 2026-06-15

Cross-Dataset Bloom Question Classification: Supervised Models and Prompted LLMs

Automatic Bloom's taxonomy classification of assessment questions can substantially reduce instructor workload, but labeling is subjective and teacher-dependent. Prior machine learning (ML) and deep learning (DL) approaches reported strong within-dataset results, yet were rarely evaluated in cross-dataset settings, leaving real-world generalizability unclear; meanwhile, LLM effectiveness for Bloom question classification has not been systematically studied. We evaluated the cross-dataset generalization of existing ML/DL methods and assessed LLMs with multiple prompting strategies on five datasets; the best prompting strategy combined in-context examples with course-specific action verbs. Supervised ML/DL models degraded substantially on unseen datasets, whereas LLMs were more stable, suggesting a robust alternative across diverse educational contexts. Based on the best prompting strategy, we also presented a lightweight UI that supports instructors in automatically classifying large question banks; a usability study indicated low workload and high usability.