Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-18

DeepInflation: an AI agent for research and model discovery of inflation

arXiv:2601.14288v2 Announce Type: replace-cross Abstract: We present DeepInflation, an AI agent designed for research and model discovery in inflationary cosmology. Built upon a multi-agent architecture, DeepInflation integrates Large Language Models (LLMs) with a symbolic regression (SR) engine and a retrieval-augmented generation (RAG) knowledge base. This framework enables the agent to automatically explore and verify the vast landscape of inflationary potentials while grounding its outputs in established theoretical literature. We demonstrate that DeepInflation can successfully discover simple and viable single-field slow-roll inflationary potentials consistent with the latest observations (with the ACT DR6 results taken as an example) or any given $n_s$ and $r$, and provide accurate theoretical context for obscure inflationary scenarios. DeepInflation serves as a prototype for a new generation of autonomous scientific discovery engines in cosmology, which enables researchers and non-experts alike to explore the inflationary landscape using natural language. This agent is available at https://github.com/pengzy-cosmo/DeepInflation.

02.
arXiv (CS.AI) 2026-06-12

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

arXiv:2606.13176v1 Announce Type: new Abstract: Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

03.
arXiv (CS.CV) 2026-06-19

3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM's own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder's patches into K locally coherent regions and inserts, before each region's patch tokens, a learnable per-region marker and a reserved vocabulary token ; a Marker-Space Refinement (MSR) module then conditions each marker on its region's spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs. To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0.459 and Exact-match 13.78%, with a slot-to-caption GPT-4o judge of 44.68. On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3.03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0.65 SBERT and +1.85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective. All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.

04.
arXiv (CS.CV) 2026-06-18

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

05.
arXiv (CS.LG) 2026-06-18

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

arXiv:2606.18539v1 Announce Type: new Abstract: Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.

06.
arXiv (CS.CL) 2026-06-12

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

作者:

LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.

07.
arXiv (CS.CV) 2026-06-16

Fi-Gaussian: Frequency-Aware Implicit Gaussian Splatting for Single Image Dehazing

Single image dehazing continues to be hindered by the loss of high-frequency details and the difficulty of accurate physical scattering modeling. To address these issues, we propose Fi-Gaussian, a frequency-aware implicit Gaussian splatting network for single image dehazing. Unlike explicit rendering methods that rely on 3D point clouds, our method employs implicit Gaussian splatting to adaptively model the underlying distribution of clear images as a continuous representation in 2D feature space. The core of the network is a frequency-aware implicit Gaussian splatting module, which decouples low-frequency structural information and high-frequency texture information in the frequency domain and then performs adaptive Gaussian aggregation with complex-valued weights to recover fine details. In addition, a physics-driven scattering renormalization mechanism is introduced to estimate the transmission map and atmospheric light under the guidance of implicit Gaussian priors. Extensive experiments on multiple benchmark datasets demonstrate that Fi-Gaussian achieves state-of-the-art quantitative performance and produces visually superior dehazed results, validating the effectiveness of implicit Gaussian splatting for low-level vision tasks.

08.
arXiv (CS.AI) 2026-06-24

AI Tokenomics: The Economics of Tokens, Computation, and Pricing in Foundation Models

作者:

arXiv:2606.24616v1 Announce Type: new Abstract: Tokens have become the practical accounting unit for modern foundation model services, linking information processing, computation, memory use, energy expenditure, pricing, and economic value. This paper develops a framework for AI tokenomics: the study of how tokens are generated, consumed, priced, allocated, and optimized across AI systems. We connect token-level technical costs to workflow-level production functions, enterprise resource allocation, measurement and instrumentation methods, and emerging market-design questions. The framework shows that token expenditure and economic value are distinct: value depends on marginal productivity, workflow position, hidden reasoning activity, risk, and downstream propagation effects. The paper concludes by identifying open research directions in hidden-token measurement, empirical calibration, token productivity, dynamic allocation, and token-based markets.

09.
arXiv (CS.AI) 2026-06-16

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

arXiv:2606.08151v2 Announce Type: replace Abstract: Modern large language model (LLM) agents do not simply need longer contexts; they need decision-relevant evidence at the moment of action. We study decision-aware context selection: ranking retrieved files, tests, traces, rules, and memories by their expected effect on an agent's next action rather than by semantic similarity alone. We present the Counterfactual-Inspired Context Layer (CICL), which builds an instance context graph, estimates decision-oriented utility for candidate units, and compresses selected evidence into typed memory cards. The same schema can be instantiated with hosted LLM judges, local surrogates, or lightweight rankers, making the selection protocol auditable across model choices. On 50 SWE-bench Verified file-retrieval instances, Qwen3.6-Plus reranking of BM25 top-50 candidates improves hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show that CICL identifies action-critical evidence: removing the top-utility semantic unit reduces F1 from 0.245 to 0.000. In selected-then-compressed mode, memory cards save 44.93 tokens per query while preserving selected evidence. CICL provides a practical layer for measuring, ranking, and compressing decision-critical context for tool-using agents. Code is available at https://github.com/stephen-guan-researcher/CICL.

10.
arXiv (quant-ph) 2026-06-12

A ribbon ZX calculus for gauge theory

arXiv:2606.13551v1 Announce Type: cross Abstract: ZX calculus provides a graphical formalism for reasoning about quantum processes, built from two interacting Frobenius algebras associated with the Z and X bases of a qubit. While it has found widespread application in quantum information and computing, its relationship to quantum field theory has only recently begun to be explored. In this work, we further develop this connection by providing a generalization of ZX calculus to two-dimensional Yang Mills theory with a compact gauge group. The key observation is that both frameworks can be organized around the Hopf Frobenius algebraic structure associated with a group algebra, which can in turn be described by the diagrammatics of two dimensional topological quantum field theory. Given the well known relationship between gauge theory and gravity in two and three dimensions, our work paves the way for applications of ZX to low dimensional gravity.

11.
arXiv (CS.LG) 2026-06-16

A Compositional Framework for Open-ended Intelligence

arXiv:2606.15386v1 Announce Type: new Abstract: Open-ended intelligence is the capacity to adapt to novel problems and environments that are substantially different from those in training. We formalize open-ended intelligence as the closure induced by a finite primitive set \(P\) and a set of composition operators \(C\). We characterize properties of the induced closure \(\mathcal{L}(P,C)\) that support unbounded compositional generation across families of tasks and worlds. A mathematics of open-ended intelligence requires two pillars: a minimal set of representational primitives (e.g., states, actions) and algorithmic primitives (e.g., nearest neighbor), together with composition motifs (e.g., recursion, sequencing) that reflect an acquired compositional grammar. The closure of these two pillars enables the generation of infinite adaptive responses across a wide range of settings. The mathematics supports complementary research agendas, including evaluation metrics for explanation and interpretability, as well as building architectures where compositional generalization is native. We propose next primitive prediction as a novel architectural objective, where the training objective encourages the acquisition of reusable algorithmic primitives and their compositional grammar, such that new solutions are generated through recombination. Curriculum learning and self-play enable lifelong learning and expansion of the closure by discovering reusable primitives and transition motifs across families of tasks and worlds. We ground the framework through case studies in physics, evolution, and neuroscience.

12.
arXiv (CS.AI) 2026-06-16

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

arXiv:2605.29874v2 Announce Type: replace-cross Abstract: Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): ten of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Gemini 3.1 Pro Refine achieves the highest ICD in the dataset (0.925), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

13.
arXiv (CS.CL) 2026-06-11

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through graduated, semantically-aware eviction: the agent annotates its trajectory as typed, dependency-linked episodes as work proceeds, and a deterministic, LLM-free policy evicts content in priority order within that structure when a token budget is exceeded. CWL preserves user turns and the exploratory context the agent is actively reasoning over, while aggressively shedding action episodes whose effects are already persisted in the environment, keeping active context near a stable ceiling that also avoids the performance degradation associated with very large prompts. Compared to summarization-based compaction, CWL avoids four well-known limitations: unpredictable lossiness, destruction of causal structure, blocking model cost, and compression-induced hallucination. Compared to recency truncation, CWL is semantically aware: it drops the oldest-and-most-recoverable content according to the dependency graph rather than oldest-in-time regardless of relevance. We describe the annotation protocol, the episode graph, the eviction policy, and the token-accounting loop, and evaluate CWL on long-horizon agentic benchmarks: a single agent session completing 89 sequential tasks across 80 million tokens with no measurable degradation in task accuracy relative to per-task isolated sessions

14.
medRxiv (Medicine) 2026-06-22

''Circumstantial Determinants'': An Efficient Approach to Reaching People in Need of HIV Prevention?

HIV prevention and testing programmes primarily reach people who self-refer or attend routine health services. Higher-risk individuals are missed if they are healthy, under-estimate their risk of infection or under-report sexual risk-behaviours. We assess a new approach to address limitations in existing programmes by targeting HIV services on ''Circumstantial Determinants'' (CDs) of HIV risk - the social circumstances, settings, and norms associated with behaviours that increase risk of HIV acquisition. Data on potential CDs and sexual behaviour were collected in a population survey in Zimbabwe in 2018/19 (N=9141). HIV-negative individuals reporting [≥] 1 sexual risk-behaviours were defined as the 'priority population' for HIV prevention. For each sex, six circumstantial determinants were associated with being in the priority population (aOR [≥] 1.30; p [≤] 0.01). Reach and efficiency of CDs (and combinations) were calculated; ROC curve algorithms evaluated their ability to identify priority population membership; and HIV prevention condom cascades were compared between CD-defined priority population subgroups. Example findings include that targeting men at bars and beerhalls could reach 48.5% of the priority population and 25.1% of lower-risk men. These percentages increase to 77.1% and 53.7% if men with poor mental health, no religious affiliation, negative social capital, or living on agricultural estates are also targeted. Targeting women with poor mental health could reach 32.0% of the priority population and 21.3% of lower-risk women. Targeting additional circumstantial determinants increases these percentages to 54.1% and 37.5%, respectively. Cascade barriers to condom use differed between CD-defined subgroups. The Circumstantial Determinants approach demonstrates proof-of-concept potential to strengthen HIV prevention services.

15.
arXiv (CS.CV) 2026-06-24

MILE: A Mechanically Isomorphic Hand Exoskeleton and Visuotactile Robotic Hand for Data Collection in Dexterous Manipulation

Dexterous robotic hands are expected to perform complex, contact-rich object manipulation, but learning such skills remains challenging because high-dimensional hands require high-fidelity demonstrations. Imitation learning provides a practical route for acquiring dexterous manipulation skills from human demonstrations, yet collecting synchronized multimodal demonstrations with accurate hand actions and tactile observations remains a key bottleneck. We present MILE, a teleoperation-based data-collection system comprising the human-first MILE exoskeleton and the mechanically corresponding MILE-Tac robotic hand. The system integrates custom-designed and fabricated modular joint encoders and compact MILE fingertip visuotactile sensor modules. The exoskeleton is informed by human-hand anatomy and ergonomic constraints, while the robotic hand is co-designed to preserve the selected four-finger kinematic topology. This correspondence enables joint-space command transfer and reduces reliance on task-space IK-based retargeting. The system synchronously records task-specific visual observations, four fingertip visuotactile streams, robot-hand proprioception, and exoskeleton-derived action commands. We evaluate MILE through a four-task teleoperation benchmark against representative glove-based and vision-based interfaces, and through imitation-learning experiments that compare policies trained with and without fingertip tactile input. The project page is available at https://sites.google.com/view/mile-system.

16.
medRxiv (Medicine) 2026-06-22

Nutrient Composition of Foods Represented in the U.S. Food and Nutrient Database for Dietary Studies, 2013-2023

Background: The U.S. Food and Nutrient Database for Dietary Studies (FNDDS) is updated across NHANES dietary cycles and is central to U.S. nutrition surveillance. However, multi-cycle food-code-level changes in nutrient composition have not been comprehensively characterized across the full WWEIA nutrient panel. Objective: To characterize ten-year temporal patterns in nutrient composition across five FNDDS cycles, evaluate pandemic-period food-code compositional stability, and distinguish exploratory mean-level signals from distributional heterogeneity that may reflect reformulation, database coverage, or food-code definition changes. Methods: We analyzed five consecutive FNDDS biennial releases: 2013-14, 2015-16, 2017-18, 2019-20, and 2021-23. Nutrient values were extracted from the public FNDDS/FoodData Central release files and standardized to per-100-g food-code-level records. Cycle midpoints, 2013.5, 2015.5, 2017.5, 2019.5, and 2022.0, served as the independent variable in an exploratory ordinary least squares (OLS) regression. Mann-Kendall testing assessed monotonic rank trends, Welch's ANOVA assessed food-code-level distributional heterogeneity, and pairwise Welch comparisons with Cohen's d summarized pre-pandemic, pandemic-period, and post-pandemic differences. Equivalence testing using TOST with +/-10% bounds was restricted to the 2019-20 versus 2021-23 stability comparison. OLS sensitivity analyses were repeated after excluding the structurally atypical 2017-18 cycle. Results: Sixty-three nutrients were analyzed. Eight nutrients showed nominal OLS trends, p < 0.05, but none remained significant after Bonferroni correction. Mann-Kendall testing identified two nominal monotonic signals, and none after adjustment. Welch's ANOVA detected cycle-level distributional differences for 61 of 63 nutrients at nominal p < 0.05 and 57 of 63 after adjustment. Pairwise pandemic-period analyses showed many adjusted differences when the pre-pandemic baseline was compared with 2019-20 or 2021-23, but standardized effects were small, with all absolute Cohen's d values < 0.20. No nutrient differed after adjustment between 2019-20 and 2021-23, and 39 of 48 primary analytes met +/-10% TOST equivalence criteria for that comparison. Slope estimates were directionally stable after excluding 2017-18, but nominal significance status remained sensitive to the short time series. Conclusions: FNDDS food composition varied across cycles, but there was no clear decade-long linear trend for most nutrients. The main signal was a possible increase in total PUFA and linoleic acid, which may reflect changes in fat quality. The 2021-23 cycle was very similar to 2019-20, suggesting no major post-pandemic shift in the foods represented. These findings should be interpreted as food-database signals, not as direct estimates of what people consumed.

17.
bioRxiv (Bioinfo) 2026-06-18

Bayesian modeling of longitudinal metatranscriptomes of broiler meat spoilage microbiomes shows shared predictive signature associated with spoilage at refrigerated temperatures

Microbial spoilage of packaged meat is driven by complex microbial succession and related metabolic activity, yet conventional shelf-life assessment is mainly based on shelf-life studies relying on culturing and sensory analysis. In routine quality assurance, results are obtained retrospectively, and they are only indirectly linked to the metabolic activity related to sensory deterioration. Functional, time informative approaches that capture the active metabolic state of the spoilage microbiome and predict the rate of spoilage are lacking. We developed a censoring-aware Gaussian process (CAGP) framework to model longitudinal pathway expression profiles from broiler meat metatranscriptomes collected over consecutive storage days at 4 or 6{degrees}C. Samples were annotated using odor-based sensory scores defining fresh, early-spoilage, and late-spoilage phases. Because observed zeros in pathway-level data may reflect non-detection rather than true absence, the model treats low values as left-censored observations below a detection threshold while estimating smooth temporal trajectories with uncertainty. In leave-one-out prediction within the 4{degrees}C time series, predicted sampling days differed from the true days by an average of 0.43 days, and predicted spoilage phases agreed with the sensory classification. Trajectories learned at 4{degrees}C also transferred to an independent 6{degrees}C time series at the spoilage-phase level, suggesting that shared functional spoilage programs are preserved despite temperature-dependent changes in spoilage rate. Cross-entropy ranking further identified pathway modules carrying time- and phase-informative signals across temperatures. Overall, this framework provides a probabilistic approach for linking metatranscriptomic functional dynamics to sensory spoilage progression, supporting shelf-life assessment beyond retrospective microbial enumeration.

18.
arXiv (CS.CV) 2026-06-15

Aligned but Stereotypical? How System Prompts Shape Demographic Bias in LLM-Based Text-to-Image Models

Text-to-image (T2I) systems increasingly rely on Large Language Model (LLM)-based text conditioning to interpret and expand user prompts. While this improves prompt understanding and text-image alignment, we find that it can also introduce implicit demographic assumptions, even when demographic attributes are unspecified. To systematically investigate this behavior across varying levels of prompt ambiguity and complexity, we construct a comprehensive benchmark covering diverse prompt settings. Evaluations on eight recent T2I models show that LLM-based systems consistently exhibit stronger demographic skew than non-LLM-based baselines. We further analyze system prompts, a component unique to LLM-based T2I systems that guides prompt interpretation and expansion. Our analyses show that these instructions strongly influence text embeddings, which subsequently leads to biased image generations. Motivated by these findings, we propose FairPro, a training-free debiasing framework that adaptively generates fairness-aware instructions while preserving user intent. Experiments demonstrate that FairPro substantially reduces demographic disparities while maintaining prompt fidelity.

19.
arXiv (CS.AI) 2026-06-11

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

arXiv:2605.23243v2 Announce Type: replace-cross Abstract: We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

20.
arXiv (CS.LG) 2026-06-12

Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training

arXiv:2601.17654v2 Announce Type: replace Abstract: The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive and contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus on optimizing either dynamic or static energy consumption. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time-energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time-energy tradeoff frontier. Compared to the state of the art, Kareus reduces training energy by up to 28.3% at the same training time, or reduces training time by up to 27.5% at the same energy consumption.

21.
arXiv (CS.CV) 2026-06-11

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/

22.
arXiv (math.PR) 2026-06-11

Delta-Epsilon-Common Knowledge and Quantitative Agreement Theorems

arXiv:2606.11902v1 Announce Type: cross Abstract: Aumann defined common knowledge mathematically and established his now famous Agreement Theorem. We present a novel approach to quantifying how close individuals are to commonly knowing events, $(\delta,\epsilon)$-common knowledge, which is defined for any (and not just countable) probability spaces, and provide quantitative versions of the key results in this field. Specifically, we do this for Aumann's Agreement Theorem and Nielsen's extension thereof to random variables, as well as for the setting in which posteriors are communicated back and forth between individuals. Our results apply in particular to noisy communication settings.

23.
arXiv (math.PR) 2026-06-18

Evolution of Conditional Entropy for Diffusion Dynamics on Graphs

arXiv:2510.19441v2 Announce Type: replace-cross Abstract: The modeling of diffusion processes on graphs is the basis for many network science and machine learning approaches. Entropic measures of network-based diffusion have recently been employed to investigate the reversibility of these processes and the diversity of the modeled systems. While results about their steady state are well-known, very few exact results about their finite-time evolution exist. Here, we introduce the conditional entropy of heat diffusion in graphs, and outline a mathematical framework that contextualizes diffusion and conditional entropy within the theories of continuous-time Markov chains and information theory. In particular, we highlight that this entropic measure satisfies an information-theoretical version of the second law of thermodynamics, thereby providing a parallelism between diffusion dynamics on networks and their physical counterparts. Furthermore, we obtain explicit results for its evolution on complete, path, and circulant graphs, as well as a mean-field approximation for Erdös-Rényi graphs. We also obtain asymptotic results for general networks and provide bounds for the evolution of conditional entropy. Finally, we experimentally demonstrate several properties of conditional entropy for diffusion over random graphs, such as the Watts-Strogatz model.

24.
arXiv (CS.CL) 2026-06-16

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation introduces errors that are easily missed at scale, and some items conflate general and culture-specific knowledge. We address all three with a unified statistical framework, Multilingual-IRT, which extends Item Response Theory with per-language difficulty deviations, split discriminability separating content from language effects, and per-language ability residuals. Fitting Multilingual-IRT on 25 LLMs across 29 languages of MMLU-Pro-X, we show that its fitted parameters support three practical applications: predicting unobserved (item, LLM, language) instances with 11-16% lower binary cross-entropy than the strongest accuracy-based baseline, surfacing candidate translation errors distributed across all 28 non-English languages, whereas accuracy-based baselines concentrate detections in a few languages, and recovering culture-specific items that accuracy-based baselines miss.

25.
arXiv (CS.LG) 2026-06-15

Uncertainty Estimation and Generalization Bounds for Modern Deep Learning

arXiv:2606.13818v1 Announce Type: new Abstract: This thesis investigates how Bayesian principles can deepen our understanding of modern deep learning systems. While neural networks achieve remarkable predictive performance, their ability to generalize and to quantify uncertainty remains only partly understood. This thesis approaches this challenge from both methodological and theoretical angles: unifying Bayesian inference, function-space modeling, and large-deviation theory under a common probabilistic perspective. On the methodological side, the thesis introduces the Deep Variational Implicit Process (DVIP), a scalable Bayesian framework that extends implicit processes to deep architectures. Complementing this, two post-hoc methods – the Variational Linearized Laplace Approximation (VaLLA) and the Fixed-Mean Gaussian Process (FMGP) – are proposed to equip pretrained deterministic networks with calibrated uncertainty estimates. The theoretical contributions focus on one of the central open questions in modern machine learning: why do large, over-parameterized neural networks generalize so well? To address this, the thesis develops a unified probabilistic framework that connects three key mechanisms – diversity, smoothness, and stochasticity – within the language of PAC-Bayesian and large-deviation theory.