Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-16

Smoothness Errors in Dynamics Models and How to Avoid Them

arXiv:2602.05352v3 Announce Type: replace Abstract: Modern neural networks have shown promise for solving partial differential equations over surfaces, often by discretizing the surface as a mesh and learning with a mesh-aware graph neural network. However, graph neural networks suffer from oversmoothing, where a node's features become increasingly similar to those of its neighbors. Unitary graph convolutions, which are mathematically constrained to preserve smoothness, have been proposed to address this issue. Despite this, in many physical systems, such as diffusion processes, smoothness naturally increases and unitarity may be overconstraining. In this paper, we systematically study the smoothing effects of different GNNs for dynamics modeling and prove that unitary convolutions hurt performance for such tasks. We propose relaxed unitary convolutions that balance smoothness preservation with the natural smoothing required for physical systems. We also generalize unitary and relaxed unitary convolutions from graphs to meshes. In experiments on PDEs such as the heat and wave equations over complex meshes and on weather forecasting, we find that our method outperforms several strong baselines, including mesh-aware transformers and equivariant neural networks.

02.
bioRxiv (Bioinfo) 2026-06-15

Multi-platform reassessment of human mitochondrial DNA methylation reveals signals consistent with technical artifacts

The existence and functional relevance of mitochondrial DNA methylation remain controversial. Here, we systematically profiled cytosine methylation and hydroxymethylation across human brain and blood tissues spanning healthy and malignant states using orthogonal sequencing approaches that avoid chemical conversion during library preparation. While nuclear DNA exhibited canonical methylation patterns, mitochondrial DNA consistently showed negligible signal, indistinguishable from background technical noise. By mapping cytosine-guanine sites between mitochondrial DNA and nuclear-embedded mitochondrial sequences, we demonstrate the potential of these nuclear counterparts to confound not only cytosine methylation but also hydroxymethylation measurements, corroborating and extending prior findings implicating nuclear contamination as a potential source of apparent mitochondrial epigenetic signals. Additional technical factors that inflate apparent mtDNA methylation signals were identified, including sequence context biases, flow cell chemistries, and coverage-dependent discrepancies between the heavy and light strands. Collectively, these results provide convergent evidence against the presence of biologically meaningful cytosine methylation or hydroxymethylation in mitochondrial DNA. These findings caution against interpreting apparent mtDNA methylation signals in human adult tissues as meaningful without rigorous orthogonal validation and comprehensive consideration of technical and analytical confounding factors.

03.
arXiv (quant-ph) 2026-06-12

Continuum Neural Momentum Eigenstate for Variationally Solving Quasiparticles

arXiv:2606.12928v1 Announce Type: cross Abstract: We design the first neural quantum state for continuum particles that, for any chosen allowed momentum $\mathbf{k}$, is by construction an exact eigenstate of total momentum with eigenvalue $\mathbf{k}$. Our architecture, EVE, enables off-the-shelf VMC to solve for momentum-sector ground states. We test EVE on 2D bosons with mutual $1/r$ interactions, finding that a single unified ansatz is capable of describing four qualitatively different states: superfluid, roton, crystal, and phonon. At different densities, we extract the underlying phase of matter from the dispersion's shape. At $r_s = 20.0$, we see the roton minimum at finite $k$ expected of a superfluid. At $r_s = 100.0$, we see striking zone folding indicative of crystalline order, with periodically spaced minima representing floating crystals connected by phonon arcs in between. Using density-density correlation functions, we confirm the phase diagnoses and probe the excitations' correlation structures. Finally, we analyze the roton's phase texture and find unexpected multi-particle phase strings, formed when several vortex dipoles merge, leaving two vortices connected by a phase slip.

04.
Nature (Science) 2026-06-10

A prognostic human brain network for diffuse midline glioma

Authors:

Diffuse midline gliomas (DMGs) are near-universally lethal tumours of the childhood central nervous system1,2. In animal models, DMGs form brain-wide integrated networks through neuron-to-glioma synapses3–6 and glioma-to-glioma gap junctional coupling3. This extensive connectivity robustly promotes the growth and invasion of DMG3–9 and other glial malignancies10–12 through paracrine mechanisms and direct neuron-to-glioma synapses. However, the organization and clinical implications of these connections in the living human brain remain to be elucidated. Here, we develop tumour network mapping to compute the brain-wide connectivity profile of DMG, defining a conserved brain network across pontine and thalamic DMG associated with patient short-term survival (DMG network). Tumour functional connectivity with the DMG network was independently predictive of patient overall survival across two external validation cohorts. Tumour growth mapped to DMG network-specific trajectories and peak in-network neurometabolic changes across development spatiotemporally aligned with the peak age incidence of DMG. Analyses of single-nucleus RNA sequencing data confirmed diverse synaptic gene enrichment in high-connectivity DMG. Strikingly, incidental surgical resection of high-connectivity thalamic DMG tissue conferred a significant survival advantage. Collectively, these data define a conserved and prognostically important brain network in children with DMG, consistent with the hypothesis that DMGs exploit otherwise healthy brain circuits to promote tumour growth. Tumour network mapping of diffuse midline glioma (DMG) defines a conserved and prognostically important brain network in children with DMG, consistent with the hypothesis that DMGs exploit otherwise healthy brain circuits to promote tumour growth.

05.
arXiv (CS.AI) 2026-06-19

How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech

arXiv:2606.20532v1 Announce Type: new Abstract: Style-captioned text-to-speech systems use natural language to control voice characteristics, but how individual words influence acoustic output remains unclear. Understanding this is critical for diagnosing failure modes and improving controllability in expressive TTS. We propose cross-attention attribution for speech diffusion models, adapting the DAAM framework to the speech domain for the first time, and apply it to CapSpeech-TTS. Our method extracts per-token heatmaps across 25 layers and 24 ODE steps. We analyze 3,600 (style caption, text transcript) combinations comprising 120 style captions conditioning the generation of 30 text transcripts each, revealing how caption tokens shape waveforms. Results show: (1) style tokens have lower temporal variance than content/function tokens, confirming global conditioning; (2) style attention correlates with F0 and energy; (3) style conditioning peaks in early steps and deep layers; (4) attention entropy reaches its minimum at layer 17, co-occurring with the style importance peak, indicating maximal network selectivity at the most style-critical stage. This is the first study of how natural language influences cross-attention in speech diffusion models

06.
arXiv (CS.CL) 2026-06-19

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

07.
arXiv (CS.CV) 2026-06-11

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

08.
arXiv (CS.CL) 2026-06-17

Algorithmic Prompt Generation for Diverse Human-like Teaming and Communication with Large Language Models

Understanding how humans collaborate and communicate in teams is essential for improving human-agent teaming and AI-assisted decision-making. However, relying solely on data from large-scale user studies is impractical due to logistical, ethical, and practical constraints, necessitating synthetic models of multiple diverse human behaviors. Recently, agents powered by Large Language Models (LLMs) have been shown to emulate human-like behavior in social settings. But, obtaining a large set of diverse behaviors requires manual effort in the form of designing prompts. On the other hand, Quality Diversity (QD) optimization has been shown to be capable of generating diverse Reinforcement Learning (RL) agent behavior. In this work, we combine QD optimization with LLM-powered agents to iteratively search for prompts that generate diverse team behavior in a long-horizon, multi-step collaborative environment. We first show, through a human-subjects experiment, that humans exhibit diverse coordination and communication behavior in this domain. We then present a series of experiments showing that our approach captures behaviors that are difficult to observe without large-scale data collection, and a follow-up user study to show that these generated behaviors are human-like. Our findings highlight the combination of QD and LLM-powered agents as an effective tool for studying teaming and communication strategies in multi-agent collaboration.

09.
bioRxiv (Bioinfo) 2026-06-10

HOMED enables hierarchical and multimodal optimization of DNA methylation deconvolution across tissues

Cellular heterogeneity is a major confounder in bulk DNA methylation data for epigenome-wide association studies. Existing reference-based DNAm deconvolution methods often ignore hierarchies among related cell types and may generalize poorly across datasets due to limited variability in reference profiles. We developed HOMED (Hierarchically Optimized Methylation Deconvolution), a framework that integrates cell-lineage hierarchies, single-cell RNA sequencing-guided deconvolution, and paired bulk RNA-seq/DNAm data for CpG signature optimization. Across simulated and real peripheral blood mononuclear cell, lung, and placental datasets, HOMED consistently yielded the highest PCCs and lowest RMSEs, outperforming existing scRNA-seq-guided DNAm deconvolution methods, improving accuracy, resolution, and cross-tissue generalizability.

10.
bioRxiv (Bioinfo) 2026-06-10

Folding the unfoldable 2: using AlphaFold and ESMFold to explore spurious proteins

Motivation: Spurious protein sequences, resulting from gene prediction errors, theoretically should not yield folded structures. AlphaFold2 was previously shown to predict short spurious sequences with high pLDDT scores and was therefore unlikely to distinguish between real proteins and spurious proteins which are usually short. We evaluate whether newer structure prediction methods (ESMFold and AlphaFold3) similarly predict short sequences with high pLDDT or if they better discriminate between spurious and real proteins. Results: All three structure prediction methods (ESMFold, AlphaFold2, and AlphaFold3) predict short spurious sequences from AntiFam with unexpectedly high pLDDT scores, however the discrimination between spurious and real proteins improves beyond 100 amino acids. By analysing sequences with disparate pTM and pLDDT scores, we identified two likely spurious shadow ORFs in Swiss-Prot and one potentially non-spurious AntiFam entry. Using the structure prediction scores, we developed a Gaussian Process Model and evaluated its performance on AlphaFold DB, identifying potential spurious proteins at scale. While limited on its own, this model can increase confidence in spurious protein identification when combined with other methods.

11.
arXiv (CS.AI) 2026-06-12

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

arXiv:2606.12945v1 Announce Type: new Abstract: Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency – both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 – this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable – reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

12.
arXiv (quant-ph) 2026-06-17

Quantum Chip Paradigm Framework

arXiv:2606.17899v1 Announce Type: new Abstract: Quantum Electronic Design Automation (Q-EDA) is emerging as quantum chips move from laboratory prototypes to scalable engineering systems. This paper argues that superconducting quantum chip design is approaching a "SPICE moment" similar to early classical EDA, where growing qubit scale, control complexity, frequency planning, packaging, process variation, and cryogenic measurement feedback require a shift from experience-based design to model-driven engineering. We propose a Quantum Chip Paradigm Framework that treats Q-EDA not only as software, but as part of the quantum chip development paradigm. Unlike classical HDL-first design, quantum chip design must begin with physical structures such as Josephson junctions, resonators, couplers, readout elements, control lines, and packaging environments. The framework emphasizes PCell-based modeling, SPICE-Q simulation, Quantum PDKs, and design-technology-measurement co-optimization. We further outline a hierarchical Q-EDA system spanning physical structures, qubit PCells, logical qubits, quantum arithmetic, functional quantum IP, and Quantum SoC systems. The key goal is to turn physical models, layout rules, simulation results, fabrication data, and measurement feedback into reusable and auditable engineering objects for large-scale quantum processors and fault-tolerant quantum computing.

13.
arXiv (CS.LG) 2026-06-15

Adaptive Oscillatory-State Alignment for Time Series Forecasting

arXiv:2606.06010v2 Announce Type: replace Abstract: Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or fixed learnable templates. However, real-world temporal dynamics are rarely rigidly periodic: around a nominal cycle, oscillatory behavior often exhibits non-rigid periodicity (NRP), where cycle magnitude, cycle alignment, and local cycle duration vary over time. Under these conditions, fixed-template periodic modeling can become fundamentally mismatched to the underlying temporal states. We propose AOSNet, a Hilbert-guided forecasting framework that reformulates periodic forecasting from fixed template matching to adaptive oscillatory-state alignment. AOSNet extracts analytic-signal descriptors from both the observed sequence and a learnable global oscillatory prior, then adaptively aligns local states through a descriptor-conditioned gate that selectively preserves reliable observations while softly correcting mismatched regions. The learned prior serves not as a rigid repeated template but as a flexible oscillatory reference interpreted through local state dynamics. Experiments on eight public benchmarks and two cloud workload traces demonstrate leading or highly competitive accuracy with a compact model size and low inference latency, supporting repeated forecasting settings such as capacity planning and autoscaling. Controlled synthetic studies that isolate cycle-magnitude and cycle-alignment variation and combine them with cycle-duration changes show that the advantage of oscillatory-state alignment increases as NRP intensifies.

14.
arXiv (CS.CL) 2026-06-12

X-MADAM-RAG: Diagnosing and Handling Chinese-English Evidence Conflict in Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) systems may receive evidence that is not merely noisy but mutually contradictory. This issue becomes particularly salient in multilingual settings, where retrieved Chinese and English evidence may support incompatible answer candidates. We study this problem through X-RAMDocs-ZHEN, a controlled Chinese-English benchmark derived from RAMDocs for diagnosing evidence conflict in RAG. The benchmark contains 300 examples across six balanced conditions, including monolingual support, bilingual agreement, reversed conflict directions, and conflict with optional noise. We further examine X-MADAM-RAG, an interpretable pipeline that decomposes evidence handling into per-document candidate extraction, visible-evidence repair, deterministic candidate grouping, and conflict-aware aggregation. On the original controlled benchmark with Qwen2.5-7B-Instruct, X-MADAM-RAG achieves 0.9667 strict accuracy and 0.9767 conflict-aware success, outperforming an evidence-normalized single-call baseline. However, a zero-call rule-only extractor reaches 1.0000 on the same benchmark, revealing strong template regularity. To probe this limitation, we construct a deterministic naturalized stress test that removes explicit answer templates while preserving candidate strings. On its 100-sample subset, rule-only extraction falls to 0.0000, but X-MADAM-RAG also drops to 0.3000 strict accuracy, below both naive and evidence-normalized baselines. A privileged oracle remains perfect, indicating that document-level extraction is the main bottleneck. These findings position X-RAMDocs-ZHEN and X-MADAM-RAG as diagnostic tools for controlled evidence conflict rather than as evidence of general hallucination detection or robustness to natural retrieval.

15.
arXiv (CS.CL) 2026-06-19

Closing the Calibration Gap in Semantic Caching

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

16.
arXiv (CS.AI) 2026-06-15

LLM-Powered AI Agent Systems and Their Applications in Industry

arXiv:2505.16120v3 Announce Type: replace Abstract: The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule-based agents with limited task scope, LLM-powered agents offer greater flexibility, cross-domain reasoning, and natural language interaction. Moreover, with the integration of multi-modal LLMs, current agent systems are highly capable of processing diverse data modalities, including text, images, audio, and structured tabular data, enabling richer and more adaptive real-world behavior. This paper comprehensively examines the evolution of agent systems from the pre-LLM era to current LLM-powered architectures. We categorize agent systems into software-based, physical, and adaptive hybrid systems, highlighting applications across customer service, software development, manufacturing automation, personalized education, financial trading, and healthcare. We further discuss the primary challenges posed by LLM-powered agents, including high inference latency, output uncertainty, lack of evaluation metrics, and security vulnerabilities, and propose potential solutions to mitigate these concerns.

17.
arXiv (CS.CV) 2026-06-18

The Market in the Model: Latent Diffusion as Neural Economy

Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as "black boxes." In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert's notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.

18.
arXiv (CS.CV) 2026-06-24

Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods

WordArt (artistic text) features highly customized fonts, textures, and layouts, making WordArt-oriented scene TExt Recognition (WATER) substantially more challenging than general Scene Text Recognition (STR). Existing STR datasets and methods, typically built around regular scene text and fixed-template inputs, struggle to scale to WATER. Thus, we aim to advance this task from both data and model perspectives. On the data side, we construct a 2M synthetic dataset, WATER-S, with the scale improved by hundreds of times compared to existing artistic text data. WATER-S consists of two complementary subsets. One rendered by an upgraded rendering pipeline (SynthWordArt), which provides highly accurate and controllable synthetic WordArt data. The other is generated by combining Qwen3-VL for prompt mining and Z-Image for image synthesis, which improves the coverage of realistic and diverse data. On the model side, we propose WATERec. It adopts an visual encoder supporting arbitrary-shaped inputs and an autoregressive decoder to model complex layouts, structurally breaking the bottleneck of fixed-template STR on WordArt. Experiments show that this architecture outperforms prior STR methods, achieving state-of-the-art performance on irregular texts such as WordArt. Together with WATER-R, carefully reorganized from existing real STR data, our strong baseline with the new synthetic data and model design reaches 90.40% accuracy on WordArt-Bench, surpassing both general-purpose and OCR-specialized vision-language models by a large margin. Code and data are available at https://github.com/YesianRohn/WATER.

19.
arXiv (math.PR) 2026-06-16

Collapsibility in Multiparametric Models of Random Simplicial Complexes

Authors:

arXiv:2606.15276v1 Announce Type: cross Abstract: We study collapsibility in the multiparametric models of random simplicial complexes, namely the lower and upper models. In the upper model, we improve upon a result of Farber and Nowik, and assert that the homology is a.a.s concentrated in a single dimension by proving that the complex collapses to that \di. In the lower model, we prove that the complex a.a.s collapses to the \di\ with maximal non-trivial cohomology. We then compare this threshold to the ones derived previously for the special cases of the clique complex (by Kahle) and the Linial-Meshulam model.

20.
bioRxiv (Bioinfo) 2026-06-10

Pseudoperplexity Probes Memorization in Protein Language Models

Protein Language Models (pLMs) have significantly advanced computational biology. Yet their scale and reliance on redundant training data raise a fundamental question: do pLMs generalize the statistical grammar of proteins, or do they simply memorize their training data? To investigate this, we used pseudoperplexity as a probe for sequence-level memorization, comparing ProtT5's pseudoperplexity on a pre-training proxy dataset against a post-training holdout of genuinely novel sequences. To ensure a valid comparison, we matched the datasets by sequence length, cluster size, and taxonomic family. As a statistical baseline, we trained n-gram language models; analysis of higher-order n-gram composition and a statistically significant divergence in perplexity confirmed that the post-training sequences were genuinely novel at the local sequence level. ProtT5 showed a statistically significant difference in pseudoperplexity between seen and unseen sequences, though further analysis revealed this memorization signal to be modest. These findings suggest that ProtT5 exhibits detectable but limited memorization of its training data as measured by a pseudoperplexity-based probe.

21.
arXiv (CS.AI) 2026-06-24

AI Tokenomics: The Economics of Tokens, Computation, and Pricing in Foundation Models

Authors:

arXiv:2606.24616v1 Announce Type: new Abstract: Tokens have become the practical accounting unit for modern foundation model services, linking information processing, computation, memory use, energy expenditure, pricing, and economic value. This paper develops a framework for AI tokenomics: the study of how tokens are generated, consumed, priced, allocated, and optimized across AI systems. We connect token-level technical costs to workflow-level production functions, enterprise resource allocation, measurement and instrumentation methods, and emerging market-design questions. The framework shows that token expenditure and economic value are distinct: value depends on marginal productivity, workflow position, hidden reasoning activity, risk, and downstream propagation effects. The paper concludes by identifying open research directions in hidden-token measurement, empirical calibration, token productivity, dynamic allocation, and token-based markets.

22.
arXiv (CS.CL) 2026-06-16

The Value Axis: Language Models Encode Whether They're on the Right Track

We investigate whether language models internally track the value of their current trajectory, defined as the likelihood that their ongoing strategy will achieve their goals. Using synthetic, in-context reinforcement learning data, we construct a "value" axis for Qwen3-8B. We find that activations along this axis distinguish between high vs. low verbalized confidence, rollouts without and with backtracking, and correct vs. corrupted code. Steering towards high value causally suppresses self-correction and reduces explanatory verbosity, while steering towards low value induces backtracking and exploration. We demonstrate that direct preference optimization (DPO) can increase the internal value of rewarded behaviors (e.g. use a certain word), causing the model to act more confidently after exhibiting them. Finally, we apply the value axis to study in-the-wild settings. For example, we find that Qwen assigns low value to politically sensitive chat queries after post-training and that supervised fine-tuning increases internal confidence within the training domain. Our results suggest that language models linearly encode an estimate of expected goal success that modulates their confidence in pursuing a direction.

23.
arXiv (CS.CL) 2026-06-11

Context-Aware Multimodal Claim Verification in Spoken Dialogues

Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.

24.
arXiv (CS.AI) 2026-06-11

Sustainability assessment using multimodal AI agents

arXiv:2507.17012v2 Announce Type: replace Abstract: Reducing the rapidly growing environmental impact of the computing industry requires assessing the emissions of electronics at scale. However, a traditional life cycle assessment (LCA) of an electronic device, which maps materials and processes to environmental impacts, often requires proprietary or unavailable data. Here, we reimagine conventional sustainability assessment by introducing a multimodal multi-agent AI system that emulates the collaborative process between LCA professionals and stakeholders (such as product managers and engineers) to automatically estimate the carbon footprint of electronic devices. The agents iteratively construct a complete life-cycle inventory by leveraging a structured data abstraction and software tools that mine information from the public internet, including repair communities and government regulatory databases. This reduces data gaps and data collection from weeks or months of expert time to under one minute. The system can calculate carbon footprint within 19% of expert LCAs with zero proprietary data (typical of the variation between human LCAs). We also show that by encoding domain-specific knowledge, environmental impact estimation can be reframed as a data-driven prediction task, in which both unknown products and emission factors are represented as weighted combinations of similar ones with known emissions.

25.
arXiv (CS.CL) 2026-06-24

EXPO-SQL: Execution-based Clause-level Policy Optimization for Text-to-SQL

Text-to-SQL enables users to query databases using natural language by generating executable SQL queries. Recent methods have increasingly adopted Large Language Models based reinforcement learning (RL) to leverage execution feedback for training. However, existing RL methods assign uniform query-level rewards to all clauses in a SQL query, treating correct and incorrect clauses equally. This coarse-grained reward design leads to insufficient learning signals for correct SQL generation. To address this issue, we propose EXPO-SQL (EXecution-based clause-level Policy Optimization for Text-to-SQL) which provides fine-grained supervision through clause-level rewards. To assign clause-level rewards, our method identifies erroneous clauses by analyzing execution results, including error messages and clause-wise incremental execution. Experiments on widely-used Text-to-SQL benchmarks demonstrate that EXPO-SQL significantly outperforms existing supervised fine-tuning, prompting, and RL-based methods through fine-grained clause-level learning. Our code is available at https://github. com/jhn25/EXPO-SQL.