Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (quant-ph) 2026-06-16

Optimizing Wigner Negativity in Scattering Processes Using Energetic Cost Functions

arXiv:2606.15101v1 Announce Type: new Abstract: Wigner negativities (WNs) are key signatures of non-Gaussian bosonic states and essential resources for quantum technologies. We study their generation in the scattering of coherent pulses by a two-level atom coupled to a one-dimensional reservoir, a unitary and energy-preserving platform. Optimization in this multimode setting is hindered by the complexity of evaluating Wigner functions. We overcome this challenge by introducing energetic cost functions that identify output modes most likely to host large negativities. First using incoherent energy and then isolating a genuinely non-Gaussian contribution, we demonstrate a strong correlation between these quantities and WNs. This correlation extends beyond short, intense pulses to encompass pulses of finite energy, where photons are scattered while the two-level atom is driven. Focusing on the energy-efficiency of the process, we show that maximally efficient generation takes place for one input photon, on average, spectrally mode-matched with the atom.

02.
arXiv (CS.AI) 2026-06-17

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

arXiv:2606.17199v1 Announce Type: cross Abstract: Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

03.
arXiv (CS.CL) 2026-06-16

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

04.
arXiv (CS.CV) 2026-06-25

StyleFusion360: View-Consistent Head Stylization via Adaptive Style Modulation

3D head stylization enables expressive reimagining of human faces for creative visual experiences in digital media. Existing 3D-aware methods often require computationally intensive optimization or per-style fine-tuning, limiting flexibility and user control. To overcome these challenges, we introduce StyleFusion360, a diffusion-based framework for multi-view consistent, identity-preserving 3D head stylization from a single style reference image, without per-style training. Our approach enhances the Style Fusion Attention mechanism with a style-conditioned key modulation mechanism that aligns content and style representations for fine-grained and controllable stylization. We further provide a user-controllable slider for adjusting stylization intensity. In addition, StyleFusion360 supports local multi-edit stylization, enabling targeted edits such as modifying hair or eyes independently. Extensive experiments on FFHQ and RenderMe360 demonstrate that StyleFusion360 produces high-quality, controllable, and visually compelling stylizations, outperforming state-of-the-art GAN- and diffusion-based methods across diverse style domains.

05.
arXiv (CS.CV) 2026-06-16

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%.

06.
arXiv (CS.CV) 2026-06-24

Neural Particle Automata: Learning Self-Organizing Particle Dynamics

We introduce Neural Particle Automata (NPA), a Lagrangian generalization of Neural Cellular Automata (NCA) from static lattices to dynamic particle systems. Unlike classical Eulerian NCA where cells are pinned to pixels or voxels, NPA model each cell as a particle with a continuous position and internal state, both updated by a shared, learnable neural rule. This particle-based formulation yields clear individuation of cells, allows heterogeneous dynamics, and concentrates computation only on regions where activity is present. At the same time, particle systems pose challenges: neighborhoods are dynamic, and a naive implementation of local interactions scale quadratically with the number of particles. We address these challenges by replacing grid-based neighborhood perception with differentiable Smoothed Particle Hydrodynamics (SPH) operators backed by memory-efficient, CUDA-accelerated kernels, enabling scalable end-to-end training. Across tasks including morphogenesis, point-cloud classification, and particle-based texture synthesis, we show that NPA retain key NCA behaviors such as robustness and self-regeneration, while enabling new behaviors specific to particle systems. Together, these results position NPA as a compact neural model for learning self-organizing particle dynamics.

07.
arXiv (CS.AI) 2026-06-16

A Security Analysis of Long-Horizon Agentic AI Systems: Threats, Evaluation, and Framework Development

arXiv:2606.14816v1 Announce Type: cross Abstract: This paper presents a structured analysis of security challenges in long-horizon agentic AI systems. The study reviews existing threats, evaluation approaches, attack propagation mechanisms, and security frameworks. A taxonomy of security threats and a framework for analyzing attack propagation are proposed to support future research in agentic AI security

08.
arXiv (CS.CV) 2026-06-16

Sustainable Face Recognition on Low-Power Devices with VQ-VAE Embeddings

Face recognition has become a cornerstone of modern AI applications, yet conventional approaches often rely on computationally intensive models deployed in cloud environments, leading to increased network traffic, high energy consumption, and a heavy carbon footprint. This work introduces a sustainable, edge-deployable face recognition framework based on Vector-Quantized Variational Autoencoders (VQ-VAE), which generates compact and semantically rich latent representations of facial images. By leveraging the compression capacity and reconstruction quality of VQ-VAE embeddings on the edge and combining them with the power of pre-trained face embeddings in a knowledge distillation setup, our system achieves comparable accuracy to state-of-the-art face embedding models while significantly reducing memory and computation requirements on the edge, making it suitable for low-power edge devices. The integration of VQ-VAE compression minimizes network overhead while keeping the matching accuracy high by retaining only the most informative facial features in the latent space. As a result, the reconstructed images preserve the key identity characteristics, improving the robustness and overall performance of the face embeddings.

09.
arXiv (CS.CV) 2026-06-16

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. Existing approaches rely on a dictionary of activity label as input, cannot provide frame-by-frame labeling explanations, or rely on superseded computer vision techniques. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.

10.
arXiv (CS.LG) 2026-06-24

Adversarial dynamical systems characterize when data-driven learning succeeds or fails

arXiv:2407.06312v2 Announce Type: replace-cross Abstract: Many systems resist analytical modeling, making data-driven inference of dynamics important. Yet data-driven methods can fail to converge or generalize, leaving open a central question: When can system behavior be learned reliably from data, and when is such learning impossible? We answer this question using adversarial dynamical systems to identify the boundary between accessible and inaccessible regimes. In Koopman operator learning, a leading framework for representing nonlinear dynamics through linear spectral objects, we design optimal data-driven spectral algorithms with convergence and certification guarantees under conditions arising broadly in physical systems. This yields a convergence theory for Koopman-operator approximations and resolves a longstanding open problem in Koopman spectral analysis. Conversely, by constructing adversarial systems, we prove matching impossibility results: without these conditions, no single-sequence limiting procedure can guarantee learning, regardless of data quality. These results sharply characterize when data-driven spectral learning can succeed and when it must fail. We validate the framework on oscillators, chaotic fluid flows and Arctic sea ice concentration forecasting. In the latter, we uncover hidden modes of Arctic sea ice decline, deliver long-range forecasts with geographic error bounds, and outperform state-of-the-art dynamical and deep learning models at substantially lower computational cost, enabling real-time deployment on standard CPUs.

11.
arXiv (CS.LG) 2026-06-15

Where Black-box Drug-Target Interaction Prediction Models Look: Cross-Method Explainability

arXiv:2606.14245v1 Announce Type: new Abstract: Drug-target interaction (DTI) and affinity (DTA) predictors increasingly achieve strong benchmark scores, yet their internal use of sequence, fingerprint, and graph features often remains opaque. We present an interpretability audit of BridgeDPI architecture on three different datasets including Gao, Human, and C.elegans. This study combines gradient-based attributions – integrated gradients, saliency, layer-wise relevance propagation, SmoothGrad, and SmoothGrad-IG – with feature-wise occlusion ablation and strict intersection consensus across methods to reduce single-explainer bias. We summarize sensitivity and signed effects at raw inputs, at the bridge similarity scaffold, and through the graph convolution, including edge-level sensitivities and targeted edge removals. The results show that explainability is most informative when treated as model criticism: it reveals modality dominance, padding and special-token artifacts, dataset-dependent cooperative versus suppressive effects across layers, and chemistry-consistent fragment and composition motifs where methods agree. These analyses do not substitute for structural or experimental ground truth, yet they can provide testable hypotheses for downstream validation in computational drug discovery pipelines. More broadly, applying modern XAI to contemporary DTI/DTA models is still an early pass over the rich structure implicit in trained weights and data – yet even this first layer of scrutiny already helps researchers relate predictions to drug- and target-side representations and to prioritize external validation.

12.
arXiv (CS.AI) 2026-06-25

Taxonomy of Risks on Automated Fact-Checking Systems Considering its Propagation

arXiv:2606.25645v1 Announce Type: cross Abstract: In recent years, the posting of fake news including disinformation and misinformation on social networking services (SNS) has become a social problem. To combat this fake news, fact-checking that is the process of assessing the veracity of posts on SNS has become increasingly important. While fact-checking is currently performed by fact-checking organizations, it is difficult to fact-check all posts on SNS. Therefore, the use of automated fact-checking systems is effective. Recent automated fact-checking systems utilize artificial intelligence and large language models, so there are risks of incorrect judgments and posting incorrect results on social media which can lead to the spread of misinformation or to engage in defamation. In this paper, as a first step toward enabling the safe use of automated fact-checking systems, we categorize the specific risks on automated fact-checking systems. In this categorizing, we consider a three-stage risk propagation: risk factors, hazardous situations, and harm. Our analysis revealed that 32 specific risks exist in automated fact-checking systems. In this paper, we utilize the categorized risks as analytical cues (guide words) to present the risk assessment of the automated fact-checking system DEFAME. This assessment result indicates that risks that cannot be derived using STRIDE, a conventional IT security risk assessment method can be derived using our guide words.

13.
arXiv (CS.LG) 2026-06-16

GD$^2$PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization

arXiv:2606.16771v1 Announce Type: new Abstract: As LLMs advance, post-training reinforcement learning (RL) increasingly relies on multi-dimensional rewards to cultivate comprehensive capabilities. This shift demands new algorithms capable of optimizing diverse and potentially competing objectives simultaneously. To address this, existing methods such as Group reward-Decoupled Policy Optimization (GDPO) decompose the overall score into independent reward groups, then compute the RL loss separately within each group. However, this strategy still encounters multi-reward conflicts: a single rollout can yield positive advantages on certain reward dimensions but negative ones on others, causing opposing signals to cancel each other out during aggregation, further hindering RL training efficiency. Inspired by Dynamic sAmpling Policy Optimization (DAPO), which improves RL training efficiency by filtering out ineffective rollouts with near-zero advantages, we propose Group-Dynamic reward-Decoupled Policy Optimization (GD$^2$PO). Specifically, GD$^2$PO employs a conflict-aware filtering mechanism to mask out rollouts suffering from severe reward-wise disagreement. By preventing conflicting signals from canceling each other out, this masking strategy preserves and enhances the magnitude of effective RL advantages, thereby significantly accelerating learning efficiency. Furthermore, we introduce query-level reweighting to dynamically adjust the update intensity of each query based on its overall reward consensus. Experiments on various multi-reward scenarios, including tool calling and human preference alignment, demonstrate that GD$^2$PO consistently and significantly outperforms existing baselines. The code is available at https://github.com/Qwen-Applications/GD2PO.

14.
bioRxiv (Bioinfo) 2026-06-24

Generative Modeling of Mouse Embryogenesis for Fate and Disease Prediction

Embryonic development is orchestrated by complex gene regulatory networks, and learning regulatory dynamics from developmental data could allow us to understand, predict, and ultimately engineer cell fates. Here we introduce Navigo (https://github.com/aristoteleo/Navigo-release), a biologically grounded generative modeling framework that learns a developmental vector field by integrating flow matching at the population level with RNA kinetics modeling at the molecular level. Navigo accurately maps developmental trajectories across lineages on a mouse embryogenesis scRNA-seq atlas spanning 43 time points and comprising 12.4 million cells. Applied to cardiac development, Navigo enables disease modeling by mechanistically resolving regulatory networks that distinguish congenital heart disease subtypes. Navigo also predicts perturbation effects in a zero-shot manner, as validated on independent in vivo data from six knockout genotypes without perturbation-specific training, uncovering lineage-specific gene-compensation mechanisms. Moreover, Navigo guides rational cell-fate engineering, exemplified by fibroblast reprogramming analyses, including identifying pro-fibrotic barriers to cardiac fates and evaluating hundreds of pairwise transcription factor combinations for neuronal fate, each consisting of one bHLH factor and one POU factor. Overall, Navigo provides a generalizable AI platform for perturbation-effect prediction, disease modeling, and rational cell-fate engineering, advancing toward AI-based virtual embryos for developmental biology and regenerative medicine.

15.
arXiv (CS.AI) 2026-06-25

Reliability-Asymmetric Spacecraft Autonomy: Co-Designing a Capable Learned GNC Stack with a Verified, Adaptation-Aware Runtime Shield

arXiv:2606.25366v1 Announce Type: cross Abstract: Deep-space missions need onboard autonomy that is both capable and certifiable. Rule-based autonomy is certifiable but brittle, while learned autonomy is capable but hard to verify. We present AMPLE-GNC, a three-tier guidance, navigation, and control stack. Its capability path combines a small foundation-model commander that maps natural language to PDDL+, a constraint-screening verifier, and a fault-adaptive controller. All three are bounded by a runtime shield with nine linear-temporal-logic invariants whose predictor soundness is machine-checked by the Kind 2 model checker. On a 6-DOF Basilisk testbed, we make three contributions. First, we deploy an edge commander. Fine-tuning a pretrained 360M model with grammar-constrained decoding gives a hard output-validity guarantee and 84% planner-executable actions. On a de-leaked test, novel-phrasing generalization is 38% exact and 51% action, rising to 48% exact after phrasing-diversity re-finetuning; we separate syntactic validity from semantic accuracy. Second, we introduce a fault-adaptive controller. Rapid Motor Adaptation infers latent actuator faults online and recovers 97.8% of actuator-sign faults and 94.4% of continuous-gain faults within the training randomization envelope. Fault-unaware PD and from-scratch end-to-end RL both score 0%, while the strongest classical-adaptive baseline reaches 55% on continuous gain. Beyond the envelope, a split-conformant retrain scores 57-67%, and adding 4x more in-regime data worsens performance, showing that randomization breadth, not data volume, drives generalization. Robustness is flat under star-tracker noise to 0.005. Third, we show that a latching safe-hold shield can suppress even a capable controller. A split-conformal recovery-deadline certificate with adaptation-aware engagement reconciles safety and recovery, keeping the controller 94.5% autonomous while still catching non-recovery.

16.
arXiv (CS.CL) 2026-06-15

Token-Level LLM Collaboration via FusionRoute

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

17.
arXiv (CS.AI) 2026-06-12

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

arXiv:2606.12699v1 Announce Type: cross Abstract: Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

18.
medRxiv (Medicine) 2026-06-17

Postoperative Cognitive Decline in Older Patients with Cardiovascular Disease and Preoperative Mild Cognitive Impairment

Objective. Older adults undergoing cardiac surgery may be vulnerable to postoperative cognitive decline. However, no studies have examined postoperative cognitive outcomes in older patients with cardiovascular disease (CVD) according to preoperative mild cognitive impairment (MCI). This study examined 12-month postoperative cognitive outcomes in older CVD patients according to preoperative MCI diagnosis and explored predictors of postoperative cognitive decline. Method. Twenty-two older CVD patients ([≥]65 years) and twenty-five controls were included. Neuropsychological assessment was conducted at baseline in both groups and repeated 12 months after surgery in the CVD group. MCI was diagnosed using current clinical criteria. Postoperative cognitive change was examined across preoperative MCI groups. Results. Fifty percent of patients met criteria for postoperative MCI, showing high diagnostic stability relative to preoperative frequency (45.5%). The preoperative CVD-MCI group showed a decline in working memory, executive functions, visual memory, and naming, whereas CVD-nMCI group declined only in verbal memory. Furthermore, CVD-MCI showed more heterogeneous postoperative cognitive trajectories of change than CVD-nMCI, who showed stability. Estimated IQ, APACHE-II score, and postoperative frailty were important variables in predicting the postoperative pattern. Conclusions. MCI frequency remained high and stable in older CVD patients across the preoperative and one-year postoperative period. However, this apparent diagnostic stability masks subclinical cognitive decline, particularly among patients with preoperative MCI, who showed greater susceptibility to further impairment. Estimated IQ, APACHE-II score, and postoperative frailty may be considered relevant predictors of outcome. These results highlight the value of preoperative neuropsychological assessment for characterizing postoperative cognitive risk in older CVD patients.

19.
arXiv (CS.LG) 2026-06-19

Reversible Residual Normalization Alleviates Spatio-Temporal Distribution Shift

arXiv:2604.15838v2 Announce Type: replace Abstract: Distribution shift severely degrades the performance of deep forecasting models. While this issue is well-studied for individual time series, it remains a significant challenge in the spatio-temporal domain. Effective solutions like instance normalization and its variants can mitigate temporal shifts by standardizing statistics. However, distribution shift on a graph is far more complex, involving not only the drift of individual node series but also heterogeneity across the spatial network where different nodes exhibit distinct statistical properties. To tackle this problem, we propose Reversible Residual Normalization (RRN), a novel framework that performs spatially-aware invertible transformations to address distribution shift in both spatial and temporal dimensions. Our approach integrates graph convolutional operations within invertible residual blocks, enabling adaptive normalization that respects the underlying graph structure while maintaining reversibility. By combining Center Normalization with spectral-constrained graph neural networks, our method captures and normalizes complex Spatio-Temporal relationships in a data-driven manner. The bidirectional nature of our framework allows models to learn in a normalized latent space and recover original distributional properties through inverse transformation, offering a robust and model-agnostic solution for forecasting on dynamic spatio-temporal systems.

20.
medRxiv (Medicine) 2026-06-15

Artificial Intelligence-Based Detection of Airway Mucus Plugs on CT and Associations With Clinical Outcomes in COPDGene

RATIONALE: Airway mucus plugging is a clinically relevant manifestation of airway pathology in chronic obstructive pulmonary disease (COPD) and is associated with increased mortality even in early disease; however, visual computed tomography (CT) assessment is subjective and labor intensive. OBJECTIVES: To develop an AI-based quantitative CT method for automated detection of airway mucus plugging and evaluate associations with physiologic impairment and clinical outcomes. METHODS: Inspiratory CT scans from 8,971 COPDGene Phase 1 (GOLD 0-4 and PRISm) participants were analyzed. An AI-based framework combining 3D airway segmentation discontinuities and convolutional neural network classification identified mucus plug obstructions, yielding mucus plug burden (total plug count). Associations with outcomes were evaluated using covariate-adjusted models. MEASUREMENTS AND MAIN RESULTS : Higher mucus plug burden was associated with lower post-bronchodilator FEV % predicted ({rho} = -0.41; P < 0.001), greater air trapping (LAA < -856 HU; {rho} = 0.33; P < 0.001), worse health status (SGRQ; {rho} = 0.31; P < 0.001), and shorter 6-minute walk distance ({rho} = -0.26; P < 0.001). Among GOLD 1-4 participants, mucus plug presence was independently associated with increased all-cause mortality (adjusted hazard ratio, 1.28; P < 0.005) and exacerbation frequency (adjusted incidence rate ratio, 1.32; P < 0.005). Plug presence was also associated with increased respiratory mortality across GOLD categories and cardiovascular mortality in GOLD 1-2. CONCLUSIONS: AI-based quantitative CT assessment of airway mucus plugging provides a scalable, reproducible measure associated with physiologic impairment and adverse outcomes in COPD, supporting its role in risk stratification and future therapeutic studies.

21.
arXiv (CS.CV) 2026-06-16

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

22.
arXiv (CS.CL) 2026-06-17

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16–23 percentage points across models. An oracle analysis decomposes the degradation into a retrieval gap (the model cannot surface the right tool) and a confusion gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10–11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10–17pp despite 10–15pp lower absolute performance.

23.
arXiv (CS.CL) 2026-06-11

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce Cross-Layer Vector Quantized-Variational Autoencoder (CLVQ-VAE), a novel framework which maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer vector quantized-variational autoencoder (VQ-VAE), and sparse autoencoder (SAE) baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

24.
arXiv (CS.CL) 2026-06-15

SciDef: Datasets and Tools for Automated Definition Extraction from Scientific Literature with LLMs

Scientific concepts are often defined inconsistently across papers, making it difficult to compare findings, reuse terminology, and build reliable downstream resources. We present SciDef, a resource suite for scientific definition extraction. The suite contains DefExtra, a benchmark of 268 human-validated author-stated definitions from 75 academic papers; DefSim, 60 human-labeled definition-pair similarity judgments; and an open LLM-based pipeline for PDF preprocessing, chunking, definition extraction, prompt optimization, and evaluation. We validate the resources by benchmarking 16 language models across prompting strategies and chunking schemes. The strongest set-level configuration achieves a score of 0.397, while the highest-coverage configuration matches at least one prediction to 86.4% of gold definitions but over-generates candidate definitions. We further show that an NLI-based matching metric agrees strongly with human DefSim judgments. These results position SciDef as a reusable benchmark and tooling layer for definition-centric literature analysis, while highlighting relevance-aware filtering as the key bottleneck for fully automatic definition extraction. Code & datasets are available at https://github.com/Media-Bias-Group/SciDef.

25.
arXiv (CS.AI) 2026-06-16

Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build

arXiv:2605.21629v2 Announce Type: replace-cross Abstract: How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for investigating time-on-task, complemented by ALEKS PPL placement-assessment data for examining proctoring and learning outcomes, with a quasi-experimental design exploiting variation in tasks that are more susceptible to AI (text-based word problems) and less susceptible to AI (interactive graph-based problems). Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. Among college students, the post-ChatGPT divergence vanishes entirely under proctoring, ruling out broad efficiency gains as the likely explanation. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase – inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build – the population-level indicator of cognitive surrender, with direct implications for educational research, assessment governance, and AI policy.