Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (quant-ph) 2026-06-11

Quantum thermodynamics of the Caldeira-Leggett model with non-equilibrium Gaussian reservoirs

arXiv:2405.00215v5 Announce Type: replace Abstract: We introduce a non-equilibrium version of the Caldeira-Leggett model in which a quantum particle is strongly coupled to a set of engineered reservoirs. The reservoirs are composed by collections of squeezed and displaced thermal modes, in contrast to the standard case in which the modes are assumed to be at equilibrium. The model proves to be very versatile. Strongly displaced/squeezed reservoirs can be used to generate an effective time dependence in the system Hamiltonian and can be identified as sources of pure work. In the case of squeezing, the time dependence is stochastic and breaks the fluctuation-dissipation relation, this can be reconciled with the second law of thermodynamics by correctly accounting for the energy used to generate the initial non-equilibrium conditions. To go beyond the average description and compute the full heat statistics, we treat squeezing and displacement as generalized Hamiltonians on a modified Keldysh contour. As an application of this technique, we show the quantum-classical correspondence between the heat statistics in the non-equilibrium Caldeira-Leggett model and the statistics of a classical Langevin particle under the action of squeezed and displaced colored noises. Finally, we discuss thermodynamic symmetries of the heat generating function, proving a fluctuation theorem for the energy balance and showing that the conservation of energy at the trajectory level emerges in the classical limit.

02.
arXiv (CS.CV) 2026-06-16

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

03.
arXiv (CS.LG) 2026-06-12

Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory

arXiv:2508.12681v3 Announce Type: replace-cross Abstract: Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of up to 44,000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3\% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.

05.
arXiv (CS.AI) 2026-06-17

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code

arXiv:2606.18168v1 Announce Type: cross Abstract: Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goal of this paper is to help practitioners assess the verification strength of agent-authored patches by characterizing oracle signals and their link to merge outcomes and review effort. We conduct an empirical study of 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories produced by five coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. A qualitative analysis of 384 stratified patches informs a syntactic taxonomy of eight oracle signal categories. Applied at scale, 80.2% of test patches contain weak or no explicit oracle signals. While raw merge rates are lower for strong-oracle PRs, a regression analysis adjusting for agent, PR size, repository popularity, task type, and language shows strong oracles significantly improve merge likelihood (OR = 1.28, p < 0.001). Our findings suggest that test file counts substantially overestimate verification strength and that practitioners can adopt oracle-aware quality checks to more accurately evaluate agent-authored contributions.

06.
arXiv (CS.AI) 2026-06-16

Guiding Federated Graph Recommendation with LLM-encoded knowledge

arXiv:2606.15277v1 Announce Type: cross Abstract: Graph-based recommender systems are highly effective at extracting collaborative signals from user–item interactions, and federated learning (FL) allows these models to be trained while preserving user privacy. However, aggregating graph representations across distributed, non-IID clients remains a challenge; structural embeddings learned locally often misalign, and naive averaging fails to capture meaningful cross-client relationships. Most existing federated graph methods rely exclusively on structural aggregation, neglecting the rich, global semantic context available in large language models (LLMs). In this paper, we propose a novel framework that uses LLM-encoded knowledge to guide federated graph recommendation. Specifically, clients learn structural representations from local graphs while simultaneously summarizing their typical interaction patterns into compact semantic vectors via a frozen LLM. The central server then uses these LLM-encoded semantic signals to discover related preference patterns across clients, guiding the selective aggregation of their structural representations. This enables semantically informed cross-client collaboration without exposing raw data. Extensive experiments on standard benchmarks show that guiding structural alignment with LLM-encoded knowledge consistently improves recommendation accuracy over existing federated graph baselines.

07.
medRxiv (Medicine) 2026-06-22

Virtual Responsive Neurostimulation Implantation: From Intracranial Connectivity to Optimized Lead Placement

Responsive neurostimulation (RNS) is an implanted device that delivers direct brain stimulation for drug-resistant focal epilepsy. Individual responses are highly variable, and no validated framework exists to predict outcome or guide lead placement before implantation. We hypothesized that this variability is partly explained by lead placement in relation to patterns of functional connectivity in brain networks. Fourty-nine patients with drug-resistant focal epilepsy who underwent pre-implantation intracranial EEG (iEEG) and RNS implantation across three independent epilepsy centers were retrospectively studied. We developed a composite functional connectivity score, based on simple Spearman correlation, combining the standard deviation and kurtosis of interictal iEEG connectivity distributions to predict the response outcome in a training cohort (HUP, n=18) and validated in two independent cohorts (NYU, n=17; UCSF, n=14). We accounted for a spatial mismatch between iEEG and RNS electrodes with a distance-based correction. The score was extended to generate patient-specific 3D maps of predicted RNS efficacy across 200 simulated, or virtual RNS, lead configurations. Accuracy of the score in predicting clinical outcome was 72% at the group level, 61% at the individual patient level, and, after distance-based optimization, 100% in patients with RNS electrodes placed close to location of iEEG electrodes. Applied to the validation cohort, the same score reached 68% accuracy (71% balanced accuracy, 55% sensitivity, 88% specificity). The spatial combination of the scores at different SEEG contacts localization gives a spatial score for each patient. Responders showed significantly higher spatial scores than non-responders, supporting that actual RNS lead placement in responders was located in map-identified favorable regions. Interictal iEEG functional connectivity predicts individual RNS response across independent epilepsy centers, and patient-specific 3D maps derived from this biomarker could prospectively guide lead implantation toward favorable network regions, opening a promising avenue toward network-informed RNS surgical planning.

08.
arXiv (CS.AI) 2026-06-11

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

arXiv:2509.11575v3 Announce Type: replace Abstract: Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

09.
arXiv (quant-ph) 2026-06-15

Dissipation-induced superradiance in matter coupled to a self-interacting cavity

arXiv:2606.14526v1 Announce Type: new Abstract: Light-matter interactions are often modeled via the Dicke model, namely, by two-level systems coupled to a cavity mode. Alas, the threshold for superradiance is often experimentally inaccessible or hindered by light's diamagnetic term. Here, within the Dicke setting, we consider self-interacting light in a cavity, modeled by a photonic Kerr nonlinearity. We show that negative Kerr nonlinearity gives rise to a low-threshold superradiant phase with spin inversion. While unstable in a closed system, cavity dissipation stabilizes this lit phase, opening avenues for lasing and bath-engineered phases.

10.
arXiv (CS.AI) 2026-06-11

MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

arXiv:2606.11416v1 Announce Type: cross Abstract: Repository-level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi-Party Computation (MPC) software do not yet exist, and directly transplanting general-purpose benchmarks such as SWE-bench fails on three structural fronts: (i) MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic; (ii) high-value MPC fixes lack the standardized tests rigid extraction pipelines require; and (iii) standard fail-to-pass evaluation is insufficient for code that must also be cryptographically safe. MPC is increasingly deployed for privacy-preserving machine learning, biomedical collaboration, and secure analytics. Existing MPC-specific code-synthesis efforts cover only operator-level or single-framework tasks; evaluating LLM agents on real repository-level MPC repair instead demands MPC-aware data curation and a verifier matched to the security and numerical-fidelity guarantees MPC programs must obey neither of which existing benchmarks provide. We introduce MPC-Patch-Bench, a repository-level benchmark organised around two frameworks. (1)The Data Curation Framework combines a domain-specific curation agent that filters raw pull requests through three cryptographic layers with a human-AI completion engine that synthesizes missing problem statements and Fail-to-Pass/Pass-to-Pass tests, yielding 205 fully verified instances. (2)The MPC Verifier provides dedicated security and numerical-fidelity checks via dynamic differential testing against plaintext oracles and MPC-specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts. The strongest evaluated LLM functionally resolves only 22.9% of MPC-Patch-Bench tasks; the MPC Verifier further reduces verified resolution to 17.1%, with up to 40% of functionally-passing patches rejected for cryptographic or numerical-fidelity violations.

11.
arXiv (CS.LG) 2026-06-16

High-Dimensional Random Projection for Activation Steering in Language Models

arXiv:2606.15092v1 Announce Type: new Abstract: Activation steering has emerged as a key methodology for controlling the behavior of large language models (LLMs). Existing difference-in-means based methods, however, are fundamentally limited: they capture only mean differences between class activations and fail to recover discriminative signals that naturally exist in the nonlinear feature subspace under the superposition hypothesis. Motivated by that, we propose High-Dimensional Random-projection for Activation Steering (HiDRA), a training-free approach that integrates seamlessly with existing activation steering methods. By performing activation addition in the projected high-dimensional space, HiDRA can provably capture a better discriminative structure beyond the reach of linear methods. Experiments across diverse LLM families and benchmarks demonstrate that HiDRA consistently outperforms baseline counterparts, achieving stronger behavioral control without significant computational overhead.

12.
arXiv (CS.CV) 2026-06-19

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1–100x, pseudo-labels across thresholds 0.90–0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

13.
arXiv (CS.AI) 2026-06-16

Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation

arXiv:2502.11201v3 Announce Type: replace-cross Abstract: NoSQL databases are core data infrastructure, yet natural-language access to them remains underdeveloped: correct query generation must recover how a non-relational data model represents entities, nested paths, arrays, missing fields, and dynamic keys. This paper studies Text-to-NoSQL, translating natural-language requests into executable NoSQL queries, instantiated with MongoDB aggregation pipelines over schema-less document stores. We present TEND, short for Text-to-NoSQL Dataset, an execution-verified benchmark with 1,210 MongoDB-native tasks across 11 databases. To our knowledge, TEND is the first Text-to-NoSQL benchmark whose database worlds are MongoDB-native by design: experts manually define collection boundaries, nested arrays, optional and sparse paths, polymorphic shapes, and dynamic-key conventions; these worlds are populated with real data and verified through frozen MongoDB execution, so TEND evaluates schema-less document reasoning rather than SQL-to-MQL transfer. We further introduce SAG, a Schema-as-Data Grounding solver that induces path and value grounding from stored-document evidence before bounded MQL generation, execution-grounded repair, and result-consistency selection. Evaluation uses bounded column-tolerant execution accuracy (EXC) as the headline metric, complemented by a graded result-set F1 and a mutually exclusive execution-outcome decomposition. Experiments show that LLMs with strong NL2SQL performance degrade substantially on TEND, validating Text-to-NoSQL as a distinct schema-less document reasoning problem.

14.
arXiv (CS.CV) 2026-06-19

3D Vessel Reconstruction from Sparse-View Dynamic DSA Images via Vessel Probability Guided Attenuation Learning

Digital Subtraction Angiography (DSA) is one of the gold standards for vascular disease diagnosis. With the help of a contrast agent, time-resolved 2D DSA images deliver comprehensive blood flow information and can be utilized to reconstruct 3D vessel structures for medical assessment. Current commercial DSA systems typically require hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. In this study, we propose a neural rendering-based optimization framework tailored for high-quality sparse-view DSA reconstruction to reduce radiation dosage. Our approach, termed vessel probability guided attenuation learning, represents DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the time-independent vessel probability field. Functioning as a foreground mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism enables self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves reconstruction quality. Our model is trained by minimizing the discrepancy between synthesized projections and real captured DSA images. We further employ two training strategies to improve reconstruction quality: (1) coarse-to-fine progressive training for better geometry and (2) temporal perturbed rendering loss for temporal consistency. Experimental results have demonstrated high-quality 3D vessel reconstruction and 2D DSA image synthesis.

15.
arXiv (CS.AI) 2026-06-16

GRAPE: Guided Parameter-Space Evolution for Compact Adversarial Robustness

arXiv:2606.14865v1 Announce Type: cross Abstract: Adversarial Training (AT) improves neural network robustness, but most methods train a fixed parameter space from the start. This paper asks whether the order in which parameters become optimizable can affect the final robust solution, even when the final architecture or computation budget is controlled. We propose GRAPE, Guided Parameter-Space Evolution, a training framework for compact adversarial robustness. GRAPE combines parameter-space stabilization with progressive hidden expansion: it stabilizes robust optimization in the currently exposed space, gradually releases new optimizable dimensions, and uses an adversarial spectral utilization score to guide newly released capacity toward high-pressure modules. In contrast to fixed-structure AT, GRAPE treats robust model learning as a process of progressive parameter-space exposure and evolution. Under the standard $\ell_\infty$ threat model on CIFAR-10, with fixed-structure ResNet-18 AT as a controlled reference, GRAPE improves PGD-20 robust accuracy from 51.70% to 56.94% at a nearly matched computation budget with a FLOPs ratio of 1.009x, while reducing parameter count by about 21.4%. A sequential grow variant with the same final ResNet-18 architecture reaches 56.52% PGD-20 robust accuracy, indicating that the gain is not only due to final architecture differences but also to the parameter-space exposure path. These results suggest that guided parameter-space evolution can yield compact and robust parameter configurations under matched computation.

16.
arXiv (CS.CV) 2026-06-16

DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

Industrial anomaly localization aims to accurately identify and localize abnormal regions in industrial products, addressing the critical challenge of detecting unseen defect categories in real-world scenarios. Traditional closed-set methods often suffer from poor cross-scenario generalization, while existingMultimodal Large Language Model (MLLM)-based approachesface two core limitations: they either adopt QA-style paradigmsmisaligned with the practical demands of localization, or relyon standard optimization techniques such as Group RelativePolicy Optimization (GRPO), which fails to deliver effectivelearning signals for subtle defects. To tackle these issues, thispaper proposes DifferAD-R1, an MLLM-augmented reinforcement learning framework tailored for industrial anomaly localization. We design a Difference-Guided dual-image paradigm,which reformulates the localization task as a one-shot difference grounding problem to effectively explore cross-scenarioanomalies. A Dual-Consistency Localization Reward is developedfor hard-to-detect anomalies, enhancing optimization stabilityand robustness. Additionally, we integrate a difficulty-awarestrategy with adaptive reweighting and group-wise resamplingto prioritize learning on challenging instances. To facilitateevaluations in real-world industrial settings, we construct theAD-DualDiff dataset, comprising 13K paired images across 20categories. Experimental results demonstrate that DifferADR1 significantly outperforms existing baselines and achievescompetitive performance compared to large-scale models likeQwen3-VL (235B parameters). Our code is publicly availableat: https://github.com/Rong2026/work-1.

17.
arXiv (CS.AI) 2026-06-12

A Three-Layer Framework for AI in Scientific Discovery

作者:

arXiv:2606.13566v1 Announce Type: new Abstract: Current discussions of AI in scientific discovery are often dominated by two visible capabilities: search over existing knowledge and execution through optimization, simulation, and automation. Both are important, but neither fully captures the central act of discovery: the formation and evolution of models. This paper proposes a three-layer view of AI in discovery. Layer 1 is search and retrieval by large language models. Layer 2, as the main innovation of this paper, is model formation through qualitative reasoning: the capacity to recognize when a current framework is structurally inadequate and to understand the problem within a broader representational space, not through trial and error, but through structural insight into what is missing and where it can be found. Layer 3 is execution, optimization, and refinement. The main claim is that Layer 2 is both the most important and the least developed. Search without model formation remains confined to inherited frameworks, while execution without conceptual revision only amplifies an existing formulation. We illustrate Layer 2 reasoning through three case studies: S. S. Chern's intrinsic proof of the Gauss-Bonnet theorem, the resolution of the Nesterov Accelerated Gradient convergence problem via Lyapunov functions, and the autonomous disproof of the Erdos unit distance conjecture by OpenAI in 2026. Each case exhibits the same structural signature: a framework that had become inadequate, a missing conceptual object, and a resolution found in an unexpected neighboring field.

18.
arXiv (CS.CL) 2026-06-17

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style reinforcement learning (RL) post-training, framing it as a training-time credit-assignment problem rather than merely a decoding-time stopping problem. In rollouts sampled at the onset of GRPO training, we observe that successful trajectories can exhibit a slightly higher degree of overthinking than unsuccessful trajectories for the same prompts. This early imbalance provides a starting point for an undesirable feedback loop: because GRPO assigns sequence-level credit, it cannot distinguish the solution-reaching prefix from the unnecessary continuation that lengthens a successful trajectory. Both receive positive update signal, allowing the initial imbalance to grow into more severe overthinking during training. To address this issue, we introduce Dynamic Rollout Editing (DRE), a training-time intervention for successful trajectories that continue thinking after answer emergence. DRE preserves the accepted verified prefix, edits the remaining thinking, and prefers the edited trajectory within the same RL group, weakening the preference signal for unnecessary thinking without penalizing the reasoning needed to reach the answer. Experiments across diverse tasks show the effectiveness of DRE.

19.
arXiv (math.PR) 2026-06-11

Sharp log-Sobolev inequalities on finite cyclic groups

arXiv:2606.02847v2 Announce Type: replace-cross Abstract: Let $\mathbb Z_n$ be the cyclic group equipped with the uniform probability measure $\pi$, and let $A_{\psi_n}$ be the Laplacian with word length \[ \psi_n(k) = \min(k,n-k). \] We prove the sharp log-Sobolev inequality \[ Ent_{\pi}(f^2) \le 2\pi(f A_{\psi_n} f), \qquad f:\mathbb Z_n \to [0,\infty), \] for every $n \ge 4$. The proof is inspired by the recent work of Frank and Ivanisvili[FrankIvanisvili2026] on a sharp log-Sobolev inequality for nearest-neighbor simple random walk. We use their cubic-majorant reduction, which turns the problem into a 3rd moment estimate; the new point is a blockwise 3rd moment estimate adapted to the word-length multiplier. The same 3rd moment argument also recovers the log-Sobolev inequality for Poisson-semigroup on the circle, first proved by Weissler[Weissler1980]. The same sharp inequalities were also obtained recently by Yao[Yao2026] by a different method.

20.
arXiv (CS.CL) 2026-06-17

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16–23 percentage points across models. An oracle analysis decomposes the degradation into a retrieval gap (the model cannot surface the right tool) and a confusion gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10–11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10–17pp despite 10–15pp lower absolute performance.

21.
bioRxiv (Bioinfo) 2026-06-12

PeptiDIA: A Machine Learning Framework for Enhanced Peptide Identification in Fast-Gradient Data-Independent Acquisition Proteomics

Data-independent acquisition (DIA) mass spectrometry has become increasingly prevalent in proteomics as advances in instrumentation, chromatography, and computational analysis have enabled robust proteome identification across complex biological samples. However, analytical depth achieved with fast chromatographic gradients remains lower than that obtained using long-gradients, reflecting a throughput-depth trade-off. Here, we present PeptiDIA, a machine learning framework that enhances peptide identification in fast-gradient DIA data by leveraging paired fast and long-gradient acquisitions from identical samples. PeptiDIA processes DIA-NN outputs generated at relaxed false discovery rate thresholds to obtain expanded candidate peptide pools and trains gradient-boosted decision tree models using long-gradient identifications as reference labels. The model integrates DIA-NN features with engineered peptide descriptors and applies isotonic regression to calibrate probabilities, enabling controlled peptide recovery relative to the long-gradient reference. Applied to human and murine datasets spanning six tissues acquired on an Orbitrap Exploris 480, PeptiDIA increased peptide identifications by 25-34% at 1% target reference-discordance rate (RDR) and increased the number of protein groups containing at least one rescued peptide by 15-17%. Overall, PeptiDIA improves the identification depth of fast-gradient DIA-NN workflows without altering acquisition strategies. The framework is available as a web application and command-line tool at https://github.com/Jordano700/PeptiDIA.

22.
arXiv (CS.CV) 2026-06-19

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using 62,876 CFPs from 44,501 unique participants from the UK Biobank, DL models were trained to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

23.
arXiv (CS.LG) 2026-06-19

Evaluating Universal Machine Learning Force Fields Against Experimental Measurements

arXiv:2508.05762v2 Announce Type: replace-cross Abstract: Universal machine learning force fields (UMLFFs) promise to revolutionize materials science by enabling rapid atomistic simulations across the periodic table. However, their evaluation has been limited to computational benchmarks that may not reflect real-world performance. We introduce UniFFBench, a comprehensive evaluation framework featuring the MinX dataset – a diverse collection of 1,500+ mineral systems spanning 85 elements, extreme thermodynamic conditions (0–5000 K, 0–1000 GPa), and structural complexity, including partial occupancy and disorder. This diversity, combined with experimental reference values for validation, enables assessment of UMLFF generalization across chemical space and conditions substantially beyond typical training scenarios. Our systematic evaluation of six state-of-the-art UMLFFs reveals a substantial ``reality gap'': models achieving impressive performance on computational benchmarks often fail when confronted with experimental complexity. Even the best-performing models exhibit higher density prediction error than the threshold required for practical applications. We observe disconnects between simulation stability and mechanical property accuracy, with prediction errors correlating with training data representation rather than the modeling method.

24.
arXiv (CS.LG) 2026-06-16

Phase-Localized Curation Does Not Help: A Negative Result on Per-Phase Metric Selection for Demonstration Filtering

作者:

arXiv:2606.15064v1 Announce Type: new Abstract: Manipulation demonstrations have temporal phase structure, and a natural hypothesis is that demonstration-curation metrics should be applied within phases rather than globally. The idea is to segment each trajectory into phases, score each phase with the metric that is locally most informative, and then aggregate. This follows directly from prior work showing that a single global metric can be the best detector of a defect and yet the worst curator of the resulting policy. We test the per-phase hypothesis on three contact-rich LIBERO pick-and-place tasks with a controlled early-release structural defect, comparing phase-gated curation against the same metrics applied uniformly and against a strong single global metric. Across all three tasks and five random seeds per condition, phase-gated curation is never the best curation strategy, and it is the worst of the three on two of the three tasks (Task 1: 86.0 vs. 92.0 for global; Task 3: 22.7 vs. 48.0 for uniform). We trace the failure to a concrete mechanism. When the defect signal is concentrated in a single phase, rank-aggregating across phases dilutes that signal with uninformative scores from defect-free phases, selecting a worse demonstration subset than simply applying the defect-informative metric everywhere. We further show that the per-phase metric selection does not transfer across tasks, since no phase shares a winning metric between any two tasks, so the selection cannot be reused and must be re-derived per task from a noisy sweep. These results bound a plausible and previously untested method, and they argue that practitioners should prefer identifying a single defect-informative metric over decomposing curation by phase. We release the full pipeline, all metric implementations, and per-seed results.

25.
arXiv (CS.CL) 2026-06-12

C-QUERI: Congressional Questions, Exchanges, and Responses in Institutions Dataset

Questions in political interviews and hearings serve strategic purposes beyond information gathering including advancing partisan narratives and shaping public perceptions. However, these strategic aspects remain understudied due to the lack of large-scale datasets for studying such discourse. Congressional hearings provide an especially rich and tractable site for studying political questioning: Interactions are structured by formal rules, witnesses are obliged to respond, and members with different political affiliations are guaranteed opportunities to ask questions, enabling comparisons of behaviors across the political spectrum. We develop a pipeline to extract question-answer pairs from unstructured hearing transcripts and construct a novel dataset of committee hearings from the 108th–117th Congress. Our analysis reveals systematic differences in questioning strategies across parties, by showing the party affiliation of questioners can be predicted from their questions alone. Our dataset and methods not only advance the study of congressional politics, but also provide a general framework for analyzing question-answering across interview-like settings.