Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-16

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

arXiv:2606.15862v1 Announce Type: new Abstract: Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

02.
arXiv (quant-ph) 2026-06-12

The Pound-Drever-Hall Method for Superconducting-Qubit Readout

arXiv:2512.03138v3 Announce Type: replace Abstract: Scaling quantum computers to large sizes requires the implementation of many parallel qubit readouts. Here we present an ultrastable superconducting-qubit readout method using the multi-tone self-phase-referenced Pound-Drever-Hall (PDH) technique, originally developed for use with optical cavities. In this work, we benchmark PDH readout of a single transmon qubit, using room-temperature heterodyne detection of all tones to reconstruct the PDH signal. We demonstrate that PDH qubit readout is insensitive to microwave phase drift, displaying $0.73^\circ$ phase stability over 2 hours, and capable of single-shot readout in the presence of phase errors exceeding the phase shift induced by the qubit state. We show that the PDH sideband tones do not cause unwanted measurement-induced state transitions for a transmon qubit, leading to a potential signal enhancement of at least $14$~dB.

03.
medRxiv (Medicine) 2026-06-16

Adherence to Red Reflex and Vision Screening Recommendations: A Deep Dive into Primary Care Implementation Gaps

Introduction: Early childhood vision screening is critical for detecting amblyopia and other vision-threatening conditions. Despite screening recommendations during well-child visits, rates remain low. Red reflex assessment is recommended to identify serious ocular pathology, yet its use in primary care is not well described. We examined rates and drivers of vision screening in pediatric primary care. Methods: We conducted a retrospective review of electronic health records for children 3 to 5 years attending well-child visits in 2022 in one of three representative primary care clinics within a university health system. Outcomes were documented red reflex and functional vision tests. We evaluated associations with patient demographics and clinic site using multivariable logistic regression Results: Among 1,003 visits, 21.1% (n=212) had a documented red reflex assessment, and 60.8% (n=610) a functional vision test. Younger children (ages 3 and 4 vs. 5 years) had higher odds of red reflex assessment [adjusted odds ratio (aOR) 9.00 and 8.64], and lower odds of a functional vision (aOR 0.47 and 0.59) test. Females had higher odds of red reflex assessment (aOR 1.53). Other/Multiracial children had lower odds of red reflex assessment than Non-Hispanic White children (aOR 0.48). Screening rates varied significantly by clinic site Conclusions: Visual function and red reflex assessment are inconsistently performed in pediatric primary care, with particularly low rates of red reflex documentation. Screening rates varied between clinics and were affected by age. These findings highlight missed opportunities for early detection of vision-threatening conditions and identify targets for improving adherence to pediatric vision screening recommendations

04.
arXiv (CS.AI) 2026-06-17

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

arXiv:2606.17073v1 Announce Type: cross Abstract: While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

05.
arXiv (CS.AI) 2026-06-24

The Measurable Majority

arXiv:2606.23853v1 Announce Type: cross Abstract: This paper studies strict majority reasoning in finite electorates using so-called $social decision frames$: finite sets of voters equipped with distinguished families of coalitions interpreted as those voting blocs evaluated to form a strict majority. A coherence criterion for qualitative majority judgments is identified and shown to give an exact characterization for representability of strict majorities by finitely additive measures. In addition, a minimal natural logic for reasoning about strict majorities is shown to be sound and complete. These developments motivate examination of associated combinatorial questions concerning incoherence in finite families of sets; partial results and a conjecture are given. Finally, the results of this paper are applied to correct a classical representation theorem for weak qualitative probability structures due to Patrick Suppes and to establish a May-type characterization for ordinary strict majority rule for social decision frames.

06.
arXiv (CS.AI) 2026-06-11

Characterizing Software Aging in GPU-Based LLM Serving Systems

arXiv:2606.11916v1 Announce Type: cross Abstract: This paper proposes an empirical methodology to study software aging in GPU-based LLM serving systems. Traditional aging studies focus on CPU-centric software with relatively regular workloads; LLM serving is different, spanning a Python host and a CUDA device, handling requests whose cost varies by orders of magnitude, and relying on rapidly evolving software stacks. We run a 216-hour campaign across six co-located deployments under identical stress conditions, monitor host, device, and client metrics in parallel, and apply a statistical pipeline that accounts for autocorrelation and multiple testing. Our results reveal statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and deployment configuration. Beyond these findings, we provide a reproducible framework that opens a research direction at the intersection of the software aging and rejuvenation and LLM serving communities.

07.
arXiv (CS.CV) 2026-06-11

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Frames2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Frames2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Frames2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Frames2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

08.
arXiv (CS.CV) 2026-06-16

MAF: Multimodal Adaptive Few-shot Prompting for Sentiment Analysis with MLLMs

Authors:

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in understanding complex multimodal content. However, their performance in sentiment analysis exhibits acute sensitivity to prompt design, rendering static, uniformly applied prompts inherently suboptimal for capturing the nuanced multimodal cues that vary across inputs. To address this limitation, we propose a Multimodal Adaptive Few-Shot Prompting (MAF) framework, which dynamically retrieves and integrates query-relevant demonstrations to elicit the sentiment reasoning capabilities of MLLMs in a context-sensitive manner. MAF constructs a demonstration retrieval module that holistically encodes facial expressions, scene context, and textual semantics, with a lip movement amplitude detection mechanism introduced for accurate speaker identification in multi-person scenarios. Departing from conventional fixed-weight fusion, a lightweight coefficient generation network is trained to output query-conditioned fusion weights in real time, enabling weighted aggregation of multimodal similarity scores to retrieve the top-K most informative demonstrations. Prediction stability is further enhanced through majority voting over multiple candidate outputs generated by the MLLM. Extensive experiments on public benchmark datasets demonstrate that MAF achieves substantial and consistent performance improvements over the corresponding backbone variants and remains competitive with strong multimodal sentiment-analysis baselines.

09.
PLOS Computational Biology 2026-06-01

Supervised deep learning with gene functional annotation for cell classification

Authors:

by Zhexiao Lin, Yuanyuan Gao, Wei Sun Gene-by-gene differential expression analysis is a widely used supervised approach for interpreting single-cell RNA-sequencing (scRNA-seq) data. However, modern scRNA-seq datasets often contain large numbers of cells, leading to the identification of many differentially expressed genes with extremely small p-values but negligible effect sizes, thus making biological interpretation difficult. To overcome this challenge, we developed Supervised Deep learning with gene functional ANnotation (SDAN), a method that integrates gene functional annotation information (e.g., protein-protein interaction) with gene-expression profiles through a graph neural network. SDAN identifies functionally coherent gene sets that optimally classify cells, and the resulting cell-level classification scores can be aggregated to make individual-level predictions. We evaluated SDAN alongside three representative existing methods in three real-data applications aimed at identifying gene sets associated with severe COVID-19, dementia, and cancer immunotherapy response. Across all applications, SDAN consistently outperformed the alternative approaches by achieving two objectives simultaneously: accurate outcome classification and clear assignment of genes to functionally related gene sets.

10.
arXiv (CS.AI) 2026-06-24

Social Structure Matters in 3D Human-Human Interaction Generation

arXiv:2606.24255v1 Announce Type: cross Abstract: Although text-to-motion generation has achieved strong progress in synthesizing realistic single-person motions from language, extending it to text-driven 3D human-human interaction (HHI) remains non-trivial, as HHI requires modeling the underlying social structure that governs phase progression, actor roles, and inter-actor coordination. In this paper, we formulate HHI generation as a social structure modeling and grounding problem: the model must first infer how an interaction unfolds and how the two actors coordinate their roles, and then realize this structure as continuous, physically plausible, and partner-aware 3D motion. To study how such structure should be modeled, we first examine the capability boundary of large language models (LLMs) for HHI generation. Our analysis shows that LLMs can think by recovering phase decompositions and partner-aware roles, but cannot directly move, as they fail to generate dynamic, physically plausible, and interaction-aware motion. This motivates our planner-executor paradigm, Think with LLM, Move with Motion Skill. The LLM planner converts implicit interaction semantics into motion-aligned social supervision by decomposing interactions into phases, assigning partner-aware actor roles, and aligning them with motion sequence. The motion executor then grounds the planned social structure into coordinated two-person motion by adapting a pretrained solo motion model with LoRA, previous-phase self-conditioning, and ego-relative partner conditioning. Together, our Solo-to-Social framework bridges social organization and motion realization, producing 3D HHI with improved phase consistency, role alignment, and partner-aware coordination.

11.
PLOS Computational Biology 2026-06-04

CIPHER: An end-to-end framework for designing optimized aggregated spatial transcriptomics experiments

by Zachary Hemminger, Haley De Ocampo, Fangming Xie, Zhiqian Zhai, Jingyi Jessica Li, Roy Wollman Motivation Most imaging-based spatial transcriptomics methods measure individual genes, which limits scalability and typically requires integration with scRNA-seq to recover full cellular states. Recent approaches such as CISI, FISHnCHIPs, and ATLAS address this limitation by measuring aggregate transcriptional signatures, where multiple genes are pooled into each channel to increase throughput. While aggregate measurements improve scalability, they shift the problem from gene selection to feature design. For effective integration with scRNA-seq, these signatures must be not only discriminative in transcriptional space but also straightforward to measure, with balanced signal, sufficient dynamic range, and robustness to experimental noise. By optimizing decoding accuracy in isolation, existing methods leave substantial performance on the table. Results We present CIPHER (Cell Identity Projection using Hybridization Encoding Rules), a neural-network framework that jointly optimizes the experimental encoding matrix, i.e., the way that genes are aggregated to signatures, and the downstream cell embedding. CIPHER integrates the physical limits of imaging assays directly into its loss function, shaping the latent space to maximize discriminability while maintaining robustness to measurement noise and signal constraints. Using a large-scale mouse brain scRNA-seq reference, we show that CIPHER-designed encodings yield latent spaces with improved cell-type separability, uniform signal utilization, and greater resilience to hybridization variability, resulting in higher decoding accuracy from both simulated and experimental data. Conclusion CIPHER formulates aggregate signature design as a joint optimization problem over decoding accuracy and experimental measurability. This enables systematic, scRNA-seq-aligned feature design for scalable spatial transcriptomics based on aggregate measurements. Availability Code and documentation are available at https://github.com/wollmanlab/Design/.

12.
arXiv (CS.AI) 2026-06-16

Intrinsic Computational Functionalism and Simulated Consciousness

arXiv:2606.15348v1 Announce Type: cross Abstract: A common objection to artificial or simulated consciousness is that a simulated brain is no more conscious than simulated water is wet. We address this from the perspective of Intrinsic Computational Functionalism (ICF): if consciousness is computationally constituted, it depends not on externally imposed descriptions but on the computational structures a system physically realizes in virtue of its own causal-dynamical organization. In previous work we developed Canonical Functionalism as a mathematically precise special case of this anti-interpretivist program, identifying functional states by their complete future input-output roles under a fixed interface. Here we argue that this input-output construction, though important, is incomplete: as a behavioral boundary case of ICF, it makes lookup tables and unfolded systems that preserve the same boundary behavior canonically equivalent. A consciousness-relevant canonical representation must instead include internal mechanisms, interventions, and joint readouts belonging to the relevant intrinsic organization. We therefore define a mechanism-enriched canonical structure and use it to formulate Intrinsic Causal-Computational Realization (ICCR), a realization relation preserving physical implementation, intrinsic state individuation, transition structure, intervention profiles, and the relevant agent-body-world boundary. The central result is conditional: if conscious properties are invariants of intrinsic causal-computational organization, then any system satisfying ICCR realizes the same consciousness-relevant properties, whether biological, artificial, or simulated. We discuss objections including biological naturalism and integrated information theory. We conclude that to deny consciousness to a simulation, one must identify a consciousness-relevant intrinsic causal-computational structure that the simulation fails to realize.

13.
arXiv (CS.AI) 2026-06-25

A Hybrid CNN-LSTM Intrusion Detection Framework for Cybersecurity in Smart Renewable Energy Grids

arXiv:2606.25200v1 Announce Type: cross Abstract: The accelerated digitalization of renewable energy smart grids through IoT sensors, AMI, and SCADA systems has significantly expanded the attack surface for sophisticated cyberattacks, FDI attacks that stealthily distort state estimation and DoS/DDoS attacks that flood communication channels. Current IDS, however, exhibit three inherent limitations: inadequate modeling of the temporal progression of multi-step attacks, degraded scalability under extremely skewed class distributions of standard benchmark datasets, and restricted generalization across heterogeneous network environments. In this study, we present a Hybrid CNN-LSTM IDS that jointly exploits CNN-based spatial feature extraction and LSTM-based temporal sequence modeling, enabling the detection of instantaneous volumetric anomalies and gradually evolving low and slow-attack campaigns in real time. The model was trained using a seven-step preprocessing workflow comprising missing-value imputation, min-max normalization, one-hot encoding, SMOTE class balancing, mutual-information feature selection, causal temporal sequence construction (T=10), and stratified partitioning. LSTM (96.1%), Random Forest (93.5%), SVM (91.2%) and KNN (89.7%); in NSL-KDD, it reaches 98.2% precision versus 96.4% (LSTM), 95.2% (CNN), 92.7% (Random Forest) and 90.8% (SVM), with margins of 2-9 percentage points in all measures. An ablation analysis identified SMOTE balancing as the most influential design choice (-3.7~pp F1 without it). The model achieves a real-time inference throughput of 27,800 flows/s on GPU and 0.082 ms/sample CPU latency in FP32,, with INT8 quantization providing an additional 3.1 x speedup at 0.3% accuracy loss, confirming deployment feasibility on resource-constrained IEDs with

14.
medRxiv (Medicine) 2026-06-22

AI-driven Multimodal Representation Learning for Latent Mediation Structure Discovery of Socioeconomic Disadvantage, Psychosocial Factors, and Cardiometabolic Multimorbidity

Authors:

Social disadvantage is associated with multimorbidity, but the pathways linking social conditions to disease burden remain poorly understood. We developed an AI-driven multimodal mediation framework that integrates socioeconomic, psychosocial, clinical, laboratory, behavioral, and genomic data from the All of Us Research Program. Modality-specific variational autoencoders were used to derive latent representations of each data domain, and mediation analyses were subsequently performed in latent space to evaluate indirect associations between socioeconomic disadvantage, psychosocial factors, and multimorbidity. The final analytic cohort included 20,804 participants with complete multimodal data. Across 800 exposure–mediator–outcome combinations, mediation signals were concentrated within a small number of latent dimensions. The strongest indirect association linked a socioeconomic disadvantage dimension, a psychosocial vulnerability dimension, and a cardiometabolic multimorbidity dimension (NIE = 0.002517). The psychosocial dimension was characterized by poorer mental health, greater loneliness, lower social well-being, and lower health literacy, whereas the outcome dimension was associated with hypertension, diabetes, hyperlipidemia, obesity, chronic kidney disease, and heart disease. Bootstrap analyses supported the stability of the leading pathway. These findings suggest that psychosocial vulnerability may contribute to the association between socioeconomic disadvantage and cardiometabolic multimorbidity. More broadly, the proposed framework illustrates how AI-based representation learning can be used to investigate complex relationships across high-dimensional multimodal health data.

15.
arXiv (CS.CV) 2026-06-25

Benchmarking the Alignment of Data-Quality Metrics, Human Judgment and Land-Cover Segmentation Performance for Earth Observation

Volume and quality of datasets are crucial for deep learning model training, yet they are often constrained by availability and data acquisition costs. Synthetic data augmentation can extend existing datasets with realistic images, and the quality of these images is generally assessed through fidelity metrics such as FID, KID, IS, LPIPS and SSIM that measure structural or distributional similarity. However, such metrics, including the widely used FID, focus on visual fidelity without reflecting downstream utility, and can diverge from human perception under perturbations that are imperceptible to human observers. In this work, we systematically evaluate Earth observation datasets alongside synthetic counterparts generated by deep generative models, comparing automatic metrics against human perception and downstream tasks. Our results reveal a stark misalignment: semantics-preserving perturbations such as rotation drastically alter metric scores while leaving human recognition unaffected, and synthetic samples that score poorly on automatic metrics achieve comparable or higher perceived realism, and can improve downstream performance when combined with real data. By benchmarking semantic segmentation models trained on mixed real-synthetic datasets, we demonstrate that quality metrics rooted in ImageNet-pretrained feature spaces are unreliable indicators for geospatial data. Our findings underscore that automatic quality evaluation of synthetic datasets should be grounded in downstream task performance and human evaluation.

16.
arXiv (CS.AI) 2026-06-17

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

arXiv:2606.18111v1 Announce Type: cross Abstract: Fairness is an important aspect of decision-making in multi-objective reinforcement learning (MORL), where policies must ensure both optimality and equity across multiple, potentially conflicting objectives. While single-policy MORL methods can learn fair policies for fixed user preferences using welfare functions such as the generalized Gini welfare function (GGF), they fail to provide the diverse set of policies necessary for dynamic or unknown user preferences. To address this limitation, we formalize the fair optimization problem in multi-policy MORL, where the goal is to learn a set of Pareto-optimal policies that ensure fairness across all possible user preferences. Our key technical contributions are threefold: (1) We show that for concave, piecewise-linear welfare functions (e.g., GGF), fair policies remain in the convex coverage set (CCS), which is an approximated Pareto front for linear scalarization. (2) We demonstrate that non-stationary policies, augmented with accrued reward histories, and stochastic policies improve fairness by dynamically adapting to historical inequities. (3) We propose three novel algorithms, which include integrating GGF with multi-policy multi-objective Q-Learning (MOQL), state-augmented multi-policy MOQL for learning non-statoinary policies, and its novel extension for learning stochastic policies. We evaluate our algorithms across various domains and compare our methods against the state-of-the-art MORL baselines. The empirical results show that our methods learn a set of fair policies that accommodate different user preferences.

17.
arXiv (CS.LG) 2026-06-16

Active Learning with Low-Rank Structure for Data Selection

arXiv:2606.16045v1 Announce Type: new Abstract: In the data selection problem, the objective is to choose a small, representative subset of data that can be used to efficiently train a machine learning model. Sener and Savarese [ICLR 2018] showed that, given an embedding representation of the data and suitable geometric assumptions, heuristics based on $k$-center clustering can be used to perform data selection. This perspective was further explored by Axiotis et. al. [ICML 2024], who proposed a data selection approach based on $k$-means clustering and sensitivity sampling. However, these methods rely on the assumption that the dataset exhibits intrinsic geometric structure that can be effectively captured by clustering, whereas many modern datasets instead possess global algebraic structure that is better exploited by low-rank approximation or principal component analysis. In this paper, we introduce a new data selection framework based on low-rank approximation and residual-based sampling, formulated through the lens of row subset selection and loss-preserving coreset construction. Given an embedding representation of the data satisfying mild regularity conditions, which can be interpreted as algebraic or angular notions of Lipschitz continuity, we show that it is possible to select a weighted subset of $\tilde{O}\left(k + \frac{1}{\varepsilon^2}\right)$ data points whose average loss approximates the average loss over the full dataset within a $(1+\varepsilon)$ relative error, up to an additive $\varepsilon \Phi_k$ term, where $\Phi_k$ denotes the optimal rank-$k$ approximation cost of the embedding matrix. We complement these theoretical guarantees with empirical evaluations, demonstrating that on a range of real-world datasets, our data selection approach achieves improved performance over prior strategies based on uniform sampling or clustering-based sensitivity sampling.

18.
arXiv (CS.LG) 2026-06-16

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy

arXiv:2506.20668v3 Announce Type: replace-cross Abstract: We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8\% average success rate, compared to 13.8\% for the pre-trained policy and 52.5\% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/

19.
arXiv (CS.LG) 2026-06-19

The Hidden Environmental Cost of Poor Coding Practices in TensorFlow and Keras Applications: A Study on Resource Leaks and Carbon Emissions

arXiv:2606.19799v1 Announce Type: cross Abstract: Efficiency and sustainability are critical considerations in the development and deployment of machine learning (ML) applications. Among the factors influencing sustainability, resource leaks in ML code can introduce hidden inefficiencies that elevate energy consumption and CO2 emissions. Despite this, empirical evidence quantifying their environmental impact remains limited. This emerging results paper presents an initial empirical investigation of two common resource-leak smells, namely Improper Model Reuse (IMR) and Unreleased Tensor References (UTR), and their impact on energy consumption and CO2 emissions in TensorFlow and Keras workloads. Controlled experiments were conducted for each smell by executing identical training tasks while comparing against a smell-free baseline. Our preliminary results show that both smells consistently increase estimated electricity usage and carbon emissions. IMR and UTR increased electricity consumption by approximately 32% and 46%, respectively, with proportional increases in CO2 emissions. Paired statistical tests indicate that these differences are systematic and statistically significant, providing initial empirical evidence that resource-leak smells may degrade ML energy efficiency and environmental sustainability. These findings suggest that resource-leak smells pose measurable risks to both software quality and sustainability, emphasizing the importance of integrating resource-lifecycle management and energy-efficiency considerations into ML development.

20.
arXiv (CS.AI) 2026-06-19

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

arXiv:2605.29483v2 Announce Type: replace Abstract: Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 25% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

21.
arXiv (CS.AI) 2026-06-16

The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

arXiv:2606.16541v1 Announce Type: new Abstract: Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by faithfulness: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce Bidirectional Provability Fingerprinting (\bpf{}), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) Counterfactual Probe Generation (\cpg{}), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the Equivalence Spectrum, a continuous faithfulness score that replaces brittle binary verdicts; (iii) Adaptive Probe Budget Allocation (\apba{}), an information-theoretic budget router; and (iv) Faithfulness-Guided Decoding (\fgd{}), which uses \bpf{} signals as a reward during autoformalization. We prove a drift detection theorem and a PAC-faithfulness result establishing that the equivalence class of a natural language statement is learnable from $\mathcal{O}(\log(1/\delta)/\varepsilon)$ probes under mild assumptions. We release \driftbench{}, a benchmark of $2{,}183$ NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf{}\,+\,\cpg{} detects $89.6\%$ of drifted formalizations at a $3.0\%$ false-positive rate-against $41.2\%$ for typecheck and $63.3\%$ for LLM-judge baselines, and \fgd{} reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by $47\%$. https://pmlrbd.github.io/BPF/

22.
Nature Biotechnology 2026-06-23

Efficient generation of epitope-targeted antibodies with Germinal

Obtaining antibodies to specific protein targets is a widely important yet experimentally laborious process. Meanwhile, computational methods for antibody design have been limited by low success rates that require resource-intensive screening. Here we introduce Germinal, a broadly enabling generative pipeline that designs antibodies against specific epitopes with nanomolar binding affinities while requiring only low-n experimental testing. Our method co-optimizes antibody structure and sequence by integrating a structure predictor with an antibody-specific protein language model to perform de novo design of functional complementarity-determining regions onto a user-specified structural framework. When tested against four diverse protein targets, Germinal designed functional antibodies across all targets and binder formats, testing only 43–101 designs for each antigen. Validated designs also exhibited robust expression in mammalian cells and high sequence and structural novelty. We provide open-source code and full computational and experimental protocols to facilitate wide adoption. Germinal achieves epitope-targeted, de novo complementarity-determining region design with high experimental success rates.

23.
arXiv (CS.LG) 2026-06-25

Limitations of SGD for Multi-Index Models Beyond Statistical Queries

arXiv:2602.05704v2 Announce Type: replace Abstract: Understanding the limitations of gradient methods, and stochastic gradient descent (SGD) in particular, is a central challenge in learning theory. To that end, a commonly used tool is the Statistical Queries (SQ) framework, which studies performance limits of algorithms based on noisy interaction with the data. However, it is known that the formal connection between the SQ framework and SGD is tenuous: Existing results typically rely on adversarial or specially-structured gradient noise that does not reflect the noise in standard SGD, and (as we point out here) can sometimes lead to incorrect predictions. Moreover, many analyses of SGD for challenging problems rely on non-trivial algorithmic modifications, such as restricting the SGD trajectory to the sphere or using very small learning rates. To address these shortcomings, we develop a new, non-SQ framework to study the limitations of standard vanilla SGD, for single-index and multi-index models (namely, when the target function depends on a low-dimensional projection of the inputs). Our results apply to a broad class of settings and architectures, including (potentially deep) neural networks.

24.
Nature (Science) 2026-06-22

Stereoretentive decarbonylative C(sp<sup>3</sup>)-C(sp<sup>3</sup>) cross-coupling

Authors:

While C(sp3)–C(sp3) bond-forming cross-coupling methods have become more common, stereocontrolled bond-formation remains a challenge,1 despite its importance for drug discovery, where there is a emerging demand for molecules with increased sp3 character.2-4 Enantiospecific cross-coupling approaches would complement advances in enantioselective coupling,5-8 but have been limited to specialized substrates with lower availability5,9 because stereospecific oxidative addition of more abundant chiral alkyl electrophiles is unknown.10 Inspired by the classic, stereoretentive Curtius rearrangement,11 herein we disclose a catalytic strategy that proceeds by an analogous stereoretentive decarbonylation step to form a versatile chiral alkylnickel intermediate from easily-available chiral amino-acid and α-hydroxy-acid derivatives. The chiral alkylnickel intermediates decompose and/or racemize on the order of minutes, but are sufficiently stable to enable stereoretentive cross-electrophile coupling12 with alkyl radicals (derived from alkyl iodides) at relatively low temperature (22-40 °C). This mechanistic strategy provides a straightforward approach to stereocontrolled C(sp3)–C(sp3) bond formation, including diastereomers that are inaccessible by stereoselective radical mechanisms. The “metallo-Curtius” strategy described in this study lays a mechanistic foundation for the development many new stereospecific cross-coupling reactions.

25.
arXiv (CS.AI) 2026-06-24

Red-Teaming the Agentic Red-Team

arXiv:2606.24496v1 Announce Type: cross Abstract: The use of agentic systems to perform offensive security operations has moved from a theoretical possibility to a commoditized capability. However, while the community has focused on creating more and more capable agents, less attention has been allocated to assessing the security of those systems. In this work, we present the first in-depth security analysis of the most widely used agentic systems for offensive security operations. We show that most of these tools share common design flaws that enable an active adversary to exfiltrate API keys, establish persistent footholds, and fully compromise the operator's machine, even when the agent operates inside a sandboxed container. To support our analysis, we introduce a full cyber kill chain for such agentic systems, capturing the progression from initial LLM manipulation to lateral movement, persistence, guardrail bypass, and sandbox escape. Building on our security analysis, we derive a robust architecture for agentic offensive-security tools and propose actionable, broadly applicable design principles that mitigate the disclosed attack paths at the architectural level.