Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-11

Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling

arXiv:2509.20241v2 Announce Type: replace Abstract: As AI inference scales to billions of queries, estimates of per-query energy use are increasingly important for capacity planning, efficiency interventions, and policy. Yet many public estimates assume non-production settings, leading to systematic overestimation. We introduce a bottom-up framework estimating inference energy from token throughput, node power, and overhead under large-scale deployment assumptions. For frontier-scale models (>200B parameters) on H100 nodes, we estimate a median energy of 0.31 Wh/query (IQR 0.16-0.60), indicating widely cited estimates are overstated by 4-20x. In test-time scaling scenarios 15x longer than typical queries, the median energy rises 13x to 3.91 Wh (IQR 2.15-7.05). Across models, serving systems, and hardware, we estimate 8-20x line-of-sight energy reductions. At datacenter scale, serving 1 billion queries/day requires 0.7 GWh; if 10% are long queries, demand rises to 1.7 GWh/day. With efficiency interventions, it falls to 0.8 GWh/day, mitigating the energy impact of test-time scaling.

02.
arXiv (CS.LG) 2026-06-16

Enhancing Visual Feature Attribution via Weighted Integrated Gradients

arXiv:2505.03201v4 Announce Type: replace-cross Abstract: Integrated Gradients (IG) is a widely used attribution method in explainable AI, particularly in computer vision applications where reliable feature attribution is essential. A key limitation of IG is its sensitivity to the choice of baseline (reference) images. Multi-baseline extensions such as Expected Gradients (EG) assume uniform weighting over baselines, implicitly treating all baseline images as equally informative. In high-dimensional vision models, this assumption often leads to noisy or unstable explanations. This paper proposes Weighted Integrated Gradients (WG), a principled approach that evaluates and weights baselines to enhance attribution reliability. WG introduces an unsupervised criterion for baseline suitability, enabling adaptive selection and weighting of baselines on a per-input basis. The method preserves the core axiomatic properties of IG in a generalized weighted-baseline form. Under an expected, proxy-based fitness–relevance monotonicity assumption, WG provides a probabilistic justification for assigning larger weights to more informative baselines. Experiments on commonly used image datasets and models show that WG improves over EG under our protocol, with up to 36% gains across evaluated convolutional and Transformer architectures. These gains come with additional fitness-evaluation cost, so WG should be viewed as an attribution-fidelity trade-off rather than a faster alternative to EG. By moving beyond the assumption that all baselines contribute equally, Weighted Integrated Gradients offers a clearer and more reliable approach to explaining computer-vision models, improving both understanding and practical usability in explainable AI.

03.
medRxiv (Medicine) 2026-06-18

Diabetes is associated with increased nocturnal respiratory rate

Background and Objective: Diabetes mellitus (DM) causes autonomic neuropathy, which may alter nocturnal respiratory rate (NRR). To test the association between DM and NRR, we analyzed elective polysomnograms of four large observational cohorts. Research Design and Methods: We performed cross-sectional analysis of over 25,000 individuals with polysomnograms (PSGs) from the Sleep Heart Health Study (SHHS), Hispanic Community Health Study/Study of Latinos (HCHS/SOL), Osteoporotic Fractures in Men Study (MrOS), and Wisconsin Sleep Cohort (WSC). Patient-level NRRs were derived from inductance plethysmography waveforms. DM status was determined by self-report, physician diagnosis, medication use, or laboratory values, depending on the cohort. We related DM and NRR (continuous and dichotomized) using logistic regression models and adjusted for potential confounders. Cohort-specific results were combined using random-effects meta-analysis. Results: Meta-analysis of unadjusted models showed a pooled odds ratio (OR) of 1.10 (95% CI:1.04-1.17) for each breath-per-minute (brpm) increase in NRR. This association remained significant after multivariable adjustment (OR:1.06, 95% CI:1.02-1.11). Dichotomized analyses similarly showed higher odds of DM across dichotomization thresholds ranging from 15 to 21 brpm. At a threshold of 18 brpm, the unadjusted pooled OR was 1.77 (95% CI:1.23-2.55, P=0.0022), and the adjusted OR was 1.49 (95% CI:1.10-2.02, P=0.0098). Conclusions: Clinically stable outpatients with elevated NRR have an increased prevalence of DM. Additional studies are needed to investigate whether the mechanism is autonomic neuropathy and whether monitoring NRR can detect early complications of DM.

04.
arXiv (CS.CL) 2026-06-24

PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation

Petroleum-engineering search exposes a supervision gap for strong general retrievers: relevant evidence exists in public web text, but domain relevance labels are scarce. To address this gap, we propose PETRA, a large-scale Petroleum Engineering Text for Retrieval Adaptation dataset and pipeline that converts noisy public web data into a curated domain corpus and synthetic supervision for dense retrieval and reranking. PETRA contains 1.36M curated chunks, approximately 2B token equivalents, $\approx$859k, embedding training rows from $\approx$224k anchors, and roughly 400k teacher-scored reranker candidate rows. Its construction combines high-recall energy-domain curation, an energy-domain classifier with 98.4% test accuracy, chunk-grounded query generation, LLM-written hard negatives, and retrieval-mined candidate lists. PETRA improves first-stage in-domain Normalized Discounted Cumulative Gain (nDCG) from 0.703 to 0.763 through score fusion. Reranker adaptation improves the public Earth Science benchmark by 44% relative and a six-task reasoning-intensive panel by 23%. Failed training recipes show that high train-holdout accuracy on synthetic labels does not predict retrieval gains; retrieval-mined data helps only after being repackaged as teacher-scored candidate lists sampled from the inference-time candidate distribution.

05.
arXiv (quant-ph) 2026-06-17

Helical Dirac Current with Local Coupling to a Chiral Potential

arXiv:2606.17618v1 Announce Type: new Abstract: We show that exact Dirac eigenstates in cylindrical confinement carry a definite helical conserved-current texture even in the zero orbital angular momentum channel l = 0. For the lowest confined mode, the Dirac current contains a nonvanishing azimuthal component together with longitudinal transport and exhibits opposite handedness in the two spin-resolved sectors. The structure also persists into the evanescent region. We further derive the channel-resolved matrix-element kernel generated by a static chiral scalar potential acting on the confined l = 0 Dirac modes. The resulting spin-selective coupling arises from the Dirac current texture and the scalar chiral potential, and yields a geometric selection rule in which diagonal channels vanish while off-diagonal conversion channels survive. The coupling strength is governed by an internal sampled-current overlap Jchi(k), defined as the integral from 0 to R of f(rho) times jphi_up(rho, k) times rho d rho. This quantity measures the spatial overlap between the chiral radial profile and the spin-up azimuthal Dirac-current density. The mechanism is fully local and texture-based, without external magnetic fields or spin-orbit coupling. Within standard Dirac theory, this work identifies the minimal static Dirac-geometric kernel underlying spin-selective response, establishing a baseline structure from which dynamical-medium, scattering, and transport formalisms can be systematically developed toward a complete description of spin-polarization phenomena such as CISS.

06.
PLOS Medicine 2026-05-26

Requiring code sharing to strengthen transparency and trust in research

Authors:

by Helen Lumbard, Lauren Cadwallader, Devin Soper, on behalf of the PLOS Medicine Staff Editors PLOS Medicine has always championed open science and data transparency. Now, recognizing that code is as essential a research artifact as the data it analyzes, we are strengthening our code sharing policy to further ensure reproducibility and trust in the scientific record. Recognizing that code is as essential a research artifact as the data it analyzes, this Editorial outlines how PLOS Medicine is strengthening its code sharing policy to further ensure reproducibility and trust in the scientific record.

08.
arXiv (CS.LG) 2026-06-15

Learning the Context of Errors: Black-Box Online Adaptation of Time Series Foundation Models

arXiv:2606.14222v1 Announce Type: new Abstract: The rapid evolution of Time Series Foundation Models (TSFMs) has advanced zero-shot forecasting across diverse domains. Inspired by the current form of Large Language Models, future TSFMs may be offered as commercialized, closed-source API services. However, many existing online adaptation methods still rely on white-box access for parameter fine-tuning or gradient backpropagation. This paradigm mismatch raises a question: In black-box online adaptation for TSFMs, what should we learn? We answer this with an insight: the predictive errors of the base model are conditioned on both the input and output of the base model (i.e., the context of errors). To validate this insight, we propose ORCA (Online Residual Contextual Adaptation). We conduct extensive experiments across 5 state-of-the-art TSFMs and 8 datasets to demonstrate the effectiveness of our approach. Furthermore, through ablation studies, we quantitatively analyze the impact of different adapter learning hypotheses on the final adaptation performance in black-box online adaptation. Code available at https://github.com/Fifthky/ORCA.

09.
arXiv (CS.AI) 2026-06-15

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

arXiv:2606.08881v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $\pi_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

10.
arXiv (CS.CV) 2026-06-11

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{https://intuitive-robots.github.io/DAM-VLA/}{intuitive-robots.github.io/DAM-VLA/}

11.
PLOS Computational Biology 2026-06-02

Data-driven model reveals increased stability of CAG-expanded <i>huntingtin</i> RNA due to MID1 binding

Authors:

by Yuhong Liu, Annika Reisbitzer, Domagoj Dorešić, Jan Hasenauer, Sybille Krauß, Tatjana Tchumatchenko RNA-binding proteins (RBP) are important regulators of RNA metabolism. In neurodegenerative disorders such as Huntington’s Disease (HD), disrupted RBP-RNA interactions contribute to neuronal dysfunction. One such RBP, Midline 1 (MID1), has been shown to aberrantly associate with mutant huntingtin (Htt) RNA, enhancing its translation, yet the mechanism driving this effect remains unknown. Here, we develop a computational model to understand the role of MID1. Based on previously published data, our model predicts that MID1 increases the stability of the Htt RNA. We experimentally validate this prediction, showing that overexpression of MID1 significantly prolongs the half-life of mutant Htt RNA. Furthermore, we evaluate model refinements, including clustering of MID1-bound RNA, which allow capturing all key observations in the data. Together, we provide a data-driven framework that underlines the importance of RBP-RNA interaction in post-transcriptional regulation. This framework also shows how individual molecular reactions jointly determine RNA stability and protein levels in HD.

12.
arXiv (CS.LG) 2026-06-24

Natural Identifiers for Privacy and Data Audits in Large Language Models

arXiv:2606.24408v1 Announce Type: new Abstract: Assessing the privacy of large language models (LLMs) presents significant challenges. In particular, most existing methods for auditing differential privacy require the insertion of specially crafted canary data during training, making them impractical for auditing already-trained models without costly retraining. Additionally, dataset inference, which audits whether a suspect dataset was used to train a model, is infeasible without access to a private non-member held-out dataset. Yet, such held-out datasets are often unavailable or difficult to construct for real-world cases since they have to be from the same distribution (IID) as the suspect data. These limitations severely hinder the ability to conduct scalable, post-hoc audits. To enable such audits, this work introduces natural identifiers (NIDs) as a novel solution to the above-mentioned challenges. NIDs are structured random strings, such as cryptographic hashes and shortened URLs, naturally occurring in common LLM training datasets. Their format enables the generation of unlimited additional random strings from the same distribution, which can act as alternative canaries for audits and as same-distribution held-out data for dataset inference. Our evaluation highlights that indeed, using NIDs, we can facilitate post-hoc differential privacy auditing without any retraining and enable dataset inference for any suspect dataset containing NIDs without the need for a private non-member held-out dataset.

13.
bioRxiv (Bioinfo) 2026-06-23

Measuring peptide-MHC generalization to unseen alleles across both HLA classes

Authors:

Reported peptide-MHC (pMHC) AUROCs of 0.85-0.95 overstate generalization to unseen alleles: because immunopeptidome data are dense on a few well-studied alleles and sparse on the rest, training and test sets come to share near-identical alleles, so the numbers partly reflect interpolation rather than extrapolation to new MHC grooves. This is a property of the data, not of any one method. We assembled an open, harmonized corpus of 5.8 million experimental measurements across both HLA classes and use it to control the leakage explicitly: alleles held out at the sequence and cluster level, peptide-disjoint splits, and provenance-matched negatives. On strictly novel alleles, generalization is in the high 0.7s rather than the 0.9s a conventional split returns. Against this benchmark we trained a predictor that spans both classes in one model and factors presentation into a peptide-only ligand-likeness term and an allele-specific term; it exceeds eight published predictors by per-allele {Delta}AUROC = +0.22 to +0.37 (p < 10-9), most on the least-studied genes. Corpus, benchmark, and model are released.

14.
arXiv (CS.CL) 2026-06-15

Coping in Crisis: Computational Modeling of Coping Styles in Digital Crisis Discourse During the 2023 Turkiye Earthquake

How do people cope when disaster strikes and can we detect it at scale, in real time, from what they write? This study addresses that question using over one million Turkish-language tweets posted in the aftermath of the February 6, 2023 earthquake in Turkiye, which unfolded in a deeply polarized political context just months before a national election. Drawing on Lazarus and Folkman's (1984) coping theory, we develop a multi-label BERTurk classifier to detect three coping styles (problem-focused, emotion-focused, and meaning-making) across four theoretically motivated crisis phases. BERTurk achieves a macro F1 of 0.693, substantially outperforming a zero-shot mDeBERTa baseline (macro F1 = 0.324). Applied to the full corpus, the classifier reveals a clear temporal trajectory: problem-focused coping dominates the urgency phase and declines sharply, emotion-focused coping rises and stabilizes, and meaning-making increases monotonically. Anger correlates most strongly with meaning-making (Spearman r = 0.387), suggesting it functions as a mobilizing force toward blame attribution rather than practical action. These findings demonstrate that coping theory can be reliably operationalized in real-world digital crisis data and that doing so can help humanitarian organizations tailor their responses to where a population actually is.

15.
arXiv (CS.CL) 2026-06-19

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

16.
arXiv (quant-ph) 2026-06-15

The Bilateral Efficiency of Ethernet: Recalibrating Metcalfe and Boggs After Fifty Years

Authors:

arXiv:2603.19406v2 Announce Type: replace-cross Abstract: In July 1976, Metcalfe and Boggs published their foundational paper on Ethernet in Communications of the ACM. Their efficiency model – E = (P/C)/(P/C + W*T) – measures the fraction of Ether time carrying good forward packets under contention. For fifty years this model has framed how the community thinks about Ethernet performance. We argue it is silent on the question that matters for modern intra-rack interconnect: bilateral transaction efficiency – the fraction of link time that produces committed agreements between sender and receiver. Metcalfe and Boggs themselves planted the seed in their EFTP "end-dally" protocol (Section 7.2.2), and the deeper anchor is older still: Abramson's Alohanet carried positive acknowledgments at the link layer – a bilateral mechanism Metcalfe consciously removed in 1973 to obtain Ethernet's simple, ACK-free packet format. The result is a fifty-year bilateral zigzag: Aloha (bilateral) to Ethernet (unilateral) to the EFTP end-dally (bilateral) to TCP (unilateral-with-bilateral-above). We formalize bilateral efficiency, connect it to the back-to-back Shannon channel with Perfect Information Feedback, and – scoping the claim explicitly to intra-rack distances of one meter or less – describe how the Open Aethernet link recovers mutual knowledge at the link layer. The correction to Table 1 is not a different set of numbers. It is a different question.

17.
arXiv (CS.LG) 2026-06-19

QMaxCal: Path-Space Regularization for Open Quantum Control via Girsanov's Theorem

arXiv:2606.19947v1 Announce Type: cross Abstract: Reliable quantum control in the presence of decoherence requires policies that combat the effect of environmental noise on the controlled dynamics. Open quantum systems under continuous monitoring generate classical measurement records whose drift depends on the noise experienced by the system; the records of two evolutions sharing the same decoherence channels differ only in this drift, so Girsanov's theorem yields a closed-form, differentiable estimator of the KL divergence between their trajectory distributions. We instantiate this estimator with two physically motivated reference measures, yielding two regularizers that both drive the system toward states where the effects of decoherence are minimal: the Wiener KL (KL_W), which is empirically more effective under certain conditions on the noise model, and the drift-variance regularizer (R_DV), which works for all noise models. Both are qualitatively distinct from existing penalties on control fluence or smoothness: they penalize the observable consequences of control on the decoherence channels rather than the control amplitude itself. The regularizers outperform unregularized gradient-based and reinforcement-learning baselines across a range of open quantum systems – including single- and multi-qubit benchmarks and a multi-qubit chain calibrated to a published snapshot of the IBM Kingston processor – along several axes of evaluation: final-state fidelity, robustness to mismatch in the assumed noise model (gains grow from +17 pp at training noise to +27 pp under 2.5x noise mismatch), and occupation of forbidden states. The regularizers reduce infidelity by up to 50%, with ~16% gains on the calibrated IBM Kingston chain.

18.
medRxiv (Medicine) 2026-06-22

Referral pathways, ETAT triage acuity, and inpatient outcomes among children presenting to a national tertiary paediatric emergency unit in Ghana: a prospective cohort study

Emergency referral systems in sub-Saharan Africa are fragmented, and children reaching tertiary facilities through different referral pathways often arrive in advanced clinical states. Prospective data simultaneously characterising referral patterns, triage acuity at presentation, diagnostic case mix, and inpatient mortality at a national tertiary paediatric emergency unit are lacking from West Africa. This prospective cohort study enrolled 675 consecutively presenting children aged one month to 12 years at the Paediatric Emergency Unit of Korle Bu Teaching Hospital, Accra, Ghana, from February to December 2019. The primary outcome was all-cause inpatient mortality. Key variables collected included referral status and facility tier, Emergency Triage Assessment and Treatment (ETAT) triage category, ICD-10 diagnostic classification, Oyedeji socioeconomic classification, and time from symptom onset to PEU registration. Crude odds ratios were computed for all candidate predictors. Multivariable logistic regression was conducted using complete case analysis (n = 613). Of 675 children, 63.0% (n = 425) were referred from another health facility; referred children had higher ETAT emergency triage category rates than self-presenting children (32.7% vs 27.6%, p < 0.001). Overall inpatient mortality was 9.9% (67/675). Mortality varied by referral source: 16.7% among secondary/regional hospital referrals, 11.0% among lower-tier facility referrals (district, municipal, CHAG, polyclinic, private, health centre, and maternity home facilities combined, n = 356), 7.6% among self-presenting children, and 7.4% among tertiary referrals. Overall, 30.8% of children were classified as ETAT emergencies on arrival, with case fatility rate of 21.6%. The three most common diagnostic domains were respiratory conditions (17.2%), blood and haematological disorders (17.0%), and digestive presentations (16.4%). Inpatient mortality was highest in neoplastic disease (33.3%, n = 30) and circulatory presentations (31.0%, n = 29). In the primary multivariable analysis (n = 613, 51 events; events-per-variable ratio 4.2), no referral tier was independently associated with inpatient mortality after adjustment. Referral from secondary/regional hospitals showed a borderline non-significant association (adjusted odds ratio 3.09, 95% CI 0.96 to 9.90, p = 0.058). School going children (60-119 months) had higher odds of inpatient death than infants (adjusted odds ratio 5.56, 95% CI 1.16 to 26.53, p = 0.032), as did adolescents (adjusted odds ratio 10.01, 95% CI 2.15 to 46.69, p = 0.003). ETAT emergency category and lower socioeconomic status were not independently significant in this model. A pre-specified sensitivity analysis using the full analytic cohort (n = 674, events-per-variable ratio 6.7) with collapsed referral categories did not confirm any referral tier association; ETAT emergency category and lower SES were independently associated in the sensitivity model. All multivariable estimates should be regarded as exploratory. This prospective cohort provides simultaneous characterisation of referral patterns, ETAT triage acuity, diagnostic case mix, and inpatient mortality at a national tertiary paediatric emergency unit in West Africa. The referral-mortality gradient and high ETAT emergency category proportion document the severity of illness arriving through different referral pathways at this facility. The association between secondary/regional hospital referral and inpatient mortality is hypothesis-generating and requires replication in an adequately powered multicentre study before any service-level conclusions can be drawn.

19.
bioRxiv (Bioinfo) 2026-06-16

FlowBench: separating planning, fault recovery and interpretation in agentic bioinformatics

Agentic large language model (LLM) systems are being deployed in bioinformatics faster than they are understood, and single-metric evaluations conflate capabilities that fail independently. We introduce FlowBench, a benchmark that decomposes agentic bioinformatics performance into planning, fault recovery, biological interpretation, and end-to-end output-fidelity. Existing systems achieve high plan completeness, but their closed, single-provider designs prevent attribution of performance to scaffolding versus the underlying model. We therefore built FlowAgent, a modular, provider-agnostic framework whose components can be selectively disabled and whose backbone model can be swapped across providers on a shared harness, and used it to evaluate 23 models from three main providers. Three findings emerge. First, generating a valid workflow plan from a named toolchain is largely solved, whereas inferring an appropriate toolchain from biological intent alone is uniformly difficult regardless of model tier, compressing all models into a narrow 44-57% pass-rate band. Second, ablation shows that the dependency-structured plan and a completeness-reflection step drive performance, while adding a same-context validator-driven retry makes structural quality worse. Third, fault recovery and data-grounded interpretation remain unsolved. Models frequently propose fixes that force a clean exit while leaving the underlying data invalid, and data-grounded interpretation lags internal-knowledge recall by a consistent margin. Safety does not emerge from capability, and reasoning-tier models were among the least reliable at recognising unrecoverable faults. Once planning saturates, agent architecture and refusal calibration, not model scale, are the productive frontier.

20.
arXiv (CS.LG) 2026-06-12

Thermodynamic assessment of machine learning models for solid-state synthesis prediction

arXiv:2602.04075v2 Announce Type: replace-cross Abstract: Machine learning models have recently emerged to predict whether hypothetical solid-state materials can be synthesized. These models aim to circumvent direct first-principles modeling of solid-state phase transformations, instead learning from large databases of successfully synthesized materials. Here, we assess the alignment of several recently introduced synthesis prediction models with material and reaction thermodynamics, quantified by the energy with respect to the convex hull and a metric accounting for thermodynamic selectivity of enumerated synthesis reactions. A dataset of successful synthesis recipes was used to determine the likely bounds on both quantities beyond which materials can be deemed unlikely to be synthesized. With these bounds as context, thermodynamic quantities were computed using the CHGNet foundation potential for thousands of new hypothetical materials generated using the Chemeleon generative model. Four recently published machine learning models for synthesizability prediction were applied to this same dataset, and the resultant predictions were considered against computed thermodynamics. We find these models generally overpredict the likelihood of synthesis, but some model scores do trend with thermodynamic heuristics, assigning lower scores to materials that are less stable or do not have an available synthesis recipe that is calculated to be thermodynamically selective. In total, this work identifies existing gaps in machine learning models for materials synthesis and introduces a new approach to assess their quality in the absence of extensive negative examples (failed syntheses).

21.
arXiv (CS.CV) 2026-06-16

K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model

Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $K-Prism$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $semantic priors$ learned from annotated datasets, (ii) $in-context knowledge$ from few-shot reference examples, and (iii) $interactive feedback$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $what$ to segment and 2-D dense prompts indicating $where$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings.

22.
arXiv (quant-ph) 2026-06-24

Monitoring Beam Splitter Entanglement using Quantumness

arXiv:2606.24242v1 Announce Type: new Abstract: We report on an experiment in which two independent squeezed vacuum states get entangled by mixing them with a balanced beam splitter. We follow standard practice and use an inseparability criterion to quantify their entanglement. However, this only allows us to witness the entanglement, but not to determine the deleterious effects of experimental imperfections due to the beam splitter mixing and the associated mode-mismatch and detection imperfections. We therefore introduce an alternative framework suitable for continuous variable systems using the states' quantumness, $\Xi$. We show that, under ideal circumstances, $\Xi$ is a conserved quantity under beam mixing. This allows us to benchmark the experiment's performance by comparing the states' quantumness $\Xi$ after the beam splitter mixing with $\Xi$ before. Such a comparison is not possible with entanglement witnesses, as the input states are unentangled. This highlights the main strength of our approach: its ability to generally quantify the quantumness of multi-mode continuous variable states and use this to probe different stages in an experiment.

23.
arXiv (CS.LG) 2026-06-12

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

arXiv:2606.13194v1 Announce Type: new Abstract: Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

24.
arXiv (CS.LG) 2026-06-16

Incentives and Evidence in Learned Service Orchestration

arXiv:2606.16555v1 Announce Type: cross Abstract: Reinforcement learning for service orchestration has been the subject of sustained research for over a decade, yet it is not used in production at scale. The usual explanation is that learned controllers degrade under delayed and noisy telemetry, workload shifts, and uncontrolled tenants. We test whether existing evidence supports that explanation. We evaluate three highly influential RL-based orchestration systems spanning resource allocation, DAG scheduling, and autoscaling, using pre-registered predictions about comparative degradation under production-relevant perturbations and paired inference with family-wise error correction. Across the tests, most predicted performance reversals do not occur. Diagnostic analyses show that these outcomes often reflect comparator collapse, artefact limitations, or evaluation choices rather than evidence that learned controllers tolerate the perturbations. One apparent advantage under observation lag is roughly fortyfold compared to a Kubernetes HPA-equivalent controller. Another widely cited result cannot be reconstructed from its released artefact, and the strongest reproducible margin is far smaller than the published results. Conclusions also reverse under changes in perturbation magnitude and evaluation mode. Based on these results and broader patterns in the literature, we identify an institutional problem. Publication and review incentives favour benchmark gains against convenient comparators, even when those gains provide little evidence of deployment performance. We argue that the problem is not solely technical. Rather, it is institutional, so learned orchestration needs production-grade comparators, registered perturbation models, separate operational metrics, and publication criteria that reward reproducible operational evidence. Without these changes, the literature can grow without establishing whether learning improves orchestration.

25.
arXiv (math.PR) 2026-06-15

Stationary measures for higher spin vertex models on a strip

Authors:

arXiv:2309.04897v2 Announce Type: replace-cross Abstract: We introduce a higher spin vertex model on a strip with fused vertex weights. This model can be regarded as a generalization of both the unfused six-vertex model on a strip arXiv:2212.09111 and an 'integrable two-step Floquet dynamics' model introduced in arXiv:1711.08884. We solve for the stationary measure using a fused version of the matrix product ansatz and then characterize it in terms of the Askey-Wilson process. Using this characterization, we obtain the limits of the mean density along an arbitrary down-right path. It turns out that all these models share a common phase diagram, which, after an appropriate mapping, matches the phase diagram of open ASEP. This provides evidence for the universality of this phase diagram.