Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

Kairos: A Native World Model Stack for Physical AI

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

02.
arXiv (CS.LG) 2026-06-16

PromptShift-CRC: Drift-Aware Conformal Risk Control for Foundation Models Under Prompt and Domain Shift

arXiv:2606.15964v1 Announce Type: cross Abstract: Foundation models are now used in settings where the prompts they receive can change quickly. Users change, topics change, policies change, and the model may suddenly face a kind of request that was rare in the calibration data. This makes fixed calibration risky. Conformal prediction and conformal risk control give model-agnostic ways to control error, but they work best when the calibration data still look like the future data. This paper develops PromptShift CRC, a drift-aware conformal risk control method for foundation-model outputs under prompt and domain shift. The method embeds prompts and responses, measures how far the current prompt stream has moved from the calibration pool, gives more weight to relevant or recent calibration examples, and updates the risk level online after observed violations. It reports three practical diagnostics: realized risk error, prompt drift, and effective calibration size. We give conditions under which the method controls risk up to terms for distribution mismatch and weighted quantile uncertainty. In a synthetic prompt-shift benchmark, static conformal risk control fails sharply after drift, while PromptShift-CRC gives the best coverage among the adaptive baselines considered. We then evaluate the same calibration layer on public benchmark derived streams for question answering, toxicity, summarization factuality, and long-context hallucination risk

03.
arXiv (CS.CV) 2026-06-18

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$\lambda$ Mix Interaction ($L\lambda MI$) block. Comprising Local-$\lambda$ and Interactive-$\lambda$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

04.
medRxiv (Medicine) 2026-06-16

Non-invasive Detection of Fasciculation Using Surface EMG with a Wavelet-Based Analytical Method (DEWCS)

Objective: Needle electromyography (nEMG) is essential for diagnosing neuromuscular disorders but is invasive and often painful. We employed single-channel bipolar surface EMG (sEMG) analyzed with a novel wavelet-based analytical approach, Detecting and Extracting Elemental Wave Components based on a Wavelet Coefficient Set (DEWCS) and investigated whether fasciculation-related activity could be identified. Methods: In this prospective study, 28 patients undergoing nEMG for suspected neuromuscular disorders and 13 healthy controls were included. Resting-state sEMG was recorded from selected muscles using single-channel bipolar active electrodes at a high sampling rate. DEWCS was used to extract indices reflecting fast- and slow-type motor unit (MU)-related activity. These standardized indices were evaluated against nEMG-detected fasciculation potentials using generalized estimating equation logistic regression to account for within-subject clustering. Diagnostic performance was assessed by receiver operating characteristic analysis. Results: A total of 67 muscles from 38 participants were analyzed. Indices of fast- and slow-type MU-related activity were significantly associated with fasciculation potentials (slow: OR 5.10, p = 0.0041; fast: OR 2.38, p = 0.0162). The combined model showed excellent discrimination (area under the curve = 0.97), outperforming either index alone. Muscle region had no significant effect. Conclusions: A single-channel bipolar sEMG setup combined with DEWCS detected fasciculation-related activity with promising accuracy. This method may serve as a non-invasive surrogate marker of lower motor neuron involvement. Further validation in larger cohorts is warranted. Significance: This non-invasive sEMG approach may help detect fasciculation-related activity and complement nEMG in neuromuscular diagnostics.

05.
arXiv (CS.CL) 2026-06-11

Teaching Diffusion to Speculate Left-to-Right

Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward. In this work, we offer an empirical analysis of three training-time interventions that narrow this gap: token positional weighting, a first-error focal loss that targets the position that breaks the accepted prefix within each block, and a chain loss term that substitutes a differentiable surrogate for the expected accepted length. The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined. Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.

06.
arXiv (CS.CV) 2026-06-11

DeceptionX: Explainable Deception Detection with Multimodal Large Language Models

Deception detection is a critical and highly challenging task within affective computing and behavioral analysis. Existing deep learning methods typically treat this task as a straightforward classification problem; however, this black-box approach lacks interpretability and fails to capture the complex logical deduction processes utilized by human experts when identifying lies. While Multimodal Large Language Models (MLLMs) have shown potential, applying them effectively requires a bridge between low-level audiovisual cues and high-level logical reasoning. In this paper, we propose DeceptionX, a novel MLLM framework that shifts the paradigm of deception detection from black-box classification to an interpretable Observe-Think-Summarize reasoning process. To address the scarcity of high-quality reasoning data, we first constructed DeceptChain, a high-quality dataset developed through a human-in-the-loop process. This dataset synthesizes fine-grained visual and auditory evidence (such as micro-expressions and vocal tremors) into structured chain-of-thought reasoning data. Furthermore, we propose a three-stage training pipeline and a Discrepancy-Aware Redundancy Elimination~(DARE) strategy for DeceptionX to further enhance the model's generalization capabilities. Extensive experiments demonstrate that DeceptionX not only outperforms existing MLLM baselines and state-of-the-art methods on standard real-world benchmarks but also provides transparent, expert-level reasoning paths, bridging the critical gap between accuracy and interpretability in multimodal deception detection.

07.
arXiv (CS.CL) 2026-06-11

Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation

While large language models (LLMs) are increasingly used as automatic judges for question answering (QA) and other reference-conditioned evaluation tasks, little is known about their ability to adhere to a provided reference. We identify a critical failure mode of such reference-based LLM QA evaluation: when the provided reference conflicts with the judge model's parametric knowledge, the resulting scores become unreliable, substantially degrading evaluation fidelity. To study this phenomenon systematically, we introduce a controlled swapped-reference QA framework that induces reference-belief conflicts. Specifically, we replace the reference answer with an incorrect entity and construct diverse pairings of original and swapped references with correspondingly aligned candidate answers. Surprisingly, grading reliability drops sharply under swapped references across a broad set of judge models. We empirically show that this vulnerability is driven by judges' over-reliance on parametric knowledge, leading judges to disregard the given reference under conflict. Finally, we find that this failure persists under common prompt-based mitigation strategies, highlighting a fundamental limitation of LLM-as-a-judge evaluation and motivating reference-based protocols that enforce stronger adherence to the provided reference.

08.
arXiv (CS.LG) 2026-06-16

Phase-Localized Curation Does Not Help: A Negative Result on Per-Phase Metric Selection for Demonstration Filtering

作者:

arXiv:2606.15064v1 Announce Type: new Abstract: Manipulation demonstrations have temporal phase structure, and a natural hypothesis is that demonstration-curation metrics should be applied within phases rather than globally. The idea is to segment each trajectory into phases, score each phase with the metric that is locally most informative, and then aggregate. This follows directly from prior work showing that a single global metric can be the best detector of a defect and yet the worst curator of the resulting policy. We test the per-phase hypothesis on three contact-rich LIBERO pick-and-place tasks with a controlled early-release structural defect, comparing phase-gated curation against the same metrics applied uniformly and against a strong single global metric. Across all three tasks and five random seeds per condition, phase-gated curation is never the best curation strategy, and it is the worst of the three on two of the three tasks (Task 1: 86.0 vs. 92.0 for global; Task 3: 22.7 vs. 48.0 for uniform). We trace the failure to a concrete mechanism. When the defect signal is concentrated in a single phase, rank-aggregating across phases dilutes that signal with uninformative scores from defect-free phases, selecting a worse demonstration subset than simply applying the defect-informative metric everywhere. We further show that the per-phase metric selection does not transfer across tasks, since no phase shares a winning metric between any two tasks, so the selection cannot be reused and must be re-derived per task from a noisy sweep. These results bound a plausible and previously untested method, and they argue that practitioners should prefer identifying a single defect-informative metric over decomposing curation by phase. We release the full pipeline, all metric implementations, and per-seed results.

09.
arXiv (CS.CV) 2026-06-15

One Layer's Trash is Another Layer's Treasure: Adaptive Layer-wise Visual Token Selection in LVLMs

Large Vision-Language Models (LVLMs) have achieved remarkable success across diverse multimodal tasks, yet their practical deployment remains constrained by the computational burden arising from lengthy visual tokens. While visual token pruning has emerged as a promising solution, existing methods suffer from a fundamental limitation: once tokens are pruned at a specific layer, they become inaccessible to all subsequent layers, leading to premature information loss that can compromise model performance. Through empirical studies, we observe that different layers exhibit distinct visual region focus, indicating a varying optimal token subset across layers. Motivated by this insight, we propose Adaptive Layer-wise Visual Token Selection (ALVTS), a novel framework that breaks away from the conventional static token pruning paradigm. ALVTS incorporates a lightweight token selector to identify and route important tokens for further processing, while allowing less important tokens to skip the layer, thus minimizing computational redundancy. These two streams of tokens are seamlessly reintegrated before being fed into subsequent layers, facilitating adaptive compression across the entire model. Grounded in our importance consistency constrained low-rank approximation, the proposed token selection module closely emulates the full attention mechanism, effectively capturing its essential patterns without requiring model retraining. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL validate the effectiveness of our method. With an 89% token compression ratio, ALVTS retains 96.7% of the original model's accuracy, achieving a superior efficiency-accuracy trade-off for LVLM inference.

10.
arXiv (CS.LG) 2026-06-11

Least-Action-Guided Diffusion for Physical Extrapolation

arXiv:2606.11277v1 Announce Type: new Abstract: Reliable extrapolation remains a central challenge for generative models in computational physics, because models trained over finite ranges of time, parameters, or geometries may produce physically inconsistent predictions outside the training distribution. We introduce a least-action-principle-guided diffusion, LAPG, a framework that promotes physical consistency during inference rather than relying solely on constraints imposed during training. The method combines a conditional score-based diffusion model with an action-derived physical guidance score. In the first stage, the learned score model generates an in-distribution proposal; in the second, an action-based variational prior refines this proposal toward the target out-of-distribution condition. This formulation turns the principle of least action into a differentiable inference-time correction mechanism and provides an alternative to pointwise residual penalties that often require empirical loss balancing. We evaluate LAPG on representative ordinary- and partial-differential-equation systems, including free fall, conservative and dissipative spring-mass dynamics, interacting point vortices, and potential flow over parameterized airfoils. In temporal, parameter, and geometric extrapolation tests, LAPG reduces phase drift, preserves dissipative decay, captures vortex motion, and improves the lift response of airfoil flows compared with training-time physics-informed baselines.

11.
arXiv (CS.LG) 2026-06-16

A Decision-Theoretic View of Test-Time Training: When, How Far, and Which Directions to Adapt

arXiv:2606.15569v1 Announce Type: new Abstract: Test-time training (TTT) adapts a pretrained model to each prompt via parameter updates, improving accuracy under pretraining-to-test distribution shifts. Yet, its performance often suffers from instability and sensitivity to hyperparameters such as update steps and subspace. We explain this behavior through a decision-theoretic lens, treating TTT as implicit Bayesian inference in the kernel regime. Under a Gaussian process benchmark, we show that TTT reduces prediction error when updates are spectrally matched to the prompt's signal-to-noise ratio and aligned with query-relevant eigen-directions. This perspective underpins the following results: (1) we show when fixed update steps and subspaces fail under distribution shifts, motivating adaptive strategies; (2) we prove that selecting update steps via prompt evidence admits a PAC-Bayes guarantee against overfitting; and (3) we characterize the Bayes-optimal update subspace under a linear-Gaussian correction model, yielding a scoring rule for selecting Transformer blocks and heads. Our theory helps explain the empirical instability of TTT, taking a step toward principled guidance for when, how far, and which directions to adapt.

12.
medRxiv (Medicine) 2026-06-15

Iron deficiency testing among people with incident heart failure in primary care

Background: Given around 50% of people with heart failure have a degree of iron deficiency, guidelines recommend screening. It is uncertain to what extent this is done in primary care and whether testing is equitable. Aim: To report the proportion of people with incident heart failure who undergo a ferritin test within 12 months. Design and setting: Retrospective primary care cohort study using Clinical Practice Research Datalink Aurum data, between 2016 and 2021. Methods: We report the proportion of adults with an incident diagnosis of heart failure who received a ferritin test within 12 months. Multivariable logistic regression was used to examine the odds of testing based on key demographic covariates and co-morbidities. Results: Among 105,749 individuals with an incident diagnosis of heart failure (mean age 71.6 years, SD 14.3), only 35,688 (33.7%) received a ferritin test within the subsequent year. Increasing age (odds ratio 1.25 per 10-year increase, 95% CI: 1.24-1.27), female sex (male sex OR 0.86, 0.84-0.89) and Asian ethnicity (OR 1.70, 1.59-1.80) were all associated with increased odds of testing as were diagnoses of coeliac disease (OR 1.86, 1.58-2.21), type 1 diabetes (OR 1.82, 1.51-2.19) and cirrhosis (OR 1.64, 1.43-1.87). There was geographic variation in testing, even in adjusted analyses. Conclusion: In a large primary care dataset, two thirds of people with incident heart failure did not receive a ferritin test for iron deficiency within a year of diagnosis demonstrating a gap in current practice and an opportunity for improvements in service delivery.

13.
arXiv (quant-ph) 2026-06-19

Matrix-product state skeletons in Onsager-integrable quantum chains

arXiv:2511.07212v2 Announce Type: replace Abstract: Matrix-product state (MPS) skeletons are connected networks of Hamiltonians with exact MPS ground states that underlie a phase diagram. Such skeletons have previously been found in classes of free-fermion models. For the translation-invariant BDI and AIII free-fermion classes, it has been shown that the underlying skeleton is dense, giving an analytic approach to MPS approximation of ground states anywhere in the class. In this paper, we partially expose the skeleton in certain interacting spin chains: the $N$-state Onsager-integrable chiral clock families. We construct MPS that form a dense MPS skeleton in the gapped regions surrounding a sequence of fixed-point Hamiltonians (the generators of the Onsager algebra). Outside these gapped regions, these MPS remain eigenstates, but no longer give the many-body ground state. Rather, they are ground states in particular sectors of the spectrum. Our methods also allow us to find further MPS eigenstates; these correspond to low-lying excited states within the aforementioned gapped regions. This set of MPS excited states goes beyond the previous analysis of ground states on the $N=2$ free-fermion MPS skeleton. As an application of our results, we find a closed form for the disorder parameter in a family of interacting models. Finally, we remark that many of our results use only the Onsager algebra and are not specific to the chiral clock model representation.

14.
PLOS Medicine 2026-06-18

Association between initial benzodiazepine prescribing patterns and time to benzodiazepine discontinuation: A population-based retrospective cohort study

by Nikki Bozinoff, Tanya S. Hauck, Robert A. Kleinman, Matthew E. Sloan, Beth A. Sproule, Simone N. Vigod, Jennifer Wyman, Priscila Pequeno, Tara Gomes Background Long-term benzodiazepine use has been associated with increased risk of morbidity and mortality. Preventing long-term use through safer prescribing practices has received little attention to date. We sought to better understand associations between initial prescription characteristics and duration of benzodiazepine use. Methods and findings This was a retrospective population-based cohort study of 1,820,808 adults in Ontario with incident benzodiazepine prescriptions between January 1, 2013 and December 31, 2020, with follow-up to December 31, 2021. The primary exposure was duration of the index prescription (≤7 days—referent group, 8–14 days, 15–30 days, or >30 days). Secondary exposures were: (a) duration of action of index benzodiazepine(s) prescription (short-acting, long-acting or both); (b) number of benzodiazepine dispensed on index (1 or 2+); and (c) mean daily dose of the index prescription in Diazepam Milligram Equivalents (DMEs). The primary outcome was time to benzodiazepine discontinuation in days. Multivariable models were adjusted for age, sex, anxiety, insomnia, and substance use disorders as well as other important comorbidities and socio-demographic characteristics. The median age at index was 53 years (Interquartile Range (IQR) 38–67), and 62.6% were women. The median time to discontinuation in women was 16 days (IQR: 6–29) while the median time to discontinuation in men was 19 days (IQR: 6–29). Lorazepam was the most commonly prescribed benzodiazepine on index (63.9%), followed by clonazepam (17.3%) and diazepam (5.8%). In multivariable Cox Proportional Hazards Models, longer index prescriptions were associated with a lower likelihood of benzodiazepine discontinuation (adjusted Hazard Ratio (aHR) 0.54 (95% Confidence Interval (CI) [0.54,0.54]) for 8–14 days; aHR 0.26 (95% CI [0.25,0.26] for 15–30 days and aHR 0.14 (95% CI [0.14,0.14]) for >30 days, compared to ≤7 days, respectively). Being prescribed two or more benzodiazepines versus 1 was also associated with a reduced likelihood of discontinuation (aHR 0.59 (95% CI [0.57,0.61])), as was being prescribed long-acting benzodiazepines (aHR 0.80 (95% CI [0.80,0.80])) or a combination of short and long acting benzodiazepine (aHR 0.84 (95% CI [0.80,0.88])) versus short-acting benzodiazepines alone. Mean daily doses of >5 to ≤10 DME and >10 to ≤20 DME were associated with an increased likelihood of discontinuation (aHR 1.03 (95% CI [1.03,1.03]); aHR: 1.03 (95% CI [1.03,1.04])), whereas doses >20 DME were associated with a reduced likelihood of discontinuation (aHR 0.98 (95% CI [0.97,0.98])) compared with ≤5 DME. Findings may be subject to bias from unmeasured confounding. Conclusion This large population-based cohort study found that prescribing shorter courses of benzodiazepines, use of a single benzodiazepine, use of a short-acting agent, were associated with reduced likelihood of long-term benzodiazepine use. Findings suggest that simple changes to prescribing practices could reduce prolonged benzodiazepine use and the morbidity and mortality associated with long-term use of these medications.

15.
arXiv (math.PR) 2026-06-19

Hermite trace polynomials and chaos decompositions for the Hermitian Brownian motion

arXiv:2207.13180v4 Announce Type: replace Abstract: For a non-zero parameter $q$, we define Hermite trace polynomials, which are multivariate polynomials indexed by permutations. We prove several combinatorial properties for them, such as expansions and product formulas. The linear functional determined by these trace polynomials is a state for $q = \frac{1}{N}$ for $N$ a non-zero integer. For such $q$, Hermite trace polynomials of different degrees are orthogonal. The product formulas extend to the closure with respect to the state. The state can be identified with the expectation induced by the $N \times N$ Hermitian Brownian motion. Hermite trace polynomials are martingales for this Brownian motion, while the elements in the closure can be interpreted as stochastic integrals with respect to it. Using the grading on the algebra, we prove several chaos decompositions for such integrals, as well as analyze corresponding creation and annihilation operators. In the univariate, pure trace polynomial case, trace Hermite polynomials can be identified with the Hermite polynomials of matrix argument.

16.
arXiv (math.PR) 2026-06-16

Testing for a Hidden Geometry in Random Graphs

arXiv:2606.16715v1 Announce Type: cross Abstract: We study the problem of detecting a faint geometric signal hidden in an otherwise random graph. Formally, we consider a hypothesis testing problem in which, under the null, the observed graph is an Erdős–Rényi random graph $\mathcal{G}(n,q)$, while under the alternative a random geometric graph $\mathcal{G}(k,q,d)$ is planted on $k\le n$ vertices. The planted subgraph is generated from independent random points on the unit sphere $\mathbb{S}^{d-1}$, with edges determined by latent geometric proximity and calibrated to have edge density $q$. Our goal is to characterize the statistical and computational limits of detecting this hidden geometry. We derive sharp information-theoretic lower bounds that identify regimes where detection is impossible and provide algorithms that achieve these limits whenever detection is feasible. We further investigate the computational complexity of the problem and determine when efficient polynomial-time tests exist. The model exhibits an easy–hard–impossible phase transition: some regimes allow efficient detection, others permit detection only with computationally intractable procedures, and still others render detection impossible even with unlimited computational power. As evidence for the computational barrier, we prove that all low-degree polynomial algorithms fail throughout the conjecturally hard regime, demonstrating a sharp gap between statistical and computational feasibility.

17.
arXiv (CS.CV) 2026-06-18

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

18.
arXiv (quant-ph) 2026-06-15

Simultaneous Estimation of Partial-Transpose Moments with Active Memory Independent of the Moment Order

arXiv:2606.14204v1 Announce Type: new Abstract: We study the simultaneous estimation of partial-transpose moments $p_j(\rho_{AB})=\mathrm{Tr}[(\rho_{AB}^{T_B})^j]$, $j=2,\ldots,K$, of an unknown bipartite $n$-qubit state from independent copies under an explicit active-memory constraint. We give a sequential qubit-reuse realization of the partial-transpose permutation that uses at most $2n+1$ active qubits, independent of $K$, and estimates all moments $p_2,\ldots,p_K$ to uniform additive error $\epsilon$ with total copy complexity $O(K\log K/\epsilon^2)$. We also prove two converse bounds. First, any uniformly accurate simultaneous estimator requires $\Omega(K/\epsilon^2)$ copies in the worst case. Second, the same scaling holds on an explicit isospectral two-qubit negative-partial-transpose (NPT) family whose ordinary moments are constant while the partial-transpose moments vary. These results characterize the copy complexity of the partial-transpose moment hierarchy up to a logarithmic factor and extend simultaneous nonlinear-functional estimation from ordinary state powers to partial-transpose spectral data under active quantum memory independent of the target moment order.

19.
arXiv (CS.LG) 2026-06-15

A Statistical and Machine Learning Framework for Operational Threshold Detection and Deployable Dispatch Controller Development in Hydrogen Multi-Energy Systems

arXiv:2606.14601v1 Announce Type: new Abstract: This study presents a statistical and machine learning framework for characterizing a hydrogen-based multi-energy system (H-MES) using one year of high-resolution operational data. Statistical analysis revealed a binary operation driven by renewable surplus, with solar irradiance explaining 45.7% of rank-based variance in hydrogen production, a large effect by conventional standards. Only high-irradiance periods triggered meaningful electrolyzer engagement, while electricity demand exerted a weaker inverse suppression effect ($\epsilon^2 = 0.126$). Multiple regression confirmed electrolyzer power as the dominant linear predictor, with a synergistic solar-wind interaction. Notably, Random Forest analysis ranked wind output first in predictive importance despite its weak bivariate correlation (r = 0.167), revealing non-linear dynamics invisible to parametric methods. A sequence model exploited strong 24-hour autocorrelation (r = 0.845) for operational forecasting, while a reinforcement learning agent optimized hydrogen revenue dispatch. The core contribution is demonstrating that statistical and machine learning approaches are complementary for H-MES modeling and control.

20.
bioRxiv (Bioinfo) 2026-06-18

segSHAPE: RNA secondary structure prediction from nanopore direct RNA sequencing

RNAs adopt complex structures that regulate key biological processes, making accurate structure prediction essential. Chemical probing coupled with Nanopore direct RNA sequencing (DRS) offers a route to single-molecule structural inference, but current tools are limited by inaccurate signal-to-sequence alignment, which degrades modification-rate estimation and downstream structure prediction. Here we introduce segSHAPE for RNA secondary structure prediction from Nanopore DRS data (both RNA002 and RNA004 chemistries), a probe-agnostic framework that improves signal alignment using prior information of basecalling and per-read signal baseline shift correction, learns position-specific k-mer raw signal parameters, and estimates per-nucleotide modification rates with an unsupervised anomaly detector. On three public RNA002 DRS datasets spanning different chemical probes (AcIm, NAI-N3) and RNAs from 421 to 1552 nt, segSHAPE achieves the highest F1 score and Matthews correlation coefficient (MCC) on all RNAs, exceeding the strongest baseline by 3.4 to 5.8 percentage points in MCC. It additionally captures the ligand-induced conformational change of the thiamine pyrophosphate (TPP) riboswitch RNA directly from RNA002 DRS data using the DEPC probe. On a public RNA004 DRS dataset, segSHAPE improves over the sm-PORE-cupine baseline by 17 ROC-AUC points in modification rate estimation and by 6.7 MCC points in structure prediction. These results establish segSHAPE as a unified, probe-agnostic pipeline for RNA structure prediction from Nanopore DRS data.

21.
arXiv (CS.AI) 2026-06-16

MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

arXiv:2606.14715v1 Announce Type: cross Abstract: LLM agents are increasingly used to simulate real world interactions, but it remains unclear whether simulated behaviors preserve the content patterns and interaction dynamics of real human behaviors. Existing evaluations remain fragmented, which makes it difficult to compare systems or measure progress. In this paper, we focus on Reddit discussions as a concrete first step toward evaluating real-world social simulation. Reddit threads provide public, topic-grounded, multi-party interactions where people share experiences, debate, seek advice, express emotion, and collectively respond to products, events, and social issues. These discussions offer an observable window into broader social behavior, making them a useful setting for testing whether LLM agents can reproduce not only fluent text, but also the distributional patterns and interaction dynamics of real online communities. We introduce MiroBench, a benchmark for Reddit discussion simulation built from 4,292 real Reddit threads. MiroBench uses statistical tests to compare generated and real discussions across four major aspects: repetition and semantic uniformity, narrative content, toxicity and aggression, and structural complexity. Experiments across five domains and five models show that current simulators remain distributionally mismatched with real Reddit threads, while a lightweight prompt-based improvement procedure provides only limited gains. MiroBench offers a concrete benchmark for measuring, diagnosing, and improving realism in LLM-based social simulation.

22.
arXiv (CS.CV) 2026-06-16

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

23.
bioRxiv (Bioinfo) 2026-06-12

Evaluating cell type annotations in single-cell omics in the absence of ground truth

Accurate cell type annotation is essential for single-cell transcriptomics, directly shaping downstream analyses and biological interpretations. Yet, objective evaluation of annotation quality remains a major challenge. Here, we argue that a cell type or cell state label has practical utility only if it captures a molecular pattern that is reproducible across biological replicates. Based on this principle, we introduce inter-sample consistency (ISC), a quantitative framework to assess annotation quality in single-cell RNA-seq datasets. Unlike existing cluster validation approaches, ISC distinguishes annotations that generalize across samples and individuals from those driven by technical or unwanted variation, thereby providing principled criteria for annotation quality and transferability. When applied to published single-cell atlases, ISC reveals widespread reproducibility gaps and provides actionable guidance for repairing inconsistent annotations. Notably, ISC enables benchmarking of automated cell type annotation tools even when ground-truth labels are unavailable, providing interpretable metrics to guide their development and evaluation. Implemented as the scTypeEval Bioconductor package, this framework offers a broadly applicable resource for evaluating and improving cell type annotations in single-cell RNA-seq experiments.

24.
arXiv (CS.CL) 2026-06-16

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries ("Does this string belong to the target language?") and (2) equivalence queries ("Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.