Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (quant-ph) 2026-06-11

Compressed minimum-purity time evolution for late-time quantum dynamics

arXiv:2606.11392v1 Announce Type: cross Abstract: Unitary time evolution of initially simple quantum many-body states rapidly generates entanglement and complex correlations, which limits direct numerical simulations. The late-time dynamics of physical observables, however, typically exhibits an effective simplicity in the form of hydrodynamics or kinetic theory. This leads to the question whether microscopic equations of motion can remain accurate and tractable up to long time scales by discarding irrelevant information in a controlled manner. Here, we introduce compressed minimum-purity time evolution (CoMPuTE) as an approach to keep track of a consistent set of reduced local density matrices, closing the hierarchical equations of motion using a minimum-purity principle. In benchmark applications we demonstrate (i) accurate description of energy diffusion in the one-dimensional mixed-field Ising model, (ii) the applicability to genuinely out-of-equilibrium Floquet dynamics starting from a pure state, and (iii) the limitations of the local reduced density matrix approximation when describing transport in the XXZ chain at $\Delta=1$ that is governed by increasingly non-local integrals of motion. The CoMPuTE method enhances computational efficiency in comparison to the closely related local-information time evolution algorithm, opening a possible route towards an extension to systems in higher spatial dimensions.

02.
arXiv (CS.LG) 2026-06-25

Conformal Orbit-Valid Trust Horizons for Equivariant World Models

Authors:

arXiv:2606.24946v1 Announce Type: new Abstract: Learned world models are useful only over horizons on which their rollout error remains controlled. We study trust-horizon certification for latent world models with known group symmetries. Given a one-step latent residual and a finite-time expansion estimate, we form a raw horizon curve and calibrate it with a split-conformal multiplicative factor. On the reproducible audit set, the conformal factor is $\gamma_\alpha=1.0$: the raw certificate is already conservative under the audit protocol. Across 50 stable audits, we observe zero anti-conservative violations, corresponding to an exact-binomial 95% upper bound of 5.8% on the violation rate. Our main structural result is that exact equivariance transports a calibrated trust-horizon curve over the group orbit: when the environment dynamics, encoder, predictor, action transform, and latent metric satisfy the stated equivariance/invariance conditions, rollout errors and trust horizons are orbit-constant. Empirically, the implemented models exhibit small orbit-transport residuals, with median 1.1% and maximum 4.1% over 14 orbit audits. The certificate is also non-vacuous (median certified-to-measured horizon ratio 0.67). A certificate-level calibration-cost study shows two complementary regimes. On a symmetric 2D substrate, equivariant, plain, and augmented models are all orbit-valid from a single calibration sector – no separation, because the substrate already makes non-equivariant baselines approximately orbit-robust. A 3D yaw audit shows the other regime: the equivariant model obtains a one-sector safe and non-vacuous orbit-valid certificate, while healthy non-equivariant baselines pay violation, slack, sharpness, or additional-sector cost. The certificate is a conservative, distributional audit rather than a global reachability guarantee, and certificate-guided subgoal spacing is not confirmed in the current 3D CEM-MPC behavior layer.

03.
arXiv (quant-ph) 2026-06-15

Tamed Feynman-Kac diffusion processes: Killing-branching intertwine

arXiv:2605.07824v2 Announce Type: replace-cross Abstract: Relaxation to equilibrium of a drifted Brownian motion is quantified by a transition probability density function, whose main (multiplicative) entry is an inferred Feynman-Kac kernel of the Schr\"{o}dinger semigroup operator. Although seemingly devoid of a natural probabilistic significance (except for its explicit path integral definition), the pertinent kernel relaxes to equilibrium as well. The implicit Feynman-Kac potential ${\cal{V}}(x)$, continuous, confining and bounded from below, may take negative values. If positive, ${\cal{V}}(x)$ can be interpreted as the killing rate of the decaying diffusion process. In case of relaxing F-K kernels the killing effects are tamed (often overcompensated). The taming inavoidably appears in conjunction with the existence of the negativity subdomains of ${\cal{V}}(x)$ in $R$. If locally ${\cal{V}}(x) < 0$, its sign inversion $- {\cal{V}}(x)$ can be interpreted as the branching (cloning, alternatively bifurcation) rate in the course of the other wise free random motion. The arising killed diffusion processes with branching, we interpret as the possible path-wise background of tamed (relaxing) Feynman-Kac diffusions. We present acomputer-assisted path-wise arguments, towards a consistency of the killing/branching taming scenario, for a number of nonlinear model systems in one space dimension. Special attention is paid to Feynman-Kac potential shapes in the double well form, where an analytic access to eigenvalues and eigenfunctions is scarce. Throughout the paper the dynamics refers to the positive real time. Since the Newton-type equations of motion for admissible classical trajectories have a Euclidean form (due to the sign inverted force term), we give a brief resume of a couple of their explicit solutions, without recourse to the Euclidean time intuitions, and the instanton lore of related quantum model systems.

04.
arXiv (CS.AI) 2026-06-11

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

arXiv:2606.07537v1 Announce Type: cross Abstract: Large language models hallucinate–producing fluent, confident, factually wrong outputs–with a consistency that persists across generations and scales. Existing taxonomies classify hallucination by output type, distinguishing intrinsic from extrinsic failures and faithfulness from factuality divergence. These frameworks are descriptively rigorous but do not identify which internal mechanism produced a given instance. This paper analyses hallucination as a structural consequence of three architectural decisions that together form a compound failure system. Self-attention's co-occurrence learning substitutes statistical proximity for semantic meaning and produces entity confusion, fact misattribution, and semantic drift. The maximum likelihood estimation training objective optimises next-token probability without factual constraint, rewarding statistically plausible outputs regardless of their truth value. Autoregressive decoding's permanent left-to-right commitment under exposure bias ensures that a single wrong token cascades forward through the entire output sequence without revision. Dataset pathologies–long-tail deficiencies, training bias, and synthetic pollution–amplify these vulnerabilities but do not independently cause them. We make three contributions. First, we map each mechanism to a specific output category in the Alansari and Luqman taxonomy, locating intrinsic hallucination in self-attention, extrinsic hallucination in MLE, and logical inconsistency in autoregressive decoding. Second, we show that each commonly cited dataset pathology exploits one of these mechanisms rather than originating hallucination independently. Third, we identify the diagnostic limitation of output-type-only classification and contrast it with inference-layer mitigation approaches.

05.
arXiv (CS.AI) 2026-06-11

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

arXiv:2602.05746v2 Announce Type: replace-cross Abstract: Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward generic compliance, while prompt injection requires emitting specific tool calls with correct parameters. The success signal is binary, and randomly sampled suffixes almost never trigger it, so standard optimizers have no gradient to follow. We present AutoInject, a black-box reinforcement learning (RL) framework that learns adversarial suffixes for prompt injection. A learned comparison-based reward scores each candidate against the best suffix seen so far, turning the binary signal into a dense reward suitable for RL optimization. The framework supports both online query-based attacks and offline-trained transferable suffixes that need no utility access at deployment, and incorporates a utility objective when task-completion feedback is available. On AgentDojo, AutoInject outperforms template attacks, GCG, TAP, and adaptive attack across production models, with statistically significant improvements under McNemar's test with p

06.
arXiv (CS.CL) 2026-06-15

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $pivots$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines. Code is available at https://github.com/AgentCombo/DEEP-GRPO

07.
bioRxiv (Bioinfo) 2026-06-24

RNabel-A Standalone Software Tool for Annotating Tandem Mass Spectra of Modified Ribonucleic Acids

Ribonucleic acid (RNA) modifications, with over 170 identified types, play diverse roles in cellular processes. The past decade has witnessed surging demand for accurate identification and localization of RNA modifications in both endogenous and synthetic therapeutic RNAs. With accurate spectral annotation for RNA, tandem mass spectrometry (MS/MS) can meet this demand. Here we present RNabel, a user-friendly software tool for in-depth annotation of MS/MS spectra of RNA oligonucleotides. RNabel considers a full set of backbone-cleavage ions (a, b, c, d, a-B, w, x, y, z) in which the ribonucleotide unit could be A, U, C, G, Y (pseudouridine), or I (Inosine). Additionally, RNabel considers 196 modifications on the base, the phosphoribose linkage, the 5' or the 3' terminus, or detachment of a sub-nucleotide fragment as a neutral or charged group. Users can create new components if needed, including ribonucleotides, modifications, neutral or charged groups that could detach from a ribonucleotide. RNabel efficiently processes large datasets in four acceptable formats including .mgf, .raw, .txt from msConvert, and RNabel batch files. Multiple statistical metrics are provided for quality assessment of spectral annotation. To accelerate RNA modification analysis, RNabel is made freely available for Mac and Windows users at https://github.com/songge1111/RNabel/releases.

08.
PLOS Medicine 2026-05-06

Pathways of emergency care for severely ill children in Nigerian and Ugandan hospitals: A process mapping study

Authors:

by Rami Subhi, Abiodun Sogbesan, Dan Muramuzi, Mikael Burhin, Ayobami A. Bakare, Adegoke G. Falade, Freddy E. Kitutu, Freddie Ssengooba, Carina King, Sumit Kane, Belinda Dawson-McClaren, Hamish R. Graham, the MOXY-Implementation Research Collaboration Background Child mortality remains high in countries with weak emergency care systems. Facility organisation for paediatric emergency care is heterogeneous and under-described. We examined how hospitals in Uganda and Nigeria are organised to deliver emergency care for neonates and children. Methods and findings We conducted a qualitative, multi-method study in 26 purposively selected secondary and tertiary facilities in Uganda and Nigeria from October 2023 to December 2024. Embedded researchers documented patient pathways, resources for care, and care processes for severely ill children (

09.
arXiv (CS.CL) 2026-06-24

Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling

Authors:

We present Harmonic, a hierarchical state space model (SSM) for language modeling. The architecture stacks three recurrent levels at progressively slower timescales; each level receives the prediction error of the level below as input, rather than its raw hidden state. On enwiki8 with equal token budgets, Harmonic outperforms a comparable Transformer (28M params) by +1.4% at 1K tokens, +6.7% at 8K tokens, and +11.4% at 32K tokens (bpt, lower is better). It also outperforms Mamba at every tested length by 0.7–1.8%. At 64K tokens, both Mamba and Transformer run out of memory on an 80GB H100; Harmonic trains successfully, reaching 6.169 bpt. Results replicate on WikiText-103 (H-TF gap +1.7% to +7.2% across 1K–32K). At 1B parameter scale, replacing all attention layers in TinyLlama 1.1B with HarmonicBlock eliminates the RoPE positional encoding limit: the resulting Hallamonic model maintains stable loss across sequence lengths 1K–8K on two independent clean benchmarks (Lambada and fineweb-edu held-out), while TinyLlama degrades catastrophically past its 2K-token RoPE limit (gap: +9.4 bpt at seq=8K on Lambada). Compute is O(L) per forward pass vs. O(L^2) for attention. Logs: https://github.com/Omibranch/harmonic-logs.

10.
arXiv (CS.AI) 2026-06-15

Squeeze-Release: Iterative Pruning with Exact Structural Minimization

arXiv:2606.14346v1 Announce Type: cross Abstract: Unstructured pruning produces sparse weight tensors, but the standard implementation keeps tensor shapes unchanged so the deployed model is no smaller than before pruning. We present an exact structural rewrite, which we call minimization, that converts a masked network into a smaller dense network with the same forward function up to floating-point rounding. The Squeeze-Release cycle iterates pruning and minimization with an intermediate release step that re-enables the exact-zero positions inside the compacted tensors as small calibrated noise, turning otherwise wasted capacity back into trainable parameters. Successive cycles use that capacity to find structural redundancy a single pass cannot reach. We additionally introduce CompensatedLayerNorm, a function-preserving replacement for LayerNorm that extends minimization to channel reduction across LayerNorm-equipped residual streams. Squeeze-Release compresses the deployable network to 39x smaller than the unpruned model on a fully-connected model network and 14.8x smaller on modern CNN (ConvNeXt-Tiny), at comparable accuracy. In addition we prove that the rewrite can be extended to transformer architectures.

11.
arXiv (CS.CV) 2026-06-25

Cross-Modality Structural Guidance in 3D Latent Diffusion for Robust FLAIR Super-Resolution

High-resolution (HR) MRI acquisition is often hampered by scan time constraints, resulting in anisotropic or low-resolution scans (e.g., thick-slice FLAIR) that limit diagnostic accuracy. While deep learning-based super-resolution (SR) methods show promise, they often hallucinate anatomical details, which can compromise brain structural integrity. To mitigate this limitation, we introduce MR-DiffuSR, a Multi-Resolution Diffusion-based Super-Resolution framework that incorporates HR T1w structural image priors to guide the restoration of thick-slice FLAIR scans and operates in the 3D latent space. Our architecture introduces cross-modality structural swin-attention, which derives structural attention maps from the HR T1w and applies them to the low-resolution FLAIR latent features. This design disentangles anatomical structure from modality-specific contrast, effectively preventing hallucinations. Furthermore, we employ a mixed-scale degradation strategy, training the model on a continuum of downsampling factors to ensure robustness to varying slice thicknesses, while optimizing with a DINOv3-based perceptual loss to preserve high-frequency semantic details. Evaluated on the ADNI-4 dataset, MR-DiffuSR surpasses both CNN and 2D diffusion approaches, achieving an average PSNR of 32.46dB, SSIM of 0.97, and LPIPS of 0.07 across all downsampling factors. In downstream white matter hyperintensity segmentation, our model demonstrates exceptional robustness. While baseline performance collapses at 10x down-sampling (Dice: 0.51), MR-DiffuSR maintains a Dice score of 0.63, preserving utility even at 7mm equivalent slice thickness.

12.
arXiv (CS.LG) 2026-06-12

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

arXiv:2606.12610v1 Announce Type: new Abstract: Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

13.
arXiv (CS.CV) 2026-06-25

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions. We introduce HG-Bench, a benchmark of 500 human-annotated K-12 homework samples curated from a 1,489,278-image source pool, with question-level and step-level boxes linked by a hierarchical containment constraint. HG-Bench is paired with a page-aware evaluation protocol that separately measures complete-answer localization (FA) and step-level decomposition (FSm), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text. Across frontier closed-source APIs and competitive open-weight VLMs, no zero-shot system exceeds 55.22% on FA or 48.22% on FSm, while a GLM-4.6V 9B reference model fine-tuned on ~10k in-domain examples reaches 74.97/72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.

14.
arXiv (quant-ph) 2026-06-25

Constraint-Aware Quantum Optimization of Defect Configurations in Doped ZrO2: XY-Mixer QAOA and Grover Adaptive Search

Authors:

arXiv:2606.24922v1 Announce Type: new Abstract: Quantum optimization offers a route to searching the large defect-configuration spaces that arise in materials design. We develop an end-to-end, constraint-aware quantum optimization workflow for composition-defect search in a doped ZrO2 thermal-barrier-coating (TBC) material system, using a MACE-MPA-0 energy dataset to fit a 24-variable QUBO over 8 cation-occupation and 16 oxygen-vacancy variables with exactly two rare-earth substitutions and one oxygen vacancy, yielding 448 feasible configurations. The QUBO surrogate reproduces the MACE energies with held-out R2 = 0.997 (full-data R2 = 0.999, RMSE = 17 meV). We validate two complementary quantum pathways against exact enumeration: a constraint-preserving XY-mixer QAOA that confines sampling to the feasible subspace and places 86% of probability mass within 1 meV of the MACE optimum at depth p = 3, and a fault-tolerant constrained Grover Adaptive Search oracle with explicit fixed-point arithmetic, branch-safe comparison, feasibility checking, and phase kickback. Across threshold cases, the validated oracle uses 324 high-level logical qubits, or 352 to 358 with conservative clean-ancilla v-chain accounting, and requires 3.6 to 4.3 x 104 Toffoli gates per Grover (GAS) iteration. An idealized feasible-space amplification estimate suggests up to a 240x reduction in total Toffoli cost relative to the full 224 occupation space, providing a resource-estimation bridge between materials-informed QUBO modeling, constraint-aware QAOA, and fault-tolerant threshold search.

15.
arXiv (CS.AI) 2026-06-16

Training and Evaluating Diffusion Policies with Long Context Lengths

arXiv:2606.16447v1 Announce Type: cross Abstract: Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning.

16.
arXiv (CS.AI) 2026-06-18

A Hybrid LSTM–Vision Transformer Architecture for Predicting HRRR Forecast Errors

arXiv:2606.19026v1 Announce Type: cross Abstract: Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric phenomena. Previous work demonstrated that Long Short-Term Memory (LSTM) networks can successfully predict forecast errors in the High-Resolution Rapid Refresh (HRRR) model using mesonet observations, but we believe performance degradation is linked to periods of complex vertical atmospheric evolution. To address this limitation, we develop a hybrid LSTM-Vision Transformer (LSTM-ViT) framework that combines temporal sequence learning from surface observations with atmospheric profiles from the New York State Mesonet profiler network. The LSTM-ViT framework is trained to predict HRRR hourly precipitation, 10 m wind speed, and 2 m temperature forecast errors at individual mesonet stations. Across all three predictors, incorporation of profiler-derived atmospheric structure improves forecast error prediction skill relative to the baseline LSTM architecture, with the largest gains occurring at shorter forecast lead times and during periods of enhanced PBL activity. Improvements are particularly pronounced for precipitation forecast error, where the LSTM-ViT framework achieves approximately a twofold increase in predictive skill relative to the baseline LSTM while better capturing convectively driven error evolution and reducing degradation associated with PBL processes. These results demonstrate that combining temporal sequence learning with vertically informed attention mechanisms provides a physically meaningful pathway for improving forecast error prediction in operational NWP systems. Our research offers forecasters enhanced guidance regarding model bias and forecast confidence.

17.
arXiv (CS.AI) 2026-06-15

Active Inference for Adaptive Traffic Signal Control in Noisy Nonstationary IoT Environments

arXiv:2606.13698v1 Announce Type: cross Abstract: Urban traffic signal control at IoT-instrumented intersections must remain effective under sensor occlusion, weather attenuation, and nonstationary demand. Conventional controllers degrade under these conditions, and learned policies remain difficult to audit. To address these challenges, we propose an active inference controller for a four-arm signalized intersection that dynamically selects phases by minimizing expected free energy (EFE) over Gaussian beliefs about per-direction congestion levels, yielding a fully traceable decision pipeline. We benchmark the controller in a SUMO traffic simulator against a rule-based heuristic and a deep Q-network (DQN) across four scenarios that progressively increase noise and nonstationarity, spanning sensor occlusion, adverse weather, and stochastic accidents. Across 100 independent random evaluations per scenario, active inference attains the lowest idle times and CO2 emissions in the noisiest scenarios (56,977 s and 29.12 kg vs. 71,741 s and 30.56 kg for DQN). These gains come at a modest cost in bus priority service rate and phase switch frequency.

18.
arXiv (CS.AI) 2026-06-17

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

arXiv:2606.17767v1 Announce Type: cross Abstract: Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through "spoken statistics," intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.

20.
arXiv (CS.CV) 2026-06-15

Memento: Reconstruct to Remember for Consistent Long Video Generation

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

21.
arXiv (CS.CL) 2026-06-11

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation. Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance. To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models' knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model's contribution based on reliability. Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: https://github.com/DecayingSeart/BlendIn.

22.
arXiv (CS.CV) 2026-06-15

A Pragmatic VLA Foundation Model

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 3 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.

23.
arXiv (CS.CV) 2026-06-11

EventRadar: Long-Range Visual UAV Discovery through Spatiotemporal Event Sensing

Unauthorized unmanned aerial vehicle (UAV) activity around airports, public venues, and other sensitive sites has made protected-airspace monitoring increasingly important. A practical sensing system must search a wide angular region, find small long-range targets, and return both bearing support and UAV-specific evidence before a restricted perimeter is breached. Existing UAV detection paths often rely on spatially organized evidence, such as body extent, silhouette, or track continuity. At long range, however, these cues become difficult to preserve and verify as the target footprint weakens and its image-plane support shrinks. EventRadar follows a complementary cue: propeller-induced temporal periodicity, which recent event-camera sensing studies have shown can reveal UAV-specific motion after appearance becomes weak. We extend this cue to kilometer-scale active sensing with an event-camera prototype. Scene-Anchored Geometry Evidence (SAGE) fuses scanning events with IMU pose to maintain a bearing-indexed scene memory, separating transient candidate support from persistent background clutter. Comb-guided Harmonic-Group Learned Iterative Shrinkage and Thresholding Algorithm (CHG) then treats each candidate as a weak high-rate timing signal and recovers phase-insensitive harmonic evidence with fixed compute. Compared with related event-camera baselines on 700-1500 m UAV event recordings, EventRadar achieves 0.990 mAP$_{.3}$ and 0.949 F1$_{.3}$, reduces FN$_{.3}$ to 0.009, and shows real-time feasibility in prototype profiling.

24.
Nature Biotechnology 2026-06-11

Large-scale, spatially resolved panoramic CRISPR screening in native tissue environments using Perturb-DBiT

Authors:

Spatially resolved CRISPR screening in vivo has been limited to small perturbation panels and subsets of protein-coding RNAs. We present Perturb-DBiT, a method for co-sequencing of spatial total RNA whole transcriptomes and single guide RNAs (sgRNAs) on the same tissue section in situ. In a human cancer metastatic colonization model, we applied large (80,000+) sgRNA panels across tumor colonies in multiple consecutive tissue sections alongside their corresponding total RNA transcriptomes. We linked perturbations affecting long noncoding RNA covariation, microRNA–mRNA interactions and distinct amino acid-specific tRNA alterations to tumor migration and growth. By integrating transcriptional pseudotime trajectories, we further observed the impact of perturbations on clonal dynamics and cooperation. In an immune-competent syngeneic mouse model, investigation of the tumor immune microenvironment indicated distinct, synergistic effects on immune infiltration and suppression. Perturb-DBiT provides a spatially resolved comprehensive view of perturbation responses in complex tissues, including small and large RNA regulation, tumor proliferation, migration, metastasis and immune interactions. In vivo CRISPR genetic perturbations are spatially mapped at scale.

25.
arXiv (CS.CV) 2026-06-15

Naive Visual Memory is Not Enough: A Failure-Mode Study of GUI Agents

Graphical User Interface (GUI) agents are increasingly used to automate complex computer tasks across applications, websites, and operating systems. To improve their reliability, recent work has introduced experiential memory, where agents retrieve prior trajectories to guide decision-making in similar states. More recent approaches further extend this idea to visual memory by storing and retrieving screenshots from past interactions, providing agents with richer contextual information than text-only memories. However, the effect of visual memory in GUI agents remains insufficiently understood: it is unclear which failures visual memory mitigates, or which failures it exacerbates. To systematically analyze the effect of visual memory, we introduce a taxonomy of four GUI agent failures (i.e., cognitive failure, visual state misunderstanding, hidden operation blindness, and grounding error) that map to distinct stages of the perception-reasoning-action pipeline. We find that prepending full-image memory has a divergent effect on the failure distribution: it reduces state-level failures but worsens action-level ones, and increases hidden operation blindness and grounding error. Motivated by this finding, we propose Action-Grounded Visual Memory (AGMem), an action-grounded memory framework for GUI agents. The core idea of AGMem is to store image crops that capture the local GUI region closely related to a successful action or a recovery, rather than storing full screenshots. Experiments on OSWorld show that AGMem improves task success rates by 33.3 % over full-image memory. These results demonstrate that AGMem is an effective representation for visual memory in GUI agents.