Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-24

ActiveScope: Actively Seeking and Correcting Perception for MLLMs

Multimodal Large Language Models (MLLMs) have demonstrated impressive vision-language understanding, yet still struggle with fine-grained perception in high-resolution images. While existing training-free methods typically rely on attention-based localization or coarse-to-fine search, they are often misled by distractors and fail to locate multiple targets. Our investigation attributes these failures to Contextual Dominance, where salient distractors overwhelm target attention and cause inaccurate localization, and Semantic Bias, where global semantics cause the model to fixate on the most salient concept, resulting in incomplete localization in multi-object scenarios. Built on these insights, we propose ActiveScope, a training-free framework that enhances MLLMs by actively seeking and correcting perception. ActiveScope features two modules. The Semantic Anchor Localization (SAL) utilizes fine-grained semantic anchors to independently localize key targets, thereby mitigating semantic bias. The Interference-Suppressed Refinement (ISR) refines localization by suppressing attention on salient distractions to overcome contextual dominance. Extensive experiments on high-resolution image understanding benchmarks demonstrate that ActiveScope outperforms existing training-free methods (e.g., 96.34 percent accuracy on $V^{*}$ Bench), validating the superiority of the active search and self-correction paradigm. Our code is available at https://github.com/jasmine-ww/ActiveScope.

02.
arXiv (quant-ph) 2026-06-19

Proposal of quantum arrival-time measurement with a Bose-Einstein condensate

arXiv:2606.20278v1 Announce Type: new Abstract: This work shows how a Bose-Einstein condensate of ultracold atoms could be used to address a long-standing question in quantum theory: how much time does it take for a particle to reach a detector? To this end, we propose a realistic experimental setup, whose key idea is not to measure arrival times directly, but the arrival flux on the detector as a function of its position. This novel approach not only solves practical issues with having a detector close to the system, but also results in signals that allow to unambiguously distinguish different theoretical predictions. This proposal raises prospects for resolving the decades-old debate on this fundamental issue.

03.
PLOS Computational Biology 2026-06-02

Linking reduced prefrontal microcircuit inhibition in schizophrenia to EEG biomarkers in silico

by Sana Rosanally, Frank Mazza, Heng Kang Yao, Faraz Moghbel, Hannah Seo, Etay Hay Reduced cortical inhibition by parvalbumin-expressing (PV) interneurons in schizophrenia is thought to be associated with impaired processing in the prefrontal cortex and altered EEG signals such as oddball mismatch negativity (MMN). Recent studies also suggest loss of somatostatin (SST) interneuron inhibition. However, establishing the link between reduced interneuron inhibition and reduced MMN experimentally in humans is currently not possible. To overcome these challenges, we simulated spiking activity and EEG during baseline and oddball response in detailed models of human prefrontal microcircuits in health and schizophrenia, with reduced PV and SST interneuron inhibition as constrained by postmortem patient data. We showed that reduced PV interneuron inhibition can account for the decreased MMN amplitude seen in schizophrenia, with a threshold below which the amplitude effect was low as seen in at-risk patients. In contrast, reduced SST interneuron inhibition did not affect the MMN amplitude. We further showed that both types of inhibition loss were necessary to account for changes in resting EEG in schizophrenia, with reduced SST interneuron inhibition increasing broadband power, and reduced PV and SST interneuron inhibition both leading to a right shift from alpha to beta frequencies. Our study thus links reduced PV and SST interneuron inhibition in schizophrenia to distinct EEG biomarkers that can serve to improve stratification and early detection using non-invasive brain signals.

04.
arXiv (CS.CL) 2026-06-16

VeriGraph: Towards Verifiable Data-Analytic Agents

LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semantic deductions over natural-language claims are often entangled in an unstructured stream, leaving numerical conclusions hard to reproduce and qualitative judgments hard to inspect. To address this, we propose VeriGraph, a traceable neuro-symbolic reasoning framework that enables agents to construct an explicit heterogeneous evidence directed acyclic graph (DAG) during execution. VeriGraph introduces three evidence-expansion primitives, namely computational, grounding, and derivational expansion, to connect raw data, interpreter variables, computed results, and natural-language claims in a unified graph. Under this formulation, structural traceability is reduced to graph reachability from raw data sources to terminal claims, while semantic support is measured by claim-level evidence evaluation. To improve graph construction, we further design a graph-based policy optimization strategy with a composite reward that jointly supervises answer correctness, computational integrity, and derivational coherence. Experiments on four benchmarks show that VeriGraph-8B achieves the highest overall score among all baselines. More importantly, VeriGraph produces auditable evidence graphs with substantially stronger claim grounding, achieving a 87.61\% Grounding Rate under our claim-level evidence support evaluation. These results suggest that explicit evidence-graph construction is a promising path toward verifiable data-analytic agents. Our code is available at https://github.com/ignorejjj/VeriGraph.

05.
arXiv (CS.CL) 2026-06-24

Posterior Refinement: Fast Language Generation via Any-Order Flow Maps

Non-autoregressive generation offers a powerful paradigm for iterative refinement, allowing models to recursively critique, erase and regenerate arbitrary subsets of tokens. However, existing non-autoregressive models fail to realize this potential. Masked Diffusion Models (MDMs) suffer from factorization error, causing sample quality to collapse when generating multiple tokens simultaneously. Flow Map Language Models (FMLMs) circumvent this bottleneck via joint sequence transport for excellent few-step generation, but sacrifice the inference-time flexibility of MDMs. We introduce FMLM+, a framework that bridges this gap by equipping FMLM with masking-style noise schedules. While generating the full sequence in a single step, FMLM+ simultaneously scores the global consistency of each token a posteriori. We leverage this to introduce Posterior Refinement, a novel inference-time refinement strategy that enables the model to adaptively self-correct its outputs, matching the performance of discrete baselines with 32x fewer NFEs. Across diverse benchmarks, we demonstrate that FMLM+ with Posterior Refinement improves the speed–quality tradeoff over both MDM and FMLM families, providing a scalable foundation for high-fidelity language modeling.

06.
arXiv (CS.CV) 2026-06-16

V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.

07.
arXiv (CS.CL) 2026-06-19

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

08.
arXiv (CS.CV) 2026-06-18

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

09.
arXiv (CS.CL) 2026-06-11

Neural FOXP2 – Language Specific Neuron Steering for Targeted Language Improvement in LLMs

LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.

10.
arXiv (quant-ph) 2026-06-12

Global Control with the Tavis-Cummings Interaction

arXiv:2606.12906v1 Announce Type: new Abstract: We study the controllability of a system of qubits under global control, where control pulses act identically on all qubits. Specifically, we consider a collection of qubits identically coupled to a single bosonic mode, or harmonic oscillator, via the Jaynes-Cummings interaction. This collective coupling, known as the Tavis-Cummings (TC) interaction, has been realized in several quantum computing platforms, including superconducting and atomic qubit systems. Although the qubits do not interact directly with one another, they can become entangled through their common coupling to the bosonic mode. We characterize the group of unitaries that can be implemented on the joint Hilbert space of the qubits and bosonic mode using the TC interaction together with a global $z$ field $J_z$, corresponding to identical z rotations on all qubits. We show that for n>2 qubits the set of realizable unitaries is restricted by an "accidental" symmetry of the TC Hamiltonian, distinct from its "standard" U(1) and permutational symmetries. On the other hand, we find that the Hamiltonian $J_z^2$ breaks this accidental symmetry and, together with the TC interaction and $J_z$, achieves semi-universality: it allows the implementation of arbitrary unitaries that respect permutational and U(1) symmetry, up to certain constraints on the center of the group. In a companion paper, we further analyze this remarkable accidental symmetry and show that it can be understood through Schwinger's bosonic model of angular momentum.

11.
arXiv (CS.LG) 2026-06-16

Spectral Adaptive Conformal Prediction for Structured Non-Exchangeable Data

arXiv:2606.15950v1 Announce Type: cross Abstract: Conformal prediction gives prediction intervals with finite-sample coverage when the data are exchangeable. Many time-indexed datasets are not exchangeable. They have seasons, recurring regimes, changing frequencies, or other forms of structured dependence. This paper studies a simple way to use that structure. We propose spectral adaptive conformal prediction, a method that forms weighted conformal quantiles using local spectral similarity and then updates the target miscoverage level online. The spectral weights choose calibration residuals that look relevant to the current test point. The adaptive update corrects the long-run miss rate when uncertainty changes over time. We give an approximate coverage result for the fixed spectral weighted quantile and a deterministic long-run calibration result for the adaptive update. Simulations with recurring regimes and slowly changing frequencies, together with three U.S. real-data examples, show that the hybrid method can improve on fixed spectral weighting, while also showing that spectral weighting must be monitored through effective sample size diagnostics.

12.
arXiv (CS.CV) 2026-06-17

RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

13.
medRxiv (Medicine) 2026-06-18

Comparative Evaluation of Pretrained Large Language Models for Suicide Risk Prediction from Clinical Notes in U.S. Veterans

Background: Suicide remains a significant and potentially preventable cause of death among United States veterans. Predictive models based on structured electronic health record (EHR) data, including the U.S. Department of Veterans Affairs' Recovery Engagement and Coordination for Health-Veterans Enhanced Treatment (REACH-VET) program, aim to identify individuals at elevated risk for enhanced monitoring and follow-up. Increasing evidence suggests that unstructured clinical narratives contain additional psychosocial information that may enhance risk prediction when analyzed using natural language processing (NLP). However, optimal approaches for representing clinical text remain uncertain. Recent advances in large language models (LLMs) enable contextual text representations that capture complex semantic relationships beyond traditional lexical methods. Methods: We compared the predictive performance of pretrained LLMs with classical bag-of-words (BoW) representations for suicide risk prediction using clinical notes from 27,241 veterans receiving care in the Veterans Health Administration. Patients were stratified by REACH-VET risk tier (low, moderate, high), and models were evaluated across prediction windows defined by note look-back periods (

14.
arXiv (CS.CV) 2026-06-25

VPA-Guard: Defending and Benchmarking Image-to-Video Generation Against Visual Prompt Attacks

Recent advancements in Image-to-Video (I2V) generation have transformed input images from simple appearance references into interactive control interfaces where visual cues such as arrows, sketches, and emojis orchestrate complex video dynamics with unprecedented controllability. However, these seemingly innocuous static cues can be interpreted by models as executable temporal instructions, unfolding into harmful actions in the generated videos. Despite the severity of this threat, existing safety benchmarks remain predominantly focused on text-based and content-only image-based jailbreaks, leaving implicit visual prompt attacks insufficiently explored. To bridge this gap, we present VVA-Bench, the first systematic benchmark for evaluating video generation safety under categorized vision-centric prompt attacks. Extensive experiments on VVA-Bench demonstrate that state-of-the-art models are highly susceptible to such attacks, with Attack Success Rates (ASR) reaching 100.0\% on Wan 2.7 and 74.8\% on Veo 3.1. To mitigate these risks, we propose VPA-Guard, a retrieval-augmented and self-evolving defense framework. By leveraging few-shot reasoning to identify latent malicious intents, our method reduces the attack ASR by 44.2\% and the harmfulness score by 73.4\% on average, while maintaining the model's utility for legitimate user edits. Our work provides both a rigorous benchmark and an effective defense strategy to advance safe and socially responsible multimodal generation.

15.
arXiv (CS.CL) 2026-06-24

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models and code have been released on [Huggingface](https://huggingface.co/collections/McGill-NLP/afriquellm) and [Github](https://github.com/McGill-NLP/AfriqueLLM).

16.
arXiv (quant-ph) 2026-06-24

Uncovering Latent Structures in Robust Pulse Sequences: A Model-Based Reinforcement Learning Approach for Adaptable Quantum Control

arXiv:2606.24507v1 Announce Type: new Abstract: Real-time adaptive control of quantum systems requires rapid generation of robust, high-fidelity pulses across a continuous range of operating conditions. Standard optimization algorithms such as gradient-ascent pulse engineering (GRAPE) solve each instance independently, discarding information between runs and requiring costly reinitialization when parameters change. We present an approach to robust optimal quantum control based on model-based reinforcement learning, in which a single neural network – embedding the Hamiltonian directly into the training pipeline – generates robust gates across an entire family of gate configurations, without pre-computed training data. Demonstrated on a single-spin (two-level) system, the trained networks produce pulses for arbitrary rotation angles over a range of pulse durations, detunings, and field inhomogeneities in milliseconds, at fidelities comparable to multi-seed GRAPE. The framework is inherently adaptable: any parameter entering the Hamiltonian can serve as a network input, extending the approach to different systems and control settings. Beyond speed, the network reveals structure in the control landscape: it discovers the same structured phase profiles that appear in GRAPE solutions – made identifiable through fidelity-invariant symmetry transformations – but more consistently than independent optimization. This consistency enables smooth interpolation across the entire trained parameter space.

17.
arXiv (CS.CV) 2026-06-18

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

18.
arXiv (CS.CL) 2026-06-11

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in planning practice, there is still no systematic framework for testing whether they can reason with the contextual sensitivity, value awareness, and institutional literacy central to planning expertise. This paper introduces Urban Planning Bench (UPBench), a domain-specific evaluation framework that assesses LLM reasoning through a 4x5 matrix of four knowledge pillars and five cognitive levels adapted from Bloom's revised taxonomy. Evaluating 25 LLMs with automated scoring and expert review, we find a non-monotonic cognitive curve: models perform better on higher-order analytical tasks than on factual recall and integrative judgment. This suggests that planning knowledge often treated as lower-order is deeply shaped by institutional, jurisdictional, and temporal context, making it hard for LLMs to generalize. We summarize these limits as four epistemic diagnostics: regulatory hallucination, conceptual conflation, wickedness paralysis, and phronetic deficit. Takeaway for Practice: The findings support differential delegation in planning. LLMs can assist with cross-disciplinary synthesis, literature review, scenario generation, and preliminary policy analysis. However, they remain unreliable for jurisdiction-specific regulation, normative conflict resolution, and context-sensitive procedure. Agencies should require verification for AI-assisted regulatory analysis, while planning education should emphasize institutional literacy, normative judgment, and contextual sensitivity.

19.
arXiv (quant-ph) 2026-06-16

Systematic Construction of Time-Dependent Hamiltonians for Microwave-Driven Josephson Circuits

arXiv:2512.20743v4 Announce Type: replace Abstract: Time-dependent electromagnetic drives are fundamental for controlling complex quantum systems, including superconducting Josephson circuits. In these devices, accurate time-dependent Hamiltonian models are imperative for predicting their dynamics and designing high-fidelity quantum operations. Existing numerical methods, such as black-box quantization (BBQ) and energy-participation ratio (EPR), excel at modeling the static Hamiltonians of Josephson circuits. However, these techniques do not fully capture the behavior of driven circuits stimulated by external microwave drives, nor do they include a generalized approach to account for the inevitable noise and dissipation that enter through microwave ports. Here, we introduce numerical techniques that leverage classical microwave simulations, efficiently executable in finite-element solvers, to obtain the time-dependent Hamiltonian of microwave-driven superconducting circuits with arbitrary geometries under charge, flux, or mixed electromagnetic modulation. Importantly, our techniques do not rely on a lumped-element description of the superconducting circuit, in contrast to previous approaches to tackling this problem. We demonstrate the versatility of our approach by characterizing the driven properties of realistic circuit devices in complex electromagnetic environments, including coherent dynamics due to charge and flux modulation, as well as drive-induced relaxation and dephasing. Our techniques offer a powerful toolbox for optimizing circuit designs and advancing practical applications in superconducting quantum computing.

20.
medRxiv (Medicine) 2026-06-24

Evaluation of Corneal Subbasal Nerve Plexus Alterations in ARSACS and SPG7 by In Vivo Corneal Confocal Microscopy

Purpose: To investigate corneal subbasal nerve plexus alterations using in vivo corneal confocal microscopy (IVCM) in patients with Autosomal Recessive Spastic Ataxia of Charlevoix-Saguenay (ARSACS) and Spastic Paraplegia Type 7 (SPG7). Methods: This cross-sectional pilot study included eight ARSACS patients, five SPG7 patients, and twenty age- and sex-matched healthy controls. All participants underwent neurological and ophthalmological examination followed by central corneal imaging using IVCM. Quantitative corneal nerve parameters were analyzed with automated software, and correlations with clinical severity scales were assessed. Results: The mean age was 34.2 +/- 3.4 years in controls, 34.5 +/- 0.7 years in the ARSACS group, and 38.2 +/- 3.5 years in the SPG7 group. Corneal nerve branch density (CNBD) and corneal nerve total branch density (CTBD) were significantly lower in ARSACS and SPG7 patients compared with healthy controls. CNFD, CNFL, CNFA, CNFW, and CNFrD were lower in ARSACS and SPG7 patients compared with healthy controls; however, these differences did not reach statistical significance. No statistically significant differences in IVCM parameters were detected between ARSACS and SPG7 patients. Spearman correlation analysis did not show significant correlations between corneal nerve parameters and FARS, SARA, ADL scores, or disease duration. Conclusion: IVCM revealed reduced corneal nerve branching parameters in patients with ARSACS and SPG7. These findings indicate involvement of the corneal subbasal nerve plexus and support the potential role of corneal confocal microscopy as a non-invasive ocular imaging modality for evaluating peripheral neural alterations in hereditary spastic ataxias.

21.
arXiv (CS.AI) 2026-06-15

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

arXiv:2506.14202v4 Announce Type: replace-cross Abstract: End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $DiffusionBlocks$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures. Code is available at https://github.com/SakanaAI/DiffusionBlocks .

22.
arXiv (CS.CV) 2026-06-25

TTSA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction

Streaming recurrent models enable efficient 3D reconstruction by maintaining persistent state representations. However, they suffer from catastrophic forgetting over long sequences due to balancing historical information with new observations. Recent methods alleviate this by deriving adaptive signals from the attention perspective, but they operate on single dimensions without considering temporal and spatial consistency. To this end, we propose a training-free framework termed TTSA3R that leverages both temporal state evolution and spatial observation quality for adaptive state updates in 3D reconstruction. In particular, we devise a Temporal Adaptive Update Module that regulates update magnitude by analyzing temporal state evolution patterns. Then, a Spatial Contextual Update Module is introduced to localize spatial regions that require updates through observation-state alignment and scene dynamics. These complementary signals are finally fused to determine the state updating strategies. Extensive experiments show that TTSA3R achieves competitive performance on standard short-sequence benchmarks and provides substantially stronger robustness on extended sequences. On NRGBD, as sequences extend from 50 to 250 frames, TTSA3R exhibits only a 1.33x error increase, compared with over 4x degradation for CUT3R. This highlights the practical value of temporal-spatial adaptive updates for long-term reconstruction stability. Our code is available at https://github.com/anonus2357/ttsa3r.

23.
arXiv (CS.CV) 2026-06-25

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.

24.
arXiv (CS.CV) 2026-06-25

Learning Action Priors for Cross-embodiment Robot Manipulation

Most Vision-Language-Action (VLA) models build on a Vision-Language Model (VLM) backbone by attaching an action module and optimizing the full policy jointly. This design inherits strong visual and linguistic priors from the VLM, but leaves the action module to learn physical motion almost from scratch. As a result, the policy lacks an explicit motion prior, forcing early optimization to simultaneously discover temporal action dynamics and cross-modal alignment, a challenge further amplified in cross-embodiment settings. In this work, we propose to pretrain the action module with motion priors before cross-modal VLA alignment. Specifically, we introduce a two-stage training framework that equips the action module with cross-embodiment temporal motion structure before VLA training begins. In Stage~1, a lightweight flow-matching-based encoder-decoder action module efficiently learns temporal motion structure solely from unconditioned action trajectories, without processing visual or language tokens. In Stage~2, this learned prior is transferred to VLA training through decoder reuse and early-stage latent distillation, aligning visual-language features with the action embedding space while still allowing end-to-end policy refinement. In addition, the trained encoder serves as a compact history compressor, summarizing state-action histories into a single temporal context token for history-aware modeling at negligible cost. Extensive experiments across 13 diverse cross-embodiment tasks on both simulated and real-world platforms validate the effectiveness of our approach. Compared with VLA training without action priors, our model achieves faster convergence, higher success rates, and substantially stronger performance on data-scarce real-world tasks. Moreover, scaling up the action data in Stage~1 yields a more generalizable action prior that directly improves downstream VLA performance.

25.
arXiv (CS.LG) 2026-06-19

Predictability as a Fine-Grained Measure for Privacy

arXiv:2606.20546v1 Announce Type: new Abstract: Differential privacy (DP) ensures rigorous individual-level privacy guarantees against even the most knowledgeable attackers, but its worst-case nature can impose a costly privacy-accuracy tradeoff. We introduce privacy via predictability, a fine-grained framework that explicitly incorporates the attacker's core knowledge, a compromised portion of the dataset generated by a stochastic process, and a specified family of queries. Predictability measures privacy leakage as the incremental gain in an attacker's ability to predict sensitive information about unknown individuals after observing the algorithm's output, beyond what can already be inferred from the compromised data. We show that predictability and DP are generally incomparable: each can be small while the other is large. However, in the worst-case regime where all but one individual is compromised, and all binary queries are considered sensitive, predictability implies mutual-information DP. More generally, predictability provides a finer-grained privacy metric tailored to specific sensitive information and specific attacker models. We introduce a general framework, using the generalized method of moments (GMM), to analyze asymptotic predictability when the compromised data is generated by a stationary, ergodic, mixing process. Using this analysis, we derive a predictability-calibrated output perturbation scheme for ERM. Our approach is complementary to DP and can be used alongside DP to provide fine-grained privacy control.