论文广场 - AcademicHub

01.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2602.13344

FireRed-Image-Edit-1.0 Technical Report

作者:

Super Intelligence Team ↗Changhao Qiao ↗Chao Hui ↗Chen Li ↗Cunzheng Wang ↗Dejia Song ↗Jiale Zhang ↗Jing Li ↗Qiang Xiang ↗Runqi Wang ↗Shuang Sun ↗Wei Zhu ↗…

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. To support future research, our code, models, and benchmark suite are publicly available at https://github.com/FireRedTeam/FireRed-Image-Edit/ .

阅读与讨论 → 访问原文 →

02.

arXiv (quant-ph) 2026-06-19 DOI: arXiv:2606.20020

Effects of interaction range on the mean-field dynamics of Bose polarons

作者:

Piotr Wysocki ↗Ubaldo Cavazos Olivas ↗Marek Tylutki ↗Krzysztof Jachymski ↗

arXiv:2606.20020v1 Announce Type: cross Abstract: We consider the three-dimensional Bose polaron problem in the regime of finite range interactions and competing length scales. Working in the reference frame of the impurity, we study both static and out of equilibrium properties of the system, in particular the transfer of momentum between the impurity and the host gas. We find that relaxation dynamics can occur via damped oscillations of the impurity velocity with simple dependence on the interaction strength. Furthermore, the equilibration process is sensitive to the type of the impurity-bath interaction. Specifically, interatomic forces describing ion-atom systems lead to much longer timescales and more pronounced oscillations in the strong coupling regime with respect to local interaction potentials. We also find that the effective masses can differ by a large amount between the two scenarios, even if the number of atoms in the polaron cloud remains similar for both cases.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.13022

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

作者:

Ziyi Chang ↗Kanglei Zhou ↗Xiaohui Liang ↗Hubert P. H. Shum ↗

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

阅读与讨论 → 访问原文 →

04.

arXiv (math.PR) 2026-06-11 DOI: arXiv:2209.03999

Consensus on Dynamic Stochastic Block Models: Fast Convergence and Phase Transitions

作者:

Haoyu Wang ↗Jiaheng Wei ↗Zhenyuan Zhang ↗

arXiv:2209.03999v2 Announce Type: replace Abstract: We introduce two models of consensus following a majority rule on time-evolving stochastic block models (SBM), in which the network evolution is Markovian or non-Markovian. Under the majority rule, in each round, each agent simultaneously updates their opinion according to the majority of their neighbors. Our network has a community structure and randomly evolves with time. In contrast to the classic setting, the dynamics is not purely deterministic, and reflects the structure of SBM by resampling the connections at each step, making agents with the same opinion more likely to connect than those with different opinions. In the Markovian model, connections between agents are resampled at each step according to the SBM law and each agent updates their opinion via the majority rule. We prove a power-of-one type result, i.e., any initial bias leads to a non-trivial advantage of winning in the end, uniformly in the size of the network. In the non-Markovian model, a connection between two agents is resampled according to the SBM law only when at least one of them changes opinion and is otherwise kept the same. We identify the phase-transition threshold, up to the second-order leading term, between halting and fast convergence to consensus. We also give sufficient initial-lead conditions for consensus to occur within one, two, or three rounds.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2512.12737

Communication-Efficient Neural Tangent Kernels for Heterogeneous Decentralized Federated Learning

作者:

Li Xia ↗

arXiv:2512.12737v2 Announce Type: replace Abstract: Decentralized federated learning (DFL) enables collaborative model training without a central server, but converges slowly under statistical heterogeneity. Recent work has shown that neural tangent kernel (NTK) methods achieve faster convergence than gradient-based updates in DFL, while momentum has proven effective for accelerating gradient-based FL. However, applying momentum to NTK updates can destabilize training under heterogeneous data. We propose SPARK, which addresses this instability with a stage-wise annealed soft-label regularizer evaluated on neighborhood-aggregated data, so that momentum can accelerate NTK updates stably. Under high heterogeneity, SPARK converges about 3$\times$ faster than baselines and lowers the total communication to a target accuracy by up to about 70\%, and it attains higher accuracy across heterogeneity levels. We further study random projection as an optional Jacobian-compression strategy for bandwidth-constrained settings. We validate the approach across multiple datasets, network topologies, and heterogeneity levels.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17722

GSPan: A Continuous Gaussian Primitive Representation for Arbitrary-Scale Pansharpening

作者:

Fangyi Li ↗Xiaoyuan Yang ↗Yixiao Li ↗Zongyang Sui ↗Kangqing Shen ↗Gemine Vivone ↗

Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and panchromatic (PAN) observations. Most existing deep learning methods treat pansharpening as fixed-grid prediction, which limits scale adaptation. To address this, we propose GSPan, a framework that introduces 2D Gaussian Splatting (GS) into pansharpening. Instead of directly predicting pixels, GSPan represents band-wise residual details as continuous and learnable 2D Gaussian primitives. We design a Dual-Stream Hierarchical Interaction (DSHI) architecture with a Spatial-Spectral Interactive Attention (SSIA) module to estimate these primitives from complementary PAN and MS observations. The predicted primitives are rendered as a residual detail field and injected into the upsampled MS image. This continuous representation allows GSPan to render fused images on arbitrary target sampling grids without scale-specific retraining. It further enables a Scale-Decoupled Asymmetric Inference (SDAI) strategy, which estimates primitives at a reduced resolution and renders the fused image at the target resolution for efficient large-scene pansharpening. Experiments on QuickBird, GaoFen-2, WorldView-3, and WorldView-3-4K datasets show that GSPan delivers state-of-the-art fusion performance. Moreover, SDAI markedly accelerates inference, achieving a favorable trade-off between computational efficiency and fusion quality. Our results demonstrate the potential of continuous Gaussian residual representations as a flexible and scale-decoupled alternative to fixed-grid prediction.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.16749

Structure-aware Knowledge-guided Heterogeneous Mamba for Zygomaticomaxillary Suture Assessment

作者:

Xiaoqi Guo ↗Birui Chen ↗Xinquan Yang ↗Chaoyun Zhang ↗Xuefen Liu ↗Mianjie Zheng ↗Kun Tang ↗Xuguang Li ↗Wen Ma ↗Yanhua Xu ↗Linlin Shen ↗

The Zygomaticomaxillary Suture is a key circummaxillary structure that connects the zygomatic bone and the maxilla, which serves as a primary site of resistance during maxillary advancement, and its maturation status directly influences the timing and efficacy of orthopedic interventions. However, accurate staging of ZMS maturation remains challenging due to subtle high-frequency transitions in suture lines and the global semantic ambiguity between adjacent stages. To address this, we present the first public ZMS dataset, comprising 3,790 ZMS images covering the entire age range from 4 to 24 years. Based on this dataset, we propose SKMamba, a Structure-aware and Knowledge-guided Mamba-based multi-modal framework for automated ZMS maturation assessment. SKMamba adopts a decoupled dual-path architecture that mimics the hierarchical diagnostic process used by experienced orthodontists. We first introduce an Implicit Edge Extractor (IEE), which leverages structural pre-training to reduce trabecular noise and accentuate sutural boundaries. Complementarily, a Cross-Modal Semantic Alignment (CSA) module is designed to incorporate anatomical descriptions from a large language model (LLM). This module helps align local morphological cues with global semantic descriptions while ensuring that objective morphological evidence remains the primary basis for decisions. Extensive experiments on our ZMS dataset demonstrate that SKMamba achieves state-of-the-art performance compared to existing methods. Code is available at https://github.com/galaxygxq1116/SKMamba.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.AI) 2026-06-15 DOI: arXiv:2606.08881

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

作者:

Yi Yu ↗Xinchuan Qiu ↗

arXiv:2606.08881v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $\pi_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.LG) 2026-06-11 DOI: arXiv:2606.12077

Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

作者:

Yifan Wang ↗Lifeng Shen ↗Shuyin Xia ↗Yi Wang ↗

arXiv:2606.12077v1 Announce Type: new Abstract: Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance computations, while deep learning-based approaches typically rely on costly iterative training and a large number of trainable parameters. In this paper, we propose MSRGC-Net, an efficient time-series clustering framework that integrates multiscale reservoir computing, granular-ball-based anchoring graph construction, and consensus learning. MSRGC-Net adopts a training-free reservoir computing paradigm to extract multiscale temporal representations from raw time series without backpropagation, significantly reducing computational overhead. To capture the intrinsic structure of the resulting representations, granular-ball computing is employed to adaptively model data distributions via density-consistent regions, yielding compact and robust anchor graph representations. Furthermore, a consensus-based anchoring graph optimization strategy is introduced to effectively align multiscale reservoir representations and integrate complementary information across temporal scales. Extensive experiments on widely used univariate and multivariate benchmark datasets demonstrate that MSRGC-Net consistently outperforms state-of-the-art methods in clustering performance while maintaining superior computational efficiency.

阅读与讨论 → 访问原文 →

10.

bioRxiv (Bioinfo) 2026-06-19 DOI: HASH:309baee803313d032a196fb7dad80cea

FeatureMSEA: Metabolic Feature-based Metabolite Set Enrichment Analysis

作者:

Liu ↗Wang ↗Huan ↗Shen ↗

Liquid chromatography-mass spectrometry (LC-MS) untargeted metabolomics detects thousands of metabolic features, but converting these chemical signals into metabolite set-level biological knowledge remains challenging. This is because most features lack unambiguous metabolite identities. Conventional metabolite set enrichment analysis (MSEA) generally requires identified metabolites and metabolite-level ranked inputs, leaving much of the untargeted feature space unused. Here, we present FeatureMSEA, a feature rank-based framework for metabolite set enrichment directly from metabolic features with ambiguous annotations. FeatureMSEA integrates multi-evidence feature-to-metabolite annotation, feature rank-based enrichment scoring, permutation-based inference, and iterative leading-edge-guided annotation refinement, with an optional LLM-assisted module for post-enrichment interpretation. In null comparisons of randomly split healthy samples, FeatureMSEA detected no significant metabolite sets, whereas metabolite-set spike-in simulations showed recovery of implanted signals. In a cerebrospinal fluid metabolomics study of Huntington's disease, FeatureMSEA identified dysregulated metabolite sets related to amino acid metabolism, mitochondrial energy metabolism, and neuroactive signaling. MS/MS-based annotation analysis further showed that FeatureMSEA refinement reduced annotation ambiguity and prioritized chemically consistent candidate metabolites. In summary, FeatureMSEA provides a general framework for extracting metabolite set-level biological insights from LC-MS untargeted metabolomics in which confident metabolite identification remains incomplete.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.LG) 2026-06-12 DOI: arXiv:2606.13178

Loss-Shift Transfer via Bayes Quotients

作者:

Vasileios Sevetlidis ↗

arXiv:2606.13178v1 Announce Type: new Abstract: Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called loss shift. A loss determines which information in $X$ is Bayes-relevant, and two losses may therefore require different representations even under the same joint law $P(X,Y)$. The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about $Y$ discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.LG) 2026-06-19 DOI: arXiv:2606.20128

The Correctness Illusion in LLM-Generated GPU Kernels

作者:

Dipankar Sarkar ↗

arXiv:2606.20128v1 Announce Type: cross Abstract: Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.16253

Learned Image Compression for Vision-Language-Action Models

作者:

Hyeonjun Kim ↗Jegwang Ryu ↗Sangbeom Ha ↗Junhyeok Lee ↗Jun-Hyuk Kim ↗Hyemin Ahn ↗Jaeho Lee ↗

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

阅读与讨论 → 访问原文 →

14.

arXiv (quant-ph) 2026-06-11 DOI: arXiv:2606.11530

Locally Acting Grover Mixers for Constraint-Preserving QAOA

作者:

Minjin Choi ↗Dongkeun Lee ↗Junghee Ryu ↗

arXiv:2606.11530v1 Announce Type: new Abstract: The Grover mixer quantum alternating operator ansatz (GM-QAOA) employs the Grover mixer to confine the quantum evolution to the feasible subspace defined by the problem. Its mixing unitary, however, requires a global multi-controlled phase-shift gate acting on all qubits, resulting in substantial circuit overhead on near-term quantum devices. In this work, we propose locally acting Grover mixers tailored to initial states that admit a product structure over disjoint qubit subsystems, which may be obtained by encoding only a subset of problem constraints into the initial state preparation. The proposed method preserves the search space defined by the initial state while significantly lowering implementation cost, as the global multi-controlled phase-shift gate is replaced with local operations on disjoint subsystems. Numerical simulations on the exact-cover problem and the traveling salesman problem (TSP) demonstrate that the proposed method achieves convergence behavior comparable to that of the original GM-QAOA, while using shallower circuits with fewer gates. We further compare two constraint encoding strategies for the TSP, encoding only a subset of constraints versus all constraints into the initial state preparation, and show that the former combined with the proposed mixer yields markedly more compact circuits at the point where comparable solution quality is achieved.

阅读与讨论 → 访问原文 →

15.

arXiv (math.PR) 2026-06-11 DOI: arXiv:2606.11389

Instability of a nonlinear oscillator with small friction and small additive noise

作者:

Peter H Baxendale ↗

arXiv:2606.11389v1 Announce Type: new Abstract: Let $\lambda = \lambda(\beta,\sigma,a,b)$ denote the top Lyapunov exponent for the linearization along trajectories of the noisy damped non-linear oscillator $\ddot{x}+\beta \dot{x} + ax+bx^3 = \sigma \dot{W}_t$, where $a$, $b$ and $\beta$ are all positive and $\sigma \neq 0$. In 2004 Arnold, Imkeller and Sri Namachchivaya stated without proof that $\lambda(\varepsilon^2 \beta,\varepsilon \sigma,a,b) \sim \overline{\lambda} \varepsilon^{2/3}$ as $\varepsilon \to 0$ with $\overline{\lambda} > 0$. This paper contains a proof of this assertion.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.LG) 2026-06-15 DOI: arXiv:2601.11626

Concatenated Matrix SVD: Compression Bounds, Incremental Approximation, and Error-Constrained Clustering

作者:

Maksym Shamrai ↗

arXiv:2601.11626v2 Announce Type: replace-cross Abstract: Large collections of matrices arise throughout modern machine learning, signal processing, and scientific computing, where they are commonly compressed by concatenation followed by truncated singular value decomposition (SVD). This strategy enables parameter sharing and efficient reconstruction and has been widely adopted across domains ranging from multi-view learning and signal processing to neural network compression. However, it leaves a fundamental question unanswered: which matrices can be safely concatenated and compressed together under explicit reconstruction error constraints? Existing approaches rely on heuristic or architecture-specific grouping and provide no principled guarantees on the resulting SVD approximation error. In the present work, we introduce a theory-driven framework for compression-aware clustering of matrices under SVD compression constraints. Our analysis establishes new spectral bounds for horizontally concatenated matrices, deriving global upper bounds on the optimal rank-$r$ SVD reconstruction error from lower bounds on singular value growth. The first bound follows from Weyl-type monotonicity under blockwise extensions, while the second leverages singular values of incremental residuals to yield tighter, per-block guarantees. We further develop an efficient approximate estimator based on incremental truncated SVD that tracks dominant singular values without forming the full concatenated matrix. Therefore, we propose three clustering algorithms that merge matrices only when their predicted joint SVD compression error remains below a user-specified threshold. The algorithms span a trade-off between speed, provable accuracy, and scalability, enabling compression-aware clustering with explicit error control.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.LG) 2026-06-17 DOI: arXiv:2606.17500

Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

作者:

Gram Koski ↗Sean Lipps ↗Zhenghua Ma ↗G. Abarajithan ↗Ryan Kastner ↗

arXiv:2606.17500v1 Announce Type: new Abstract: Transformer-based models achieve strong performance for jet tagging at the CERN LHC, but deploying them in low-latency, resource-constrained trigger systems is challenging. We present an initial implementation of a quantized, integer-only transformer for jet tagging on the AMD Versal AI Engine (AIE), mapping dense and multi-head attention (MHA) layers to AIE tiles. The main contribution is a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high-level Python model description. This framework provides a foundation for future research and is released as open-source software at https://github.com/KastnerRG/particle_transformer_aie.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CV) 2026-06-19 DOI: arXiv:2606.19641

Scaling Self-Play for End-to-End Driving

作者:

Luke Rowe ↗Roger Girgis ↗Rodrigue de Schaetzen ↗Daphne Cornelisse ↗Alaap Grandhi ↗Felix Heide ↗Eugene Vinitsky ↗Christopher Pal ↗Liam Paull ↗

End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2512.07212

Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation

作者:

Zhaoyang Liu ↗Mokai Pan ↗Zhongyi Wang ↗Kaizhen Zhu ↗Haotao Lu ↗Haipeng Zhang ↗Jingya Wang ↗Ye Shi ↗

arXiv:2512.07212v3 Announce Type: replace Abstract: Imitation learning with diffusion models has advanced robotic control by capturing the multi-modal action distributions. However, existing methods typically treat observations only as high-level conditions to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, the sampling is forced to begin from random noise, weakening the coupling between perception and control and often yielding suboptimal performance. We propose BridgePolicy, a generative visuomotor policy that directly integrates observations into the stochastic dynamics via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich and informative prior rather than random noise, substantially improving precision and reliability in control. A key difficulty is that diffusion bridge normally connects distributions of matched dimensionality, while robotic observations are heterogeneous and not naturally aligned with actions. To overcome this, we introduce a semantic aligner to unify the visual and state inputs and align the observations with action representations, making diffusion bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and 5 real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies. Our code is available at https://jianghcsr.github.io/BridgePolicy_page/.

阅读与讨论 → 访问原文 →

20.

arXiv (math.PR) 2026-06-15 DOI: arXiv:2606.14450

Universality for Products of Random Matrices with i.i.d. Entries and the Fuss–Catalan Number

作者:

Yanjin Xiang ↗Kun Chen ↗Zhihua Zhang ↗

arXiv:2606.14450v1 Announce Type: cross Abstract: Let $(w_{ij})_{i,j\ge1}$ be a single infinite array of independent identically distributed real- or complex-valued entries of mean zero, variance $\sigma^2$, and finite fourth moment. Set $W_n=(w_{ij})_{1\le i,j\le n}$ and $X_n=n^{-1/2}W_n$. For every fixed $k\ge1$, we identify the almost sure limiting operator norm of several fixed products built from this family. Define the $k$-th freeness coefficient by \[ \gamma_k:=\sqrt{\frac{(k+1)^{k+1}}{k^k}}. \] Then we prove \[ \|X_n^k\|\to\sigma^k\gamma_k \qquad almost surely. \] The same limit holds for products sampled with replacement from any fixed finite pool of independent copies of $X_n$; in particular, it holds for the product of $k$ independent copies. Thus, the freeness coefficient captures the non-commuting characteristic between large random matrices %powers and independent or fixed-pool sampled products under the finite fourth moment assumption. The improvement of the classical Bai–Yin-type power estimate from the scale $\sigma^k(k{+}1)$ to $\sigma^k \sqrt{k{+}1}$ is a direct corollary of our result. The main technical challenge is to prove the upper bound using a high-moment expansion of %the upper bound is proved by a high-moment expansion of $\E\Tr((X_n^kX_n^{*k})^m)$. The leading zero-defect trace words are tree-like and are counted by the Fuss–Catalan number \[ F_{k,m}= \frac1{km+1}\binom{(k+1)m}{m}. \] The combinatorial tool helps to devise a defect-sensitive global enumeration: if $L=km$ and \[ r=(L+1-v)+(L-q), \] then the number of admissible word classes with defect $r$ is at most $F_{k,m}(Cm)^{Dr}$. This polynomial-in-$m$ loss, with degree proportional to the defect, is summable in the logarithmic moment range.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.LG) 2026-06-12 DOI: arXiv:2606.12478

Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

作者:

Gilhan Kim ↗Daniel K. Park ↗

arXiv:2606.12478v1 Announce Type: new Abstract: Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query–key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2602.22673

Forecasting Bacterial Antimicrobial Resistance Trends Using Machine Learning on WHO GLASS Surveillance Data: A Retrieval-Augmented Generation Approach for Policy Decision Support

作者:

Md Tanvir Hasan Turja ↗

arXiv:2602.22673v2 Announce Type: replace Abstract: Background: Antimicrobial resistance (AMR) is a global health threat. While the WHO Global Antimicrobial Resistance and Use Surveillance System (GLASS) provides standardized data, population-level machine learning forecasting of resistance trends remains limited. Translating computational forecasts into policy requires transparent interpretation mechanisms. Methods: Surveillance data (2021-2023) comprising 5,909 observations across 44 countries and five WHO regions were processed. A rigorous temporal split prevented data leakage. Six models (Naive, Linear, Ridge, XGBoost, LightGBM, LSTM) were benchmarked to forecast one-year-ahead resistance rates using features including prior-year resistance and antibiotic consumption. Evaluation metrics (MAE, RMSE, sMAPE) were computed, with 95% bootstrap confidence intervals for MAE. A local Retrieval-Augmented Generation (RAG) system utilizing Gemma 4 was implemented to translate forecast findings into policy guidance grounded in retrieved WHO documents. Results: XGBoost achieved the best performance (test MAE = 6.13% [95% CI: 5.83-6.44]), an 85.3% error reduction versus the naive baseline (MAE = 41.79%). SHAP analysis identified prior-year resistance as the dominant predictor (50.5% gain), confirming strong autoregressive behavior. Regional forecast error tracked closely with surveillance coverage, ranging from 3.65% in the European Region to 8.61% in South-East Asia. The RAG pipeline generated accurate, source-attributed policy responses without fabricated citations. Conclusion: Short-term AMR resistance rates exhibit strong temporal autocorrelation that can be accurately forecasted using gradient boosting. Coupling these forecasts with a hallucination-resistant RAG system provides a scalable, evidence-based decision-support framework for AMR governance.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.19474

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

作者:

R. D. N. Shakya ↗C. P. Wijesiriwardana ↗S. M. Vidanagamachchi ↗Nalin A. G. Arachchilage ↗

arXiv:2606.19474v1 Announce Type: cross Abstract: The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

阅读与讨论 → 访问原文 →

24.

medRxiv (Medicine) 2026-06-22 DOI: HASH:bb44de6a401d7ccf28fb0601b05fa944

The Protective Role of Belonging and Socioeconomic Status in Dropout Intent Among Minority Ethnic Students: A Mixed Methods Study

作者:

Vaportzis ↗Khan ↗George ↗K. K ↗

Improving minority ethnic student retention is a global higher education priority. This mixed-methods study investigated how institutional belonging and socioeconomic status interact to shape dropout intentions among minority university students in the UK (N = 182). Quantitative results revealed that perceived course difficulty and lower subjective socioeconomic status were the strongest predictors of dropout intent. While the interaction between socioeconomic status and difficulty was non-significant, qualitative accounts showed distinct structural vulnerabilities. Financial strain restricted social integration, turning socioeconomic disparities into campus isolation. Conversely, representative curricula, diverse peer networks, and stable cultural in-groups (e.g., religious affiliations, living in the parental home) functioned as essential psychological buffers against academic exhaustion and alienation. Universities must shift from transactional models to sustained structural equity to protect vulnerable student groups.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.13714

TSA: Temporal Slot Activation for Persistent Object-Centric Video Representation

作者:

Duc Nguyen ↗Sieu Tran ↗Hao Vo ↗Khoa Vo ↗Duy Minh Ho Nguyen ↗Nghi D. Q. Bui ↗Anh Nguyen ↗Long Mai ↗Ngan Le ↗

Unsupervised video object-centric learning aims to decompose dynamic scenes into temporally persistent entity representations. Existing recurrent video slot-attention methods propagate a fixed set of slots across frames, but typically assume unconditional slot propagation: every slot is updated and decoded at every frame, regardless of whether its corresponding object is visible. We show that this design violates a basic lifecycle requirement for persistent slots: when an object is absent or fully occluded, its slot should preserve its previous state and avoid explaining unrelated visible content. Instead, unconditional propagation creates two failure pathways: update-induced state drift, where current-frame evidence overwrites the absent object's representation, and decoder-induced reconstruction interference, where the inactive slot remains coupled to reconstruction through decoder attention. We propose Temporal Slot Activation (TSA), a mechanism that learns a per-slot, per-frame activation score $\alpha_{k,t} \in (0, 1)$ without visibility supervision. TSA uses this activation as a shared latent control variable for slot lifecycle modeling. When a slot is inactive, TSA anchors its state to the previous slot via activation-gated updating and suppresses its decoder participation through an activation-dependent additive bias on attention logits before softmax normalization. This jointly reduces state drift and reconstruction-driven interference. To improve decisions under partial occlusion and gradual reappearance, TSA further conditions activation prediction on a per-slot temporal memory produced by a Temporal Context Encoder. We evaluate TSA on MOVi-C/E, YT-VIS, and OVIS benchmarks using both standard and tracking-based metrics (FG-ARI, mBO, IDF1, HOTA). TSA consistently improves object decomposition and temporal identity preservation, with large gains on long, heavily occluded videos.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络