Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-16

Poster: EdgeCitadel – Hybrid NATS-MQTT Orchestration for Edge Multi-Agent Systems

arXiv:2606.14710v1 Announce Type: cross Abstract: Edge-resident AI agents increasingly span home servers, IoT hubs, laptops, and phones, yet their coordination stacks still assume cloud-style transports or a central relay. We present EdgeCitadel, an edge multi-agent orchestration platform built around a single NATS 2.10 server with the built-in MQTT adapter. The design combines MQTT connectivity for heterogeneous agents, JetStream-backed persistence and replay for backend services, direct peer delegation over a shared subject namespace, and a passive aggregator that visualizes and stores traffic without sitting on the delivery path. Our poster highlights the migration from MQTT relay prototypes (common in IoT communication) to the current hybrid architecture and demonstrates a working cross-device testbed spanning ARM64, x64, and Android clients.

02.
arXiv (CS.AI) 2026-06-17

Beyond the Sampled Token: Preserving Candidate Support in RLVR

arXiv:2510.14807v3 Announce Type: replace Abstract: We revisit exploration collapse in reinforcement learning with verifiable rewards (RLVR), from the perspective of the candidate distribution for next-token prediction. We formally show that as probability concentrates on the top-$1$ candidate, the expected number of distinct responses collapses to one regardless of the sampling budget $K$. This theoretical implication is further verified by our empirical tracking of top-$N$ candidate probabilities during training, where the top-$1$ candidate progressively dominates while plausible alternatives are suppressed. These findings suggest a key desideratum for effective exploration: preserving non-negligible probability mass on the top-$N$ candidates. To this end, we propose Candidate-aware Support Preservation (CaSP), with two complementary designs. Specifically, CaSP redistributes positive gradients among top-$N$ candidates for correct responses, and applies a stronger penalty to the top-$1$ candidate for incorrect responses. Unlike many exploration-oriented methods that improve pass@$K$ at the cost of pass@1, CaSP improves pass@$K$ across the full $K$ spectrum. These gains generalize to 6 math, 2 logical-reasoning, and 2 coding benchmarks, and scales to 32B-parameter models and sampling budgets up to $K=1024$, positioning it as a principled, candidate-level approach for RLVR exploration.

03.
arXiv (CS.LG) 2026-06-16

Learning Topological Representations for Molecular Dynamics

arXiv:2606.14737v1 Announce Type: cross Abstract: Molecular dynamics (MD) simulations generate trajectories in a high-dimensional configuration space whose analysis critically depends on molecular descriptors, typically handcrafted observables or learned kinetic embeddings. Designing descriptors that are both expressive and broadly applicable, however, remains challenging. We study persistent homology (PH) as a general-purpose representation for MD and introduce the masked Flood complex, a protein-tailored modification of a recently introduced simplicial complex construction that emphasizes inter-residue structure at low computational cost. Vectorized persistence diagrams then provide information-rich, geometry-aware summaries of protein conformations, which we evaluate on protein class prediction, frame-level observable regression, and Markov state model (MSM) estimation from learned low-dimensional coordinates in a single shared representation space. Results on the mdCATH dataset show that PH-based descriptors are competitive across tasks, with masked Flood PH yielding the most consistent overall performance. Further, when using topologically-informed MSMs as a drop-in replacement within the recent MarS-FM framework for generative modeling of protein conformations, we obtain consistently better ensemble statistics than MSMs based on physical observables. Finally, we explore the transferability of the generative model to qualitatively different, fast folding, proteins.

04.
arXiv (CS.CL) 2026-06-12

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

05.
arXiv (CS.CL) 2026-06-19

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

06.
arXiv (quant-ph) 2026-06-11

On the Addressability Problem on CSS Codes

arXiv:2502.13889v4 Announce Type: replace Abstract: Recent discoveries in asymptotically good quantum codes have intensified research on their application in quantum computation and fault-tolerant operations. This study focuses on the addressability problem within CSS codes: we ask what circuits might implement logical gates on strict subsets of logical qubits. With some notion of fault-tolerance, we prove several impossibility results: for CSS codes with non-zero rate, one cannot address a logical $H$, $HS$, $SH$, or $\mathsf{CNOT}$ to any non-empty strict subset of logical qubits using a circuit made only from 1-local Clifford gates. Furthermore, we show that one cannot permute the logical qubits in a code purely by permuting the physical qubits, if the rate of the code is (asymptotically) greater than 1/3 and the distance is at least 3. We can show a similar no-go result for $\mathsf{CNOT}$s and $\mathsf{CZ}$s between two such high-rate codes, albeit under a more restrictive assumption on the circuit, which we call "global" (though recent addressable CCZ gates use global circuits). This work pioneers the study of distance-preserving addressability in quantum codes, mainly by considering automorphisms of the code. This perspective offers new insights and potential directions for future research. We argue that studying this trade off between addressability and efficiency of the codes is essential to understand better how to do efficient quantum computation.

07.
arXiv (CS.CV) 2026-06-17

SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

08.
arXiv (CS.LG) 2026-06-17

Meta-classification of one-class classification models using ranking correlation and nearest neighbor

arXiv:2606.17858v1 Announce Type: new Abstract: Machine Learning (ML) techniques have been applied to various problems. However, applying ML to ML models is an unexplored direction. For this purpose, this paper considers a meta-classification of one-class classification (OCC) models, because all ML models could be approximated as OCC models. The proposal represents OCC models as normality rankings and classifies them using nearest-neighbor and ranking-correlation metrics. The experiment classifies OCC models, where classes correspond to training datasets, algorithms, and hyperparameters. The proposal achieves high accuracy when class labels are datasets. Moreover, it can classify algorithms when the training datasets contain the same class. In addition, the discussion highlights that the classification of OCC models is essentially the classification of datasets that treats multiple samples as a single input. The experiment demonstrates the classification of datasets using sleeping records. The proposed method can provide a unified solution for classifying OCC models, datasets, and rankings. Source code is uploaded to the public repository https://github.com/ToshiHayashi/ClassOCC.

09.
arXiv (CS.CL) 2026-06-16

Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling

Large Language Model (LLM)-driven conversational search is shifting information retrieval from reactive keyword matching to proactive, open-ended dialogues. In this context, Conversation Starters are widely deployed to provide personalized query recommendations that help users initiate dialogues. Conventionally, recommending these starters relies on a closed "exposure-click" loop. Yet, this feedback loop mechanism traps the system in an echo chamber where, compounded by data sparsity, it fails to capture the dynamic nature of conversational search intents shaped by the open world. As a result, the system skews towards popular but generic suggestions. In this work, we uncover an untapped paradigm shift to shatter this harmful feedback loop: harnessing user "free will" through active user expressions. Unlike traditional recommendations, conversational search empowers users to bypass menus entirely through manually typed queries. The open-world intents in active queries hold the key to breaking this loop. However, incorporating them is non-trivial: (1) there exists an inherent distribution shift between active queries and formulated starters. (2) Furthermore, the "non-ID-able" nature of open text renders traditional item-based popularity statistics ineffective for large-scale industrial streaming training. To this end, we propose Passive-Active Bridge (PA-Bridge), a novel framework that employs an adversarial distribution aligner to bridge the distributional gap between passively recommended starters and active expressions. Moreover, we introduce a semantic discretizer to enable the deployment of popularity debiasing algorithms. Online A/B tests on our platform, demonstrate that PA-Bridge significantly boosts the Feature Penetration Rate by 0.54% and User Active Days by 0.04%.

10.
arXiv (CS.CV) 2026-06-16

CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving

End-to-end autonomous driving models trained with imitation learning (IL) often generalize poorly, particularly in long-tail scenarios where expert demonstrations are sparse. Reinforcement learning (RL) can provide complementary task-level supervision, but applying RL to real-world autonomous driving is challenging in offline settings without interactive simulators, where datasets are dominated by expert actions and provide limited behavioral diversity. We propose CoIRL-AD, a competitive dual-policy framework that integrates IL and RL under a unified offline training regime. CoIRL-AD decouples imitation and reward optimization into separate actors to alleviate objective conflicts, uses imagined future rollouts for long-horizon reward estimation, and introduces a competition mechanism that selectively transfers beneficial behaviors while keeping RL anchored to expert-like driving. Experiments on the nuScenes benchmark show that CoIRL-AD consistently improves robustness over strong IL-based baselines, with especially large gains in cross-city generalization and long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.

11.
arXiv (CS.CL) 2026-06-24

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

Extracting structured procedural knowledge from unstructured business documents is a critical yet unresolved bottleneck in process automation. While prior work has focused on extracting linear action flows from instructional texts, such as recipes, it has insufficiently addressed the complex logical structures, including conditional branching and parallel execution, that are pervasive in real-world regulatory and administrative documents. Furthermore, existing benchmarks are limited by simplistic schemas and shallow logical dependencies, restricting progress toward logic-aware large language models.To bridge this Logic Gap, we introduce BREX, a carefully curated benchmark comprising 409 real-world business documents and 2,855 expert-annotated rules. Unlike prior datasets centered on narrow service scenarios, BREX spans over 30 vertical domains, covering scientific, industrial, administrative, and financial regulations. We further propose ExIde, a structure-aware reasoning framework that investigates five distinct prompting strategies, ranging from implicit semantic alignment to executable grounding via pseudo-code generation. This enables explicit modeling of rule dependencies and provides an out-of-the-box framework for different business customers without finetuning their own large language models. We benchmark ExIde using 13 state-of-the-art large language models. Our extensive evaluation reveals that executable grounding serves as a superior inductive bias, significantly outperforming standard prompts in rule extraction. In addition, reasoning-optimized models demonstrate a distinct advantage in tracing long-range and non-linear rule dependencies compared to standard instruction-tuned models.

12.
arXiv (quant-ph) 2026-06-16

Quantum Energy Teleportation under Equilibrium and Nonequilibrium Environments

arXiv:2511.01518v3 Announce Type: replace Abstract: Quantum energy teleportation (QET), implemented via local operations and classical communication, enables carrier-free energy transfer by exploiting quantum resources. While QET has been extensively studied theoretically and validated experimentally in various quantum platforms, enhancing energy output for mixed initial states, as the system inevitably interacts with environments, remains a significant challenge. In this work, we study QET performance in a two-qubit system coupled to equilibrium or nonequilibrium reservoirs. We derive an analytical expression for the energy output in terms of the system Hamiltonian eigenstates, enabling analysis of energy output for mixed states. Using the Redfield master equation, we systematically examine the effects of qubit detuning, nonequilibrium temperature difference, and nonequilibrium chemical potential difference on the energy output. We find that the energy output for mixed states often follows that of the eigenstate with the highest population, and that nonequilibrium environments can enhance the energy output in certain parameter regimes.

13.
arXiv (CS.AI) 2026-06-17

A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

arXiv:2606.17649v1 Announce Type: cross Abstract: The high cost of fine-tuning LLMs poses a significant economic barrier; pre-hoc performance prediction offers a critical solution to substantially reduce this expense. However, the theoretical limits of pre-hoc performance prediction remain unexplored. We formulate it as a stochastic estimation problem under information constraints, decomposing prediction risk into two components: an intrinsic limit (static data-model compatibility) and a reducible optimization variance. We prove that optimization variance admits a necessary lower bound on its decay rate, implying fundamental constraints on how quickly uncertainty dissipates, regardless of the predictor used. Based on these dynamics, we derive a budget-optimal probing principle and introduce a predictability phase diagram that organizes tasks into three distinct regimes: Static-Sufficient, Dynamic-Critical, and Noise-Dominant. Extensive experiments on synthetic and real-world benchmarks validate these theoretical regimes and demonstrate the efficiency of our probing strategy.

14.
arXiv (CS.CV) 2026-06-16

Through-Foliage Surface-Temperature Reconstruction for Early Wildfire Detection

We present a method to reconstruct surface temperatures through forest vegetation by combining signal processing and machine learning, enabling fully automated aerial wildfire monitoring with drones for early fire detection. Synthetic aperture (SA) sensing reduces canopy occlusion but introduces thermal blur. To overcome this, we train a visual state space model to recover subtle thermal signals of partially occluded soil and fire hotspots from blurred data. To address limited real-world training data, we generate realistic surface temperature simulations using a latent diffusion model, temperature augmentation, and procedural thermal forest modeling. On simulated datasets, our method reduces RMSE by 2-2.5 versus conventional thermal and uncorrected SA imaging; in field experiments on hotspots, RMSE improved by 12.8-fold and 2.6-fold, respectively. Our approach also generalizes to other thermal signals, including human signatures, capturing morphology and extent – critical where simple thresholding fails – while conventional imaging struggles with partial occlusion.

15.
arXiv (CS.CV) 2026-06-11

Information-Theoretic Decomposition for Multimodal Interaction Learning

Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at https://github.com/GeWu-Lab/DMIL.

16.
arXiv (CS.CV) 2026-06-15

Dual Cross-Attention Siamese Transformer for Rectal Tumor Regrowth Assessment in Watch-and-Wait Endoscopy

Increasing evidence supports watch-and-wait (WW) surveillance for patients with rectal cancer who show clinical complete response (cCR) at restaging following total neoadjuvant treatment (TNT). However, accurate methods to early detect local regrowth (LR) from follow-up endoscopy images during WW are essential to manage care and prevent distant metastases. Hence, we developed a Siamese Swin Transformer with Dual Cross-Attention (SSDCA) to combine longitudinal endoscopic images at restaging and follow-up and distinguish cCR from LR. SSDCA leverages pretrained Swin Transformers to extract domain agnostic features and enhance robustness to imaging variations. Dual cross attention is implemented to emphasize features from the paired scans without requiring any spatial alignment to predict response. SSDCA as well as Swin-based baselines were trained using image pairs from 135 patients and evaluated on a held-out set of image pairs from 62 patients. SSDCA produced the best balanced accuracy (81.76% $\pm$ 0.04), sensitivity (90.07% $\pm$ 0.08), and specificity (72.86% $\pm$ 0.05). Robustness analysis showed stable performance irrespective of artifacts including blood, stool, telangiectasia, and poor image quality. UMAP clustering of extracted features showed maximal inter-cluster separation (1.45 $\pm$ 0.18) and minimal intra-cluster dispersion (1.07 $\pm$ 0.19) with SSDCA, confirming discriminative representation learning. Code and weights available at: https://github.com/Jotanator/SSDCA

17.
arXiv (CS.CL) 2026-06-12

Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

18.
arXiv (CS.AI) 2026-06-18

The New Social Image: How AI Competency and AI Proactivity Influence Self- and Peer-Perceptions in the Workplace

arXiv:2606.00182v2 Announce Type: replace-cross Abstract: Human-AI collaboration is considered the most promising way to incorporate AI in the workplace. What remains unexplored are the experiential consequences of this teaming. More specifically, in a team with AI, how humans perceive themselves (self-perception) and how they are perceived by their coworkers (peer perception) in terms of work ownership and job meaningfulness. In a 2x2x2 vignette study (n=50), participants rated perceptions of ownership, affect, job meaningfulness and satisfaction, and role dynamics across two levels (low/high) of AI proactivity and AI competency as within-subject factors, with point-of-view (self perception/peer perception) as between-subjects. Our results showed that AI with low competency or low proactivity generally improved feelings related to ownership, meaningfulness, satisfaction, and role dynamics, and also increased positive affect while reducing negative affect. However, these effects were often influenced by point-of-view. For instance, low AI proactivity resulted in higher job satisfaction from self-perception rather than peer perception. Based on our findings, we argue that designing AI for the future of work solely around performance metrics may not be adequate. Highly competent and proactive AI-driven systems can have undesirable impacts on perceptions of ownership, job identity, social image and team dynamics, and consequently, job meaningfulness.

19.
arXiv (CS.AI) 2026-06-18

SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation

arXiv:2606.18897v1 Announce Type: cross Abstract: Intent-based recommender systems have gained significant attention for improving accuracy and interpretability by modeling the underlying motivations behind user behaviors. Most existing models derive intents directly from user sequences via clustering or prototype learning. However, they are sensitive to sequence quality, require presetting the number of intents, and lack explicit semantic grounding. These issues lead to an incomplete and coarse intent set and limit the effectiveness of recommendation. In this paper, we propose the Sparse Autoencoder for intent-based recommendation (SAERec), a novel recommender that automatically constructs a fine-grained and interpretable intent space from a textual corpus to guide recommendation. Rather than treating texts as side signals, SAERec leverages them as high information density evidence for intent construction. Specifically, we first extract a comprehensive set of fine-grained interpretable intents from the latent space of large language models (LLMs) by using a sparse autoencoder (SAE) to disentangle and interpret text embeddings, which isolates intent-related semantics from textual noise. Then, for each user, we retrieve relevant intents from this set as priors to guide recommendation. It contains personal intents matching a user's current interests and public intents capturing general item patterns shared across users (e.g., quality, price). Finally, to integrate retrieved intents into sequence modeling, we propose a multi-branch attention mechanism that captures temporal dependencies and injects both personal and public intent signals, followed by an adaptive fusion layer to construct the final user representation for recommendation. Extensive experiments on public datasets demonstrate the superiority of SAERec, consistently outperforming state-of-the-art baselines while providing human-understandable explanations.

20.
arXiv (CS.LG) 2026-06-17

Learn from Your Mistakes: Self-Correcting Masked Diffusion Models

arXiv:2602.11590v3 Announce Type: replace Abstract: Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models, enabling parallel token generation while achieving competitive performance. Despite these advantages, MDMs face a fundamental limitation: once tokens are unmasked, they remain fixed, leading to error accumulation and ultimately degrading sample quality. We address this by proposing a framework that trains a model to perform both unmasking and correction. By reusing outputs from the MDM denoising network as inputs for corrector training, we train a model to recover from potential mistakes. During generation we apply additional corrective refinement steps between unmasking ones in order to change decoded tokens and improve outputs. We name our training and sampling method Progressive Self-Correction (ProSeCo) for its unique ability to iteratively refine an entire sequence, including already generated tokens. We conduct extensive experimental validation across multiple conditional and unconditional tasks, demonstrating that \method~yields better quality-efficiency trade-offs (up to ~4x faster sampling) and enables inference-time compute scaling to further increase sample quality beyond standard MDMs (up to ~1.2x improvement on benchmarks).

21.
arXiv (quant-ph) 2026-06-15

Experimental violation of a Bell-like inequality for causal order

arXiv:2506.20516v2 Announce Type: replace Abstract: Quantum mechanics is compatible with scenarios where physical processes happen in an indefinite order. In theory, this feature could be detected through violations of inequalities on the observed correlations, analogous to Bell inequalities. However, experimental demonstrations of such violations have been missing until recently due to the complexity of the required setup. Here we report an experimental violation of a Bell-like inequality involving the correlations of four parties, one of which is spacelike separated from the others. Our demonstration employs 3 km fiber spools to simulate spacelike separation, and achieves high-speed operations in photonic time-bin encoding, nanosecond synchronization, and accurate temperature stabilization. These experimental advances enable a violation by 5.7 standard deviations and open a path towards a certification of indefinite order in conditions that guarantee spacelike separation with existing state-of-the-art devices. However, the certification is not device-independent, as it relies on knowledge about the setup to exclude bidirectional signaling–a loophole inherent to implementations in classical acyclic spacetimes, which may be resolved in future quantum-spacetime tests.

22.
arXiv (CS.CV) 2026-06-16

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

23.
arXiv (CS.AI) 2026-06-16

Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks

arXiv:2606.01602v2 Announce Type: replace-cross Abstract: Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled and robust way to quantify dependence between heterogeneous data types, especially between continuous time series and discrete temporal event sequences. Existing approaches rely on ad hoc transformations or mutual-information estimators that are highly sensitive to quantization, repeated values, and event redundancy, leading to biased or unstable results in practice. We propose a nonparametric mutual information estimator that directly measures the dependence between time series and event sequences without data transformation, learning, or ad hoc discretization. Our method models the continuous-discrete duality of real-world time series to handle quantization and repeated-value artifacts and introduces a latent event clustering strategy to mitigate bias from event co-occurrence and redundancy. Together, these yield a robust and unified framework that bridges discrete and continuous mutual information. We evaluate the proposed estimator on four representative tasks: discrete-continuous time-delayed mutual information for causality analysis, global and local temporal repetition discovery, discrete covariate selection for time series forecasting, and continuous feature selection for classification. Experiments on synthetic and real-world datasets show consistent improvements over existing methods in accuracy, robustness, and interpretability, positioning our approach as a general-purpose dependence operator for heterogeneous temporal data, similar to Pearson correlation for homogeneous time series. Code available at: https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

24.
arXiv (CS.CV) 2026-06-16

Context-Aware RL for Agentic and Multimodal LLMs

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query–answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query–context–answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

25.
arXiv (CS.AI) 2026-06-17

Timestamp-Aware Spatio-Temporal Graph Contrastive Learning for Network Intrusion Detection

arXiv:2606.17109v1 Announce Type: cross Abstract: Given their effectiveness in modeling the relational structure among network traffic flows, graph neural networks (GNNs) have been widely adopted in network intrusion detection systems (NIDSs). However, most existing GNN-based NIDS approaches focus on the relational structure of traffic flows, and treat them as temporally independent, which limits their ability to cope with evolving attack behaviors. Moreover, their reliance on supervised or semi-supervised learning often restricts generalization to unseen attacks. To address these limitations, we propose a novel self-supervised GNN-based framework. To the best of our knowledge, the proposed model is among the first self-supervised GNN-based NIDS models to explicitly leverage real timestamps, which provides faithful temporal dependencies for representation learning. We first construct a series of temporal graphs from network traffic flows according to their timestamps, and then employ an E-GraphSAGE and LSTM based encoder to fully extract temporal information and spatial dependencies of network traffic, without introducing time-costly attention mechanisms. A multi-view graph contrastive learning (GCL) scheme is introduced, where temporal, spatial, and feature contrasts are jointly performed to capture temporal continuity, preserve structural consistency, and improve the generalization and robustness of the learned representations, respectively. In addition, a gradient-norm-based adaptive weighting strategy is designed to optimize the contrastive loss weights. Experimental results on four representative NIDS datasets with real timestamps demonstrate that our method significantly outperforms existing self-supervised approaches and achieves performance comparable to the supervised state-of-the-art GNN method, while maintaining high computational efficiency.