Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-19

ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

arXiv:2606.19371v1 Announce Type: cross Abstract: Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

02.
arXiv (CS.LG) 2026-06-15

Riemannian Metric Matching for Scalable Geometric Modeling of Distributions

arXiv:2606.14334v1 Announce Type: new Abstract: High-dimensional datasets often concentrate near low-dimensional structures, but estimating their geometry from samples typically relies on graphs and kernels that scale poorly with dataset size and dimension. We propose Riemannian metric matching: a denoising probabilistic framework for learning the Riemannian geometry of data using neural networks. Specifically, we learn the carré du champ operator, which, using diffusion geometry, gives us access to the Riemannian geometry toolkit for downstream machine learning and statistical tasks. Our key observation is that the carré du champ operator can be formulated as a conditional expectation over random perturbations of the data, which can be exploited for sample-wise training and constant cost, amortized inference without explicit kernel construction. Empirically, metric matching rivals or improves the accuracy of $k$-NN-based diffusion geometry estimators, while enabling amortized inference that is up to $400\times$ faster, and supports graph-free geometric analysis on high-dimensional images where nearest neighbors break down.

03.
arXiv (CS.CL) 2026-06-18

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like depth-first search (DFS). This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients", and then synthesizes corresponding user queries. This "answer-first" approach led to ToolGrad-500, a dataset generated with more complex tool use, lower cost, and almost 100% pass rate. Experiments show that ToolGrad models outperform those trained on expensive baseline datasets and proprietary LLMs. The ToolGrad source code, dataset, and models are available at https://github.com/zhongyi-zhou/toolgrad.

04.
arXiv (CS.CV) 2026-06-16

Contrastive Learning for Seismic Horizon Tracking with Domain-Specific Priors

Unsupervised 3D seismic horizon tracking faces a key limitation: signal-based propagators provide accurate trace-level alignment but often fail near faults, whereas texture-driven deep models are more robust to discontinuities, typically at the cost of labeled data requirements and reduced trace-level precision. We propose a self-supervised fusion of both paradigms in which signal-derived local horizon correspondences act as domain-specific priors to train a texture-based deep learning model. Specifically, we estimate reliable trace-to-trace flows from reflector slopes and use them to form positive pairs in a contrastive objective, while restricting training to high-confidence neighborhoods, optionally augmented with a fault mask. The objective is not to infer ambiguous correspondences close to discontinuities, but to preserve horizon identity across them. As a result, the network learns voxel-wise embeddings that preserve local signal continuity while enabling horizon propagation beyond discontinuities through similarity search. Experiments on the public F3 dataset and a faulted synthetic dataset achieve lower mean absolute error (MAE) than unsupervised baselines and competitive performance against a semi-supervised method using a single labeled slice.

05.
arXiv (CS.CL) 2026-06-11

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. The pattern illustrates a general gap: representations that are readable from activations may not be writable through them.

07.
arXiv (CS.LG) 2026-06-11

Higher-Order Token Interactions via Quantum Attention

arXiv:2606.11673v1 Announce Type: cross Abstract: Standard dot-product self-attention computes, in a single layer, only pairwise (order-2) interactions between tokens; representing a generic order-$k$ interaction is known to require either super-quadratic resources in one layer or composition across depth. We introduce Quantum Higher-Order Attention (QHA), a shallow, hardware-realizable quantum attention head that, via data re-uploading and an all-to-all non-Clifford entangler, synthesizes order-$k$ token interactions inside the circuit and exposes them through a local single-qubit read-out. We prove (i) an expressivity separation: any single standard self-attention layer with embedding dimension $m$, $H$ heads and $p$-bit precision satisfying $mHp=o(N/\log\log N)$ cannot represent the order-$k$ correlation family that one QHA head represents with circuit depth $O(\log k)$ ($O(k)$ two-qubit gates); and (ii) a trainability guarantee for its local-design instantiation: with a local read-out and $O(\log n)$ depth the gradient variance is $\Omega(1/\mathrm{poly}(n))$ (no barren plateau), which we confirm empirically – while being explicit that the more expressive all-to-all instantiation we benchmark is trained empirically and shows exponentially decaying gradients. Empirically, at a $6.5\times$ smaller parameter budget, QHA generalizes hidden-subset parity of every order $k\le6$ from disjoint inputs, whereas the larger classical attention head collapses past order~2; consistent with theory, the size of the advantage tracks the target's Fourier degree - largest for parity and shrinking when low-order structure is present. As an application, QHA serves as a compact high-order interaction detector across three domains - genetic epistasis, learning-parity-with-noise, and graph triangle detection - reaching the noise ceiling at the smallest parameter budget where field-standard linear methods fail.

08.
arXiv (CS.LG) 2026-06-16

Auditing Machine Unlearning: A Systematic Research on Whether Models Truly Forget

arXiv:2606.16110v1 Announce Type: new Abstract: Machine unlearning has been extensively studied in response to growing privacy concerns and regulatory requirements. However, auditing whether unlearning algorithms have truly erased the influence of specific data remains an open challenge. The lack of reliable and practical auditing mechanisms can lead to critical privacy risks, such as residual information leakage. This paper initiates a systematic investigation into whether existing unlearning algorithms can truly forget the designated data. We propose the first practical and general-purpose auditing framework for machine unlearning, inspired by the concept of proof of ignorance. Our framework addresses the key practicality limitations of existing methods by eliminating the need for retraining-from-scratch baselines, avoiding the training of large numbers of shadow models, and requiring no intrusive intervention in the original training process. To evaluate the effectiveness of our framework, we first conduct validation experiments to verify its soundness and completeness. We then perform comprehensive experiments across six datasets and ten representative unlearning methods. The results demonstrate that our framework reliably distinguishes between successful and failed unlearning. In particular, we observe that retraining-based and fine-tuning-based methods can achieve effective unlearning, even when the target data remain in the original dataset. In contrast, de-optimization-based methods fail to achieve true unlearning and instead degrade the model's performance. Fisher/Hessian-based methods also fail to unlearn requested data, even formal certification is provided. Moreover, we show that our framework is robust against fake unlearning attempts and generalizes well to large language models.

09.
arXiv (quant-ph) 2026-06-12

Quantum charge pumping in helical systems: A comparative study of short- and long-range hopping

arXiv:2606.12914v1 Announce Type: cross Abstract: Using the Keldysh non-equilibrium Green's function approach, we investigate charge pumping through a single-stranded helical structure described by a tight-binding model that includes either short-range hopping (SRH) or long-range hopping (LRH). While quantum pumping has been studied in various low-dimensional systems, the detailed behavior of the spectral current and the pumped dc current in helical geometries in the presence of higher-order electron hopping (beyond nearest neighbors) has not yet been systematically explored. Here, we focus on the interplay between helicity and extended hopping ranges, analyzing how they jointly control the energy-resolved and dc pumped currents under time-periodic end potentials. For LRH, the pumped dc current exhibits pronounced plateau-like regions as a function of chemical potential when energy levels are sparsely spaced – consistent with adiabatic transport – whereas SRH yields more parameter-sensitive currents without clear plateaus. The plateau stability is controlled by the drive frequency: at higher frequencies, Floquet side-band mixing destroys the plateaus, leading to oscillatory currents. The phase dependence remains nearly sinusoidal, and the current vanishes at zero phase lag, confirming the necessity of out-of-phase potentials. Crucially, in helical systems, the decay exponent $(\ell_c)$ acts as an effective structural parameter that can tune both the magnitude and sign of the pumped current, offering a geometric knob for controlling quantum pumping. Our findings not only fill a gap in the understanding of spectral and pumped currents in helical systems with extended hopping but also provide tools that can be applied to analyze similar phenomena in other chiral or quasi-one-dimensional systems.

10.
arXiv (CS.CL) 2026-06-24

Quantifying Prior Dominance in RAG Systems

作者:

Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ''epistemic blindness'' - failing to distinguish genuine contextual information extraction from parametric memory recall. To address this, we introduce the Normalized Context Utilization (NCU) metric, leveraging continuous token log-probabilities across zero-shot, oracle, and adversarial conditions to strictly quantify contextual information gain. Evaluating architectures ranging from 1.5B to 72B parameters alongside a proprietary commercial API reveals that for strict factual extraction (without Chain-of-Thought reasoning), traditional scaling laws exhibit extreme diminishing returns: highly efficient Small Language Models (SLMs) match or outperform high-capacity architectures. Furthermore, we demonstrate that ``Prior Dominance'' correlates with model scale and proprietary alignments. The evaluated commercial API not only overrode explicit external evidence in nearly half of adversarial conflicts, but also frequently suffered from systemic confidence collapse (Negative Transfer) when its parametric priors were contradicted. Our findings highlight the structural epistemic advantage and superior contextual adherence of SLMs in strict extraction workflows.

11.
arXiv (CS.AI) 2026-06-24

NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

arXiv:2606.24450v1 Announce Type: cross Abstract: Perceiving physical contact is fundamental to dexterous manipulation. While robots often rely on dedicated hardware tactile sensors, humans exhibit a remarkable ability to infer contact by integrating visual information with an innate sense of their body's pose and movement. Inspired by this embodied perceptual skill, we investigate whether a robot can learn to infer contact from vision, an approach that also offers a scalable alternative to tactile hardware specifically for binary contact estimation, which faces practical challenges in cost, fragility, and integration. We present NoContactNoWorries, a transformer-based multimodal framework that fuses RGB-D vision with the robot's proprioception to infer binary contact states as a pseudo-tactile signal for hand-object interactions. We validate by training a single contact prediction model on multiple objects and show that the inferred contact signal supports downstream reinforcement learning agents for in-hand object reorientation, generalizing to novel objects. Experiments in both simulation and on a real-world robot validate our approach, highlighting the feasibility of inferring contact from vision and proprioception. Project Page: https://soham2560.github.io/no-contact-no-worries/

13.
bioRxiv (Bioinfo) 2026-06-11

A systematic imputation framework for sparse, multimodal space biology datasets: application to retinal imaging and omics from the RR9 mission

Space biology experiments are expensive, logistically complex, and inherently limited in sample size, resulting in datasets that are frequently incomplete and highly heterogeneous (2). Missing data is a fundamental barrier to building reliable computational models of how the human body responds to spaceflight. This work introduces a systematic framework for addressing missing data through imputation. We developed a validated four-stage framework for imputation specifically designed to preserve biological signal needed for digital twin development, while quantifying trade-offs in downstream analyses. Using retinal imaging and omics data from the NASA RR9 mission as a case study (9), we demonstrate how to diagnose why data is missing(10), select and optimize appropriate imputation strategies (5,10), and rigorously evaluate whether imputed data remains biologically meaningful. A key finding of this work is that while imputation substantially improves the performance of predictive models, it can simultaneously obscure subtle biological patterns; a critical trade-off that researchers must understand before applying these methods (11). This framework provides practical, actionable guidance for space biologists and data scientists working with sparse, multimodal datasets in space biology, and represents a foundational step toward more complete and reliable data-driven models of human physiology in extreme environments.

14.
arXiv (quant-ph) 2026-06-11

Quantum iterative approach to the Traveling Salesman Problem

arXiv:2606.11843v1 Announce Type: new Abstract: The Traveling Salesman Problem (TSP) is a classical NP-hard problem in combinatorial optimization, where determining the shortest route among a set of cities becomes computationally prohibitive as the problem size increases. This work explores quantum computing as an alternative approach to address this complexity. Unlike existing methods that primarily rely on quantum annealing, we propose a quantum iterative framework integrating Quantum Phase Estimation (QPE) and Grover's search algorithm. Route costs are encoded as quantum phases, enabling QPE to efficiently evaluate them, while Amplitude Amplification, implemented via the Grover-Long algorithm, iteratively refines the solution space toward the optimal route. A proof-of-concept case study on a small-scale TSP instance demonstrates the feasibility of this approach and its potential for scaling to larger optimization problems. Furthermore, under an expectation-based analysis, the algorithm exhibits an expected computational complexity of $O(\frac{m^2\log_2(m)\log_2(1/\epsilon)}{\sqrt{\epsilon}})$ which depends on the error tolerance parameter $\epsilon$. This estimation omits the initialization term, which we expect future refinements to render subdominant to Phase Estimation.

15.
arXiv (CS.CL) 2026-06-16

All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present All-Mem, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: Split, Merge, and Update, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on LoCoMo and LongMemEval-s show improved retrieval and QA over representative baselines. The code is available at https://github.com/LvCan926/All-Mem.

16.
arXiv (CS.CL) 2026-06-17

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.

17.
arXiv (CS.LG) 2026-06-19

Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection

arXiv:2606.19411v1 Announce Type: new Abstract: Selecting a small, diverse, high-quality subset from a massive pool of candidates is a recurring primitive in modern machine learning – data curation and coreset selection for training and fine-tuning large models, active-learning batch acquisition, prompt and exemplar selection for in-context learning, retrieval diversification, and experimental design. Determinantal Point Processes (\operatorname{DPP} s) give a principled, well-calibrated notion of diversity for this task, but their MAP objective – pick a size-$k$ subset $S$ maximizing $\logdet(L_S)$ – is NP-hard, and the standard greedy and sampling algorithms scale superlinearly in the ground-set size $n$. This cost is prohibitive precisely in the data-centric regime where diversity matters most, where $n$ ranges over millions to billions of candidate examples, features, or embeddings. We recast \operatorname{DPP}-MAP as a continuous optimization problem over the Stiefel manifold, and show that its first-order optimality conditions form a Nonlinear Eigenvalue Problem with eigenvector dependency (\operatorname{NEP}v) of a previously unstudied form. This \operatorname{NEP}v\ admits a self-consistent field (\operatorname{SCF}) iteration with a spectral-gap-based local contraction guarantee, giving a principled iterative solver where the diversity objective drives an eigenvector-dependent operator. The resulting algorithm, \OurMethod, requires only matrix-vector products with the kernel and runs in time $O\!\big((ndk+nk^2)\,t\big)$ for a small number of iterations $t$, scaling near-linearly in $n$ and integrating directly with low-rank and feature-map kernels common in ML. This paper focuses on the relaxation, solver, and scaling analysis; full real-data benchmarking is left to a planned empirical study.

18.
arXiv (quant-ph) 2026-06-16

On-chip semi-device-independent quantum random number generator exploiting contextuality

arXiv:2601.08392v2 Announce Type: replace Abstract: We present a semi-device-independent quantum random number generator (QRNG) based on the violation of a contextuality inequality, implemented by the integration of two silicon photonic chips. Our system combines a heralded single-photon source with a reconfigurable interferometric mesh to implement qutrit state preparation, transformations, and measurements suitable for testing a KCBS contextuality inequality. This architecture enables the generation of random numbers from the intrinsic randomness of single-photon interference in a complex optical network, while simultaneously allowing a quantitative certification of their security without requiring entanglement. We observe a contextuality violation exceeding the classical bound by more than 10{\sigma}, unambiguously confirming non-classical behavior. From this violation, we certify a conditional min-entropy per experimental round of Hmin = 0.077 +- 0.002, derived via a tailored semidefinite-programming-based security analysis. Each measurement outcome therefore contains at least 0.077 +- 0.002 bits of extractable genuine randomness, corresponding to an asymptotic generation rate of 21.7 +- 0.5 bits/s. These results establish a viable route towards general-purpose, untrusted quantum random number generators compatible with practical integrated photonic quantum networks.

19.
arXiv (CS.CL) 2026-06-17

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

20.
arXiv (quant-ph) 2026-06-17

Engineering entanglement and transport in interacting quantum walks with tailored potentials

arXiv:2606.17825v1 Announce Type: new Abstract: Controlling the interplay between particle propagation and quantum correlation generation is a central challenge in quantum transport. Here, we investigate two distinguishable continuous-time quantum walkers evolving on parallel one-dimensional lattices, interacting via distance-dependent potentials. While on-site interactions reproduce the typical bosonic behaviour, extending the interaction to a linear potential over multiple neighbors introduces controlled Bloch-like oscillations and shifts the bound-pair regime to stronger couplings. More generally, we explore a Coulomb-like interaction parameterized by strength, spatial scaling, and decay rate. This reveals a rich phase diagram including four distinct dynamical regimes: (i) a high-entropy, oscillatory regime akin to a linear potential; (ii) a strongly localized, bound-pair regime; (iii) a novel intermediate regime combining near-ballistic spreading with strong correlations; and (iv) a weakly interacting, free-propagation regime. Notably, regime (iii) achieves concurrent optimization of transport efficiency and entanglement, offering a sweet spot for correlated quantum dynamics. Our results provide a tool for designing interaction-engineered quantum walks with potential applications in quantum information processing and simulations.

22.
arXiv (CS.CL) 2026-06-11

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.

23.
arXiv (CS.CL) 2026-06-24

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

24.
arXiv (CS.CL) 2026-06-16

SimSiam Naming Game: A Unified Approach for Emergent Communication and Representation Learning

Emergent Communication (EmCom) investigates how agents develop symbolic communication through interaction without predefined language. Recent frameworks, such as the Metropolis–Hastings Naming Game (MHNG), formulate EmCom as the learning of shared external representations negotiated through interaction under joint attention, without explicit success or reward feedback. However, MHNG relies on sampling-based updates that suffer from high rejection rates in high-dimensional perceptual spaces, making the learning process sample-inefficient for complex visual datasets. In this work, we propose the SimSiam Naming Game (SSNG), a feedback-free EmCom framework that replaces sampling-based updates with a symmetric, self-supervised representation alignment objective between autonomous agents. Building on a variational inference–based probabilistic interpretation of self-supervised learning, SSNG formulates symbol emergence as an alignment process between agents' latent representations mediated by message exchange. To enable end-to-end gradient-based optimization, discrete symbolic messages are learned via a Gumbel–Softmax relaxation, preserving the discrete nature of communication while maintaining differentiability. Experiments on CIFAR-10 and ImageNet-100 show that the emergent messages learned by SSNG achieve substantially higher linear-probe classification accuracy than those produced by referential games, reconstruction games, and MHNG. These results indicate that self-supervised representation alignment provides an effective mechanism for feedback-free EmCom in multi-agent systems.

25.
arXiv (CS.CV) 2026-06-15

Vanishing Depth: Training Generalized Depth Adapters with Sinusoidal Depth Preprocessing for Pretrained RGB Encoders

Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose a self-supervised training approach that extends pretrained RGB encoders with a depth adapter to incorporate and align metric depth into a combined latent space without interfering with the pretrained RGB feature extraction. In combination with our sinusoidal depth encoding, the depth adapter enables generalized and robust depth density and distribution invariant feature extraction. Our depth adapters improve a wide set of generalized RGB baselines across a spectrum of relevant RGBD downstream tasks in segmentation, pose estimation, and depth completion – without the necessity of finetuning. Most importantly, we achieve 56.05 mIoU in the SUN-RGBD segmentation, while outperforming SOTA depth-aware and multi-modal encoders in our experiments. When no depth is present, one can activate our depth adapter with an empty map, use single pixel depth clues, or monocular depth estimation to include the depth aware feature extraction into subsequent downstream tasks.