Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-24

Governed Shared Memory for Multi-Agent LLM Systems

arXiv:2606.24535v1 Announce Type: new Abstract: Multi-agent LLM environments require robust mechanisms for shared knowledge management. This paper formalizes the fleet-memory problem and identifies four foundational failure modes: unauthorized leakage, stale propagation, contradiction persistence, and provenance collapse. To address these, we define explicit systems-level primitives: scoped retrieval, temporal supersession, provenance tracking, and policy-governed memory propagation. These primitives are implemented in MemClaw, a production multi-tenant memory service, and evaluated via ArgusFleet, a reproducible harness testing four governance dimensions. Rather than a baseline comparison, this study measures a live production service, emphasizing real-world architectural insights and negative results. Key Evaluation Results Provenance: Successfully reconstructed 100% of depth-four derivation chains with correct writer identity at sub-second per-hop latency. Propagation: Demonstrated high intra-fleet visibility with zero cross-fleet leakage. Under strong write mode, write-to-visible latency was optimized to a single search round-trip. Production Architectural Issues Discovered Asymmetric Scope Enforcement: Tenant isolation held, but sub-tenant scope was initially bypassed on direct GET-by-id requests for agent-scoped credentials (disclosed and remediated during the study). Pipeline Ordering Conflict: While contradiction supersession works for admitted writes, a synchronous near-duplicate gate can prematurely reject contradictory writes before the asynchronous contradiction detector can evaluate them. Conclusion: Long-context retrieval alone is insufficient for production multi-agent memory. Governed shared memory demands explicit systems-level abstractions, and live evaluation is vital to expose enforcement and pipeline-ordering failures missed by design-only treatments.

02.
arXiv (CS.AI) 2026-06-19

Human-like autonomy emerges from self-play and a pinch of human data

arXiv:2606.19370v1 Announce Type: cross Abstract: Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at https://spiced-self-play.com/.

03.
arXiv (CS.AI) 2026-06-18

Scaling Learning-based AEB with Massive Unlabeled Data

arXiv:2606.18864v1 Announce Type: cross Abstract: This paper studies how to scale learning-based automatic emergency braking (AEB) with massive unlabeled fleet data under production constraints. Our approach is based on meta-feedback semi-supervised learning (MF-SSL), where a teacher generates pseudo labels for unlabeled driving data and is updated using a small labeled anchor set as safety-critical feedback. In production, anchor ambiguity and labeled-unlabeled mismatch can amplify systematic pseudo-label errors, leading to spurious triggers. We propose a stabilized MF-SSL framework with (i) Noise-Aware Decoupling, which removes ambiguity-prone anchors from the teacher's supervised update path, and (ii) kinematics-gated pseudo-labeling with a teacher conflict penalty to suppress mismatch-induced risk hallucinations on unlabeled data while maintaining broad coverage. Extensive experiments show consistent gains as unlabeled data scale from 1M to 1B windows, improving safety while keeping comfort stable. The 1B-trained student model is deployed to hundreds of thousands of vehicles and validated over \$10^9$ km of driving, achieving a positive-to-false activation ratio exceeding 100:1 and a 35% improvement in accident-free driving mileage over a production rule-only baseline.

04.
arXiv (CS.CL) 2026-06-19

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.

05.
arXiv (math.PR) 2026-06-24

Deep numerical schemes for systems of Ergodic BSDEs with applications to regime-switching forward utilities

arXiv:2606.24271v1 Announce Type: cross Abstract: In this paper, we introduce two neural-network-based numerical schemes for solving systems of coupled ergodic Backward Stochastic Differential Equations (eBSDEs), motivated by the approximation of optimal strategies within the framework of forward utilities in a regime-switching stochastic factor model. Our approach builds on the representation of such models through systems of eBSDEs introduced in [HLT20]. We first establish a link between the solution of the system of ergodic BSDEs and that of an associated multidimensional BSDE with random terminal time, given by the hitting time of the positive recurrent stochastic factor. Building on this representation, we introduce a locally additive deep learning scheme obtained by minimizing aggregated local error terms. We then present a new Deep Galerkin Method (DGM) inspired algorithm that minimizes the residual of the associated ergodic PDE system, relying on a representation of the ergodic cost. Finally, we apply this framework to regime-switching forward utilities in a stochastic factor model. We first derive a general consistency SPDE that characterizes regime-switching forward utilities and retrieve their representation with systems of ergodic BSDEs in the homothetic case. Numerical experiments demonstrate the performance of the proposed methods, with a particular focus on the impact on forward preferences of taking into account regime switches.

06.
arXiv (CS.AI) 2026-06-24

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

arXiv:2606.24551v1 Announce Type: new Abstract: Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark of 440 desktop tasks across 18 applications and 12 workflow categories, where screen-only GUI agents and skill-mediated CLI agents receive identical goals, states, and final-state verifiers while being restricted to modality-native actions. In this controlled setting, the strongest GUI agent reaches a 59.1% full pass rate, outperforming the strongest original-skill CLI agent at 48.2%; however, verifier-guided skill augmentation raises CLI success to 69.3%, showing that much of the CLI deficit comes from incomplete skill coverage rather than model capability alone. These results suggest that GUI and CLI expose different execution bottlenecks: GUI agents are limited by reliable grounded interaction over long-horizon workflows, whereas CLI agents are limited by the coverage and scalability of their skill interfaces.

07.
arXiv (CS.CV) 2026-06-24

D3Seg: Dependency-Aware Diffusion for Brain Tumor Segmentation with Missing Modalities

Accurate brain tumor segmentation using multi-parametric MRI is critical for effective treatment planning. However, in clinical settings, complete acquisition of all MRI sequences is not always possible. The absence of certain MRI modalities results in substantial performance degradation in existing segmentation methods, which typically rely on naive feature concatenation or direct fusion strategies. To address this limitation, we propose a novel segmentation model D3Seg which is designed to maintain stable performance under missing-modality settings. D3Seg introduces Multi-hop Modality Graph Fusion (MMGF) to model higher-order inter-modality dependencies, a lightweight diffusion-based imputation mechanism to compensate for missing T1ce and FLAIR feature representations in latent space, and probability-space decision refinement to mitigate dominant-class overconfidence and improve delineation of underrepresented tumor subregions. We evaluate the proposed D3Seg model on BraTS 2023 Glioma as the primary benchmark and further test it on a subset of the external BraTS 2023 Meningioma cohort to assess generalization across tumor pathologies. The results are compared with the state-of-the-art models under different missing-modality conditions. The proposed model achieves approximately 1.5-2.0% Dice improvement on enhancing tumor (ET) and around 1.0% on tumor core (TC) across multiple missing-modality configurations compared to the current state-of-the-art model on BraTS Glioma dataset. Cross-cohort evaluation on BraTS Meningioma dataset demonstrates the generalizability of the proposed model, showing consistent improvements in the challenging TC and ET regions, with approximately 1.5-3.0% and 1.5-6.5% gains respectively across several missing-modality configurations.

08.
arXiv (CS.CV) 2026-06-16

MapDream: Task-Driven Map Learning for Vision-Language Navigation

Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

09.
arXiv (CS.LG) 2026-06-24

Similarity of Neural Network Representations in Superposition

arXiv:2604.00208v2 Announce Type: replace Abstract: Comparing internal representations is a central goal in neuroscience and machine learning, but standard linear alignment metrics (Representational Similarity Analysis, Centered Kernel Alignment, and linear regression) are frequently applied to neural activity coordinates rather than on the underlying features. We show this matters when neural systems operate in superposition, encoding more features than they have neurons via linear compression. Closed-form derivations prove that these metrics depend on the Gram matrices of each system's projection, not on the latent features themselves: alignment thus combines what a system represents with how it is encoded. For those interested in what features two systems share, this is a problem: Two networks can have identical feature content yet appear more dissimilar than networks exhibiting partial feature overlap. This apparent misalignment need not reflect lost information as compressed sensing guarantees sparse features remain recoverable from the compressed activity. We confirm this by training supervised TopK sparse autoencoders that realize solvable compressed sensing by construction, finding alignment on recovered latents restored even when raw-activation alignment remains deflated. We extend the result to unsupervised SAEs trained without ground-truth latents, and to pretrained vision and language model SAEs, where SAE-latent alignment exceeds raw-activation alignment, consistent with superposition in real systems.

10.
arXiv (quant-ph) 2026-06-16

On-chip semi-device-independent quantum random number generator exploiting contextuality

arXiv:2601.08392v2 Announce Type: replace Abstract: We present a semi-device-independent quantum random number generator (QRNG) based on the violation of a contextuality inequality, implemented by the integration of two silicon photonic chips. Our system combines a heralded single-photon source with a reconfigurable interferometric mesh to implement qutrit state preparation, transformations, and measurements suitable for testing a KCBS contextuality inequality. This architecture enables the generation of random numbers from the intrinsic randomness of single-photon interference in a complex optical network, while simultaneously allowing a quantitative certification of their security without requiring entanglement. We observe a contextuality violation exceeding the classical bound by more than 10{\sigma}, unambiguously confirming non-classical behavior. From this violation, we certify a conditional min-entropy per experimental round of Hmin = 0.077 +- 0.002, derived via a tailored semidefinite-programming-based security analysis. Each measurement outcome therefore contains at least 0.077 +- 0.002 bits of extractable genuine randomness, corresponding to an asymptotic generation rate of 21.7 +- 0.5 bits/s. These results establish a viable route towards general-purpose, untrusted quantum random number generators compatible with practical integrated photonic quantum networks.

11.
arXiv (math.PR) 2026-06-17

Large deviation principle for friendship-biases in Galton–Watson trees

arXiv:2606.17381v1 Announce Type: new Abstract: In this paper we consider the friendship-bias of the vertices in an infinite rooted Galton–Watson tree. The friendship-bias of a vertex is the difference between the average degree of the neighbours of the vertex and the degree of the vertex itself. A vertex is said to be of type $\chi \in S$, with $S = \{-,0,+\}$, when its friendship-bias is, respectively, strictly negative, zero or strictly positive. We consider the fractions $f_l^\chi$ of vertices of type $\chi \in S$ along a random downward path up to branching depth $l \in \mathbb{N}$ and derive a large deviation principle (LDP) for the triple $(f_l^\chi)_{\chi \in S}$ as $l\to\infty$. The branching depth of a vertex counts the number of branchings that occur along the path that connects the vertex to the root of the tree. The rate in the LDP is $l$, while the rate function in the LDP is identified in terms of a variational formula minimising a relative entropy under a linear constraint. We focus on the case of binary branching, for which the rate function is already quite involved. We identify the qualitative properties of the rate function and show how it can be computed numerically. We briefly indicate how to proceed for more general branching and for vertex types along a tree consisting of a finite number of random downward paths. Our paper is the first to consider large deviations of vertex types.

12.
arXiv (CS.AI) 2026-06-15

Position: AI Must Become Planet-Centered, Not Just Human-Centered

arXiv:2606.13704v1 Announce Type: cross Abstract: This position paper argues that contemporary AI paradigms are insufficient for supporting complex global goals and introduces Planet-Centered AI (PCAI) as a design philosophy and research agenda that reorients AI toward planetary-scale socio-ecological systems and their long-term trajectories. A planet-centered approach is grounded in systems thinking, treating Earth as an interconnected whole of which humans are part. We diagnose recurring limitations across AI frameworks, many of which remain human-centered, and show why these become especially consequential under current planetary conditions characterized by systemic risk, non-stationarity, and deep uncertainty. We then articulate how PCAI reshapes the AI lifecycle, from problem formulation and model design to evaluation and deployment, by emphasizing alignment with global agendas, developing system-aware AI foundations, trajectory-oriented evaluation, and monitorability. Finally, we advance a falsifiable claim: AI systems optimized without explicit consideration of systemic consequences are more likely to exacerbate systemic instability than to mitigate it.

13.
medRxiv (Medicine) 2026-06-18

Hospital-Level Variation in Antenatal Corticosteroids for Late Preterm Births

Objective: To determine whether and to what extent hospitals across the United States vary in their use of late-preterm steroids using a novel data set in which the timing of steroid administration relative to delivery can be observed. Methods: This was a retrospective cohort study of singleton births with known gestational ages identified in the Premier Healthcare Database from 2015 to 2022. The primary variable of interest was hospital-level adoption of antenatal corticosteroids for late-preterm singleton deliveries, calculated as the proportion of late-preterm singleton births (34-36 completed weeks of gestation) with any betamethasone exposure during the same late-preterm period. Hospital adoption was defined as the weighted average rate of ALPS administration among late-preterm infants across the entire post-period. Hospitals were ranked by their late-preterm steroid adoption rates and categorized by quartile based on the empirical distribution. Temporal trends were assessed using annual hospital-level adoption rates and visualized using time-series plots and distributional plots. A logistic regression model was constructed to determine hospital characteristics associated with being a highest-quartile adopting hospital. Results: The analysis cohort included 728 hospitals and 5,452,791 births, of which 361,006 (6.6%) were singleton late preterm births. Hospital steroid exposure rates ranged from 0 to 82% and were categorized into quartiles based on overall exposure rate, with cutoffs at 20.6%, 29.8%, and 40.1%. Median exposure rates increased progressively across quartiles from 14.1% (IQR 9.3-17.4%) in the lowest adopting hospitals (Q1) to 47.6% (IQR 43.7-53.2%) in the highest adopting hospitals (Q4), with substantial within-quartile variation. In the multivariable model, urban location was a strong predictor of high adoption after adjustment (aOR 2.05; 95% CI 1.11-3.83, p=0.02). Compared to Midwest hospitals, Southern hospitals had significantly lower odds of being high adopters (aOR 0.37; 95% CI 0.20-0.69, p

14.
arXiv (CS.CV) 2026-06-17

Adversarial Attacks Leverage Interference Between Features in Superposition

Why do adversarial examples exist, and why do they transfer between models? Existing explanations appeal to high-dimensional geometry, non-robust patterns in the input, and decision boundary structure, but none provides a representation-level mechanism that explains why specific perturbations succeed and why attacks transfer between models. In this paper, we show that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, vulnerability can arise from superposition - the phenomenon where networks represent more concepts than they have dimensions, forcing non-orthogonal representation and thus interference. This interference causes perturbations targeting one representation to affect others, creating vulnerabilities determined by interference patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. The resulting attacks are predictable: PGD-discovered perturbations align with theoretically optimal perturbations derived from the interference geometry. Models trained on similar data develop similar interference patterns, explaining attack transferability. We then show that successful attacks on image classifiers exhibit the structure predicted by our proposed mechanism. These findings reveal that adversarial vulnerability can be a byproduct of networks' representational compression, complementing existing explanations based on data properties or architectural factors.

15.
arXiv (CS.CL) 2026-06-16

Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

Diffusion Large Language Models (dLLMs) offer a promising avenue for parallel generation but face a trade-off between decoding speed and quality. While revocable decoding strategies attempt to mitigate errors by verifying and remasking tokens, they typically operate within a mixed-quality context. This leads to two critical failures: Error Propagation, where new tokens absorb toxic information from erroneous context, and Local Error Reinforcement, where errors mutually reinforce each other to evade detection. To alleviate these challenges, we propose ASRD (Anchor Supervised Revocable Decoding), a training-free framework that operates within the embedding space. ASRD explicitly decouples the decoding context into trusted Anchor Tokens, which are identified via temporal consistency, and uncertain candidates. Leveraging a dynamic Anchor Tokens Cache, we introduce two complementary mechanisms: (1) Anchor-Guided Generation, which injects entropy-weighted anchor signals into masked positions to implicitly rectify attention toward the reliable global skeleton; and (2) Anchor-Perturbed Verification, which applies orthogonal perturbations to uncertain candidate tokens, destabilizing and remasking errors driven by fragile local consensus. Extensive experiments on math and coding benchmarks demonstrate that ASRD outperforms recent remasking baselines, achieving accuracy improvements of up to 6.4\% while accelerating inference throughput by up to 7.2$\times$.

16.
arXiv (CS.CL) 2026-06-16

Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts

Large Language Models (LLMs) natively default to literal semantic interpretations, making zero-shot irony detection a persistent challenge. We introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought (CoT) reasoning trajectories without Supervised Fine-Tuning (SFT). Evaluated on a strictly held-out TweetEval test set (N=734), RDS achieves 78.1% accuracy and a Macro F1 of 0.777, matching the absolute performance ceiling of the fine-tuned BERTweet. On the heavily imbalanced iSarcasm dataset, the frozen CoT pipeline filters 22.5% of out-of-distribution hallucinations, yielding a zero-shot Macro F1 of 0.6726 and Ironic F1 of 0.4821, outperforming multiple heavily supervised SemEval transformer ensembles. A statistical ablation confirms this structural synergy: adding the symbolic prior to the neural baseline yields no significant gain (p = 0.242), and the marginal benefit of adding the CoT pipeline to that prior is heavily compressed (p = 0.149). Only the complete, concurrent fusion of all three signals achieves a statistically validated improvement over the baseline (p = 0.005).

17.
arXiv (quant-ph) 2026-06-11

Quantum repeater segment with free-space coupled co-trapped ions using telecom photon interference

arXiv:2606.12313v1 Announce Type: new Abstract: A quantum repeater segment is a basic building block of a quantum repeater, generating buffered entanglement of quantum memories to connect quantum repeater cells. It also enables the connection between quantum computers. In the implementation we present here, photons emitted from two co-trapped free-space coupled $^{40}$Ca$^+$ ions are converted to the telecom-C band and interfered after transmission over 440$\,$m of optical fiber (220$\,$m per arm), where a photonic Bell measurement is performed to create entanglement between the memories. With this scheme we generate an entangled $\left|\Psi^+\right\rangle$ Bell state with $\ge 68(8)\,$% fidelity, highlighting trapped $^{40}$Ca$^+$ ions as a promising quantum repeater hardware platform.

18.
arXiv (CS.AI) 2026-06-19

PiDR: Physics-Informed Inertial Dead Reckoning for Autonomous Platforms

arXiv:2601.03040v2 Announce Type: replace-cross Abstract: A fundamental requirement for full autonomy is the ability to sustain accurate navigation in the absence of external data, such as GNSS signals or visual information. In these challenging environments, the platform must rely exclusively on inertial sensors, leading to pure inertial navigation. However, the inherent noise and other error terms of the inertial sensors in such real-world scenarios will cause the navigation solution to drift over time. Although conventional deep-learning models have emerged as a possible approach to inertial navigation, they are inherently black-box in nature. Furthermore, they struggle to learn effectively with limited supervised sensor data and often fail to preserve physical principles. To address these limitations, we propose PiDR, a physics-informed inertial dead-reckoning framework for autonomous platforms in situations of pure inertial navigation. PiDR offers transparency by explicitly integrating inertial navigation principles into the network training process through the physics-informed residual component. PiDR plays a crucial role in mitigating abrupt trajectory deviations even under limited or sparse supervision. We evaluated PiDR on real-world datasets collected by a mobile robot and an autonomous underwater vehicle. We obtained more than 29% positioning improvement in both datasets, demonstrating the ability of PiDR to generalize different platforms operating in various environments and dynamics. Thus, PiDR offers a robust, lightweight, yet effective architecture and can be deployed on resource-constrained platforms, enabling real-time pure inertial navigation in adverse scenarios.

19.
arXiv (quant-ph) 2026-06-15

Note on the local calculation of decoherence of quantum superposition in the static black holes

arXiv:2606.14178v1 Announce Type: cross Abstract: We investigate the decoherence of a quantum spatial superposition of a static particle in Schwarzschild and Reissner-Nordstr\"{o}m black holes. By treating the particle as a localized classical source coupled to a quantum scalar field, we reformulate the decoherence process in the Danielson-Satishchandran-Wald (DSW) gedankenexperiment through coherent state generation and derive the local expression for the decoherence functional in terms of the Wightman function. In the long-time limit, the decoherence rate is shown to be characterized by the low-frequency behavior of the Wightman function. We then employ the asymptotic matching method to calculate the analytical expressions of the Wightman functions in the Boulware, Unruh, and Hartle-Hawking vacua. We show that the decoherence behavior depends on the quantum state of the environmental field. While the Boulware vacuum gives vanishing decoherence for a static superposition, the thermal effects associated with Hawking radiation in the Unruh and Hartle-Hawking vacua can induce nonvanishing decoherence.

20.
arXiv (CS.CV) 2026-06-18

Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing data, demonstrating significant application value in land use monitoring and urban sustainable development. However, literature MMCD approaches exhibit limitations in both cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes. To address these problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts multimodal feature importance based on semantic priors obtained from visual foundation models. Finally, we introduce the novel Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan datasets demonstrate that our method outperforms the state-of-the-art by 3.21%, 0.87%, and 1.32% in mIoU, respectively.

21.
arXiv (CS.CV) 2026-06-17

EventDrive: Event Cameras for Vision-Language Driving Intelligence

Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

22.
arXiv (CS.CL) 2026-06-16

Data-Driven Decoding of Russell's Circumplex Model of Affect

Affective computing increasingly relies on deep learning to represent emotions, yet latent spaces often remain opaque, high-dimensional black boxes. This paper investigates whether Transformers' embeddings recover the geometric regularities of Russell's circumplex model. We unify two complementary experiments testing the hypothesis that, after training models on text and speech, their resulting latent spaces encode a topology consistent with valence-arousal and reproduce human-like neighborhood relations. Specifically, we evaluate deep representations extracted from Transformer-based text (RoBERTa) and speech (wav2vec 2.0) encoders, along with a multimodal Transformer fusion architecture, across naturalistic datasets like MSP-Podcast and controlled LLM-generated stimuli. Our analysis reveals that multimodal fusion of text and audio yields perfect topological alignment with Russell's primary emotion ordering. Furthermore, in a zero-shot setting using generic text embeddings, projected fine-grained emotion terms fall close to their established human-mapped coordinates. Our contribution is a novel, data-driven framework for validating emotion models, demonstrating that Russell's circumplex structure is intrinsically encoded in the embeddings of these modalities rather than being solely an artifact of human labeling, thereby bridging the gap between psychological theory and representation learning.

23.
Nature (Science) 2026-06-17

Mapping the neuronal building blocks of human language with language models

作者:

Humans can convey new and highly diverse information through language. This ability to form and combine words into elaborate phrases and sentences enables us to express inexhaustible meanings and is fundamental to human cognition1–5. However, understanding the microscopic cellular building blocks and cortical landscape that precisely underlie human language has remained a challenge. Here we used wide-scale single-neuronal recordings combined with natural language processing models to identify fine-grained linguistic representations across the human frontotemporal cortex during language production. We find that, whereas certain neurons represented the detailed grammatical relationships between words or their parts of speech, others tracked the sentences’ higher-order syntactic structure, their phrase transitions and sequence. Collectively, these neurons reliably captured the words’ syntactic and semantic properties but also dynamically incorporated their specific sentence contexts, therefore enabling them to encode information combinatorially and at highly granular levels of detail. We show how these cell populations were locally organized and how their microscale representations differed from that of their wider field potential patterns. We also show how these neurons were distributed broadly across the frontotemporal cortex, but how their ability to encode linguistic information was left-lateralized and varied between cortical regions. Together, these findings identify some of the most basic cellular building blocks by which linguistic information is encoded in humans and begin to define the cortical landscape of language at a combined micro (cellular), meso (local population) and macro (regional) scale. Wide-scale recordings reveal neurons in the human brain that encode fundamental components of language such as the grammatical relationships between words, their parts of speech and the higher-order syntactic structure of phrases and sentences.

24.
arXiv (CS.CV) 2026-06-16

EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

25.
arXiv (CS.CL) 2026-06-17

MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT-style models, various routing regimes, zero-shot downstream reasoning benchmarks, and continual pre-training adaptation of DeepSeek model show that MoSE matches or improves standard MoE at full width and consistently shifts the compute-quality frontier toward lower inference FLOPs. The code can be found at: https://github.com/tnurbek/mose.