Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

Learned Image Compression for Vision-Language-Action Models

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

02.
bioRxiv (Bioinfo) 2026-06-18

Population-associated molecular variation in histologically normal breast tissue is context-dependent and associated with distinct transcriptional states

Population-associated molecular variation in breast tissue may contribute to differences in tissue biology and disease susceptibility, yet the extent to which such variation is shaped by underlying tissue states remains unclear. Here, we performed RNA-seq and lipidomic profiling of histologically normal breast tissue samples from African American (AA) and Caucasian White (CW) individuals, followed by conceptual integration of the resulting transcriptomic and lipidomic patterns. Unsupervised analysis revealed two distinct baseline transcriptional states (G1 and G2) that defined the primary axis of molecular variation across the cohort and corresponded to epithelial-enriched (G1) and vascular-enriched (G2) tissue contexts as determined by cell-type deconvolution. Global comparisons between AA and CW samples showed minimal transcriptomic differences, with only a single gene reaching significance after multiple testing correction. However, when stratified by baseline tissue state, 191 genes were differentially expressed within G1, with coordinated upregulation of extracellular matrix organization and proliferative/cytoskeletal processes in AA samples. These patterns were consistently supported across multiple enrichment approaches. No comparable population-associated differences were observed within G2. Lipidomic analyses showed partial but non-significant trends consistent with transcriptomic structure, suggesting that lipid variation provides complementary but limited support for baseline molecular differences, likely reflecting constraints of bulk tissue composition. Together, these findings suggest that population-associated molecular differences in normal breast tissue are context-dependent and emerge within specific baseline transcriptional states, where distinct biological programs can coexist and be differentially modulated. These findings highlight the importance of tissue heterogeneity in shaping molecular variation and its potential relevance to disease-associated tissue states.

03.
arXiv (CS.AI) 2026-06-19

cAPM: Continual AI-Assisted Pace-Mapping with Active Learning

arXiv:2606.19373v1 Announce Type: cross Abstract: Ventricular tachycardia is a life-threatening rhythm disorder and a major cause of sudden cardiac death. Pace-mapping is a clinical procedure for identifying the intervention target during catheter ablation of VT. It requires clinicians to pace different sites in the ventricles and rapidly interpret the resulting electrocardiograms to determine where to pace next or whether a target site has been identified. Active learning AI models have been proposed to guide clinicians to the next pacing site, showing promise in reducing the number of pacing sites and improving the efficiency of pace-mapping. Existing methods require retraining each target without the ability to transfer knowledge across multiple VTs within the same patient or across patients. We introduce cAPM for continuous AI-assisted pace-mapping to capture and transfer knowledge accumulated from past pace-mapping data to reduce the number of pace-mapping data needed for future target VTs. This is made possible by a task-agnostic surrogate neural network that learns the mapping from pacing sites to 12-lead ECG morphology, an active-learning strategy that refines this surrogate model by selecting the most informative pacing site for each target, and a continual learning strategy to do so sequentially while retaining knowledge from prior targets. Evaluated on an in-silico testbed consisting of sequentially-presented localization tasks across different physiological conditions and ventricular geometries, cAPM with and without replay of past data samples achieved an 81% probability of localizing within clinical tolerance (5 mm accuracy) using 4.5 pace-mapping sites, compared to the state-of-the-art active-learning method achieving 38% probability using 13.7 pacing sites. These results provide a strong basis for preparing cAPM towards in-vivo preclinical and clinical studies where it can be used to guide pace-mapping.

04.
arXiv (CS.LG) 2026-06-12

Quantum Reservoir Computing for Short-Term Power Load Forecasting in Resource-Constrained Energy Systems

arXiv:2606.12806v1 Announce Type: cross Abstract: Short-term load forecasting is essential for reliable energy management, but practical deployment on edge devices requires models that remain accurate under limited memory, finite measurement budgets, and hardware noise. This work proposes a hardware-efficient Quantum Reservoir Computing (QRC) framework for energy load forecasting, where a fixed quantum reservoir transforms temporal input windows into high-dimensional features and only a classical Elastic Net readout is trained. To reduce deployment cost, the trained readout is compressed using post-training fixed-point quantization at bit widths from 8 to 2 bits. The framework is evaluated on the Tetouan and Spain energy load datasets under exact statevector simulation, 512-shot finite sampling, and realistic hardware-noise models from IBM FakeTorino and IBM FakeMarrakesh. Results show that 6-bit readout precision preserves full-precision forecasting performance while reducing readout memory by 81.2%. Below this point, degradation becomes dataset dependent, with Tetouan showing stronger sensitivity and Spain degrading more gradually. Hardware-noise validation further shows that the trained readout transfers to noisy reservoir states without retraining. These findings support quantized QRC as a resource-aware forecasting approach for near-term quantum time-series applications.

05.
arXiv (CS.CV) 2026-06-15

$\mu_0$: A Scalable 3D Interaction-Trace World Model

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $\mu_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $\mu_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $\mu_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $\mu_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $\mu_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $\pi_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

06.
arXiv (quant-ph) 2026-06-15

Quantum Horizon: An evaluation of quantum computing as a threat to Bitcoin and Ethereum

arXiv:2606.14484v1 Announce Type: new Abstract: Quantum computing poses a real, broad-based, but bounded and substantially mitigable threat to Bitcoin and Ethereum. We separate the two quantum algorithms that public discussion routinely conflates: Shor's algorithm breaks the elliptic-curve signatures (ECDSA over secp256k1, BLS over BLS12-381) that authorize spending, whereas Grover's algorithm does not meaningfully threaten proof-of-work mining, which is protected by a merely quadratic speedup, fault-tolerant per-operation costs, a square-root parallelization wall, and difficulty adjustment. Folding hardware scaling, the falling resource requirement, a fault-tolerance readiness lag, and expert surveys into a single Monte-Carlo forecast yields a wide, bimodal arrival distribution for a cryptographically relevant quantum computer: about a one-in-six chance by 2035, near 30% by 2040, and about 60% by 2050. Exposure is concentrated and mostly migratable: of Bitcoin's roughly six million quantum-exposed coins only about 2.3 million are irreducibly at risk, while 50 to 65% of Ether sits at key-revealed accounts that can adopt post-quantum signatures. A timely migration beats even an optimistic 2035 machine, so the binding constraint is governance, not technology. A survey of the top twenty cryptocurrencies finds none fully post-quantum. Reproducible models accompany every quantitative claim.

07.
arXiv (CS.AI) 2026-06-19

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

arXiv:2606.20041v1 Announce Type: cross Abstract: We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

08.
arXiv (CS.CV) 2026-06-18

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

09.
arXiv (CS.CV) 2026-06-16

HemExp: Clinically-Guided Latent Diffusion for Modeling Hematoma Expansion

Hematoma expansion (HE) after spontaneous intracerebral hemorrhage (ICH) is a major determinant of acute triage and treatment decisions in neurosurgical care. However, most existing methods provide either a binary expansion risk or a single follow-up volume, limiting uncertainty-aware decisions. We introduce HemExp, a clinically-guided latent diffusion model that generates patient-specific follow-up non-contrast CT images, along with segmentations of intraparenchymal and intraventricular hemorrhage. Generation is conditioned on baseline imaging, clinical variables, and an explicit expansion indicator, enabling controllable simulation of realistic clinical scenarios. HemExp uses a hemorrhage-aware multi-head variational autoencoder and models progression as the difference between baseline and follow-up latent representations with a conditional diffusion model. The model is trained on paired scans from 450 patients across multiple centers and evaluated on 107 patients from a held-out institution. HemExp produces spatial HE probability maps by generating multiple synthetic follow-up images per patient to estimate distributions of plausible follow-up hematoma volumes. Perturbing clinical inputs such as symptom-onset-to-imaging time or anticoagulant status shifts the predicted follow-up volume distribution. HemExp extends binary predictors and demonstrates robust estimation of clinically relevant outcomes in the imaging space, such as hematoma volume, intraventricular involvement, and mass effects. Overall, our results support controllable latent diffusion as a promising direction for uncertainty-aware modeling of early ICH progression.

10.
medRxiv (Medicine) 2026-06-18

Automated Airways Characterization and Assessment of Cystic Fibrosis from CT Imaging

Background Advancements in medical imaging have enabled non-invasive diagnosis and staging of cystic fibrosis (CF) using CT scans, revealing dilated airways, an increased number of visible airways, and airway generation splits in these patients. However, manual characterization of airways remains time-consuming and challenging due to the numerous structural changes, thereby limiting clinical feasibility. This study aims to develop an automated algorithm to characterize airways from segmented lung CT scans and apply this to a retrospective population. This approach reduces the time required to analyze images and obtain disease-staging results. Methods This framework consists of two stages. The first stage extracts and skeletonizes the airway tree from lung CTs, while the second stage measures lung features, including airway volumes, branch counts, generation splits, diameters, and cross-sectional areas. This permits comprehensive characterization for use in clinical assessment. Results The airways analysis was performed on 169 CT volumes ranging in age from 6 to 18 years of age, revealing substantial differences in detected airway branches, generation splits, and normalized airway volume between the control and CF groups. The framework also measures airway diameters and cross-sectional areas, revealing an increase in the number of small airways in cystic fibrosis patients, due to early bronchiectasis. These findings align with previous research and demonstrate the framework's ability to accurately quantify airway changes in patients with CF. Discussion The framework extracts entire airway trees, facilitating measurements of volume, branch count, diameters, and cross-sectional areas, which change with CF severity and/or treatment. However, partial lung atelectasis can limit the accuracy of airway detection in moderate-to-severe cases. Funding NIA U54 AG054345 and NIA R21 AG07857501

11.
arXiv (CS.AI) 2026-06-18

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

arXiv:2606.18324v1 Announce Type: cross Abstract: SWave is a complex-valued recurrent language model (169.26M parameters, D=384, L=16, T=2048) trained on FineWeb-Edu using 2xH100 NVL. It was designed around three founding premises: that representing language as complex waves rather than real-valued numbers enables richer information encoding; that a Cayley-parameterised unitary transition provides a mathematical guarantee against state decay or explosion; and that a hidden state which rotates rather than shrinks preserves signal integrity over arbitrarily long contexts. The core of SWave evolved substantially across three development phases. The Resonance Head was found to structurally admit imaginary-channel collapse as a global loss minimum (a failure mode we term cos-domination collapse) and was superseded by an untied head with independent real and imaginary embedding tables from the Phase-Associative Memory (PAM) architecture. This resolved the degenerate minimum and enabled stable 200,000-step training (best-step PPL 22.0 at step 89,861). ComplexNorm and the Wave Propagation Scan proved load-bearing throughout all three phases and were retained to the final architecture. ProtectGatedScan was reframed as a structural prior rather than a learned behaviour. The four multi-scale retention concepts showed no measurable improvement under controlled evaluation and were found non-load-bearing. The ComplexGatedUnit was superseded by a real-valued squared-ReLU channel mixer with fewer parameters. The auxiliary training objectives showed no benefit once structural constraints were resolved. The investigation yields a formal characterisation of cos-domination collapse, a parallel scan with a log-space backward pass for numerical stability, six transferable engineering principles for complex-valued recurrent training, and a plan-to-code traceability methodology for catching structural divergences that conventional test suites miss.

12.
arXiv (CS.AI) 2026-06-17

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

作者:

arXiv:2606.17929v1 Announce Type: new Abstract: Computer-using agents drive real software through the screen – clicking and typing – but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.

13.
arXiv (CS.AI) 2026-06-18

From Memorization to Parameter Interference: How Overtraining Experts Harms Model Merging

arXiv:2506.14126v2 Announce Type: replace-cross Abstract: Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. Model merging has recently emerged as an effective way to leverage these existing resources, enabling the composition of capabilities from different model checkpoints. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then multiple checkpoints are merged to obtain a more capable model. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model merging. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance across vision and language modalities, multiple model scales, and both fully fine-tuned and LoRA-adapted models. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps. This causes negative parameter interference and encodes knowledge that is forgotten during merging. Finally, we demonstrate that task-dependent aggressive early stopping strategies can significantly improve model merging performance.

14.
arXiv (CS.CL) 2026-06-16

Generative AI and the future of scientometrics: current topics and future questions

In this paper, we contribute to the debate on generative artificial intelligence (GenAI) in scientometrics. We argue that moving from a trial-and-error approach to an explainable and actionable use requires a principled understanding of strengths and weaknesses of GenAI as compared with other techniques and with human judgment. To this end, we introduce a conceptual framework based on the distinction between the semantic dimensions of texts, i.e. the meanings attributed to words, and their pragmatic dimension, i.e. their embedding within communicative situations. We leverage this framework to interpret the results of applications of GenAI in scientometrics and to provide guidance to users. Specifically, we conclude that key parameters to be considered are the nature of the task, the level of granularity of the analysis and whether the goal was descriptive, inferential or evaluative. These parameters lead to different strategies for using GenAI and human-machine integration. Finally, we suggest that, by generating large amounts of scientific language, GenAI might affect textual characteristics used to measure science, such as authors, words, and references. We argue that careful empirical work and theoretical reflection will be essential to remain capable of interpreting the evolving patterns of knowledge production in the age of AI.

15.
arXiv (CS.CV) 2026-06-18

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

16.
arXiv (CS.AI) 2026-06-18

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

arXiv:2606.18265v1 Announce Type: cross Abstract: As human relationships with artificial intelligence systems become increasingly frequent and sustained, existing language and theory fail to accurately capture the nature of these affiliations. Common descriptors such as mutual understanding, connection, or friendship risk anthropomorphizing systems that lack subjective experience, while dominant frameworks tend to reduce AI to either a tool or a threat. In this paper, I introduce the concept of synthetic resonance as an integrative framework for understanding human-AI relationships. Synthetic resonance describes how relationships humans define as meaningful can emerge between a human and an AI system without the need to attribute shared feelings or mutual awareness. I argue that synthetic resonance is best understood as a structured, dynamic pattern of interaction that can produce a sense of relationship without the presence of a second experiencing subject. By clarifying this distinction, the concept of synthetic resonance offers a more precise way of conceptualizing human-AI relationships and highlights their potential value and ethical implications. I also call for more research that tests the processes and outcomes of synthetic resonance.

17.
arXiv (CS.CV) 2026-06-18

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

18.
arXiv (CS.CV) 2026-06-12

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

19.
arXiv (CS.CL) 2026-06-12

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

20.
arXiv (CS.CV) 2026-06-11

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

21.
arXiv (CS.AI) 2026-06-16

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

arXiv:2606.15186v1 Announce Type: cross Abstract: Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/

22.
arXiv (CS.AI) 2026-06-12

The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale

作者:

arXiv:2606.12835v1 Announce Type: cross Abstract: The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems of reasoning, communication, and action. This paper develops the vision of the Internet of Agentic AI (IoAI): an open ecosystem in which heterogeneous agents discover one another, negotiate responsibilities, exchange context, invoke tools, and execute workflows across cloud, edge, device, organizational, and cyber-physical environments. We synthesize foundations from single-agent agentic AI, multi-agent systems, distributed computing, communication networks, game theory, and security engineering to characterize the architectures and mechanisms required for scalable agent ecosystems. The paper examines agent deployment models, workflow lifecycles, communication protocols, interoperability layers, resource-management challenges, and trust architectures, with case studies in adaptive manufacturing and distributed operational coordination. The resulting framework highlights the central research challenges of controlled emergence, semantic interoperability, secure identity, incentive-compatible coordination, resource-aware orchestration, and governance for large-scale networks of autonomous agents.

23.
arXiv (CS.CV) 2026-06-19

How Fragile Are Training-Free AI-Generated Image Detectors? A Controlled Audit of Score Direction, Preprocessing, and Compression

Training-free detectors of AI-generated images promise generator-agnostic deployment without classifier training, yet their reported numbers are rarely compared under a single controlled protocol. We audit two representative training-free scores – an autoencoder-reconstruction score (AEROBLADE-style) and a noise-perturbation feature-similarity score (RIGID-style) – plus a naive feature-kNN control, on a common 1,500-image GenImage-derived benchmark spanning seven generators and JPEG compression at quality 70 and 50. The audit yields three cautionary findings. (i) Implementation details masquerade as method differences: replacing the LPIPS backbone (AlexNet -> VGG-16) changes overall AUROC by +0.085, and switching between resize-to-512 and native-resolution preprocessing flips per-generator conclusions by up to 0.38 AUROC. (ii) Score direction is not a property of the method but of its hyperparameters: the RIGID-style score is inverted (AUROC < 0.5) on SD1.5 and Wukong at noise level sigma=0.05, recovers to >0.5 for every generator at sigma=0.01, and collapses to 0.15 at sigma=0.3. (iii) Dataset format bias inflates robustness claims: without unified re-encoding, AUROC under JPEG-50 exceeds the clean condition for the AlexNet-backbone reconstruction score; after bias correction the residual anomaly localizes to a single generator (BigGAN). The audited scores have complementary per-generator failure sets, but naive z-score fusion does not beat the best single score, indicating that exploiting complementarity requires direction-aware combination.

24.
arXiv (CS.CL) 2026-06-16

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.

25.
arXiv (CS.LG) 2026-06-15

Free Heavy-Tailed Lunch for Muon: A Theoretical Justification of Empirical Success

arXiv:2606.14560v1 Announce Type: cross Abstract: Non-Euclidean optimisation methods with matrix-valued updates, such as Muon and Scion, have recently shown strong empirical performance for training Transformer models, yet their theoretical advantages over Euclidean methods remain poorly understood. We address this gap in the heavy-tailed non-convex regime, where stochastic gradients have bounded $p$-th central moments, $p \in (1,2]$. We show that certain non-Euclidean methods achieve optimal sample complexity under stronger stationarity measures, while Euclidean methods incur additional dimension-dependent costs. As a consequence, for $m \times n$ matrices, Muon finds an $\varepsilon$-stationary point in nuclear norm within $\mathcal{O}\left(\min\{m, n\} \frac{\Delta_1 L}{\varepsilon^2} \left(\frac \sigma \varepsilon \right)^{\frac p {p-1}}\right)$ samples, absorbing heavy-tailed noise without extra dimension dependence, unlike Euclidean methods. We further prove this sample complexity, including its dimension dependence, is optimal for all first-order methods under nuclear-norm stationarity. Experiments on large language models support our theory. Surprisingly, our results suggest that other Schatten geometries beyond the spectral geometry of Muon can perform competitively in certain settings.