Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-18

Stealthy World Model Manipulation via Data Poisoning

arXiv:2606.18697v1 Announce Type: new Abstract: Model-based learning agents use learned world models to predict future states, plan actions, and adapt to new environments. However, the process of updating world models from collected experience creates a training-time attack surface: adversarially poisoned fine-tuning trajectories can manipulate the learned dynamics and thereby corrupt downstream planning. In this paper, we propose SWAAP, the first two-stage data poisoning framework for learned world models. In the first stage, SWAAP identifies a harmful target world model that induces low-return behavior under planning while remaining close to clean dynamics, using first-order bilevel optimization enabled by a transition-gradient theorem. In the second stage, SWAAP realizes this target through stealth-constrained gradient matching, modifying only a limited fraction of fine-tuning transition targets so that the induced training gradients steer the victim model toward the adversarial target, while a prediction-error regularizer encourages the poisoned targets to remain close to the world model's natural approximation error. To assess attack stealthiness, we evaluate defenses and detectability across three stages of the poisoning pipeline: pre-training detection of poisoned transitions, robust training during fine-tuning, and test-time monitoring of the resulting world model. Across diverse continuous-control tasks, SWAAP causes substantial performance degradation while keeping poisoned transitions close to clean data and evading the evaluated non-adaptive residual/CUSUM/TRIM-style defenses. These results reveal a practical vulnerability in world-model adaptation pipelines and highlight the need for robustness methods that protect both world-model training data and learned dynamics.

02.
arXiv (CS.CV) 2026-06-16

Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation

World action models~(WAMs) have shown great promise for autonomous driving and urban navigation. Built upon Vision-Language-Action models or video generation models, existing approaches suffer key limitations: (1) High inference latency due to future observation prediction at test time, and (2) tightly coupled video and action modeling leading to representational mismatch and degraded generalization. To address both issues, we propose Metis, an end-to-end WAM framework that decouples video generation and action prediction. Specifically, Metis employs a Mixture-of-Transformers architecture with dedicated experts for video generation and action prediction, preserving the intrinsic distributional properties of each task. To enhance efficiency, we introduce an asymmetric attention mask that enables joint training of both experts while allowing the action model to bypass explicit video generation during inference. This design ensures training-inference consistency and significantly reduces computational costs without compromising planning performance. Extensive experiments demonstrate state-of-the-art performance on the NAVSIM navhard and navtest benchmarks and the CityWalker navigation benchmark, validating both the generalizability and efficiency across diverse tasks. Real-robot deployments further confirm the practical feasibility of our approach.

03.
arXiv (CS.AI) 2026-06-12

The AI Legal Specialist: A Juridically Autonomous Professional Profile for AI Governance

arXiv:2606.12415v1 Announce Type: cross Abstract: The rapid global expansion of artificial intelligence regulation has generated, across multiple jurisdictions, a demand for legal expertise dedicated to AI that the market has addressed in a fragmented manner. Data protection officers extend their remit beyond data protection law; privacy lawyers reposition themselves toward AI; compliance officers add AI chapters to their existing manuals. This paper argues that none of these adaptive responses adequately covers the professional space opened by the emerging global AI regulatory landscape, of which the EU Artificial Intelligence Act (Regulation (EU) 2024/1689) is the most comprehensive instance, alongside the Council of Europe Framework Convention on AI, the United States executive and sectoral framework, and analogous initiatives in the United Kingdom, Canada, Brazil, China, Japan, Singapore, and beyond. A distinct professional profile is required: the AI Legal Specialist, conceived as a jurist – understood broadly to encompass any professional with advanced legal training – operating at the intersection of legal interpretation and AI governance. The profile is juridically autonomous: it derives its existence from the structure of regulatory obligations generated wherever AI is subject to substantive regulation, rather than from any technical standard or the extension of adjacent roles. The paper provides a juridically grounded definition of the profile, argues for its autonomy from adjacent figures and international standards, proposes a reference competence architecture aligned with the European e-Competence Framework (e-CF, EN 16234-1) as a methodological choice, and articulates the conditions for its operational measurement through key performance indicators. The contribution is intended as a foundation for international standardization of the profile and as a reference for practice, curricula, and adoption across jurisdictions.

04.
arXiv (math.PR) 2026-06-18

A Unified Approach to Beta Moments, Combinatorial Identities, and Random Walks

arXiv:2605.05420v2 Announce Type: replace Abstract: The study of random walks has increasingly been popular across diverse disciplines such as statistics, mathematics, quantum physics, where they are used to model paths consisting of successive random steps in a mathematical space. A fundamental quantity of interest is the probability that a simple symmetric random walk returns to the origin after 2n steps. In this paper, we develop a unified probabilistic approach that connects the return probabilities in arbitrary dimensions with moment representations. Using this framework, we provide probabilistic proofs of several combinatorial identities involving beta and gamma functions, and derive new combinatorial identities in general dimensions.

05.
arXiv (CS.AI) 2026-06-19

VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

arXiv:2606.19627v1 Announce Type: cross Abstract: The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

06.
arXiv (CS.AI) 2026-06-17

LLM-Powered Multi-Agent System for Automated Crypto Portfolio Management

arXiv:2501.00826v3 Announce Type: replace-cross Abstract: Cryptocurrency portfolio management requires the fusion of heterogeneous multi-modal signals, including structured price and on-chain time series, unstructured news text, and technical indicators, under high-volatility and real-time constraints. While deep learning approaches show predictive capability, their opacity limits practical adoption, and single large language model (LLM) agents struggle to process the breadth of modality-specific inputs needed for robust decision-making. We propose a multi-agent system (MAS) framework in which three modality-specialised agents, a Crypto Agent for market dynamics, a News Agent for weekly news sentiment, and a Trading Agent for signal fusion and portfolio execution, decompose the task across three communication architectures: hierarchical, collaborative, and debate. We evaluate four capability configurations: zero-shot, chain-of-thought (CoT), retrieval-augmented generation (RAG), and skill-augmented. In a 52-week backtest over calendar year 2025 across the top 15 L1 blockchain native cryptocurrencies by market capitalisation as of January 2025, the best configuration, Hierarchical (Skill), achieves a cumulative return of 133.52% and a Sharpe ratio of 1.502, outperforming single-agent variants, passive benchmarks, and deep learning baselines. An ablation study identifies the Crypto Agent as the most critical component, with its removal reducing cumulative return by 42.57 percentage points. A cross-model comparison further shows that MAS outperforms the single-agent baseline under GPT-4o, GPT-5, and Claude Sonnet 4.5, suggesting that the benefit of multi-agent coordination is model-agnostic. Unlike black-box deep learning models, every portfolio decision is traceable to explicit agent reasoning, offering an interpretable and effective approach to multi-modal cryptocurrency portfolio management.

07.
arXiv (CS.AI) 2026-06-12

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

arXiv:2606.13316v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

08.
arXiv (CS.CV) 2026-06-17

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

09.
arXiv (CS.LG) 2026-06-17

Price of metric universality in vector quantization is at most 0.11 bit

arXiv:2602.05790v2 Announce Type: replace-cross Abstract: Fast computation of a matrix product $W^\top X$ is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation $\widehat W$ in place of true $W$ (``weight-only quantization''). Information theory demonstrates that an optimal algorithm for reducing precision of $W$ depends on the (second order) statistics of $X$ and requires a careful alignment of vector quantization codebook with PCA directions of $X$ (a process known as ``waterfilling allocation''). Dependence of the codebook on statistics of $X$, however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of $X$, in the sense of being at least as good as an $X$-adapted waterfilling codebook with rate reduced by 0.11 bit per dimension in the case when $W$ is Gaussian. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive. Equivalently, our result shows existence of a net in $\mathbb{R}^n$ that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.

10.
arXiv (CS.LG) 2026-06-17

Expanding SPHERE-JEPA: A Family of Statistical Regularizers for the Hypersphere

arXiv:2606.17603v1 Announce Type: new Abstract: In Self-Supervised Learning (SSL), preventing representation collapse by explicitly enforcing a uniform distribution on the unit hypersphere has proven to be effective. However, current frameworks typically rely on sliced statistical regularizers such as SIGReg (used in LeJEPA) and SUSReg (used in SPHERE-JEPA), which approximate this continuous objective via Monte Carlo sampling along random 1D directions. This stochasticity injects projection variance into the training gradients, destabilizing optimization, and hindering convergence. In this work, we first show that analytically integrating out these random projections natively yields a deterministic Maximum Mean Discrepancy (MMD), bypassing the variance of sliced methods. Motivated by this equivalence, we formulate full-dimensional objectives for MMD, Kernel Stein Discrepancy (KSD), and Kullback-Leibler (KL) divergence directly on the sphere to enforce a uniform distribution. To prevent spatial bias, we equip these tests with rotationally invariant kernels constructed via spectral theory, systematically evaluating two canonical families: smooth exponential decay (Heat) and strict frequency cutoff (Bandlimited) filters. Empirically, removing projection-induced noise results in more stable optimization, faster convergence, and consistent improvements over stochastic sliced regularizers on ImageNet and Galaxy10. Furthermore, we reveal that the choice of the statistical test shapes the geometry of the learned latent space: MMD and KSD favor locally clustered organization suitable for object-centric domains, whereas the continuous KDE-based KL divergence promotes fine-grained instance separation, yielding the strongest results on unclustered procedural texture retrieval.

11.
arXiv (CS.CL) 2026-06-12

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

12.
arXiv (CS.CL) 2026-06-15

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

作者:

Large language models (LLMs) are now widely used as automatic judges for open-ended instruction-following evaluation. This practice is convenient, scalable, and often more semantically aware than reference-based metrics, but it also introduces a new reliability question: does a judge evaluate the quality of an answer, or does it also react to the language in which the comparison is presented? We propose Judge-LS, a lightweight meta-evaluation protocol that transforms LLMBar response-pair items into English, Chinese, and Chinese-English language-switched variants. A reliable judge should preserve its preference under label-preserving language transformations and should not prefer a language when two answers are translation-equivalent. We evaluate four API-accessible judges on the full 419-item LLMBar benchmark, producing 13,408 successful pairwise judgments. Across models, Chinese and language-switched presentations induce 10.7–14.4% preference flips relative to English, and all judges achieve their highest accuracy in English. However, translation-equivalent tie probes do not reveal a systematic English preference: most probes are judged as ties, and non-tie decisions more often favor Chinese. We add confidence intervals, paired significance tests, and an automatic transformation audit with a sensitivity analysis that excludes mechanically flagged high-risk variants. The experiment requires no model training, uses only API calls, and is feasible on modest local hardware.

13.
arXiv (CS.AI) 2026-06-16

RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

arXiv:2606.15753v1 Announce Type: new Abstract: Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-only or coordinate-augmented chain-of-thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi-view scenarios due to cross-view appearance changes. To address these issues, we propose Pinned Chain-of-Thought (\pincot{}), a structured reasoning paradigm that pins every reasoning step to visual evidence. \pincot{} introduces the concept of \reasoninganchor{}, which binds each task-relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views. We build a fully automated data generation pipeline to construct \dataset{}, a high-quality \pincot{}-formatted reasoning dataset. We then train \method{} through three-stage post-training that progressively injects embodied knowledge, structured reasoning ability, and process-supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning. On 14 benchmarks covering embodied spatial reasoning, multi-view reasoning, and pointing, \method{} with only 4B parameters consistently outperforms 7B level open-source embodied models, achieving a 12\% average improvement over the strongest 7B baseline, Mimo-Embodied. Further analysis shows that \pincot{} improves grounding accuracy and cross-step identity consistency, validating the effectiveness of process supervision.

14.
arXiv (CS.AI) 2026-06-12

When Smaller Wins: Dual-Stage Distillation and Pareto-Guided Compression of Liquid Neural Networks for Edge Battery Prognostics

arXiv:2601.06227v3 Announce Type: replace-cross Abstract: Battery management systems increasingly require accurate battery health prognostics under strict on-device constraints. This paper presents DLNet, a practical framework with dual-stage distillation of liquid neural networks that turns a high-capacity model into compact and edge-deployable models for battery health prediction. DLNet first applies Euler discretization to reformulate liquid dynamics for embedded compatibility. It then performs dual-stage knowledge distillation to transfer the teacher model's temporal behavior and recover it after further compression. Pareto-guided selection under joint error-cost objectives retains student models that balance accuracy and efficiency. We evaluate DLNet on a widely used dataset and validate real-device feasibility on an Arduino Nano 33 BLE Sense using int8 deployment. The final deployed student achieves a low error of 0.0066 when predicting battery health over the next 100 cycles, which is 15.4% lower than the teacher model. It reduces the model size from 616 kB to 94 kB with 84.7% reduction and takes 21 ms per inference on the device. These results support a practical smaller wins observation that a small model can match or exceed a large teacher for edge-based prognostics with proper supervision and selection. Beyond batteries, the DLNet framework can extend to other industrial analytics tasks with strict hardware constraints.

15.
arXiv (CS.CV) 2026-06-12

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

16.
arXiv (CS.LG) 2026-06-18

GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

作者:

arXiv:2606.18923v1 Announce Type: new Abstract: Programmability is a missing first-class interface in fixed-tensor neural networks: editing a relation, freezing a subgraph, auditing a local function, or changing the execution backend should be an operation on the neural program rather than ad-hoc parameter surgery. GrapNet studies this graph-as-network setting. The graph is the architecture and executable program, not an input data graph. Each compute node owns its next-layer child references and a trainable allocation vector aligned with those references; deleting a relation physically removes both the child reference and the corresponding allocation coordinate. Structural rules and execution policies live outside the node core, so the same child-owned graph can be grown, frozen, structurally edited, grouped into trainable family blocks, routed by attention over active relations, or lowered to dense snapshots after topology stabilizes. GrapNet composes with conventional modules through a vector-valued parent interface: dense layers, CNN encoders, ResNet feature extractors, attention blocks, and transformer representations can all feed one sensory GrapNode per coordinate. The evaluation is organized as a programmability stress suite rather than as a new replay benchmark. In a matched ten-seed Split Fashion-MNIST study, a plastic GrapNet+ER head reaches 63.16 percent seen-class accuracy versus 51.08 percent for a parameter-larger dense MLP+ER under the same seen-class loss and replay memory, with paired delta 12.08 points and p=1.3e-5. On Split CIFAR-10 with a frozen ImageNet ResNet-18 encoder, the same substrate improves the online head over MLP-256 by 3.81 points, with p=0.0026. These results support GrapNet as an editable neural graph substrate whose core value is structural programmability with faithful execution views.

17.
medRxiv (Medicine) 2026-06-15

Supporting people to access social security payments through the Special Rules for End of Life: a qualitative study of the perspectives of patients, carers and health care professionals

Background: People living with terminal illness face a double financial burden from additional costs and loss of earning for themselves and their carers. Social security benefits are intended to help alleviate some of this financial pressure, and in the UK and other countries people are eligible for fast-tracked access to financial support via the Special Rules for End of Life. One in 3 people who are eligible miss out on this support, yet there is limited evidence on the reasons for this take-up deficit. Objectives: The aim of this study is to understand the barriers and facilitators to claiming benefits for terminally ill people from the perspectives of patients, carers, and health care professionals. Methods: This is a qualitative study combining i) focus groups with healthcare professionals recruited via professional networks and social media, and ii) interviews with patients and carers recruited in hospital and hospice settings. We analysed the data using Practical Thematic Analysis Results: Fifty-five multidisciplinary healthcare professionals participated in 11 focus groups, and we interviewed 10 patients and carers. We constructed five descriptive themes to summarise the data: Navigating priorities and uncertainty; positive impacts alongside a sense of shame and stigma; talking about money, difficulties and dividends; everybodys, yet nobodys, responsibility; and sticking points in the system. Conclusion: The themes reveal several challenges that may contribute to people not taking up this financial support. However, discussions about access to benefits were also seen as a core part of holistic care, a positive way to offer support and a gateway to other discussions about end-of-life care preferences and decisions. Recommendations for policy and practice include evaluating the adoption of a diagnostic rather than a prognostic eligibility criteria, integrating discussions about benefits into existing processes such as advance care planning, and improving education and support for clinicians.

18.
arXiv (CS.CV) 2026-06-16

Towards UAV Image Dehazing: A UAV Atmospheric Scattering Model, Benchmark, and Geometry-Aware Deep Unfolding Network

In UAV applications, haze significantly obscures distant details and weaken structural information, hindering the recovery of details. Current UAV scenarios still face two key challenges: (i) paired hazy/clean images from the real world are unobtainable, while the classical atmospheric scattering model is inadequate for modeling the spatially non-uniform haze in UAV imagery; (ii) existing dehazing methods struggle to remove the heavy haze accumulated in the upper regions of UAV images. To address these issues, we first propose a UAV Atmospheric Scattering Model (UASM), which explicitly incorporates flight altitude, viewing pitch, and extinction to characterize the non-uniform haze distribution in UAV imaging. Based on UASM, we develop a physics-driven dehazing framework, termed Geometry-aware Proximal Deep Unfolding Network (GP-DUN). Specifically, GP-DUN consists of three key modules: a Latent Geometry Estimator (LGE) that infers transmittance consistent with UAV imaging geometry, a Geometry-aware Gradient Descent Module (GeoGDM) that embeds UASM into the data-fidelity term and performs physics-consistent closed-form updates, and an Pooling-Expert Proximal Mapping Module (PE-PMM) that learns an implicit prior to restore textures and structures beyond the capability of explicit physical modeling. In addition, we further construct UASM-HazeSet, which provides controllable paired synthetic data together with 2,285 real UAV haze images for testing. Extensive experiments show that GP-DUN consistently outperforms existing methods on both UASM-HazeSet and real UAV haze benchmarks.

19.
arXiv (CS.CV) 2026-06-17

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

20.
arXiv (CS.CV) 2026-06-17

Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations

Multi-task table recognition jointly addresses table structure prediction, cell localization, and cell content recognition within a unified framework. Existing approaches often rely on autoregressive decoders to generate table structures and reuse their hidden states for cell localization and content recognition. This autoregressive generation process can make cell representations order-dependent, degrading global consistency across cells. This paper proposes a structural refinement module that produces order-independent cell features through non-causal attention. This design enables parallel inference of cell contents while conditioning each cell on global context encoded in the refined features. Experiments on two large datasets demonstrate consistent gains in cell localization and end-to-end recognition, while reducing overall inference time by around threefold.

21.
arXiv (CS.AI) 2026-06-11

LaQual: An Automated Framework for LLM App Quality Evaluation

arXiv:2508.18636v2 Announce Type: replace-cross Abstract: Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recommendation mechanisms in LLM app stores predominantly rely on static metrics, such as user interactions and favorites, making it challenging for users to efficiently identify high-quality apps. At the same time, current academic research focuses on specific vertical fields and lacks a general, automated evaluation framework applicable to the diverse LLM app ecosystem. To address the above challenges, we present LaQual, an automated framework for LLM app quality evaluation. LaQual integrates three key stages: (1) LLM app labeling and hierarchical classification for precise scenario mapping; (2) static indicator evaluation using time-weighted user engagement and functional capability indicators to filter low-quality apps; and (3) dynamic scenario-adapted evaluation, where an LLM generates scenario-specific evaluation metrics, scoring criteria, and tasks for comprehensive quality evaluation. Experiments on a mainstream LLM app store demonstrate the effectiveness of LaQual. Its automated scores show high consistency with human judgments. Through effective screening, LaQual can reduce the candidate LLM app pool by 66.7% to 81.3%. User studies further validate its significant outperformance over baseline systems, particularly in comparison efficiency (mean 5.45 vs. 3.30) and value of explanatory information (4.75 vs. 2.25). These results demonstrate that LaQual provides a scalable, objective, and user-centric solution for high-quality discovery and recommendation of LLM apps in real-world scenarios.

22.
arXiv (quant-ph) 2026-06-11

Holographic Complexity, Extremality, and Cosmic Censorship

arXiv:2604.20170v2 Announce Type: replace-cross Abstract: We propose a holographic complexity origin for the third law of black-hole mechanics and weak cosmic censorship. In both complexity equals action and complexity equals volume prescriptions, the relative complexity between subextremal and extremal AdS black holes diverges logarithmically. For overcharged RN-AdS, explicit calculations in both prescriptions show that the near-singularity action terms are power-law divergent or finite, while the maximal-volume contribution is finite. Thus, the extremal-to-naked relative complexity also diverges, obstructing finite-time transitions.

23.
arXiv (CS.CV) 2026-06-16

NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis

Interpretability is essential for trustworthy medical image diagnosis. However, existing concept-driven interpretable methods have key limitations: Concept Bottleneck Models (CBMs) require scoring all predefined concepts at inference time and for manual intervention, imposing a substantial burden on clinicians, while rationale-based generative approaches often select concepts by class discriminability, which can drift from diagnostic ontologies. To address these issues, we propose Neuro-Symbolic Rule Distillation (NeRD), a framework that produces efficient, ontology-grounded reasoning chains that are sufficient yet non-redundant, without manually crafting diagnostic rules. Experiments on two skin datasets demonstrate strong diagnostic performance and interpretability, and blinded expert evaluation confirms the clinical plausibility of NeRD rationales. Our method further enables a first expert-in-the-loop study for Multimodal Chain-of-Thought-based diagnosis, achieving efficient and effective concept-level intervention.

24.
medRxiv (Medicine) 2026-06-16

Language fMRI lateralization success and head motion in pediatric epilepsy patients with ADHD, and improvements based on fMRI task training

Introduction Language functional MRI (fMRI) is a valuable tool for presurgical planning in epilepsy. Functional MRI can be challenging in children, and head motion can compromise its utility. The candidacy of patients with ADHD for fMRI is sometimes queried regarding concerns about possible head motion. In 2020, we implemented an fMRI task training program, via telehealth and/or mock MRI. We aimed to determine whether training increased language lateralisation success and/or reduced head motion in all patients, and in those with ADHD. We also aimed to determine whether patients with ADHD exhibited more head motion during fMRI than those without ADHD. Methods We retrospectively identified 223 epilepsy (85%) and other neurosurgery patients, (241 scans including repeats) with language fMRI at Royal Children's Hospital, Melbourne, Australia, 2016-2024. There were 24 individuals with ADHD listed in the Electronic Medical Record, five of whom had diagnoses of both ADHD and autism; and nine with autism. Language lateralisation success was determined by clinician description recorded as left/right/bilateral in the medical record. 99 patients were provided the training including fMRI task practise. Head motion was quantified by maximum Framewise Displacement (FDmax; mm). Results ADHD was associated with lower language lateralisation success. Training was associated with greater language lateralisation success, across all patients, and in those with ADHD. Regarding ADHD and head motion, outliers in FDmax were seen in 5 young patients with ADHD. Data were trimmed to allow separate investigation of FDmax for the sample with and without extremes of head motion. In untrimmed data, FDmax was significantly higher in patients with ADHD than in those without. In trimmed data, FDmax was on average lower in patients with ADHD than those without, however this was not statistically supported. Regarding training and head motion, across all patients, FDmax was significantly lower for scans with training than without. In patients with ADHD, FDmax was on average lower for scans with training, however training was not associated with FDmax. Conclusions Language fMRI training was associated with higher language lateralization success, particularly in patients with ADHD. Training was associated with reduced head motion across all patients. Although some young patients with ADHD had substantial head motion, most in our sample did not move more than those without ADHD. We conclude that the training program increases success of language fMRI, and that an ADHD diagnosis should not be a contraindication to language fMRI.

25.
arXiv (CS.CV) 2026-06-15

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) Clue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset OmniVideo-100K and a human-verified test set, OmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.