Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-11

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: https://zju-xyc.github.io/1D-Fusion-Project-Page/

02.
arXiv (CS.CL) 2026-06-24

Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

Self-supervised and supervised speech models are increasingly used to investigate which linguistic information their internal representations encode, and at what level of abstraction they encode it. One underexplored phenomenon is consonant cluster reduction (CCR) in African American English (AAE), a widespread phonological process and a source of automatic speech recognition (ASR) disparity. To examine how CCR is represented, we conduct speaker-independent layer-wise probing of wav2vec2-base and Whisper-small using two tasks: segmental reduction detection and segmental restoration of underlying cluster identity. Both models distinguish reduced and canonical forms with high accuracy. Crucially, reduced segments retain cues to their underlying stops, indicating that CCR is encoded as structured gradient phonological variation rather than simple segmental deletion. These results demonstrate structured phonological encoding of AAE CCR patterns in modern speech models.

03.
arXiv (CS.AI) 2026-06-16

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

arXiv:2605.21312v2 Announce Type: replace-cross Abstract: Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration. We release Frontier at https://github.com/NetX-lab/Frontier.

04.
arXiv (math.PR) 2026-06-25

Three non-Hermitian random matrix universality classes of complex edge statistics: Spacing ratios and distributions

arXiv:2603.28457v2 Announce Type: replace-cross Abstract: The conjectured three generic local bulk statistics amongst all non-Hermitian random matrix symmetry classes have recently been extended to three generic local edge statistics. We study analytically and numerically complex spacing ratios and nearest-neighbour (NN) spacing distributions that characterise such local statistics. We choose the three simplest representatives of these universality classes, given by the Gaussian ensembles of complex Ginibre, complex symmetric and complex self-dual matrices, denoted by class A, AI$^\dag$ and AII$^\dag$. In the first part, we analytically study the complex spacing ratio in class A, at finite matrix size $N$. Introducing a conditional point process, we simplify existing expressions and show why an uncontrolled approximation introduced earlier converges well in the large-$N$ limit in the bulk. When specifying to the elliptic Ginibre ensemble, we present a parameter-dependent $N=3$ surmise for the complex spacing ratio, interpolating to that of the Gaussian unitary ensemble (GUE), where such a surmise is very accurate. In the second numerical part, we compare complex spacing ratios, its moments, and NN spacing distributions for all three ensembles with that of uncorrelated points, the two-dimensional (2D) Poisson process, both in the bulk and at the edge. The varying degree of repulsion within these different edge universality classes can be well understood in terms of an effective 2D Coulomb gas description, at different values of inverse temperature $\beta$. We find indications that the complex spacing ratio does not fully unfold the local statistics at the edge. Finally we verify that for small argument, in all three symmetry classes the NN spacing distributions in the bulk and at the edge are consistent with the universal cubic repulsion.

05.
arXiv (CS.AI) 2026-06-18

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

arXiv:2606.18874v1 Announce Type: new Abstract: AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

06.
arXiv (CS.LG) 2026-06-11

Robustness of Mixtures of Experts to Feature Noise

arXiv:2601.14792v2 Announce Type: replace Abstract: Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.

07.
arXiv (CS.CL) 2026-06-18

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

作者:

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

08.
arXiv (quant-ph) 2026-06-16

Excited-State Quantum Chemistry on Qumode-Based Processors via Variational Quantum Deflation

arXiv:2604.13457v3 Announce Type: replace Abstract: Variational quantum algorithms on bosonic quantum processors are an emerging paradigm for quantum chemistry calculations, exploiting the natural alignment between molecular structure and harmonic oscillator-based hardware. We introduce the qumode-based variational quantum deflation framework (QumVQD) for finding both electronic and vibrational excited state energies on qumode-based architectures. We validate the approach through electronic structure calculations on H$_{2}$ and linear H$_{4}$, where we introduce Hamming-weight filtering of the Fock basis to enforce particle number conservation and eliminate spurious eigenstates by reducing the required Hilbert space, which reduces the required number of qumodes in turn. We achieve agreement with full configuration interaction (FCI) using the STO-3G basis set within the chemical accuracy threshold at most points along the potential energy surfaces. Extending to the vibrational structure, we combine QumVQD with an existing Hamiltonian fragmentation approach based on Cartan subalgebra, allowing us to compute the vibrational eigenenergies of CO$_{2}$ and H$_{2}$S to spectroscopic accuracy with per-fragment circuits that scale as $O(N)$ in single-qumode gates and $O(N^2)$ in beam-splitter gates for $N$ qumodes. For the case of CO$_{2}$, we get total gate counts more than an order of magnitude smaller than those reported for qubit-based vibrational algorithms at this system size. These results demonstrate that bosonic quantum devices are a viable platform for excited-state quantum chemistry, particularly for vibrational problems where qubit-based methods incur substantial boson-to-qubit mapping overhead.

09.
arXiv (CS.CV) 2026-06-15

FBSDiff++: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation

With large-scale text-to-image (T2I) diffusion models achieving significant advancements in open-domain image creation, increasing attention has been focused on their natural extension to the realm of text-driven image-to-image (I2I) translation, where a source image acts as visual guidance to the generated image in addition to the textual guidance provided by the text prompt. We propose FBSDiff, a novel framework adapting off-the-shelf T2I diffusion model into the I2I paradigm from a fresh frequency-domain perspective. Through dynamic frequency band substitution of diffusion features, FBSDiff realizes versatile and highly controllable text-driven I2I in a plug-and-play manner (without need for model training, fine-tuning, or online optimization), allowing appearance-guided, layout-guided, and contour-guided I2I translation by progressively substituting low-frequency band, mid-frequency band, and high-frequency band of latent diffusion features, respectively. In addition, FBSDiff flexibly enables continuous control over I2I correlation intensity simply by tuning the bandwidth of the substituted frequency band. To further promote image translation efficiency, flexibility, and functionality, we propose FBSDiff++ which improves upon FBSDiff mainly in three aspects: (1) accelerate inference speed by a large margin (8.9$\times$ speedup in inference) with refined model architecture; (2) improve the Frequency Band Substitution module to allow for input source images of arbitrary resolution and aspect ratio; (3) extend model functionality to enable localized image manipulation and style-specific content creation with only subtle adjustments to the core method. Extensive qualitative and quantitative experiments verify superiority of FBSDiff++ in I2I translation visual quality, efficiency, versatility, and controllability compared to related advanced approaches.

10.
arXiv (CS.LG) 2026-06-16

Using Reinforcement Learning to Optimize the Global and Local Crossing Number

arXiv:2509.06108v2 Announce Type: replace-cross Abstract: Graph drawing concerns the algorithmic visualization of graphs. A good drawing of a graph is easy to read and facilitates solving tasks on the graph. Several properties have been identified to occur in good drawings of graphs. Such properties include a low number of crossings, large angles between edges, short edges, and depicting symmetries. Many of these properties are explicitly measurable metrics. This brings us to the insight that graph drawing can be seen as a game. In this paper, we study a single-player optimization game in which the player iteratively moves vertices of a straight-line graph drawing to reduce edge crossings. This game arose naturally from the automatic track of the Graph Drawing Challenge, where solutions are obtained by repeatedly performing local vertex movements. We formalize this process as a game with full information and investigate whether reinforcement learning can discover effective strategies for playing it. Our reinforcement-learning agent observes the local geometric and structural context of a vertex and selects a movement direction with the goal of reducing either the global or the local crossing number, that is, the total number of crossings or the maximum number of crossings per edge. We compare the resulting strategies to existing methods and established crossing-minimization heuristics on standard benchmark graphs. While our approach does not out-compete state-of-the-art methods for minimizing the global crossing number, it is competitive and often superior for minimizing the local crossing number.

11.
arXiv (CS.CV) 2026-06-12

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

12.
arXiv (CS.CV) 2026-06-19

FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference

Deploying vision transformers (ViTs) on sensor-edge systems is limited not only by on-device compute, but also by the energy and bandwidth required to transmit high-dimensional image data from the sensor to the processor. While in-sensor and near-sensor computing reduce this cost through early feature extraction, existing methods often provide only modest compression. We observe that the frequency domain provides a naturally compact representation of visual information and can be exploited at the sensor level to reduce sensor-to-processor data movement. Building on this insight, we present FrequencyFormer, a co-designed sensor-to-processor pipeline for efficient ViT inference. FrequencyFormer includes: (1) a multi-scale DCT tokenizer that compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss; (2) a LUT-based near-sensor hardware implementation that leverages fixed DCT coefficients for multiplier-free, energy- and area-efficient tokenization; and (3) a modified MIPI-based low-power communication architecture that further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding and remains compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x, demonstrating frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment.

13.
arXiv (CS.CV) 2026-06-18

Intrinsic 4D Gaussian Segmentation from Scene Cues

Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

14.
arXiv (CS.CL) 2026-06-16

A Mechanistic Understanding of Pronoun Fidelity in LLMs

Faithful and robust pronoun use is important for fair and coherent generations, yet large language models largely fail when multiple referents use different pronouns. To study the interplay of reasoning, repetition, and bias in this task, prior work relies exclusively on behavioural approaches, which may not reflect a model's internal workings. Therefore, we provide a mechanistic, model-internal perspective on pronoun fidelity, testing whether three mechanisms – group entity binding (G), recency bias (R), and stereotypical bias (S) – are causally implemented across several SOTA language models. Using Boundless Distributed Alignment Search, we find all three coexist as causal subspaces distributed across network depth. No single mechanism fully explains model behaviour, but a combination of the three consistently accounts for 91-99.5%. An attention head analysis further reveals two competing copying routes; group binding and stereotype share a localized concept-level route that retrieves a bound occupation-pronoun unit, while recency uses a distributed token-level route that repeats surface forms. In sum, pronoun fidelity arises from competition between simultaneously active causal subspaces.

15.
arXiv (quant-ph) 2026-06-15

Note on the local calculation of decoherence of quantum superposition in the static black holes

arXiv:2606.14178v1 Announce Type: cross Abstract: We investigate the decoherence of a quantum spatial superposition of a static particle in Schwarzschild and Reissner-Nordstr\"{o}m black holes. By treating the particle as a localized classical source coupled to a quantum scalar field, we reformulate the decoherence process in the Danielson-Satishchandran-Wald (DSW) gedankenexperiment through coherent state generation and derive the local expression for the decoherence functional in terms of the Wightman function. In the long-time limit, the decoherence rate is shown to be characterized by the low-frequency behavior of the Wightman function. We then employ the asymptotic matching method to calculate the analytical expressions of the Wightman functions in the Boulware, Unruh, and Hartle-Hawking vacua. We show that the decoherence behavior depends on the quantum state of the environmental field. While the Boulware vacuum gives vanishing decoherence for a static superposition, the thermal effects associated with Hawking radiation in the Unruh and Hartle-Hawking vacua can induce nonvanishing decoherence.

16.
arXiv (CS.CV) 2026-06-11

Exploring Adaptive Masked Reconstruction for Self-Supervised Skeleton-Based Action Recognition

Recently, masked skeleton reconstruction models have emerged as strong action representation learners, driving significant progress in self-supervised skeleton-based action recognition. However, existing state-of-the-art methods must predict an exceedingly large number of spatiotemporal patches, significantly prolonging training time. Besides, by treating all spatiotemporal regions equally during reconstruction, these models are distracted from learning the critical motion patterns that underlie action semantics. To address these challenges, we propose Adaptive Masked Reconstruction (AMR), a faster and stronger pre-training framework. We first decouple the decoder from the encoder, enabling flexible prediction of larger spatiotemporal patches and dramatically reducing reconstruction complexity. Given that larger patches contain more complex information, which is challenging to predict and consequently degrades performance, we accordingly introduce an adaptive guidance module. This module identifies regions of high motion informativeness, guiding the model to focus on the most discriminative parts of each patch and alleviating reconstruction difficulty. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that AMR not only accelerates pre-training substantially but also improves downstream recognition accuracy, surpassing current state-of-the-art approaches.

17.
Nature (Science) 2026-06-17

The EU needs to back its ambition to end animal testing with cash

作者: 未知作者

The European Union has declared that it wants to stop using animals in chemical safety testing. Its goal will need a timeline and a serious funding commitment. The European Union has declared that it wants to stop using animals in chemical safety testing. Its goal will need a timeline and a serious funding commitment.

18.
arXiv (CS.AI) 2026-06-16

How to Detect and Measure the AI Dangers to Democracy

arXiv:2606.16054v1 Announce Type: cross Abstract: Research on artificial intelligence and democracy has grown quickly over the last decade. A shared conclusion in this literature is that AI does not create new democratic problems so much as it makes old ones worse. We now see this across information ecosystems, in elections, and in public administration. However, despite growing evidence, we lack a clear way to prioritize risks in this area, compare them across domains, and identify where democratic control is most likely to break down. So, our problem is: How can we systematize the problems that AI systems pose to democratic processes? This paper argues that principal agent theory may fit the task. In many phases of democratic systems, principals delegate key functions to AI systems and their providers without really being able to monitor how these systems operate or the outputs they produce. Treating AI as a delegation problem helps identify accountability gaps and other governance failures. Most importantly, as we shall illustrate, it provides metrics for empirical assessments of AI impact on democracy. As a second analytical element, we draw on the NIST AI Risk Management Framework and its seven characteristics of trustworthy AI, which supply substantive criteria for evaluating delegated tasks. Operationalized across the three domains through measurable indicators and domain specific trustworthiness criteria, we propose an analytical framework that centers on institutional assessability as the central condition for democratic control over AI. However, we stress that how severe a harm is, and how much risk is acceptable, are evaluative judgments that current methodologies neither acknowledge nor operationalize. This becomes acute when such evaluative judgments are (silently) delegated to private vendors. We identify this as a strong limitation left for future work.

19.
arXiv (quant-ph) 2026-06-16

Ultrastrongly coupled open systems and fine grained time

arXiv:2606.16634v1 Announce Type: new Abstract: We study the dynamics of a d-level quantum system coupled to a bosonic reservoir when the coupling constant is large. It is known that in the limit of infinite coupling strength, the system undergoes an instantaneous nonselective measurement, resulting in the immediate decoherence in the measurement basis, followed by a unitary Zeno dynamics. Here we resolve this dynamical process by introducing a fine grained scaling regime of short times proportional to the inverse coupling. We provide a rigorous derivation of the open system dynamics in this regime of ultrastrong coupling and demonstrate how decoherence unfolds continuously in the new time scale. We show that Markovian dynamics which are not given by semigroups arise naturally, in contrast to what happens in the weak coupling theory.

20.
arXiv (quant-ph) 2026-06-15

Resurgence of the Thermal Transition between Bounce and Sphaleron

arXiv:2606.13778v1 Announce Type: cross Abstract: We study the thermal transition between the bounce and the sphaleron in quantum mechanics with a metastable vacuum from the viewpoint of Borel resurgence. For two models representing a second-order and a first-order transition, we compute the perturbative expansion of the thermal free energy to high orders and extract the leading Borel singularity data $(A,b,S)$ as functions of temperature. The Borel singularity location $A$ reproduces the on-shell action of the dominant saddle on both sides of the transition, joining smoothly in the second-order case and developing a kink in the first-order case. The characteristic exponent $b$ jumps between $0$ and $1/2$ across the transition, counting the zero modes of the corresponding saddle. The Stokes constant $S$ matches the one-loop determinant around the saddle. The perturbative expansion around the false vacuum thus determines the transition temperature, the order of the transition, and the decay rate including the one-loop prefactor without relying on semiclassical inputs.

21.
arXiv (CS.CV) 2026-06-12

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

22.
arXiv (CS.CL) 2026-06-16

Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

The Rapid Response (RR) framework, deployed in production systems, including Anthropic's ASL-3 safeguards, continuously improves jailbreak-detection classifiers. When new jailbreaks emerge that bypass these classifiers, Rapid Response generates synthetic variants for training, helping the model generalize from the new attacks and quickly adapt. We reveal that prompt injection can infiltrate this pipeline to deliver poisoned samples into the classifier's training set, enabling two attack objectives: (I) targeted poisoning attacks that create false positives on harmless samples by categorizing them as a jailbreak, with a specific desired feature (e.g., certain formatting, subject, or keyword), (II) concept-based backdoor attacks that induce false negatives on jailbreak inputs, generalizing even to jailbreaks from attack strategies the defender explicitly trained against, when the backdoor trigger is present. Importantly, our threat model restricts adversaries to modifying only jailbreak samples (not benign data or labels), a constraint unexplored by prior work that makes the second objective particularly challenging. We address this with Omission Attack, which exploits a new phenomenon: when training on concept-absent unsafe samples, the classifier misassociates that concept's presence with the safe label. Both attacks cause substantial and in some cases near-complete label flipping at only a 1% poisoning rate, achieving up to 100% false positive rates and up to 96% false negative rates.

23.
arXiv (CS.AI) 2026-06-12

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

arXiv:2606.12767v1 Announce Type: new Abstract: Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question generation strategies affect dataset quality for procedural and multi-hop reasoning. We compare three strategies: strict generation from Task-Method-Knowledge (TMK) models, transcript-first generation with post-hoc TMK filtering, and TMK-aware generation that combines transcripts with structured guidance. To evaluate generated items, we introduce a grounding validation framework based on closed-set evidence units extracted from TMK models. The framework measures whether answers are supported by the underlying representation, whether questions are self-contained, and whether they target multi-hop procedural reasoning. Across 23 instructional topics and 690 generated question-answer pairs, strict TMK generation achieves the strongest overall quality, with 96.5% grounded questions and 92.6% usable questions. Transcript-first generation produces more learner-like questions but more context-dependent or weakly grounded items, while TMK-aware generation yields high raw multi-hop coverage but lower grounding. These results show that procedural richness and natural phrasing do not guarantee representational grounding, motivating explicit representation-aware validation for evaluation datasets in AI-supported learning.

24.
arXiv (quant-ph) 2026-06-12

Cayley's First Hyperdeterminant is an Entanglement Measure

arXiv:2504.15511v2 Announce Type: replace Abstract: Previously, it was shown that both the concurrence and $n$-tangle on $2n$-qubit pure quantum states can be expressed in terms of Cayley's first hyperdeterminant [dobes2024qubits], indicating that Cayley's first hyperdeterminant, denoted $\mathrm{hdet}$, captures some aspects of a state's $2n$-way entanglement. In this paper, we rigorously prove that on both pure and mixed states, $|\mathrm{hdet}|^{2/d}$ is identically zero on separable states, is an LU invariant, and is non-increasing on average under LOCC, thus demonstrating that $|\mathrm{hdet}|^{d/2}$ is a physically meaningful and legitimate entanglement measure. Moreover, we discuss a few key examples to illustrate the particular type of entanglement Cayley's first hyperdeterminant is detecting: genuine full $d$-level GHZ-type entanglement across all $2n$ parties. Combined, this establishes Cayley's first hyperdeterminant (or $|\mathrm{hdet}|^{2/d}$ to be precise), as a genuine, physically significant generalization of the concurrence and the $n$-tangle to $2n$-qudit states.

25.
arXiv (CS.CV) 2026-06-18

Physics-IQ Verified

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $\tau = 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark