Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

02.
arXiv (CS.AI) 2026-06-12

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

arXiv:2606.13621v1 Announce Type: new Abstract: Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery – specification compilation, product game construction, attractor computation, and winning-region extraction – is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict – a formal certificate that a topology-specification pair is or is not defensible – with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

03.
arXiv (CS.CV) 2026-06-16

An Extensive Benchmark for Single-round and Multi-round Instruction-based Image Editing

In recent years, there have been notable advancements in the area of instruction-based image editing (IIE), which focuses on the automatic alteration of input images using a model. Nevertheless, assessing the effectiveness of these editing models poses a considerable challenge due to the intricate nature of instructions and the wide variety of edits. To tackle this problem, one urgent task in this domain is the development of a robust evaluation framework that can precisely gauge the quality of editing outcomes and offer valuable benchmarks to guide future improvements. To address this challenge, we present a comprehensive evaluation benchmark named I2EBench2.0, designed for single-round and multi-round assessment of IIE models. I2EBench2.0 has four key features: 1) Evaluation Across Single and Multi-rounds: I2EBench2.0 simultaneously evaluates both single-round and multi-round instruction-based edits, assessing the precision and consistency of the edits. 2) Extensive Evaluation Criteria: I2EBench2.0 encompasses a broad range of criteria, evaluating both high-level and low-level aspects of each IIE model. Specifically, it incorporates 16 dimensions for single-round evaluations and 7 for multi-round evaluations. 3) Alignment with Human Judgment: To ensure our benchmark aligns with human evaluation, we conducted a comprehensive user study for each criterion. 4) Research-driven Insights: By analyzing the strengths and weaknesses of current IIE models across all 16 single-round and 7 multi-round dimensions, we provide critical insights aimed at directing future research in this area. We tested eight recently developed IIE models using I2EBench2.0 and derived academic insights through meticulous comparison and analysis. The related code, dataset, and images generated by all IIE models are available on GitHub: https://github.com/cocoshe/I2EBench.

04.
arXiv (CS.CV) 2026-06-16

CT-VDETR: Semi-supervised 3D Trauma Detection in Computed Tomography (CT) scans using Dense Vertex Relative Position Encoding

Accurate detection and localization of traumatic injuries in abdominal CT remain challenging because voxel-level annotations are limited and expensive to obtain. We present a label-efficient framework for 3D abdominal trauma detection that combines self-supervised pretraining with semi-supervised transformer-based detection. First, we use Masked Image Modeling (MIM) on 1098 CT volumes to pretrain a 3D U-Net encoder for anatomical representation learning. Next, we adapt V-DETR to dense volumetric CT through a feature adapter that converts the encoder feature grid into a compact token sequence for transformer decoding. The pretrained encoder is then integrated with V-DETR and 3D Vertex Relative Position Encoding (3D V-RPE) to improve the localization of irregularly shaped injuries. Finally, semi-supervised teacher-student consistency regularization leverages 2,000 additional unlabeled volumes during detector training. To the best of our knowledge, this is the first application of a 3D DETR-style detector to the RSNA abdominal trauma detection task. On this benchmark, the proposed method achieves 31.33% test mAP@0.50 using only 78 labeled training volumes, corresponding to a 1.53x improvement over supervised-only training. These results show that combining medical-domain pretraining with semi-supervised learning is an effective strategy for label-scarce 3D medical detection.

05.
arXiv (quant-ph) 2026-06-19

Applications of quantum annealing to magnetic dipole hyperfine structure constants: First results beyond energies for atoms

arXiv:2606.20166v1 Announce Type: new Abstract: We report the first results of the magnetic dipole hyperfine structure (HFS) constants of neutral $\mathrm{Li}$, Li-like $\mathrm{Be}$, neutral $\mathrm{Na}$, and Na-like $\mathrm{Mg}$ using a modified version of the Quantum Annealer Eigensolver (QAE) algorithm on D-Wave's quantum hardware. The results are benchmarked against relativistic configuration interaction with multiconfiguration Dirac Hartree-Fock (MCDHF) calculations using the General-purpose Relativistic Atomic Structure Package (GRASP), and simulated annealing. In our modified QAE, a zooming-and-sigma-annealing approach with a floating-point encoding scheme is adopted to estimate the ground-state eigenvalue and eigenvector of the relativistic Dirac-Coulomb Hamiltonian matrices ($H_{\mathrm{DC}}$) constructed from 11 or fewer configuration state functions (CSFs). For calculations with extended correlation orbital sets, we applied a CSF truncation scheme, retaining only CSFs (up to 12) that make significant contributions to the ground-state wavefunction. Our modified QAE precision is kept limited to three decimal places (up to 10 qubits). Hardware demonstrations on the D-Wave quantum processing unit (QPU) yielded results that were completely consistent with GRASP (at the chosen precision) in determining the magnetic dipole HFS constants, with accuracy varying across systems and $H_{\mathrm{DC}}$ matrix dimensions.

06.
arXiv (CS.LG) 2026-06-17

Beyond IGO-Flow: Toward Convergence Analysis of IGO in Continuous Spaces

arXiv:2606.17523v1 Announce Type: cross Abstract: Information-Geometric Optimization (IGO) provides a unified framework for black-box optimization by interpreting the adaptation of a search distribution as a natural gradient update. Despite its conceptual importance, the convergence theory of IGO remains limited: most existing results concern continuous-time idealizations such as the IGO flow, rather than discrete-time updates with non-infinitesimal learning rates. In this paper, we study discrete-time IGO in continuous spaces, formulated as natural gradient updates in the expectation-parameter coordinates of an exponential family. In particular, we analyze IGO over the multivariate Gaussian family on strongly convex quadratic objective functions. Our analysis covers a setting that simultaneously incorporates full covariance adaptation, a fixed positive learning rate, and quantile-based weights. In this setting, we prove that the covariance matrix converges to the zero matrix. We further show that the mean vector converges to the global optimum, provided that the condition number of the appropriately scaled covariance matrix is bounded at sufficiently frequent iterations. These results advance the convergence theory of IGO and help bridge the gap between the mathematical theory of IGO and practical covariance-adaptive search methods such as CMA-ES.

07.
arXiv (CS.LG) 2026-06-19

Insulin4RL: Real-Time Insulin Management in the Intensive Care Unit for Offline Reinforcement Learning

arXiv:2606.19481v1 Announce Type: new Abstract: Offline reinforcement learning (ORL) offers the potential to improve the quality of clinical decision-making using historical electronic health record (EHR) data. Current training and evaluative practices in this field rely heavily on EHR datasets that have been temporally discretised into fixed, regular time intervals. Discretisation creates fictional representations of complex clinical scenarios and compromises the generalisability of retrospective model evaluations. In this paper, we introduce Insulin4RL, a healthcare ORL dataset featuring naturally irregular inputs and actions from real clinical trajectories. Derived from MIMIC-IV, Insulin4RL comprises over 375,000 labelled decisions across 12,209 patients requiring insulin infusion titration in the Intensive Care Unit. The dataset can thus be used for research into ORL model performance under realistic clinical sampling assumptions. We provide a description of the dataset's structure and characteristics, baseline performance metrics using model-free offline reinforcement learning, and a standardised evaluation protocol using fitted Q-evaluation. We conclude with suggested areas for future research that could be addressed using this resource.

08.
arXiv (CS.LG) 2026-06-19

On the Variance of Temporal Difference Learning and its Reduction Using Control Variates

arXiv:2606.20357v1 Announce Type: new Abstract: We analyze the variance of temporal difference (TD) learning using the phased setting with tabular representation, and show that one of the mechanisms behind its ability to reduce variance is by effectively aggregating over a larger number of independent trajectories. Based on this insight, we demonstrate that (1) the variance of TD is asymptotically bounded from above by Monte Carlo (MC) estimators, and (2) shorter horizon updates incurs less variance for a fixed number of samples. Beyond TD, we show that Direct Advantage Estimation (DAE), a method for estimating the advantage function, can be seen as a type of regression-adjusted control variate, which achieves a tighter bound on the variance compared to TD in the large-sample limit. Finally, we numerically illustrate the behaviors of these estimators with carefully designed environments.

09.
medRxiv (Medicine) 2026-06-10

Resolving Diagnostic Discordance in Group 2 Pulmonary Hypertension Through Staged Physiologic Testing: Insights From PVDOMICS

Background World Symposium on Pulmonary Hypertension (WSPH) Group 2 pulmonary hypertension (PH) is a clinically integrated phenotype attributed to left heart disease, whereas pre- versus post-capillary classification is operationalized primarily by pulmonary capillary wedge pressure (PCWP). Although current recommendations emphasize contextual interpretation and provocative testing for intermediate PCWP values, the relationship between PCWP-based classification and underlying phenotype has not been systematically evaluated. We aim to quantify phenotype-hemodynamic discordance across the PCWP spectrum and evaluate a staged physiology-guided framework incorporating inhaled nitric oxide (iNO), ventricular geometry, and provocative testing. Methods We studied 1,032 participants from the NHLBI-sponsored PVDOMICS cohort with multidisciplinary adjudicated phenotypes integrating clinical, imaging, physiologic, and hemodynamic data. Stage-specific PCWP thresholds classified pre- versus post-capillary physiology at rest, during iNO, and during provocation (fluid challenge or invasive cardiopulmonary exercise testing [iCPET]). Echocardiographic right ventricular-to-left ventricular (RV/LV) ratio was evaluated as a marker of ventricular interdependence. Restricted cubic spline and staged concordance analyses defined certainty-based PCWP ranges and incremental diagnostic yield. Results Adjudicated Group 2 phenotype was present in 37.0% of participants. Resting PCWP demonstrated good discrimination (AUC 0.86), but substantial bidirectional phenotype-hemodynamic discordance persisted across intermediate PCWP ranges. At a resting PCWP of 12 mmHg, 25% of participants classified as pre-capillary had adjudicated Group 2 PH, whereas at 18 mmHg, 35% classified as post-capillary remained discordant non-Group 2. Concordance did not approach 90% until PCWP values were 24 mmHg. Dynamic testing incrementally improved concordance within these overlap zones. Nearly half of adjudicated Group 2 PH participants (46.5%) were not identified by resting PCWP alone; incorporation of iNO and provocative testing increased cumulative Group 2 identification by 63.4% and improved sensitivity from 79.9% to 83.7%. Model discrimination improved from an AUC of 0.863 to 0.908 (likelihood-ratio P

10.
arXiv (CS.AI) 2026-06-25

What Does a Pathological Speech Assessment Model Know about Acoustic Features? A Case Study on Oral and Oropharyngeal Cancer Patients

arXiv:2606.24949v1 Announce Type: cross Abstract: This work investigates the interpretability of a Wav2Vec 2.0based speech intelligibility assessment model for oral and oropharyngeal cancer patients through canonical correlation analysis. By measuring the correlation between the model embeddings and eGeMAPS low-level descriptors (LLDs) as an interpretable reference, we analyze how acoustic information is encoded across the model layers. The analysis is conducted at two levels: individual LLDs layer-wise, and group-level: prosodic, spectral, and voice quality. Results show that the learned representations are most strongly correlated with spectral and prosodic features, with the first MFCC coefficient yielding the highest correlations across all layers. At the group level, spectral and prosodic groups achieve correlations of 0.77 and 0.71 respectively, while voice quality reaches 0.65. Beyond model interpretability, this work also offers practical guidance on acoustic feature selection for pathological speech assessment.

11.
arXiv (CS.CL) 2026-06-16

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self-correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain-specific supervised fine-tuning. Recent work has shown that a multi-agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent-Opponent-Judge architecture, setting a promising direction for training-free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training-free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine-tuned RoBERTa models on Macro F1, suggesting that the under-representation of the Attack class was more damaging to supervised fine-tuning than to inference-only models. Additionally, our framework produces human-readable debate transcripts, offering interpretability absent from both single-agent and supervised classifiers.

12.
arXiv (CS.LG) 2026-06-11

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

arXiv:2606.11266v1 Announce Type: new Abstract: The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the episode budget has been exceeded. At race speeds, where collisions are instantaneous and irreversible, any safety mechanism that waits for cost to accumulate is structurally too late. We present VLM-Safe-RL, a framework that integrates a frozen vision-language model into the CMDP Lagrangian update as an anticipatory cost term. The framework comprises four contributions: (i) Decoupled Dual-Path CLIP, independent reward/cost paths that respect the CMDP's factorization; (ii) VLM-Lagrange, an augmented multiplier update that incorporates a per-step VLM cost as an anticipatory term; (iii) Confidence Gating, a Bayes-optimal weight derived from a logistic noise model on the CLIP margin; and (iv) VLMPPOLag, the composed algorithm. On Safety-Gymnasium FormulaOne L2, our principal evaluation ($n{=}5$ seeds, $10^{6}$ steps, budget $d_{lim}{=}25$) VLMPPOLag$+$Conf is the only configuration in our default budget comparison that simultaneously retains substantive return ($J_r{\approx}40$) and holds cost within budget on a majority of seeds; the five constraint-aware baselines (PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND) each fail at least one requirement. The mechanism generalizes to held-out MetaDrive Medium (catastrophe rate $41\%{\to}26\%$, 95\% bootstrap CI $[-26,-5]$\,pp) and shows directionally consistent transfer to Bullet Safety-Gym; we report honestly where it does not (MetaDrive Easy/Hard, Qwen2-VL backbone) and trace the Hard failure to a Lagrangian-regulation pathology rather than the VLM signal itself. To our knowledge, this is the first work to use frozen VLM signals as an anticipatory cost term inside the CMDP Lagrangian update.

13.
arXiv (CS.CV) 2026-06-25

Geometry-Anchored Transport Framework for Exemplar-Free Class-Incremental Learning

Exemplar-free class-incremental learning (EFCIL) requires stable decision boundaries within a shifting feature space. While maintaining class-conditional Gaussian statistics provides a principled classification strategy, these parametric summaries remain sensitive to anisotropic representation drift. Existing methods often transport these statistics across tasks using a decoupled, post-hoc paradigm: optimizing a backbone without explicit geometric constraints can distort the legacy manifold, limiting the precision of retroactive alignment. In this paper, we formulate feature transport as an endogenous training constraint rather than a separate post-task step, presenting the Geometry-Anchored Transport Framework. First, we derive an Analytic Geometric Anchor via Mahalanobis-aligned regression to mitigate macroscopic anisotropic drift. Second, we introduce a Topology-Aware Evolution objective that regularizes localized manifold degradation while calibrating a residual network against the analytic prior. By coupling manifold evolution with transport constraints during the primary training phase, our framework mitigates evaluation errors without requiring decoupled fine-tuning. Experiments across CIFAR-100, TinyImageNet, and ImageNet-100 demonstrate that the proposed framework consistently improves upon existing post-hoc alternatives under strict exemplar-free constraints.

14.
arXiv (CS.AI) 2026-06-16

Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients

arXiv:2606.16210v1 Announce Type: new Abstract: Learned representations in intelligent sensing systems are often evaluated by reconstruction fidelity or downstream prediction accuracy, but these criteria do not specify which latent distinctions are justified by the sensing process. In sensor-conditioned environments, nuisance factors can change measurements without changing the scene, while distinct scenes may be indistinguishable under limited sensing capability. This paper formulates sensor-conditioned representation correctness as preserving sensing-supported scene distinctions while suppressing nuisance-induced and sensor-unsupported variation. We introduce the scene-relevant observation quotient, a representation target induced by sensing-supported distinguishability after nuisance canonicalization, and develop Observation-Quotient Tucker-Structured Autoencoding (OQ-TSAE), a scene-nuisance factorized framework with diagnostics for false distinction, false merge, nuisance sensitivity, and latent ordering consistency. Experiments on a controlled benchmark show that quotient-consistent supervision improves representation-correctness diagnostics over reconstruction-oriented, metric-learning, and contrastive-learning baselines. Sensitivity, perturbation, and ablation studies show the importance of quotient-aligned supervision, reliable quotient relations, and quotient geometry. Complementary real-radar experiments show that a reconstruction-only OQ-TSAE variant retains competitive downstream utility, robustness under observation degradation, and low seed-to-seed variability. These results suggest that sensor-conditioned representations should be evaluated not only by predictive utility, but also by whether their latent geometry preserves sensing-justified scene distinctions.

15.
arXiv (CS.AI) 2026-06-18

DRIFT: Refining Instruction Data via On-Policy Data Attribution

arXiv:2606.18307v1 Announce Type: cross Abstract: Optimizing the training data distribution for Supervised Fine-Tuning (SFT) dictates the capability of Large Language Models (LLMs). While existing data curation methods excel at accelerating training under constrained budgets, they are less suited to elevating the capability upper bound. The challenge here is no longer to identify a smaller subset that preserves performance, but to refine the data distribution toward instances most capable of improving the final model. To address this problem, we explore instance-level data attribution using Influence Functions (IF). We identify that standard IF formulations struggle in this setting due to two structural limitations: a proximity gap caused by off-policy validation targets, and a severe bias towards gradient norm. We propose DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). Instead of relying on external reference data, DRIFT utilizes the model's on-policy rollouts as validation targets, which empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF. It further applies signed weighting based on trajectory correctness and debiases influence scores against the gradient hacking issue, allowing a small set of validation queries to act as reliable anchors for attributing the full dataset. Experiments on 7B-parameter instruction and reasoning models show that DRIFT consistently raises the performance ceiling on both, outperforming existing data curation baselines.

16.
arXiv (CS.CV) 2026-06-11

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose the first physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.

17.
arXiv (CS.AI) 2026-06-24

OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

arXiv:2606.24799v1 Announce Type: cross Abstract: Generic text-to-video models can be used as rich open-world scene priors. Despite the high quality of today's generated videos, they do not directly yield reliable 3D assets: camera motion is difficult to control, view coverage is partial, and frames often contain inconsistencies across time. We introduce OrbitForge, an adapter built from frozen video priors and per-prompt Gaussian Splatting reconstruction optimization that converts a single text-generated video into a canonical closed-orbit 3D Gaussian Splatting scene. We use 3D reconstruction as an anchor to improve the 3D consistency of the generated video. We obtain a preliminary 3D reconstruction from a first generated video via Deformable Gaussian Splatting with a robust MedianGS proxy. We render views from a prescribed orbit to detect missing viewpoints. OrbitForge uses the text-to-video model to complete only the missing views, and reconstructs the completed orbit into a final Gaussian Splatting scene. This design requires no task-specific video or multiview fine-tuning, avoids per-prompt score-distillation optimization, and does not progressively generate views one step at a time. We further argue that this setting demands coverage-aware evaluation: local smoothness alone rewards methods that never attempt a full orbit. On a frozen 300-prompt T3Bench-derived audit, OrbitForge reconstruction attains a 359.0-degree measured median span, raises originally unsupported-bin Q10 ImageReward from 8.07 to 16.36 relative to MedianGS-only reconstruction, while remaining competitive with VideoMV on the coverage-quality.

18.
arXiv (CS.AI) 2026-06-15

HierSVA: A Data Synthesis Pipeline, Dataset, and Benchmark for LLM-Driven Hierarchical Hardware Formal Verification

arXiv:2606.13706v1 Announce Type: cross Abstract: We present HierSVA, an integrated suite that combines a pipeline, dataset, and benchmark for LLM-driven hierarchical hardware formal verification. HierSVA-SP pairs an RTL preprocessing toolchain with an LLM-in-the-loop formal verification flow to produce reference SystemVerilog Assertions (SVA) on hierarchical RTL. Applying it to BaseJump STL yields HierSVA-DS, a dataset of 342 modules, with hierarchy metadata and depths 0–9, accompanied by a deep subset of 28 module-bug pairs with natural-language specifications and bug variants. HierSVA-B decomposes assertion quality into six metric axes: syntax correctness, assertion proof success rate, vacuity, specification faithfulness, mutation coverage, and formal core coverage. Applying HierSVA-B to twelve recent LLMs reveals three findings. First, the module-level compile rate is 67.1\%; among generated assertions in evaluable runs, 82.1\% prove non-vacuously, but the corresponding assertion sets detect only 70.2\% of eligible injected faults and cover 36.2\% of the formal core. Second, on 211 evaluable model–module entries in the deep subset, assertion sets flag buggy RTL with 0.87 recall, but 40\% of predicted-buggy outcomes are false positives on correct RTL, limiting precision to 0.60. Third, agentic mode improves S1-style provability and strength metrics, but gains plateau and oscillate. Codes and artifacts are available at \href{https://github.com/HierSVAAnon/HierSVACodeAndArtifacts}{https://github.com/HierSVAAnon/HierSVACodeAndArtifacts}. Dataset is available at \href{https://huggingface.co/datasets/AnonymousHierSVA/HierSVA}{https://huggingface.co/datasets/AnonymousHierSVA/HierSVA}.

19.
arXiv (CS.AI) 2026-06-16

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

arXiv:2605.29874v2 Announce Type: replace-cross Abstract: Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): ten of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Gemini 3.1 Pro Refine achieves the highest ICD in the dataset (0.925), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

20.
arXiv (CS.AI) 2026-06-15

SkillAudit: Ground-Truth-Free Skill Evolution via Paired Trajectory Auditing

arXiv:2606.14239v1 Announce Type: new Abstract: Agent skills are structured procedural packages that guide frozen LLM agents in specialized workflows. Skills rarely remain sufficient after deployment: edge cases, API changes, and deployment constraints become visible only through use, making skill evolution a practical necessity. Existing methods depend on privileged feedback such as held-out validation scores, hidden test outcomes, or environment rewards – signals often unavailable when a practitioner has only a task description and workspace data. We introduce SkillAudit, a framework for evolving agent skills without ground-truth feedback. The key idea is paired trajectory auditing: at each iteration, the same task is executed with and without the candidate skill, isolating how the skill changes agent behavior without external labels. To turn behavioral differences into edit guidance, SkillAudit uses Process-Aligned Contrastive Evaluation (PACE), a cluster of evaluators that maps trajectory divergences to diagnostic signals linked to specific passages in the skill document. A structural verifier, compiled once from the task specification and then fixed, checks task constraints and rolls back harmful updates. SkillAudit routes edits through two pipelines: Refine removes noisy or irrelevant guidance from broadly useful skills, while Repair replaces passages that conflict with the task. Across 89 containerized tasks spanning 8 professional domains, SkillAudit achieves 73.9% average task reward, outperforming an agent without skills (40.9%) and the static expert skill (56.7%). These gains are obtained without accessing hidden tests, reference solutions, or external scoring functions during evolution.

21.
arXiv (CS.CV) 2026-06-12

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

22.
arXiv (CS.AI) 2026-06-16

MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance

arXiv:2606.16923v1 Announce Type: new Abstract: Simulation-based inference (SBI) of latent parameters is often hindered by simulator misspecification, the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, the recent state-of-the-art for robust SBI, addresses this through optimal transport between learned representations of real and simulated observations, but requires ground-truth parameter calibration pairs that are typically unavailable in the very settings where SBI is needed. What practitioners do have is unstructured side-information such as regime labels, instruction text, and policy bulletins. We propose Misspecification-Aware Simulation-Based Inference (MA-SBI), a calibration-free framework that turns this side-channel into a posterior correction. A learned corrector maps side-channel text to an observation-space shift applied before any pre-trained amortized posterior, requiring no retraining and no parameter ground-truth. Our main theorem bounds achievable bias reduction by the mutual information between misspecification and side-channel, with a non-vacuous constant that extends to all sub-Gaussian noise via Donsker-Varadhan. On hide-the-calibration benchmarks, MA-SBI with text alone matches the oracle posterior across 10 seeds and two backbones (TOST equivalence), while RoPE given more data does not. The two approaches are complementary: where misspecification is structural and recoverable from parameter pairs, RoPE dominates, as the theory predicts. A stochastic variant improves posterior-predictive log-likelihood on real COVID and OxCGRT epidemiological data, and correctly leaves the posterior unchanged on a well-specified cognitive-science corpus.

23.
arXiv (CS.CL) 2026-06-25

Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents

Recent LLM role-playing systems build character agents from novels by extracting characters, scenes, and relations. Yet long-narrative role-playing suffers from two failures: Factual Overreach, where shared retrieval or parametric memory lets a character use facts outside its perspective, and Stylistic Monotony, where profile descriptions flatten a character into a fixed voice. To address these failures, we propose REVERIEMEM, a three-layer memory architecture for book-based character agents. The episodic layer stores first-person scene memories; the semantic layer stores visibility-tagged facts; and the personality layer stores situation-dependent speech and behaviour patterns. For evaluation, we construct KBF-QA, a 4,386-question benchmark over eight novels for testing knowledge boundaries. REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points over the strongest prior method. On BOOKWORLD's five-dimension pairwise narrative protocol, REVERIEMEM achieves a ~ 79% win rate, suggesting that perspective-bounded memory improves both boundary fidelity and character-grounded narrative generation.

24.
arXiv (CS.CV) 2026-06-17

GSPan: A Continuous Gaussian Primitive Representation for Arbitrary-Scale Pansharpening

Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and panchromatic (PAN) observations. Most existing deep learning methods treat pansharpening as fixed-grid prediction, which limits scale adaptation. To address this, we propose GSPan, a framework that introduces 2D Gaussian Splatting (GS) into pansharpening. Instead of directly predicting pixels, GSPan represents band-wise residual details as continuous and learnable 2D Gaussian primitives. We design a Dual-Stream Hierarchical Interaction (DSHI) architecture with a Spatial-Spectral Interactive Attention (SSIA) module to estimate these primitives from complementary PAN and MS observations. The predicted primitives are rendered as a residual detail field and injected into the upsampled MS image. This continuous representation allows GSPan to render fused images on arbitrary target sampling grids without scale-specific retraining. It further enables a Scale-Decoupled Asymmetric Inference (SDAI) strategy, which estimates primitives at a reduced resolution and renders the fused image at the target resolution for efficient large-scene pansharpening. Experiments on QuickBird, GaoFen-2, WorldView-3, and WorldView-3-4K datasets show that GSPan delivers state-of-the-art fusion performance. Moreover, SDAI markedly accelerates inference, achieving a favorable trade-off between computational efficiency and fusion quality. Our results demonstrate the potential of continuous Gaussian residual representations as a flexible and scale-decoupled alternative to fixed-grid prediction.

25.
arXiv (CS.AI) 2026-06-15

VISTA: View-Consistent Self-Verified Training for GUI Grounding

arXiv:2606.14579v1 Announce Type: new Abstract: When applying Group Relative Policy Optimization (GRPO) for GUI Grounding, rollouts are sampled from a single screenshot view; groups often become either all failures on difficult instances or all successes on easy ones, yielding no useful relative advantage. We propose VISTA (View-Consistent Self-Verified Training), a GRPO-based training framework that constructs each comparison group from multiple target-preserving views of the same GUI instance.Each view is generated by a crop that keeps the target element visible and remaps its box exactly, so model rollouts are compared across semantically equivalent but geometrically different inputs. To stabilize short coordinate generation without turning reinforcement learning into unconditional imitation, VISTA further adds a self-verified cross-view anchor: an oracle answer optimized with an advantage-weighted loss, excluded from the group baseline and activated only when the model has produced a maximum-reward rollout. Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves grounding accuracy.On ScreenSpot-Pro, it raises Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Robustness analyses further show higher worst-view accuracy and lower prediction flip rates.