Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-16

Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks

arXiv:2606.01602v2 Announce Type: replace-cross Abstract: Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled and robust way to quantify dependence between heterogeneous data types, especially between continuous time series and discrete temporal event sequences. Existing approaches rely on ad hoc transformations or mutual-information estimators that are highly sensitive to quantization, repeated values, and event redundancy, leading to biased or unstable results in practice. We propose a nonparametric mutual information estimator that directly measures the dependence between time series and event sequences without data transformation, learning, or ad hoc discretization. Our method models the continuous-discrete duality of real-world time series to handle quantization and repeated-value artifacts and introduces a latent event clustering strategy to mitigate bias from event co-occurrence and redundancy. Together, these yield a robust and unified framework that bridges discrete and continuous mutual information. We evaluate the proposed estimator on four representative tasks: discrete-continuous time-delayed mutual information for causality analysis, global and local temporal repetition discovery, discrete covariate selection for time series forecasting, and continuous feature selection for classification. Experiments on synthetic and real-world datasets show consistent improvements over existing methods in accuracy, robustness, and interpretability, positioning our approach as a general-purpose dependence operator for heterogeneous temporal data, similar to Pearson correlation for homogeneous time series. Code available at: https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

02.
arXiv (CS.AI) 2026-06-16

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

作者:

arXiv:2606.17005v1 Announce Type: new Abstract: Public AI evaluations are often read as terminal leaderboards, yet the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. Repeated public archives for LiveBench and Open LLM Leaderboard v2 serve as the primary longitudinal record; LMArena provides a preference stress test; and GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, one constructed terminal-only example over $1{,}000$ systems is compatible with two pre-terminal histories, yielding times of $23.03$ or $75.13$ to reach within $0.05$ of the ceiling under the same terminal-tail model. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes. The candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration; correspondingly, fixed audit gates reject its stronger claims. An archive-and-adjudication protocol reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims.

03.
medRxiv (Medicine) 2026-06-18

Human Intuition vs. Computational Precision: Neurologists, Feature-based Models, and Deep Learning for Stroke Prognosis

Background: Prognostication in large vessel occlusion (LVO) stroke remains challenging. Although several prognostic models exist, their comparison to clinician performance, human-model interaction, and specific sources of human bias remain poorly understood. Methods: Using pre-treatment clinical and CT data from the MR CLEAN trial (n=500), six neurologists predicted three-month modified Rankin Scale (mRS) scores for 40 patients, both unaided and assisted by a validated feature-based model (MR PREDICTS). Human performance was benchmarked against MR PREDICTS and a multimodal, interpretable deep learning (DL) approach using raw imaging data. We explicitly assessed neurologists? ability to estimate model-required imaging features and identified systematic human biases. Models were additionally validated in a larger MR CLEAN trial cohort (n=404). Results: For predicting the full mRS distribution, standalone models achieved good ordinal agreement (MR PREDICTS quadratic weighted kappa (QWK) 0.51 [0.24 to 0.70]; DL model 0.49 [0.25 to 0.67]), significantly outperforming unaided neurologists (QWK 0.27 [0.10, 0.42]). Neurologists showed systematic overoptimism, predicting lower mRS scores than observed. Furthermore, there was poor accuracy in extracting imaging features. Raters? ASPECTS predictions deviated by 3.4 points from the confirmed scores, and collateral score accuracy was 44.6%. However, for predicting binary mRS (0-2 vs. 3-6), accuracy was comparable between unaided neurologists (64.17% [55.42% to 72.92%]) and models (MR PREDICTS 67.50% [52.50% to 82.50%]; DL model 63.16% [47.37% to 78.95%]). Model-assistance modestly improved and harmonized neurologists? predictions (QWK 0.41 [0.22 to 0.55]; binary accuracy 68.75% [58.33% to 78.34%]. Model performance remained robust in the larger cohort. Conclusions: Multimodal prognostic models outperform clinicians in predicting the full range of mRS outcomes, while human error in imaging assessment and systematic optimism bias are primary drivers of prognostic inaccuracy. End-to-end DL models eliminate human-input variability and hold strong potential as an automated second opinion to support prognostication and decision-making in acute LVO stroke.

04.
arXiv (CS.LG) 2026-06-17

Adaptable Method for Crystal Design across Diverse Constraints and Objectives with Pretrained Property Predictors

arXiv:2410.08562v5 Announce Type: replace-cross Abstract: Advanced crystal design can accelerate materials discovery across applications from photovoltaics to spintronics. Practical design must satisfy multiple properties and physical constraints, yet existing machine-learning-based approaches to such design often depend on large datasets, retraining, or task-specific generators. Here, we show that direct predictor-guided gradient optimization enables data-efficient, constraint-rich crystal design by combining off-the-shelf predictors with site-wise element masks, template initialization, and task-specific losses. In perovskites, it outperformed generative and Bayesian baselines under three targets – band gap, formation energy, and tolerance factor – and two hard constraints. DFT assessment further showed band-gap targeting competitive with a leading generative model despite using predictors trained on roughly one-tenth of the data. By flexibly combining pretrained predictors with application-oriented masks and custom losses, the same framework supported half-metal design. Such modularity could help researchers and engineers translate diverse application requirements directly into optimized candidate crystals with minimal computational cost.

05.
arXiv (CS.CV) 2026-06-11

DarkVGGT: Seeing Through Darkness Using Thermal Geometry without Daylight Tax

Recent feed-forward 3D reconstruction methods have demonstrated strong performance and flexibility in efficient end-to-end scene geometry estimation from image streams. However, their reliance on visible-light appearance makes them vulnerable in dark and low-visibility environments, where RGB cues are severely degraded and geometric evidence becomes ambiguous. To address this challenge, we propose DarkVGGT, an RGB-T feed-forward geometry framework that uses physics-aware thermal modeling for robust 3D estimation in low-light scenes. DarkVGGT introduces two complementary modules. First, physics-inspired thermal factorization extracts emissive-dominant, geometry-consistent thermal cues while isolating sparse reflective residuals that may introduce geometric ambiguity. Second, geometry-shared thermal routing isolates modality-invariant geometric structures from thermal-specific patterns, selectively injecting reliability-aware structural guidance into the RGB stream. Together, these components enable accurate thermal-informed geometry estimation under degraded RGB conditions while largely preserving performance in well-lit environments. Experiments on low-visibility RGB-T benchmarks demonstrate consistent improvements in both depth and camera pose estimation over existing feed-forward geometry baselines.

06.
arXiv (CS.AI) 2026-06-19

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

arXiv:2606.20041v1 Announce Type: cross Abstract: We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

07.
arXiv (CS.CV) 2026-06-16

MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow

08.
medRxiv (Medicine) 2026-06-17

Cost-effectiveness of measles rapid diagnostic tests for replacing or expanding laboratory testing in Ethiopia

Background: In low- and middle-income countries, laboratory testing to rapidly detect measles outbreaks is limited by infrastructure availability and high costs. This study estimates the potential impact and cost-effectiveness of measles rapid diagnostic tests (RDTs) if implemented nationally in Ethiopia to either replace or expand current testing. Methods: An agent-based model to simulate measles outbreaks was calibrated to Ethiopian measles surveillance data. Modelled outbreak outcomes were aggregated over a 10-year period. Scenarios included using RDTs to (1) replace laboratory testing; (2) replace epidemiological linkage; and (3) increase case detection, in addition to replacing laboratory testing and epidemiological linkage. Testing and outbreak response costs (in 2025 US$) were obtained from Ethiopian Public Health Institute from a government perspective. Total costs and disability-adjusted life years (DALYs) for each scenario were compared to baseline. Results: All scenarios were cost saving compared to baseline. Replacing laboratory testing with RDTs saved US$4.2M (3.2M-4.9M) over 10-years, but due to very low testing rates the benefits of eliminating laboratory testing delays were offset by missed cases from the lower RDT sensitivity, leading to similar outbreak detection times and DALYs. Replacing epidemiological linkage with RDTs had similar DALYs but increased the cost savings to US$9.7M. Using RDTs to double case detection reduced outbreak detection time from 113 to 80 days, averted 17,000 DALYs, and saved US$4.3M. Conclusions: In Ethiopia, use of measles RDTs could be cost saving, and if used to expand testing could prevent measles infections through faster outbreak detection and response.

09.
arXiv (CS.LG) 2026-06-19

A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data

arXiv:2606.19711v1 Announce Type: cross Abstract: Field-based modeling from onboard measurements can produce autonomous underwater vehicle (AUV) maneuvering models that reflect real operating characteristics. From an approximation perspective, conventional maneuvering models use predefined constraint polynomial bases, whereas data-driven models use data-adaptive bases. Motivated by this basis-function view, this paper presents a differentiable composite-approximation formulation, in which the polynomial-basis component and the data-adaptive basis component are treated as differentiable parts of a single predictor and calibrated jointly. A gradient-based co-calibration method is developed for full-scale AUV maneuvering prediction, where a sensitivity-aware mechanism regulates bounded polynomial updates while the neural residual captures remaining nonlinear discrepancies under a shared prediction objective. To account for ocean-current effects in field data, a turning-motion-based current estimation and compensation procedure is incorporated to construct current-compensated learning targets for training and rollout. The framework is evaluated using sea-trial data collected from a 7-meter AUV under multiple maneuvering conditions. Results show that the proposed method improves recursive trajectory and velocity prediction compared with polynomial-only, neural-only, and frozen-prior hybrid baselines, demonstrating its applicability to field-data-based AUV maneuvering modeling.

10.
arXiv (CS.CV) 2026-06-18

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

11.
arXiv (quant-ph) 2026-06-12

Non-invertible symmetries out of equilibrium: Eigenstate order and Floquet physics

arXiv:2508.14213v2 Announce Type: replace-cross Abstract: Through the study of the Rep($D_8$) non-invertible symmetry, we show how non-invertible symmetries manifest in dynamics. Results are presented for dynamics generated by Hamiltonians as well as Floquet unitaries. For both examples, the role of the non-invertible symmetry is studied through the appearance of non-invertible symmetry protected edge modes. In addition, the role of the non-invertible symmetry for the Hamiltonian is studied through eigenstate order. In particular, by considering the effect of symmetry preserving disorder, the non-invertible symmetry is shown to give rise to degeneracies in the spectra of the Hamiltonian that can only be completely lifted at orders of perturbation that scale with system size. The eigenstates of disordered Hamiltonians, whose ground state correspond to non-trivial symmetry protected topological (SPT) states, are shown to have either trivial or non-trivial SPT order that are detected as non-zero expectation value of string order-parameters. In contrast, non-trivial SPT order is absent in the eigenstates of trivial SPT Hamiltonians with disorder. The interface between two different SPT phases host edge modes whose dynamics is studied numerically and analytically. The edge mode is shown to oscillate at frequencies related to different effective chain lengths that are weighted by the temperature, becoming an exact zero mode in the limit of zero temperature. A Floquet model with the non-invertible symmetry is constructed whose edge mode is shown to exhibit period-doubled dynamics at low effective-temperatures. The zero and period-doubled edge modes differ from those in conventional SPTs by being symmetric under the invertible symmetry, while being charged under the non-invertible symmetry.

12.
arXiv (CS.CL) 2026-06-11

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framework that combines an online sequence-level signal with off-policy and on-policy token-level signals. DuDi further uses a cross-lingual verbalizer to refine teacher feedback and improve teacher-student transferability in multilingual settings. Experiments on SEA-HELM across multiple model families, scales, and teacher-student settings show that DuDi consistently outperforms competitive distillation baselines. Ablations and analyses confirm that sequence-level optimization, token-level supervision, and cross-lingual verbalization provide complementary and transferable learning signals for multilingual SLMs.

13.
arXiv (CS.AI) 2026-06-11

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

arXiv:2606.12071v1 Announce Type: cross Abstract: LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.

14.
arXiv (CS.CV) 2026-06-15

Towards Physically Realizable Adversarial Attenuation Patch against SAR Object Detection

Deep neural networks have demonstrated excellent performance in SAR target detection tasks but remain susceptible to adversarial attacks. Existing SAR-specific attack methods can effectively deceive detectors; however, they often introduce noticeable perturbations and are largely confined to digital domain, neglecting physical implementation constrains for attacking SAR systems. In this paper, a novel Adversarial Attenuation Patch (AAP) method is proposed that employs energy-constrained optimization strategy coupled with an attenuation-based deployment framework to achieve a seamless balance between attack effectiveness and stealthiness. More importantly, AAP exhibits strong potential for physical realization by aligning with signal-level electronic jamming mechanisms. Experimental results show that AAP effectively degrades detection performance while preserving high imperceptibility, and shows favorable transferability across different models. This study provides a physical grounded perspective for adversarial attacks on SAR target detection systems and facilitates the design of more covert and practically deployable attack strategies. The source code is made available at https://github.com/boremycin/SAAP.

15.
arXiv (CS.CV) 2026-06-12

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at https://github.com/BinayKSingh/CCFF.

16.
arXiv (CS.LG) 2026-06-19

GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution

arXiv:2606.19617v1 Announce Type: cross Abstract: We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline's inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.

17.
arXiv (quant-ph) 2026-06-12

Achieving Heisenberg limit under noisy conditions with quantum Zeno dynamics and dynamical decoupling

arXiv:2606.13205v1 Announce Type: new Abstract: Quantum Zeno dynamics (QZD) and dynamical decoupling (DD) are useful tools that enable the effective suppression of noise in quantum systems. We consider the problem of when (i) noise can be suppressed and (ii) Heisenberg limit (HL) can be achieved in quantum metrology, and prove necessary and sufficient conditions for when QZD and DD are useful for achieving these two goals. We also show that in the Markovian regime, there are scenarios where preventing errors using QZD/DD may enable HL to be achieved where current QEC methods may not. Finally, we demonstrate that the combination of both techniques can allow individually imperfect QZD and DD strategies to saturate HL.

18.
arXiv (CS.CV) 2026-06-17

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $\pi$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

19.
arXiv (CS.AI) 2026-06-16

The Perils of Agency: How Developers Perceive, Prioritize, and Address Risks in Agentic AI Products

arXiv:2606.15485v1 Announce Type: cross Abstract: Agentic AI systems act autonomously, use tools, adapt to context, and operate in complex real-world environments. However, these same characteristics can create or exacerbate product risks. We studied how industry developers (n=35) perceive, prioritize, and address the risks in their agentic AI products. We found that developers' perceptions of risk were closely tied to the qualities that made the product agentic, such as autonomy, tool use, and usage in a real-world context. Developers prioritized product and business risks before considering downstream societal risks like job displacement and end-user privacy. This prioritization also impacted developers' ability and motivation to mitigate agentic risks. Finally, developers lacked mature controls for containing agentic risks, often relying on constraining the same characteristics that make agents useful: e.g., autonomy and goal complexity. These findings reveal a capability vs. risk control tension in agentic AI development: developers need to address risks that emerge from agentic capabilities, yet they currently have limited support for doing so without constraining agentic functionality.

20.
arXiv (CS.CV) 2026-06-16

Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the tasks where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.

21.
arXiv (CS.AI) 2026-06-15

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

arXiv:2606.14581v1 Announce Type: cross Abstract: Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

22.
arXiv (CS.CL) 2026-06-16

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

23.
arXiv (CS.CL) 2026-06-12

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

We present M\"OVE (Modelle für die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. M\"OVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. M\"OVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

24.
arXiv (CS.AI) 2026-06-16

StyleShield: Exposing the Fragility of AIGC Detectors through Continuous Controllable Style Transfer

arXiv:2605.00924v2 Announce Type: replace-cross Abstract: AI-generated content (AIGC) detectors are increasingly deployed in high-stakes settings such as academic integrity screening, yet their reliability rests on a fundamental paradox: as language models are trained on human-written corpora, the statistical boundary between AI and human writing will inevitably dissolve as models improve. Commercial incentives have further distorted this landscape – detection services and "de-AIification" tools often operate within the same supply chain, replacing evaluation of content quality with judgment of content origin. We present StyleShield, the first flow matching framework for conditional text style transfer, operating directly in continuous token embedding space via a DiT backbone with zero-initialized cross-attention adapters conditioned on frozen Qwen-7B representations. At inference, we adapt the SDEdit paradigm from image synthesis to text embeddings, with a single parameter gamma providing smooth continuous control over the evasion-preservation trade-off. On a multi-domain Chinese benchmark, StyleShield achieves 94.6% evasion against the training detector and >=99% against three unseen detectors, maintaining 0.928 semantic similarity. We further introduce RateAudit, a document-level scheduling algorithm that demonstrates detection-rate verdicts can be set to arbitrary values, directly questioning the reliability of score-based evaluation.

25.
arXiv (CS.CV) 2026-06-11

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

Object navigation in unseen indoor environments requires agents to perform semantic search under partial observability. Vision-language models (VLMs) provide strong semantic-spatial priors for this task, but how to interface them with robot navigation remains challenging: dense VLM inference is expensive, while abstracting environments into symbolic memories often separates high-level reasoning from the raw visual evidence that supports it. We propose we propose PIGEON (Point of Interest Guided Exploration for Object Navigation), a VLM-driven framework that formulates object navigation as raw-observation-grounded sparse decision problem. PIGEON introduces Points of Interest (PoIs) as sparse visual decision units that couple geometrically executable waypoints with raw egocentric observations. Rather than using VLMs as dense controllers or restricting them to frontier ranking, PIGEON enables VLMs to select among task-critical PoIs, including exploration frontiers, suspected target objects, traversable stairs, and floor-level summaries, while low-level planners execute continuous motion between them. This PoI interface further makes high-level navigation decisions verifiable, allowing us to develop an RLVR pipeline that improves local VLMs without manual Chain-of-Thought annotations. Extensive experiments on Habitat ObjectNav benchmarks show that PIGEON achieves state-of-the-art zero-shot performance, scales consistently with foundation model capacity, and transfers to Active Embodied Question Answering with only prompt modifications. Real-world deployments on physical robots further demonstrate its robustness and efficiency.