Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-16

Bayesian Tensor Decomposition with Diffusion Model Prior

arXiv:2606.03212v2 Announce Type: replace Abstract: Low-rank tensor decomposition (TD) is usually effective on clean, fully observed data, but it often degrades under severe missingness or noise. Low-rankness is itself a useful but limited structural prior, and additional handcrafted priors (e.g., sparsity or smoothness) still fall short of capturing the rich statistics of real-world data. To compensate for this weak inductive bias under heavy corruption, one would like to inject a learned, data-driven prior; however, the state-of-the-art diffusion models are not readily compatible with current TD and tractable posterior inference. To address these challenges, we introduce DiffBCP, a hybrid-prior Bayesian CP decomposition framework that couples a cumulative shrinkage process prior over the CP factors for automatic rank selection with an off-the-shelf pre-trained diffusion model as an implicit data prior on the reconstructed tensor. To make posterior inference tractable despite the coupling among the likelihood, low-rank constraint, and diffusion prior, we develop a split Gibbs sampler: CP factors admit conjugate updates, while the diffusion block is sampled via low-rank-guided denoising. A noise-adaptive coupling schedule further reduces sensitivity to hand-tuned annealing. Experiments on image inpainting and denoising, including high-resolution out-of-distribution images, show consistent gains over Bayesian, nonlinear, and plug-and-play TD baselines.

02.
arXiv (CS.CV) 2026-06-12

Mana: Dexterous Manipulation of Articulated Tools

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (

03.
arXiv (quant-ph) 2026-06-17

Asymptotically Optimal Circuit Depth for Diagonal Unitary Synthesis and Compilation on Two-Dimensional Grids

arXiv:2606.17589v1 Announce Type: new Abstract: Diagonal unitaries are a fundamental but resource-intensive class of quantum operations, arising as the phase separators of QAOA and the time-evolution blocks of Hamiltonian simulation. Under all-to-all connectivity their optimal depth is established, but on nearest-neighbor hardware general-purpose compilers fall back on heuristic search, which yields no analyzable cost bound and becomes intractable at the very sizes where depth is the bottleneck. We address synthesis and compilation jointly. On the synthesis side, we develop a Gray-Path Framework (GPF) that realizes any $n$-qubit diagonal unitary in asymptotically optimal $R_z$ and CNOT depth $O(2^n/n)$ without ancillas. Our main result is that compiling GPF onto a two-dimensional nearest-neighbor grid preserves this optimality: routing adds depth $\Theta(2^n/n)$ and gate count $\Theta(2^n)$. Because GPF fixes its entire interaction structure in advance, routing reduces to scheduling a known sequence, with no heuristic search. We give the construction both with and without ancillas: the ancilla-free, cost-optimized layout is a two-row grid, and a $2k$-row layout introduces a space–time tradeoff that cuts depth by $1/k$ while remaining asymptotically optimal for the enlarged register; both are deterministic and analyzed in closed form. The same complexity is also attained on a linear nearest-neighbor chain, so the preservation is topology-independent, holding on any architecture that contains such a chain. All routing bounds are closed-form, giving the concrete resource estimates that heuristic compilers cannot provide at scale.

04.
arXiv (CS.LG) 2026-06-11

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

arXiv:2606.11255v1 Announce Type: new Abstract: Bernstein–Schur kernels are products of a finite-feature kernel (one with an explicit finite-dimensional feature map) and a completely monotone shift-invariant kernel: nonstationary kernels that fall between the shift-invariant and dot-product templates random features usually exploit, so in general neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and randomizes the completely monotone radial factor, sampling the latter's one-dimensional Bernstein–Widder scale and then applying Gaussian random Fourier features (whose frequency is still $d$-dimensional). The feature dimension is then $Dm$, set by the sketch size $m$ and the radial-draw count $D$, free of the $O(d^2)$ size of the exact modulation feature. Keeping the modulation \emph{exact is the analyzable limit ($m\to\infty$): there we prove unbiasedness, an exact variance for the recommended flat estimator, an expected matrix-Bernstein operator-norm bound (with a matching high-probability tail) controlled by the top eigenvalues of the kernel and modulation Gram matrices together with an intrinsic dimension rather than the crude $N\max_{ij}$ entrywise route, and a deterministic relative-spectral kernel-ridge stability result. By conditioning on the sketch, the doubly-randomized estimator inherits the same intrinsic-dimension operator-norm guarantee plus a single additive sketch term, tunable by $m$ independently of $D$. The motivating instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, $b\ge0$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$; for it the radial mixture is the IMQ spectral sampler, and one frequency per scale is variance-optimal at a fixed radial-feature budget.

05.
arXiv (CS.LG) 2026-06-15

DTVEM-RE: A Hierarchical Random-Effects Extension of the Differential Time-Varying Effect Model for Person-Specific Multi-Lag Estimation in Intensive Longitudinal Data

arXiv:2606.14116v1 Announce Type: new Abstract: The Differential Time-Varying Effect Model (DTVEM) of Jacobson et al. (2019) is a popular tool for finding the best time lag in intensive longitudinal data, but it assumes everyone shares the same lag structure. The original authors named fixing this as future work, and it clashes with the premise of modern clinical research, which is that people differ. We present DTVEM-RE, an extension that lets each person have their own lag coefficients, with two versions of the confirmatory step: a discrete-time hierarchical Bayesian VAR in Stan, which pools across people and gives calibrated uncertainty, and a continuous-time per-person Ornstein-Uhlenbeck model in ctsem, which handles unevenly spaced beeps directly. We report four results. A simulation shows the Bayesian version recovers the between-person spread tau_a with bias below 0.01 and coverage of 90 to 93 percent. On the Fisher et al. (2017) EMA dataset (N=40), person-specific lag-1 effects vary by an order of magnitude across three mood items, the Bayesian and GAMM estimates agree closely (r=0.87 to 0.92), and DTVEM-RE gives the best one-step-ahead prediction among four discrete-time methods. A multi-lag version shows all nine tau_k values have credible intervals excluding zero, and the lag where people differ most changes across items, something lag-1-only methods like mlVAR cannot detect. Finally, the two versions agree almost exactly on person-specific lag-1 estimates (r >= 0.995), differing only as shrinkage predicts. DTVEM-RE is, to our knowledge, the first person-specific implementation of DTVEM-style lag detection, and it contains standard DTVEM as a special case.

06.
arXiv (CS.AI) 2026-06-17

SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

arXiv:2602.14211v3 Announce Type: replace-cross Abstract: Agent skills extend LLM agents with task-specific instructions, executable scripts, and auxiliary resources, improving reusability but creating a new supply-chain attack surface. A malicious or compromised skill can be repeatedly loaded as trusted guidance and steer downstream tool use. Existing skill-based prompt-injection attacks are often manual and brittle, because explicit malicious instructions are rejected or ignored when they are not aligned with the original workflow. We propose SkillJect, the first automated framework for generating poisoned skills against skill-enabled agent systems. SkillJect uses two coordinated channels. In the artifact channel, it hides the payload inside an auxiliary helper script. In the instruction channel, it rewrites SKILL.md with a front-loaded inducement strategy, placing injected content at the beginning and framing the helper script as a mandatory prerequisite or initialization step. The rewritten instruction explicitly references the helper-script path and provides an executable example command, making the helper appear to be a legitimate setup step before normal skill operations. SkillJect further adopts a closed-loop multi-agent process to improve attack effectiveness. An Attack Agent generates poisoned skills, a Victim Agent executes downstream tasks with the poisoned skill, and an Evaluate Agent inspects execution traces to determine whether the hidden payload was executed. The Attack Agent then uses this feedback to diagnose failure causes and rewrite SKILL.md, while keeping the payload fixed. Experiments across skill-enabled platforms, backend LLMs, and attack categories show that SkillJect substantially outperforms naive direct injection and prior manual skill-injection attacks, highlighting poisoned skills as a persistent threat in reusable skill ecosystems.

07.
arXiv (CS.AI) 2026-06-12

An LLM System for Autonomous Variational Quantum Circuit Design

arXiv:2606.13380v1 Announce Type: cross Abstract: The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework that employs large language models (LLMs) to conduct iterative quantum circuit designs under explicit design constraints. Our system integrates seven components: Exploration, Generation, Discussion, Validation, Storage, Evaluation, and Review. These components form a closed-loop workflow that combines web-based knowledge acquisition, literature-grounded critique, executable code generation, and experimental feedback. We evaluate the framework on two tasks: quantum feature map construction for quantum machine learning and ansatz generation for variational quantum eigensolver applications in quantum chemistry. In image classification benchmarks, the best generated feature map outperforms representative quantum feature maps and, when scaled to larger qubit counts, surpasses the classical radial basis function kernel. In molecular ground state estimation across seven molecules, the generated ansatz attains competitive accuracy with widely used chemically inspired and hardware-efficient constructions while satisfying the imposed scaling constraints. These results establish LLM driven agentic system as a viable paradigm for automated quantum circuit design and illustrate how AI systems can participate in iterative scientific optimization workflows across scientific domains.

08.
arXiv (CS.CL) 2026-06-15

CoRe: A Continuously Reward-Finetuned LLM Query Rewriter for Multi-Stage Context-Aware Relevance in Web-Scale Video Search

LLM-based query rewriters in production face a tension: the training reward must reflect how the rewrite is consumed by the production ranker, yet the training procedure must be cheap enough to support continuous redeployment as data drifts. We present CoRe (Context Relevance), such a system, redeployed weekly for over five months in a major short-video search engine. Our reward uses the deployed multimodal relevance model as its source and a multiplicative ratio form mirroring the production fusion algebra, closing the simulation-production gap that offline reward proxies leave open. A semi-online Mixed Preference Optimization loop makes this reward affordable at multi-million-instance weekly scale: a DPO-style pairwise objective restricts the gradient pass to a small top-k/bottom-k subset of sampled trajectories, and a phase structure reduces trainer/inference-server parameter syncs from per-step to per-phase. An automated promotion gate over reward-like and stability metrics detected and recovered from a real reward-hacking incident in production. Rewriter output is consumed as parallel relevance signals at recall, rawrank, and finerank without displacing the original signals, bounding rewriter-failure blast radius. Online A/B from two sequential production launches, first deploying the rewriter at finerank, then extending consumption to recall and rawrank, delivers statistically significant reductions in change-query rate on rewrite-impacted queries, with all headline relevance and engagement metrics moving in the expected direction.

09.
arXiv (CS.AI) 2026-06-12

The Challenges of Balancing AI Compliance and Technological Innovations in Critical Sectors: A Systematic Literature Review

arXiv:2606.12423v1 Announce Type: cross Abstract: The rapid integration of artificial intelligence (AI) into critical infrastructure including healthcare, finance, energy, and defense, offers transformative benefits but also conflicts with evolving regulatory and governance frameworks. This paper presents a systematic literature review (SLR) to examine the challenges of balancing AI compliance and technological innovation across critical infrastructure sectors. The review follows established SLR guidelines to extract and synthesize insights from peer-reviewed articles, report, and institutional sources published between 2020-2025. The study identifies three interrelated challenges: fragmented regulations, excessive compliance burdens for smaller to medium enterprises (SMEs), and misaligned governance models. To address these challenges, the study highlights practical governance strategies, including risk-tiered regulation, compliance by design, and explainable AI, to support scalable and trustworthy AI deployment in critical sectors. Key contributions include a concise mapping of core AI-governance challenges and a conceptual diagram illustrating their overlap, as well as actionable strategies for policymakers and practitioner to harmonize oversight with innovation.

10.
arXiv (CS.CV) 2026-06-11

Global Geometry Is Not Enough for Vision Representations

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across a diverse suite of vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input–output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input–output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

11.
medRxiv (Medicine) 2026-06-16

Cross-sectional study of the association between depressive symptoms and attentional bias to emotional stimuli in patients with acute stroke: Study protocol

Post-stroke depression affects approximately 30% of patients after stroke and is associated with delayed recovery in activities of daily living, reduced rehabilitation effectiveness, and poorer quality of life. Attentional bias modification may provide a low-burden, nonpharmacological approach for patients in the acute phase of stroke. However, before such an intervention can be implemented in clinical practice, it is necessary to clarify whether attentional bias is present in patients with acute stroke and depressive symptoms, whether cognitive function influences the manifestation of this bias, and which task and stimulus formats are most appropriate for assessment. This multicenter, cross-sectional observational study will enroll patients with acute stroke between 7-30 days after stroke onset. Depressive symptoms will be assessed using the depression subscale of the Hospital Anxiety and Depression Scale. Attentional bias will be measured under four task conditions based on the dot-probe task and the cue-target task, using face and word stimuli. Secondary assessments will include cognitive function, anxiety symptoms, activities of daily living, health-related quality of life, and clinical background variables. The aims of this study are to investigate the association between depressive symptoms and attentional bias in patients with acute stroke, compare attentional bias characteristics across task and stimulus types, and examine the potential influence of cognitive function on this association. The findings are expected to provide an empirical basis for designing future attentional bias modification protocols targeting post-stroke depression in the acute phase. This study has been registered with the UMIN Clinical Trials Registry (UMIN000059166).

12.
arXiv (CS.AI) 2026-06-16

The Distributed Detectability Band Against Marginal-Preserving Attacks

arXiv:2606.10456v2 Announce Type: replace-cross Abstract: AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

13.
arXiv (CS.LG) 2026-06-17

Statistical Learning from Attribution Sets

arXiv:2602.06276v2 Announce Type: replace Abstract: We address the problem of training conversion prediction models in advertising domains under privacy constraints, where direct links between ad clicks and conversions are unavailable. Motivated by privacy-preserving browser APIs and the deprecation of third-party cookies, we study a setting where the learner observes a sequence of clicks and a sequence of conversions, but can only link a conversion to a set of candidate clicks (an attribution set) rather than a unique source. We formalize this as learning from attribution sets generated by an oblivious adversary equipped with a prior distribution over the candidates. Despite the lack of explicit labels, we construct an unbiased estimator of the population loss from these coarse signals via a novel approach. Leveraging this estimator, we show that Empirical Risk Minimization achieves generalization guarantees that scale with the informativeness of the prior and is also robust against estimation errors in the prior, despite complex dependencies among attribution sets. Simple empirical evaluations on standard datasets suggest our unbiased approach significantly outperforms common industry heuristics, particularly in regimes where attribution sets are large or overlapping.

14.
arXiv (CS.CL) 2026-06-18

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

15.
arXiv (CS.CV) 2026-06-11

Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model

Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing's role in crop rotation and water conservation. Fallow is a low accuracy class in USDA Cropland Data Layer (CDL). Geospatial foundation model (GFM), Prithvi-EO has shown strong transferability across computer vision tasks. However, its Vision Transformer (ViT) backbone produces features at a single spatial scale that are ill-suited for the multi-scale features required by object detection heads. Existing approaches synthesise multi-scale pyramids through scaling of single stride tokens, sacrificing spatial heterogeneity, and full backbone fine-tuning is computationally prohibitive for GFMs. We evaluate a fallow detection pipeline combining two parameter-efficient fine tuning (PEFT) schemes: Low-Rank Adaptation (LoRA) and a hybrid PEFT, with three neck designs: pseudo multi-scale, Lite ViT-Adapter, and Full ViT-Adapter. Our best configuration, Lite ViT-Adapter with a one-stage head, achieves a mAP@50 of 0.9479 with the Diou loss, suggesting the effectiveness of center-aware localization for irregular fallow field detection. ViT-Adapter free one-stage detection under LoRA improves the adapter-free anchor-based approach by 6.42%, and the best configuration improves baseline adapter-free anchor-based approach by 25.70%. These results demonstrate that lightweight spatial prior fusion and selective backbone unfreezing enable Prithvi-EO to capture local fallow patterns more effectively, outperforming approaches that rely on reshaped single-stride ViT tokens.

16.
arXiv (CS.CL) 2026-06-16

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon: a keylogger, ransomware, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software with requests for harmful security knowledge and report refusal rates over non-comparable corpora. This paper's central result is that the CODE-versus-KNOWLEDGE classification axis established in a prior four-corpus release remains stable under a substantially expanded corpus pool and an independently refreshed judge panel, evidence that it measures a real construct rather than an artifact of the prompts or judges. Eight corpora spanning diverse elicitation paradigms (direct, jailbreak-decorated, indirect, and agent/interpreter: ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls), reaching Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"). Critically, the panel shares no judge with the prior release (five paid commercial APIs replaced by five open-weight models from five vendors), yet the two panels agree on 94.45% of the 3,133 shared prompts and reach Cohen's kappa = 0.952 [0.942, 0.963] on the 3,031-prompt binary overlap: the axis survives near-total panel replacement. The released bank comprises 4,748 consensus-CODE and 1,923 consensus-KNOWLEDGE prompts, a reliability-quantified benchmark whose central classification axis is shown stable across corpus expansion and judge-panel replacement.

17.
arXiv (CS.CV) 2026-06-17

MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias

When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this "late-layer textual override". The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they're correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.

18.
arXiv (CS.AI) 2026-06-18

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

arXiv:2606.19199v1 Announce Type: cross Abstract: The recent growth of EV adoption poses challenges for power systems, including increased peak demand and potential grid instability. Smart control of EV charging – e.g., based on reinforcement learning (RL) – can alleviate these issues by learning temporal and contextual patterns from historical data. Yet, in real-world scenarios, key features, such as departure time, often are unavailable. This, in turn, makes it harder for an RL agent to learn and execute an effective charging policy. To mitigate this uncertainty, a trained forecaster can approximate the unknown features from available data. However, since these forecasting models are typically trained for accuracy (rather than their impact on a downstream agent's decision quality), their errors may propagate and hinder the overall performance of a controller that is using the forecasts. To avoid this, we propose a decision-focused RL (DF-RL) framework in which the forecaster is trained end-to-end, i.e., with feedback from the charging policy actions taken by the RL agent. Such joint training of both the forecaster and controller ultimately results in higher-quality actions: our proposed DF-RL method yields superior charging decisions compared to other baselines, achieving up to a 14% improvement in total reward and a 55% reduction of unsupplied energy (i.e., charging that failed to happen because the EV already left), relative to the RL method without departure time forecasting.

19.
arXiv (CS.CV) 2026-06-15

Self-Evolving Visual Questioner

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.

20.
arXiv (CS.CL) 2026-06-19

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

21.
arXiv (CS.LG) 2026-06-12

How Reliable are Fairness Audits with Unreliable Data?

arXiv:2506.23033v3 Announce Type: replace Abstract: Fairness audits are a key component of responsible machine-learning deployment. Yet, audit-recommendation reliability under incomplete protected-label access is still poorly understood. In this work, we focused on protected-label missingness in fairness mitigation audits. We introduced a seed-calibrated stress test to separate missingness effects from seed-to-seed movement already present under complete labels. Across ACS/Folktables tasks, missingness settings that retain some protected labels usually do not move selected mitigation methods beyond a complete-label seed-to-seed baseline. At $0%$ protected-label access, candidates collapse to an empirical-risk-minimization baseline and deterministic tie-breaking rather than revealing a broad missingness effect. We also found that threshold optimization can turn fairness gains on a single protected axis into intersectional harm above a seed baseline, and this threshold-optimizer finding persists under random-forest validation. Overall, our results highlight that protected-label missingness should be reported with seed-null calibration, candidate-set context, and intersectional consequences before it is treated as evidence of audit fragility.

22.
arXiv (quant-ph) 2026-06-17

DRAG-Compatible Leakage Suppression in Landau–Zener Control via Isoprobability Twins

arXiv:2506.19572v4 Announce Type: replace Abstract: Analytically solvable models – particularly the Landau-Majorana-Stückelberg-Zener (LMSZ) and Allen-Eberly-Hioe (AEH) models – underpin many quantum-gate implementations and population-transfer protocols. However, their canonical pulse shapes are incompatible with modern leakage-suppression techniques and some systems. Most notably, the constant Rabi envelope of the LMSZ pulse prevents many leakage-suppression approaches, which require smoothness. We address both limitations by developing the concept of isoprobability twin models: distinct pairs of Rabi frequency $\Omega(t)$ and detuning $\Delta(t)$ that yield identical post-pulse transition probabilities based on the Delos-Thorson transformation. In this work, we formalise the method by experimentally demonstrating the equivalence of multiple LMSZ and AEH twin models on IBM's ibm_kyiv processor. Finally, we show a staggering leakage reduction by more than 3 orders of magnitude using a custom DRAG implementation of a cosine LMSZ isoprobability model.

23.
arXiv (CS.AI) 2026-06-18

A Variational Framework for LLM Generator-Regulator Games

作者:

arXiv:2606.18424v1 Announce Type: cross Abstract: This paper develops a variational framework for regulated language generation. Starting from autoregressive token sampling, we derive the induced distribution over complete messages and relate it to an entropy-regularized Gibbs law. Regulation is modeled as an optimal discriminator whose convex-dual value is an f-divergence, and the generator-regulator interaction is formulated as a saddle-point problem. The framework applies to moderation, censorship, AI deception detection, compliance auditing, phishing defense, and manipulation control, where regulation concerns a distribution over possible messages rather than a single output. The equilibrium clarifies the tradeoff among utility, entropy, regulatory alignment, and finite-length detectability. Two finite-vocabulary case studies, censorship filtering and phishing defense, illustrate how the theory can be evaluated through utility, entropy, divergence, receiver-side scores, and detection probability.

24.
arXiv (quant-ph) 2026-06-16

REGRID-QAOA: A Resource-Efficient Graph-Reduced Hybrid QAOA Framework for Physics-Constrained Power System Islanding

arXiv:2606.15083v1 Announce Type: new Abstract: Quantum computing has rapidly emerged as a powerful paradigm for tackling computationally demanding problems. In particular, quantum optimization shows strong promise for hard combinatorial problems in power systems, where increasing distributed energy penetration heightens the need for intentional islanding to maintain grid reliability and resilience. However, power system islanding is an NP-hard combinatorial optimization problem that becomes computationally prohibitive for classical solvers as network size grows, motivating the use of quantum computing as a promising alternative pipeline. This study develops a resource-efficient hybrid QAOA islanding framework that brings physics-constrained power-system partitioning into the quantum optimization workflow. The framework combines coherency-informed graph reduction, physics-aware constraint modeling, and structured post-processing to efficiently convert shallow-circuit QAOA samples into high-quality feasible islanding decisions without deep circuits or large shot budgets. The proposed framework is validated on the standard IEEE benchmark systems (9-, 14-, 24-, 30-, 39-, and 57-bus), demonstrating that the hybrid workflow achieves Gurobi-optimal solution quality with a clear quantum resource advantage over vanilla QAOA, while the resulting islanding solutions satisfy all physical feasibility requirements after network separation. This study establishes QAOA-based islanding as a viable quantum approach for critical infrastructure, with structured post-processing as the key enabler of quantum resource efficiency.

25.
arXiv (CS.CL) 2026-06-16

Can LLM Coding Agents Reason About Time Series?

Large language models (LLMs) are increasingly being used for automated decision-making systems in finance, healthcare, or environmental monitoring. Time series data are ubiquitous in these fields, yet hard to process automatically. Can time series be analyzed by LLM agents? We examine three approaches: providing the agent with raw numerical data, using the LLM as a coding agent, or a combination of both. In the coding agent setup, the model iteratively queries the data using Python code. Using two time series understanding benchmarks, we show that agents with code access can outperform models processing raw data by up to 10%. However, even the best performing agent still answers about 22-34% of the questions incorrectly. To get insights into models' strategies and reasoning gaps, we analyze the model outputs with a strong LLM judge. Our analysis reveals that coding agents can select appropriate statistical tests, but often miss important nuances. Meanwhile, models with access to raw data can reach the right conclusions using back-of-the-envelope calculations.