Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-17

Multi-strain Probiotics Alter Gut Microbiota and Estrobolome Pathways in Primary Dysmenorrhea

Background: Exact cause of primary dysmenorrhoea is unknown but recent evidence uncovers a potential link between gut dysbiosis and benign gynaecological disorder via disruption of estrobolome. Methods: A randomized controlled trial to investigate the effects of multi-strain oral probiotics on primary dysmenorrhoea has been conducted. This is a secondary analysis comparing the stool microbiome in women with primary dysmenorrhoea and those without (control), and the effects of treatment with probiotics versus placebo. Results: Although microbial richness and evenness were comparable between groups (alpha diversity, p > 0.05), gut microbial community composition differed significantly (Bray Curtis PERMANOVA, p = 0.015), characterised by reduced Bifidobacterium adolescentis and Blautia and enrichment of Faecalibacterium in dysmenorrhoea, alongside condition-specific core taxa. Post-intervention analysis revealed significant shifts in microbial community structure between pre- and post-treatment groups (PERMANOVA, F = 2.11, p = 0.005), with probiotic supplementation inducing more consistent and directed microbiome changes than placebo, without altering alpha diversity (p > 0.05). Functional prediction showed no significant difference in overall beta glucuronidase pathway abundance (p > 0.05); however, dysmenorrhoea was associated with higher abundance of beta glucuronidase producing taxa (MaAsLin2, q < 0.05) that were differentially modulated by probiotic treatment. Conclusion: This discovery provides evidence on the microbial disruption in primary dysmenorrhoea as well as the benefit of probiotics to modulate the intestinal microbiota to improve the condition.

02.
arXiv (CS.CV) 2026-06-11

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity-attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a single-step denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by +6.25 points, diffusion-based makeup method DiffAIM by +3 points, and noise-based P3-Mask by +16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 27.15 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

03.
arXiv (CS.AI) 2026-06-16

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

arXiv:2606.14801v1 Announce Type: cross Abstract: Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

04.
Science (Express) 2026-06-02

Another red alert for American science | Science

作者: 未知作者

Although research has bipartisan support in the US Congress, and trust in science is above 75% across the country, the Trump administration seems as determined as ever to mortally wound the nation’s scientific enterprise. After the scientific community persuaded Congress to restore most of the president’s draconian cuts to research funding last year, the White House Office of Management and Budget (OMB), under Russell Vought, has found new ways to circumvent the will of Congress and starve American science. At the beginning of this year, OMB dragged its feet in releasing instructions to federal agencies for how to distribute the funding appropriated by Congress, leading to lags in dispersal. Now, OMB has proposed revising the rules that govern how federal dollars are spent. The changes would inevitably lead to unlegislated reductions in funding and damage US leadership in science, both in academia and industry.

05.
arXiv (CS.AI) 2026-06-16

Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?

arXiv:2606.15762v1 Announce Type: cross Abstract: We ran 300 repeated vulnerability-finding scans to measure how repeatable agentic large language model (LLM) security review is on the same JavaScript code, prompt, and benchmark harness. The headline result is that LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run. Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when Claude matched a Snyk Code reference finding, the behavior was much more stable: 134 of 158 unique reference-matched findings appeared in all five repetitions. The benchmark also shows complementarity. Models consistently found familiar, high-signal exploit shapes, and in one case surfaced a likely Snyk Code product gap. Snyk Code static application security testing (SAST) was deterministic and better at systematically enumerating repeated data-flow sinks. The results support combining agentic LLM review with deterministic SAST rather than treating either technique as a replacement for the other.

06.
arXiv (CS.CL) 2026-06-11

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics – Faithfulness, Coverage, Informativeness, and Acuity – to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

07.
arXiv (CS.CL) 2026-06-18

Application of integrated gradients explainability to sociopsychological semantic markers

Classification of textual data in terms of sentiment, or more nuanced sociopsychological markers (e.g., agency), is now a popular approach commonly applied at the sentence level. In this paper, we exploit the integrated gradient (IG) method to capture the classification output at the word level, revealing which words actually contribute to the classification process. This approach improves explainability and provides in-depth insights into the text. We focus on sociopsychological markers beyond sentiment and investigate how to effectively train IG in agency, one of the very few markers for which a verified deep learning classifier, BERTAgent, is currently available. Performance and system parameters are carefully tested, alternatives to the IG approach are evaluated, and the usefulness of the result is verified in a relevant application scenario. The method is also applied in a scenario where only a small labeled dataset is available, with the aim of exploiting IG to identify the salient words that contribute to building the different classes that relate to relevant sociopsychological markers. To achieve this, an uncommon training procedure that encourages overfitting is employed to enhance the distinctiveness of each class. The results are analyzed through the lens of social psychology, offering valuable insights.

08.
arXiv (quant-ph) 2026-06-12

Quantum Logic Codes: Complete Transversal Logical Clifford Instruction Sets for High-Rate Stabilizer Quantum Error Correcting Codes

作者:

arXiv:2606.13521v1 Announce Type: new Abstract: We study the structure and transversal logical capabilities of stabilizer quantum error correcting codes. Among our results, we identify universal lower bounds on circuit depth to generate a full logical Clifford algebra, and develop novel constructions of logical transversal gates including a new depth-one transversal phase $\mathrm{\overline{S}}$ gate in the rotated surface code and a depth-one intra-block $\mathrm{\overline{CZ}}$ gate in the 2D-toric code that generalizes to all odd distances and all lengths $L\ge3$, respectively. Finally, we construct a high-rate non-LDPC CSS code family with parameters $[[n,\sqrt{n},\Theta({n^{\beta}})]]$ where $\beta \approx 0.2823$ in one demonstrated case, that provably possesses a constant-depth complete 2-local transversal logical Clifford basis instruction set architecture (ISA) composed of all individually targeted $\mathrm{\overline{S}}$, $\mathrm{\overline{SHS}} = \sqrt{X}$, and $\mathrm{\overline{CZ}}$ gates. This ISA is depth-one for certain subfamilies that we design and generally constant-depth under certain conditions. The code family is built from a small code with parameters $[[n_0, 2, d_0]]$, and is tunable in the standard way: it tiles out to form utility-scale logical qubit counts, and it scales up through concatenation to achieve higher distances and error suppression. We show that this construction preserves the depth-one complete transversal logical Clifford basis ISA when composed with these commuting construction actions, inheriting structure from the core codes so that at scale the complete logical Clifford basis ISA remains depth-one up to depth-two addressable operations between tiled cores. We call these Quantum Logic Codes.

09.
arXiv (CS.CV) 2026-06-19

Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection

Pose-flow video anomaly detectors are attractive for one-class surveillance because they provide likelihood-based rankings for tracked skeleton windows. However, a single likelihood score may hide multimodal normal behavior and be sensitive to pose-observation noise. We study a frozen-detector setting in which the pose-flow backbone, cached skeleton tracks, and evaluation pipeline are fixed. Reliability-Aware Prototype Calibration (RPC) is a post-hoc score calibration method for this setting. It adds a standardized nearest-prototype deviation in the frozen latent space to the standardized flow score, and uses keypoint confidence only to gate this added geometric evidence. Thus, RPC preserves the original density signal while correcting the ranking with empirical normal-mode structure under pose reliability. Across two frozen pose-flow backbones and four datasets, RPC improves frame-level AUROC in all eight backbone-dataset pairs, with gains ranging from 0.34 to 4.49 percentage points and averaging 2.03 points. Ablation and reliability analyses show that prototype deviation is the main corrective signal, while reliability gating is most useful when pose observations are less trustworthy. These results suggest that lightweight post-hoc calibration can strengthen cached pose-flow systems when retraining or reproducing the full pose pipeline is impractical.

10.
arXiv (quant-ph) 2026-06-19

Arrival times of an atomic Bose-Einstein condensate

arXiv:2606.20281v1 Announce Type: cross Abstract: The times of flight of an atomic Bose-Einstein condensate are theoretically investigated in the experimentally unexplored regime corresponding to detection close to the trap of the condensate. In this regime, there is no consensus on how to calculate the distribution of times of arrival onto the detector. For non-interacting particles, distinct theoretical predictions have been made in the past. This work analyses how these predictions are modified for an interacting Bose-Einstein condensate. For this purpose, a time-dependent Gross-Pitaevskii equation is solved analytically and numerically.

11.
arXiv (CS.CV) 2026-06-18

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under heterogeneous clinical imaging conditions such as noise, resolution loss, contrast variation, intensity shift, and artifacts. This instability can limit reliable deployment in real-world medical imaging workflows. We propose Robustness via Augmented Multi-corruption Pipeline (RAMP), a robustness-oriented augmentation framework for CT segmentation. RAMP combines anatomically constrained spatial perturbations, CT intensity transformations, and stochastic multi-corruption composition to expose models to clinically plausible image degradation during training. Across two CT segmentation evaluation settings, RAMP achieved the strongest corrupted-image performance and the smallest clean-to-corrupted robustness gap. In the five-organ noisy evaluation benchmark, RAMP improved mean corrupted Dice from 0.610 to 0.753 and reduced the robustness gap from 0.264 to 0.064 compared with the nnU-Net baseline. In Abdomen1K, RAMP improved mean corrupted Dice from 0.633 to 0.789 and reduced the robustness gap from 0.290 to 0.070. Although RAMP did not achieve the highest clean-image Dice, it substantially mitigated worst-case segmentation collapse under severe image degradation. These results suggest that multi-corruption augmentation can serve as a practical pre-deployment strategy for improving the reliability of CT segmentation systems in heterogeneous clinical environments.

12.
arXiv (CS.LG) 2026-06-18

KANEL\'E: Kolmogorov-Arnold Networks for Efficient LUT-based Evaluation

arXiv:2512.12850v3 Announce Type: replace-cross Abstract: Low-latency, resource-efficient neural network inference on FPGAs is essential for applications demanding real-time capability and low power. Lookup table (LUT)-based neural networks are a common solution, combining strong representational power with efficient FPGA implementation. In this work, we introduce KANEL\'E, a framework that exploits the unique properties of Kolmogorov-Arnold Networks (KANs) for FPGA deployment. Unlike traditional multilayer perceptrons (MLPs), KANs employ learnable one-dimensional splines with fixed domains as edge activations, a structure naturally suited to discretization and efficient LUT mapping. We present the first systematic design flow for implementing KANs on FPGAs, co-optimizing training with quantization and pruning to enable compact, high-throughput, and low-latency KAN architectures. Our results demonstrate up to a 2700x speedup and orders of magnitude resource savings compared to prior KAN-on-FPGA approaches. Moreover, KANEL\'E matches or surpasses other LUT-based architectures on widely used benchmarks, particularly for tasks involving symbolic or physical formulas, while balancing resource usage across FPGA hardware. Finally, we showcase the versatility of the framework by extending it to real-time, power-efficient control systems.

13.
arXiv (CS.CL) 2026-06-16

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However, existing biomedical QA benchmarks mainly focus on exam-style knowledge, literature comprehension, or short-range multi-hop inference, leaving source-conditioned graph reasoning and evidence topology construction underexplored. To fill this gap, we introduce BioMedHop, a multi-source graph-grounded benchmark for evaluating biomedical reasoning over structured evidence topologies. BioMedHop contains 10,045 instances across KG, document, web, and hybrid evidence settings, covering shared-neighbor matching, intersection reasoning, path-based reasoning, and counting, with option-based, open-ended, and numeric count renderings. To support this benchmark, we further propose BioWeave, a source-aware reasoning framework that retrieves biomedical KG paths, gathers supporting clues from documents and web sources, assembles them into a unified evidence graph, and verifies answers through entity-level evidence support. Comprehensive experiments show that BioWeave achieves the best overall performance among compared methods on BioMedHop, outperforming the strong hybrid baseline ToG-2 by 10.5% in the overall average. Moreover, BioWeave consistently improves different LLM backbones and enables smaller models, such as Qwen3-4B, to achieve reasoning performance comparable to GPT-4-Turbo.

14.
arXiv (CS.AI) 2026-06-19

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

arXiv:2606.19381v1 Announce Type: cross Abstract: Code-switch (CS) Automatic Speech Recognition (ASR) remains challenging due to limited availability of high quality CS text-speech pairs for training. Although synthetic data augmentation via Text-to-speech (TTS) has been explored, existing CS TTS approaches primarily optimise reconstruction fidelity and do not explicitly enforce language-boundary consistency, thereby limiting their effectiveness for CS ASR augmentation. This paper proposes a code-mixing guided preference-learning framework that steers synthetic speech generation toward improved code-switching fidelity using the Code Mixing Index (CMI). Experiments on the SEAME Mandarin-English conversational corpus demonstrate that the proposed method enhances the utility of synthetic data for ASR fine-tuning. Specifically, when fine-tuning Whisper Large, the proposed approach reduces Mixed Error Rate (MER) from 12.1%/17.8% to 8.9%/14.2% on the DevMAN and DevSGE sets, respectively.

15.
arXiv (CS.CL) 2026-06-11

uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking

This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.

16.
arXiv (CS.AI) 2026-06-17

Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models

arXiv:2606.17102v1 Announce Type: cross Abstract: Quantum computing promises transformative advances across science and industry, yet the physical hardware that enables these computations remains invisible to the public: quantum processors operate inside sealed dilution refrigerators at temperatures near absolute zero, making direct observation impossible. This "imagination gap" between quantum computing's growing societal impact and the public's ability to visualize it represents a significant barrier to quantum literacy and workforce development. We present Quantum Cinema, an open-source, browser-based interactive application that closes this gap by transforming invisible quantum hardware into explorable, cinematic experiences using generative world models. Quantum Cinema guides users through a four-act narrative – from the foundational Nobel Prize-winning science of quantum entanglement, through curated video introductions to three major quantum computing architectures (trapped-ion, neutral-atom, and superconducting systems), into immersive three-dimensional generative worlds that make invisible quantum phenomena observable, and finally to interactive radar-chart comparisons grounded in real quantum device specifications. All three-dimensional environments are generated using WorldLabs' generative world model platform and are scientifically grounded in curated metrics from Amazon Web Services (AWS) Braket quantum hardware. Quantum Cinema requires no installation, no specialized hardware, and no quantum computing background. It is designed to serve two distinct communities: scholars and developers seeking to replicate or extend the platform, and educators, researchers, and science communicators seeking an intuitive tool for explaining quantum hardware to diverse audiences. This paper describes the system architecture, the generative world model pipeline, use cases for both communities, and directions for future work.

17.
arXiv (CS.AI) 2026-06-16

Scaling Adaptive Depth with Norm-Agnostic Residual Networks

arXiv:2606.16112v1 Announce Type: cross Abstract: Residual architectures are ubiquitous in deep learning, but they suffer from a subtle structural limitation: the norm of the residual stream can grow rapidly with depth. As a result, updates from later layers become small relative to the accumulated residual state. This reduces their impact on the representation and limits the benefits of scaling models in depth. To address this, we introduce NAG, a norm-agnostic residual architecture that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth. Importantly, NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice. We show that this architecture outperforms baseline Transformers, with gains that increase substantially as depth grows, enabling effective training of much deeper models. The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy: under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed. In our experiments, moderate Mixture-of-Depths rates of approximately 20%-25% match full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

18.
medRxiv (Medicine) 2026-06-15

Unveiling the Awareness of Private Health Insurance Coverage among Healthcare Professionals in Freetown, Sierra Leone: Insights Extracted from Their Perspectives.

Our study is an assessment of the knowledge, personal coverage, and related determinants of private health insurance as revealed by healthcare professionals in Freetown, the urban capital of Sierra Leone. This study stands as a precursor for Low- and Middle-Income Countries (LMICs), like Sierra Leone, seeking to establish Universal Health Coverage (UHC) to provide healthcare access and coverage through publicly arranged risk pooling, designed to help protect against unmanageable medical costs. In parallel, such countries face significant challenges with achieving sustainable universal coverage due to limited public resources, inefficient allocation systems, uneasy reliance on out-of-pocket payments, and large struggling populations. Our research sheds particular light on how healthcare professionals view their own participation with private healthcare options. A cross-sectional, analytical study was conducted, openly recruiting individuals from various facilities in Freetown. Using the Yamane Formula, a sample size of 109 participants was calculated. STATA 14.0 was used for data analysis. Our findings revealed that 96 (88.9%) participants did not have private health insurance, while 12 (11.1%) did have private coverage. However, 105 (97.2%) reported other modes of health insurance, with only 3 (2.8%) uninsured. Notably, 97.2% expressed willingness to join a private health insurance scheme. Our study found no statistically significant associations between selected indicators (demographic or socioeconomic fac tors) and current insurance coverage among study participants. These results highlight a low prevalence and understanding of private health insurance among healthcare professionals in a representative urban center in Sub-Saharan Africa (SSA), while acknowledging high willingness to enroll. The lack of any significant determinants suggests other unexamined factors, such as cost, accessibility, or awareness, capable of influencing the adoption and implementation of a universal health program.

19.
arXiv (CS.CV) 2026-06-17

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

20.
arXiv (CS.CV) 2026-06-12

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $appears$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $command-preserving trajectory redirection$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/

21.
arXiv (CS.CL) 2026-06-15

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

Despite advances in large-scale Automatic Speech Recognition (ASR), disfluent speech remains challenging, as state-of-the-art systems are often optimized to omit disfluencies, leading to information loss and hallucinations. Prior work has focused on verbatim transcription and the integration of disfluency markers, but adapting models on limited datasets can lead to catastrophic forgetting of general-domain knowledge. We address this gap by leveraging continual learning (CL) with explicit disfluency tokens. We first introduce these tokens into a pretrained ASR model to establish stable token mechanisms, and then continue training on additional datasets with varying disfluency distributions. Through a detailed analysis of model dynamics during training, we identify a trade-off between marker learning and ASR performance, and a consistent cross-attention head mechanism shared across CL methods.

22.
arXiv (CS.AI) 2026-06-16

Constitutional Value Potentials: reading and steering internal priority margins in language models

arXiv:2606.15420v1 Announce Type: cross Abstract: A constitution tells a language model what to value, but little tells us whether it does. Adherence is judged from outputs, and output evidence is most fragile on value conflicts, where what matters is not which value a model mentions but which one it is willing to sacrifice. We provide evidence that this arbitration can be read from activations in a structured margin readout. We introduce Constitutional Value Potentials (CVP). For each value we learn a scalar potential from the hidden state: an internal pressure to preserve that value, supervised not by the prompt but by an independent judge's verdict on which value the model's own response actually preserved. The signed difference of two potentials is a priority margin. A constitutional clause becomes the claim that a margin stays positive, and a single monitor score flags when it does not. The monitor predicts conflict violations with AUROC up to 0.95, beats a strong hidden-state probe, and generalizes to held-out synthetic conflicts across three Qwen2.5 scales. The signal appears as the answer begins, from the prompt tail and first response token. Read this early, the same signal reveals whether an adversarial priority hack has actually pushed the model toward a violation, rather than only whether the prompt looks adversarial. The same directions also support intervention tests: under selected steering settings, moving along a value direction shifts judged trade-offs in the intended direction. Together, these results suggest that some constitution-relevant priorities are accessible as activation-space margins, rather than only as output behavior.

23.
arXiv (CS.LG) 2026-06-12

Multimodal Graph Negative Learning

arXiv:2606.12863v1 Announce Type: new Abstract: Multimodal attributed graphs (MAGs) integrate graph topology with heterogeneous modality attributes, such as text and images, thereby enabling richer modeling of complex relational systems. However, such expressiveness also makes learning on MAGs depend on multiple semantic sources, including structural topology, textual and visual attributes, each of which can be regarded as a branch for node representation. Node-level branch semantic imbalance arises when these branches differ across nodes in semantic informativeness and reliability: a branch that provides discriminative semantics for one node may mislead another due to bias in modality quality or structural context. Existing methods often mitigate such heterogeneity through cross-branch agreement or alignment, implicitly treating the dominant prediction as reliable supervision. When the dominant branch is biased, forced imitation may propagate its bias to other branches and suppress original semantics that are useful for classification. We propose GraphMNL, a graph-aware multimodal negative learning framework that addresses this issue by using Negative Learning as cross-branch guidance. Instead of forcing inferior branches to imitate a teacher prediction, the model teaches them which classes a node is unlikely to belong to. GraphMNL builds a branch library, identifies dominant and inferior branches via graph-aware reliability arbitration, gates unstable transfer, and applies target-preserving negative learning over non-target classes. This design decouples target supervision from branch guidance so that supervised losses learn the correct class, while Negative Learning suppresses unlikely alternatives when branch agreement is unreliable. Through the comprehensive experimental evaluation, GraphMNL achieves the best performance on Grocery datasets with 72.47% accuracy and 76.60 F1 score on Reddit M datasets.

24.
arXiv (CS.CL) 2026-06-11

Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models

While Text-to-Speech (TTS) systems enable emotional control via natural-language instructions, expressiveness, naturalness, and speech quality degrade when the target emotion conflicts with the textual semantics. We propose a Cross-modal Consistency Guided Classifier-Free Guidance (CCG-CFG) method with dynamic scales based on the degree of inconsistency between the text emotion and the explicit speech emotion, replacing the dropout condition with the text emotion. We also distill the CCG-CFG guidance signal using a hard-sample mining strategy, improving the TTS model's emotional alignment capability. Evaluations on five emotional corpora and two TTS benchmarks show that our approaches applied to CosyVoice2 achieve up to a 12% absolute improvement in emotion-recognition accuracy and a 10% relative improvement in subjective scores, outperforming baselines including HierSpeech++, Qwen3-TTS, and original CosyVoice2, while preserving intelligibility, naturalness, and high speech quality.

25.
arXiv (CS.AI) 2026-06-16

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

arXiv:2606.03489v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories–generating both secure "golden paths" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B's pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.