Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-12

ARROW: Augmented Replay for RObust World models

arXiv:2603.11395v3 Announce Type: replace-cross Abstract: Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

02.
arXiv (CS.AI) 2026-06-19

A Comparative Study of Pretrained Transformer Models for Quranic ASR: Speech Representations, Label Formats, and Dataset Composition

arXiv:2606.19747v1 Announce Type: new Abstract: Quran Automatic Speech Recognition (ASR) aims to convert Quranic recitation into text, enabling applications such as aided memorisation tools and Quranic search engines. However, existing ASR models often exhibit high Word Error Rates (WER) on user-recited verses and lack full coverage of the Quranic corpus. This paper presents a systematic empirical study of domain-specific fine-tuning of pretrained Transformer-based models for Quranic ASR, using advanced speech feature extraction methods: Wav2Vec2.0, HuBERT, and XLS-R. These models apply self-supervised learning by masking portions of input audio and using Transformer architectures to learn context-aware speech features. The pretrained models are fine-tuned on a filtered Quranic dataset exceeding 870 hours of professional and user recitations. Through comprehensive ablation studies across feature extractors, output label formats, training strategies, and clip durations, we identify the key factors that affect transcription accuracy in this domain. Our best-performing configuration achieves a WER of 0.08 on the EveryAyah subset and 0.11 on the combined EveryAyah+Tarteel setting, representing roughly a five-percentage-point gain over the Citrinet baseline (WER = 0.163) while reducing combined-model training time from 140 hours to 40 hours. Arabic text without diacritics yields the best fine-tuning results, and Wav2Vec2-XLSR-53 provides the strongest overall representation. Future work includes improving dataset quality and developing phoneme-aware models to extract deeper speech feature representations for Tajweed-sensitive applications.

03.
arXiv (CS.CV) 2026-06-16

Region-Adaptive Sampling for Diffusion Transformers

Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.

04.
arXiv (CS.LG) 2026-06-19

Multimodal Concept Bottleneck Models

arXiv:2606.19882v1 Announce Type: cross Abstract: Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

05.
arXiv (CS.AI) 2026-06-11

Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

arXiv:2606.11675v1 Announce Type: new Abstract: Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.

06.
arXiv (CS.AI) 2026-06-19

Computational Identifiability

arXiv:2606.19361v1 Announce Type: cross Abstract: Identification conditions describe the computability of a target query or parameter of interest as a function of the type and amount of information available. In causal identification, this information is often expressed in the form of a causal graph, and data are observed or collected for some subset of variables in the graph. Target queries may be for a single effect alone or for a class of effects in a given model. The derivation of an identification algorithm then defines mathematically the process by which the desired causal effect(s) can be uniquely determined, theoretically, in expectation. Identifiability in expectation, or 'theoretical identifiability,' generally assumes asymptotic properties, infinite data, or other mathematically idealized conditions. In this paper, we explore a fundamental distinction between this theoretical, idealized notion of identifiability and a proposed alternative that is computation-bound. The framework we propose - 'computational identifiability' - is to instead define a finite computational search procedure for an empirical estimator. If this process finds an estimator empirically, within a desired error tolerance, then identifiability is satisfied, conditional on the specified assumptions of the search (i.e., a prior distribution over the parameters) and conditional on the search procedure itself. Through several experiments, we demonstrate how this framework allows us to answer fine-grained, practical identification questions, such as identification with small finite samples, with ambiguous graphical criteria, with mixed observational-interventional data, and across counterfactual data and estimands. Code is available at https://github.com/lbynum/metadentify.

07.
arXiv (CS.AI) 2026-06-15

From Prompts to Responses: Dual-Sided Data Leakage and Defense in Split Large Language Models

arXiv:2606.14210v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed in privacy-sensitive domains, where users must balance the risk of data exposure through external APIs against the high computational cost of local deployment. Split learning has therefore emerged as a promising paradigm for LLM fine-tuning and inference under limited local resources. However, it introduces new privacy risks. Prior work primarily studies leakage of private input prompts, typically via inversion attacks on intermediate representations, while the potential for sensitive information leakage through generative response outputs remains largely unexplored. In this work, we unveil novel vulnerabilities of Split-LLM by presenting Patched Model Inversion with Dual-Sided Initialization (PIDI), a two-stage attack that simultaneously targets both private input prompts and output responses in Split-LLM settings. It combines dual-sided initialization with a patched inversion strategy to tackle long sequences, substantially outperforming prior inversion methods. To counter threats from both sides, we further propose the Adapter-based DualGuard with Mutual Information Defense (ADMI), which integrates an adapter-based local warmup strategy and mutual information regularization to provide a strong empirical privacy protection with minimal impact on task performance. Extensive experiments across diverse tasks and models demonstrate that ADMI effectively defends against PIDI and other state-of-the-art inversion attacks. Our code is publicly available at https://github.com/FLAIR-THU/VFLAIR-LLM.

08.
arXiv (CS.AI) 2026-06-17

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

Authors:

arXiv:2606.17929v1 Announce Type: new Abstract: Computer-using agents drive real software through the screen – clicking and typing – but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.

09.
arXiv (quant-ph) 2026-06-17

Effects of Josephson Junction Non-idealities on Adiabatic Quantum Flux Parametron Circuits

arXiv:2606.17338v1 Announce Type: new Abstract: Adiabatic quantum flux parametron (AQFP) gate is a promising approach to scale up the cryogenic microwave electronics for superconducting qubit multiplexed control. However, the performance of these circuits depends on the quality of the Josephson junctions which are ideally superconductor-insulator-superconductor (SIS) type following the ideal sinusoidal relation between current and quantum phase. We demonstrate how the non-sinusoidal current-phase relation in Superconductor-Normal metal-Superconductor (SNS) and weak link (WL) junctions affects the speed, delay, and margin of the AQFP gates. The JJ models are defined in the Keysight ADS simulator using symbolically defined device (SDD) method.

10.
arXiv (CS.AI) 2026-06-24

OpenThoughts-Agent: Data Recipes for Agentic Models

arXiv:2606.24855v1 Announce Type: new Abstract: Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at openthoughts.ai to support future open research on agentic model training.

11.
arXiv (CS.LG) 2026-06-12

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

arXiv:2606.12731v1 Announce Type: new Abstract: As LLMs increasingly serve in advisory and deliberative roles, users rely on them for non-verifiable reasoning in domains lacking objective ground truths. However, traditional evaluations of LLM reasoning focus almost exclusively on fact-based domains, such as mathematics and science, leaving uncertainty over whether and to what degree models can handle ambiguous, subjective, or value-laden problems over time. To address this concern, we propose moral reasoning as a paradigmatic subdomain of non-verifiable reasoning. We define moral robustness as a model's capacity to exhibit sound moral reasoning across time and contexts, and we introduce a scalable, adversarial, multi-turn evaluation framework to empirically measure this capability. We simulate 48,000 user-agent moral deliberations across four frontier LLMs, varying premise relevance, premise order, conversation duration, and the user's stated moral view. We find that models successfully ignore morally-irrelevant distractors, but shift their reasoning by up to 6.5%, on average, towards the user's stated preferred moral view, and varying their reasoning depending on factors such as order (altering moral judgments by order in 13-22% of the cases) and duration (altering moral judgments between single-turn and multi-turn in 10-24% of the cases). Our analysis indicates that models tailor not just their final verdicts but their underlying justifications to align with a user's moral viewpoint - a failure mode we characterize as moral deliberative sycophancy.

12.
arXiv (CS.CV) 2026-06-24

Sat2City v2: Native 3D City Asset Generation from a Single Satellite Image

Generating explicit 3D city assets from a single satellite image is important for digital twins, urban simulation, and geospatial intelligence. Unlike satellite-to-street-view synthesis, the task requires a reusable textured mesh with plausible geometry and controllable appearance rather than a 3D proxy optimized only for rendering a small set of images or videos. The ICCV Sat2City framework made a first step by conditioning cascaded sparse-voxel latent diffusion on satellite-derived height maps, but its appearance was random, its training data were synthetic, and its task-specific VAE did not scale well to noisy real-world reconstructions. We present Sat2City v2, a journal extension that adapts a pretrained native structured-latent 3D foundation model to weakly aligned satellite images and textured meshes. We build a real-world dataset with 16,241 satellite-mesh pairs across 24 regions in 9 cities. Instead of learning a 3D representation from noisy city meshes, Sat2City v2 encodes each mesh into a pretrained native 3D latent space, fine-tunes a satellite-conditioned geometry flow, and uses the decoded shape to anchor satellite-conditioned texturing. This retains Sat2City's geometry-to-appearance cascade while enabling appearance-controllable generation from the satellite input. Experiments on metric-scale DSM reconstruction and generative city-asset benchmarks for geometry and appearance show that Sat2City v2 achieves the best overall performance among evaluated baselines. Overall, Sat2City v2 advances satellite-to-city generation from rendering-oriented 3D proxies to explicit textured mesh assets, supported by, to the best of our knowledge, the first documented satellite-mesh paired dataset collected from matched geographic crops for this asset-level task. Project page: https://ai4city-hkust.github.io/Sat2City-v2/

13.
arXiv (quant-ph) 2026-06-24

Logical qubits with erasure conversion using metastable neutral atoms

arXiv:2506.13724v2 Announce Type: replace Abstract: Implementing large-scale quantum algorithms with practical advantage will require fault-tolerance achieved through quantum error correction, but the associated overhead is prohibitive. This overhead can be reduced by engineering physical qubits with fewer errors, and by shaping the residual errors to be more easily correctable. In this work, we demonstrate quantum error correcting codes and logical qubit circuits in a metastable ytterbium-171 nuclear spin qubit with a noise bias towards erasure errors. These errors can be located separately from any syndrome information diagnosing the error, and we demonstrate adaptive circuit execution based on erasure information. We show that dephasing errors on the qubit during coherent transport can be strongly suppressed, and implement entangling gates that maintain a high fidelity in the presence of gate beam inhomogeneity or pointing errors. Furthermore, we demonstrate logical qubit encoding in the [[4, 2, 2]] code, with error correction during decoding based on mid-circuit erasure measurements despite the fact that the code is too small to correct any Pauli errors. Finally, we demonstrate logical qubit teleportation between multiple code blocks with conditionally selected ancillas based on mid-circuit erasure checks, a key part of leakage-robust error correction schemes using neutral atoms.

14.
arXiv (CS.CL) 2026-06-16

PathRouter: Aligning Rewards with Retrieval Quality in Agentic Graph Retrieval-Augmented Generation

Agentic GraphRAG trains language-model agents to iteratively retrieve and reason over graph-structured evidence, enabling more accurate and context-aware decision-making by efficiently navigating complex information networks. However, outcome-only reinforcement learning suffers from answer-path reward aliasing, where correct answers may come from shortcuts rather than useful evidence paths. It also exhibits search-update ambiguity, as scalar trajectory-level feedback does not indicate which retrieval actions to adjust. To mitigate these shortcomings, we present PathRouter, a path-aware training framework for agentic GraphRAG. PathRouter jointly evaluates each trajectory along answer correctness and evidence-path overlap, yielding four trajectory categories with differentiated GRPO advantage scaling that suppresses shortcut reinforcement while preserving evidence-seeking behavior. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens to avoid direct response imitation. Experiments on six QA benchmarks across three model sizes show that PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models compared to a strong baseline.

15.
arXiv (CS.CV) 2026-06-16

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

16.
arXiv (math.PR) 2026-06-11

Mean-field limits for stochastic particle systems on dense graphs

arXiv:2606.11369v1 Announce Type: new Abstract: We study stochastic interacting particle systems whose interaction structure is described by dense weighted directed graphs converging to a graphon. In the thermodynamic limit, we prove a law of large numbers for the empirical measure process and derive a deterministic nonlinear master equation describing the macroscopic evolution. The limiting equation retains the heterogeneous interaction structure of the microscopic system through the limiting graphon, allowing for spatially non-homogeneous behaviors such as localized or community-type interactions.

17.
PLOS Computational Biology 2026-06-18

Mechanisms underlying spontaneous and evoked calcium responses in oligodendrocyte precursor cells: A modeling investigation

Authors:

by Martin Lardy, Leqi Wang, Claire Guerrier, Veronica T. Cheli, Pablo M. Paez, Anmar Khadra Calcium (Ca2+) signaling has emerged as a central regulator of activity-dependent myelination in oligodendrocytes. These Ca2+ signals encompass both the stimulus-independent spontaneous Ca2+ local transients (SCaLTs) generated intrinsically in a voltage-independent manner or facilitated by the membrane voltage, as well as evoked responses triggered by ATP and glutamate release. To investigate the regulatory mechanisms underlying this combined spiking activity, we developed a stochastic spatiotemporal flux-balance model of Ca2+ transients in oligodendrocyte precursor cells (OPCs). The model incorporates all the relevant fluxes in these cells and integrates membrane voltage dynamics with a Ca2+-induced Ca2+-release (CICR) mechanism using parameters fitted to Ca2+ fluorescence recordings. The model reproduced the intrinsic and voltage-facilitated SCaLTs in OPCs in the absence of purinergic and glutamatergic receptors, and captured the three distinct patterns of evoked Ca2+ responses induced by prolonged ATP and glutamate stimulations identified using machine classifier. The model highlighted the role of ATP and glutamate in generating these clusters, and showed that the fast dynamics of CICR is key to producing these evoked responses. Further analysis of the model also revealed that voltage-gated L- and T-type Ca2+ channels slightly increase the frequency of SCaLTs, while stimulation with ATP and glutamate, using randomly distributed pulses mimicking in vivo conditions, leads to an increase in both the amplitudes of Ca2+ spikes (i.e., the combination of SCaLTs and evoked responses) and the prevalence of wide spikes, especially upon glutamate stimulation. Bifurcation analysis of the deterministic version of the model, in the absence of diffusion, demonstrated that ATP and glutamate stimulation can shift the system into an oscillatory regime, thereby increasing the deterministic component of SCaLT dynamics. This study thus offers a comprehensive representation of OPC Ca2+ transients linking recorded in vitro behaviors to in vivo dynamics.

18.
arXiv (CS.AI) 2026-06-16

RIDGECUT: Learning Graph Partitioning with Rings and Wedges

arXiv:2505.13986v4 Announce Type: replace-cross Abstract: Reinforcement learning (RL) has shown promise for combinatorial optimization problems on graphs by learning heuristics that generalize across instances. However, effectively incorporating domain knowledge into RL frameworks for graph partitioning remains challenging, as existing approaches typically rely on unconstrained node-level actions that lead to large action spaces and inefficient exploration. In this paper, we propose RidgeCut, an RL framework that constrains the action space to enforce structure-aware partitioning in the Normalized Cut problem. Using transportation networks as a motivating example, we introduce a novel concept that leverages domain knowledge about urban road topology – where natural partitions often take the form of concentric rings and radial wedges. By transforming the graph into linear or circular representations, our method enables the use of transformer-based policies and efficient learning via Proximal Policy Optimization. The resulting partitions from RidgeCut are not only aligned with expected spatial layouts but also achieve lower normalized cuts compared to existing methods. Experimental results on synthetic and real-world traffic graphs demonstrate that RidgeCut consistently outperforms existing methods while exhibiting strong inductive generalization across graph sizes. Although motivated by road networks, RidgeCut provides a general mechanism for embedding structural priors into RL frameworks for graph partitioning.

19.
arXiv (CS.AI) 2026-06-17

SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents

arXiv:2602.14211v3 Announce Type: replace-cross Abstract: Agent skills extend LLM agents with task-specific instructions, executable scripts, and auxiliary resources, improving reusability but creating a new supply-chain attack surface. A malicious or compromised skill can be repeatedly loaded as trusted guidance and steer downstream tool use. Existing skill-based prompt-injection attacks are often manual and brittle, because explicit malicious instructions are rejected or ignored when they are not aligned with the original workflow. We propose SkillJect, the first automated framework for generating poisoned skills against skill-enabled agent systems. SkillJect uses two coordinated channels. In the artifact channel, it hides the payload inside an auxiliary helper script. In the instruction channel, it rewrites SKILL.md with a front-loaded inducement strategy, placing injected content at the beginning and framing the helper script as a mandatory prerequisite or initialization step. The rewritten instruction explicitly references the helper-script path and provides an executable example command, making the helper appear to be a legitimate setup step before normal skill operations. SkillJect further adopts a closed-loop multi-agent process to improve attack effectiveness. An Attack Agent generates poisoned skills, a Victim Agent executes downstream tasks with the poisoned skill, and an Evaluate Agent inspects execution traces to determine whether the hidden payload was executed. The Attack Agent then uses this feedback to diagnose failure causes and rewrite SKILL.md, while keeping the payload fixed. Experiments across skill-enabled platforms, backend LLMs, and attack categories show that SkillJect substantially outperforms naive direct injection and prior manual skill-injection attacks, highlighting poisoned skills as a persistent threat in reusable skill ecosystems.

20.
arXiv (CS.LG) 2026-06-24

Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition

arXiv:2601.04181v2 Announce Type: replace Abstract: Reliable long-term decoding of gestures from surface electromyography (EMG) is hindered by signal drift caused by electrode displacement, muscle fatigue, and/or posture changes. Although modern models achieve high intra-session accuracy, their performance often degrades substantially across recording sessions. Existing approaches to mitigate this problem typically rely on large training datasets or computationally intensive pipelines that are unsuitable for energy-efficient wearable devices. We propose a lightweight test-time adaptation framework for EMG decoding. The framework includes three complementary adaptation strategies: (i) causal adaptive batch normalization for online statistical alignment, (ii) Gaussian Mixture Model alignment with experience replay to mitigate forgetting, and (iii) meta-learning for rapid few-shot calibration. We evaluate these methods on the multi-session NinaPro DB6 dataset. All approaches substantially improve inter-session robustness relative to a non-adaptive baseline while maintaining low computational overhead. Replay-regularized statistical alignment provides the most stable adaptation under limited data, while meta-learning achieves the highest accuracy when sparse calibration labels are available. Overall, our self-supervised test-time adaptation methods reach up to 82% inter-session accuracy, significantly improving upon prior approaches while maintaining resource-efficient operation. These results demonstrate that lightweight test-time adaptation can enable robust, long-term EMG decoding for wearable or prosthetic applications.

21.
arXiv (CS.CL) 2026-06-12

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

22.
arXiv (CS.AI) 2026-06-17

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

arXiv:2606.17266v1 Announce Type: new Abstract: Production planning increasingly has to treat workforce capability as a decision variable: certifications lapse when skills are not maintained, new products require skills the current workforce does not hold, and reskilling competes for the same worker hours needed for production. Existing operations benchmarks usually treat labor as exogenous, while workforce-planning models with skills and learning are rarely released as reusable testbeds. We introduce SkillChain-Gym, a benchmark specification for reskilling-aware production-inventory control: a single-site environment with stylized worker skill-state dynamics, hard threshold certification, forgetting, and capacity-consuming training actions constrained by the same per-worker time budget as production. The benchmark includes seed-controlled disruption scenarios, three feasibility modes with projection diagnostics, deterministic replay, and metrics covering operations, resilience, capability growth, and training-access distribution. We evaluate production-only, reactive adaptive, water-filling adaptive, and static-insurance policies with budget variants over 60-shift horizons with paired statistical tests. The results are regime-dependent rather than a ranking. Training-capable policies dominate the production-only baseline, and maintenance training is necessary under forgetting even without disruptions. Among training-capable classes, adaptive training helps when bottlenecks are visible in the forecast, while a lean static cross-training plan, a deliberately favorable comparator whose structure encodes relevant skill contingencies, acts as strong insurance under surprise shocks and absenteeism. Capacity slack and the forgetting rate govern the boundary between these regimes. No policy class dominates across regimes, motivating forecast-driven controllers that decide when to buy skill insurance and when to react.

23.
arXiv (CS.LG) 2026-06-18

FinP: Fairness-in-Privacy in Federated Learning by Addressing Disparities in Privacy Risk

arXiv:2502.17748v4 Announce Type: replace Abstract: Federated Learning (FL) inherently mitigates mass data centralization risks; however, its privacy protections are not equally distributed - leaving vulnerable individuals disproportionately exposed to sophisticated privacy attacks. Crucially, statistical heterogeneity in human-centric FL environments often results in an inequitable distribution of privacy risks, particularly affecting those whose sensitive attributes or behaviors make them outliers. To address this critical gap, we introduce FinP, a novel framework designed to formalize and enforce fairness-in-privacy by mitigating disproportionate client vulnerability to Source Inference Attacks (SIA). FinP operationalizes a two-pronged defense strategy that tackles both the symptoms and root causes of privacy disparity, ensuring that no group of clients bears an excessive privacy burden. It combines a server-side adaptive aggregation mechanism, which dynamically weights client contributions based on their estimated privacy risk, with a client-side regularization technique to curb localized overfitting that drives unique data memorization. Extensive empirical evaluations on FEMNIST, Human Activity Recognition (HAR), and CIFAR-10 datasets demonstrate that FinP effectively aligns privacy fairness with primary task utility. Notably, FinP successfully mitigates SIA risks and reduces disparities in privacy exposure, establishing that strong fairness-in-privacy guarantees need not compromise model utility. Ultimately, FinP establishes equitable privacy protections by reducing vulnerability disparities by up to 57.14%, while preserving global model utility within a marginal +/- 1.75% of standard federated baselines.

24.
bioRxiv (Bioinfo) 2026-06-11

A Deep Hypergraph Learning Model for Predicting Antimicrobial Combination Effects Across Bacterial Targets

Antimicrobial resistance (AMR) creates an urgent need for efficient strategies to identify effective antibacterial combinations. Combination therapy, including antimicrobial peptides (AMPs) paired with conventional antibiotics, is a promising approach, but exhaustive experimental screening across drug pairs and bacterial targets is impractical. This study introduces a hybrid GCN-based hypergraph neural network (HGNN) for predicting antimicrobial-agent combination outcomes against bacterial targets. Each antimicrobial-agent-antimicrobial-agent-bacterium triplet is represented as a ternary hyperedge, enabling the model to learn context-dependent interaction patterns. The framework integrates SMILES-derived molecular graph embeddings for antimicrobial agents, including conventional antibiotics and AMPs, with taxonomy-derived bacterial representations. The prediction task was formulated as a three-class classification problem: synergy, antagonism, and non-interaction. The non-interaction class included experimentally verified indifferent records and synthetic presumed non-interaction triplets generated by negative sampling. Model development used drug-pair-grouped splitting, five-fold grouped cross-validation within the training/validation partition, and final evaluation on a held-out test set. On the held-out three-class test set, the selected GCN-based HGNN achieved an accuracy of 0.83, weighted F1-score of 0.84, macro F1-score of 0.80, and ROC-AUC of 0.95. Per-class evaluation showed accuracies of 0.80 for synergy, 0.92 for antagonism, and 0.85 for non-interaction. Pair-type analysis showed strong performance across AMP-AMP, AMP-conventional antibiotic, and conventional antibiotic-conventional antibiotic combinations. These findings suggest that hypergraph-based representation learning can support computational prioritization of antimicrobial combinations for experimental follow-up. Further studies will be needed to improve model interpretability and to perform prospective validation of predicted synergistic combinations.

25.
arXiv (CS.AI) 2026-06-24

Zero-Shot Test-Time Canonicalization using Out-of-Distribution Scoring

arXiv:2606.24178v1 Announce Type: cross Abstract: Pretrained vision models often misclassify inputs that are rotated, scaled, or sheared, even though these affine transformations leave the object class unchanged. Robustness is usually restored either by building equivariance into the architecture or by retraining with augmentation, both of which require changing or retraining the model. Test-time canonicalization instead leaves the classifier untouched. It undoes the transformation of each input, mapping it to a canonical form near the training distribution before classification. Existing canonicalizers, however, rely on a narrow set of logit-based energy scores and bespoke search procedures, leaving the design space of scoring functions and optimizers unexplored. We reframe canonicalization as out-of-distribution (OOD) detection, which lets any OOD score serve as the energy minimized over transformations. Across benchmarks ranging from handwritten characters and sketches to natural images and 3D point clouds, we systematically evaluate around twenty OOD scores and nine search algorithms, finding that distance-based scores paired with random search and local refinement perform best overall. Because canonicalizing an already-aligned input can hurt accuracy, we add a gated mechanism that transforms an input only when its OOD score indicates this is needed, preserving most in-distribution accuracy while retaining the robustness gains on transformed inputs. Code is available at github.com/johschm/its.