Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-16

Orcheo: A Modular Full-Stack Platform for Conversational Search

arXiv:2602.14710v2 Announce Type: replace-cross Abstract: Conversational search (CS) requires a complex software engineering pipeline that integrates query reformulation, ranking, and response generation. CS researchers currently face two barriers: the lack of a unified framework for efficiently sharing contributions with the community, and the difficulty of deploying end-to-end prototypes needed for user evaluation. We introduce Orcheo, an open-source platform designed to bridge this gap. Orcheo offers three key advantages: (i) A modular architecture promotes component reuse through single-file node modules, facilitating sharing and reproducibility in CS research; (ii) Production-ready infrastructure bridges the prototype-to-system gap via dual execution modes, secure credential management, and execution telemetry, with built-in AI coding support that lowers the learning curve; (iii) Starter-kit assets include 45+ off-the-shelf components for query understanding, ranking, and response generation, enabling the rapid bootstrapping of complete CS pipelines. We describe the framework architecture and validate Orcheo's utility through case studies that highlight modularity and ease of use. Orcheo is released as open source under the MIT License at https://github.com/AI-Colleagues/orcheo.

02.
arXiv (CS.AI) 2026-06-16

NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

arXiv:2606.15888v1 Announce Type: cross Abstract: Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.

03.
arXiv (CS.LG) 2026-06-19

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

arXiv:2604.13240v2 Announce Type: replace-cross Abstract: Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

04.
arXiv (CS.CL) 2026-06-16

When Cognitive Graphs Meet LLMs: BDEI Cognitive Pathways for Panic Emotional Arousal Prediction

Predicting individual panic emotional arousal timing before manifestation is essential for proactive emergency intervention. Existing methods incorporate cognitive elements but none explicitly model the emotional arousal process, making them ill-suited for emotional arousal timing prediction. We argue that grounding prediction in appraisal emotion theory is necessary because it explicitly models this process, but three problems must be solved. (1) Appraisal theory posits that emotion arises from simultaneous evaluation across multiple threat dimensions, yet no prior work fuses these inputs into risk perception. (2) Existing cognitive models lack an Emotion node, decoupling threat appraisal from emotional arousal and forcing emotions to be inferred indirectly from behaviors. (3) Given their generalizable cognitive reasoning, current approaches adopt LLMs as the primary decision-maker, yet overlook the fragility and hallucination-proneness of their outputs. To address these issues, we introduce PanicCognitivePath (PCP), a framework that addresses all three. A Psychological Safety Distance (PSD) model, grounded in psychological distance theory, maps four-domain signals into a unified risk metric as the entry condition for subsequent cognitive reasoning. An explicit Emotion node grounded in appraisal emotion theory is introduced into BDI, forming a Belief-Desire-Emotion-Intention (BDEI) pathway. Agents whose risk metric exceeds the PSD threshold enter this pathway, coupling threat appraisal directly to emotional arousal. The BDEI pathway governs all state transitions while the LLM is confined to parameter estimation for the Belief-to-Desire transition, confining hallucinations to a single step and preventing error propagation. Experiments on Hurricane Sandy show PCP improves arousal timing accuracy by 10.68% over baselines, reduces peak count error to 7.07%.

05.
arXiv (CS.AI) 2026-06-16

Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

arXiv:2606.15251v1 Announce Type: cross Abstract: Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone's predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.

06.
arXiv (CS.CV) 2026-06-16

When the Past Matters: FlashBack Memory for Precipitation Nowcasting

Accurate precipitation nowcasting is crucial for disaster mitigation and socio-economic planning, yet existing methods often struggle with false alarms, missed events, and long range dependency modeling at high spatiotemporal resolution. To address these challenges, we propose FlashBack Memory (FB), a module that dynamically retrieves key historical states and integrates them via an adaptive fusion gate, enhancing the spatiotemporal representation capability of recurrent-based models. We incorporate FB into PredRNN, PredRNNpp, MIM, MotionRNN, and PredRNN-V2, and evaluate on CIKM2017, Shanghai2020, and SEVIR datasets. Experimental results demonstrate that FB significantly improves MSE, MAE, SSIM, and CSI metrics, particularly for high-intensity rainfall and long-sequence predictions, while reducing false alarms and missed events and enhancing temporal consistency and spatial localization. The proposed method provides a general and efficient memory enhancement mechanism, improving the overall performance of recurrent-based precipitation nowcasting models.

07.
arXiv (CS.AI) 2026-06-19

Leveraging systems' non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems

arXiv:2606.20323v1 Announce Type: new Abstract: Deep Transfer Learning (DTL) allows for the efficient building of Intelligent Fault Diagnosis Systems (IFDS). On the other hand, DTL methods still heavily rely on large amounts of labelled data. Obtaining such an amount of data can be challenging when dealing with machines or structures faults. This document proposes a novel approach to the design of vibration-based IFDS using DTL in condition of strong data scarcity. A periodic multi-excitation level procedure leveraging intrinsic non-linearities of real-world systems is used to produce images that can be conveniently analysed by pre-trained Convolutional Neural Networks (CNNs) to diagnose faults. A new data visualization method and its augmentation technique are proposed in this paper to tackle the typical lack of data encountered during the design of IFDS. Experimental validation on a railway pantograph structure provides effective support for the proposed method.

08.
arXiv (CS.CL) 2026-06-11

Detecting Sensitive Personal Information in Japanese Pre-Training Corpora for Large Language Models

Sensitive personal information can appear in large-scale pre-training corpora for large language models (LLMs). Detecting and filtering such information is therefore essential to ensure compliance with privacy regulations and prevent unintended information leakage. However, in contrast to English and other languages, research into sensitive personal information has been limited in the Japanese language. In this study, we focus on sensitive personal data defined as special care-required personal information (SCPI) under Japan's Act on the Protection of Personal Information (APPI). We construct an SCPI dataset using LLM-based annotation and train machine learning models to rapidly detect SCPI in text. As a result, our SCPI classifier can effectively identify information related to SCPI. This study is the first to explore SCPI detection in Japanese text corpora, highlighting the challenges of accurate detection.

09.
arXiv (CS.CV) 2026-06-24

SignNet-1M: Large-Scale Multilingual Sign Language Video Dataset with Downstream Benchmarks

Sign language models are typically trained on datasets captured under constrained conditions, with limited viewpoint, background, and signer-identity diversity, leading to poor robustness under real-world distribution shifts. We introduce SignNet-1M, a large-scale augmented dataset spanning ASL, CSL, and German Sign Language (DGS). SignNet-1M synthesizes realistic variations along three axes: (i) novel-view rendering (rotation and zoom) via 3D Gaussian Splatting (3DGS), (ii) scene/identity editing via diffusion models for background replacement and signer substitution while preserving sign motion and linguistic content, and (iii) post-rendering augmentations that emulate capture and compression artifacts (e.g., pose/temporal perturbations and video-level corruptions) to better match in-the-wild recordings. Beyond data release, we provide a unified benchmark suite across downstream tasks (e.g., translation and recognition) and ablations that isolate each augmentation component. Experiments across backbones show that training with SignNet-1M consistently improves generalization under cross-view, cross-background, cross-identity, and post-rendering shifts, while maintaining strong in-distribution performance. The dataset, full augmentation pipeline, and benchmark are available at https://signnet.chatsign.ai/.

10.
arXiv (CS.AI) 2026-06-15

Generalized Discrete Diffusion with Self-Correction

arXiv:2603.02230v2 Announce Type: replace-cross Abstract: Self-correction is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose a Self-Correcting Discrete Diffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality.

11.
arXiv (CS.CL) 2026-06-12

The Pragmatic Persona: Discovering LLM Persona through Bridging Inference

Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches rely on surface-level lexical or stylistic cues, treating dialogue as a flat sequence of tokens and failing to capture the deeper discourse-level structures that sustain persona consistency. To address this limitation, we propose a novel analytical framework that interprets LLM dialogue through bridging inference – implicit conceptual relations that connect utterances via shared world knowledge and discourse coherence. By modeling these relations as structured knowledge graphs, our approach captures latent semantic links that govern how LLMs organize meaning across turns, enabling persona discovery at the level of discourse coherence rather than surface realizations. Experimental results across multiple reasoning backbones and target LLMs, ranging from small-scale models to 80B-parameter systems, demonstrate that bridging-inference graphs yield significantly stronger semantic coherence and more stable persona identification than frequency or style-based baselines. These results show that persona traits are consistently encoded in the structural organization of discourse rather than isolated lexical patterns. This work presents a systematic framework for probing, extracting, and visualizing latent LLM personas through the lens of Cognitive Discourse Theory, bridging computational linguistics, cognitive semantics, and persona reasoning in large language models. Codes are available at https://github.com/JiSoo-Yang/Persona_Bridging.git

12.
arXiv (CS.CV) 2026-06-11

Lighting-aware Unified Model for Instance Segmentation

Foundation models like the Segment Anything Model (SAM) demonstrate impressive zero-shot generalization but frequently degrade under diverse real-world illumination, particularly for instance segmentation. In this work, we address this limitation by developing Lighting Convolutional-Attention (\lca{)}, an adapter module that enhances segmentation robustness without fine-tuning the heavy backbone. \lca{} employs a dual-branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca{} through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity-based synthetic dataset specifically designed to accurately replicate complex real-world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting-robust segmentation.

13.
medRxiv (Medicine) 2026-06-22

Characteristics and Outcomes of Gene-Elusive Dilated Cardiomyopathy

Background and Aims Genetic testing in dilated cardiomyopathy (DCM) guides risk stratification and family screening. Likely pathogenic or pathogenic (LP/P) variants are identified in approximately one-third of patients, leaving many without a genetic diagnosis. Cohort studies suggest that "gene-elusive" patients have a lower risk of adverse events. This study aims to better characterise this group and identify factors associated with adverse outcomes. Methods Consecutive and unrelated DCM patients undergoing genetic testing and returning no LP/P variants were retrospectively recruited and compared to two control cohorts of DCM patients carrying LP/P variants in LMNA and TTN for a primary composite endpoint of end-stage heart failure (ESHF) or malignant ventricular arrhythmia (MVA). Results Among patients without prior MVA, the composite endpoint occurred in 36/423 (8.5%) gene-elusive, 14/39 (35.9%) LMNA and 11/100 (11%) TTN cardiomyopathy patients (log-rank p

14.
Nature (Science) 2026-06-24

Small-molecule modulation of β-arrestins

β-Arrestins are multifunctional regulators of G-protein-coupled receptor (GPCR) signalling and orchestrate diverse downstream signalling events and physiological responses across the GPCR superfamily1–3. Although GPCR pharmacology has advanced to target orthosteric and allosteric sites, as well as G proteins and GPCR kinases, direct chemical tools to modulate β-arrestin activities have remained conspicuously absent. Here we report the identification of small-molecule inhibitors that selectively target β-arrestins and delineate their mechanism of action through integrated pharmacological, biochemical, biophysical and structural analyses. These inhibitors disrupt β-arrestin engagement with agonist-activated GPCRs, impairing desensitization, internalization and β-arrestin-dependent physiological functions while sparing G protein–receptor coupling. Cryo-electron microscopy, molecular dynamics simulations and structure-guided mutagenesis reveal that one modulator, Cmpd-5, engages a pocket within the central crest of β-arrestin1 formed by the middle, C and lariat loops, a critical receptor-binding interface, stabilizing a distinct conformation that is incompatible with full β-arrestin–receptor engagement. Together, these findings establish a mechanistic framework for β-arrestin modulation, reveal a novel allosteric site for structure-based drug design, and open new avenues for transducer-targeted, pathway-specific GPCR therapeutic agents. Integrated pharmacological, biochemical, biophysical and structural analyses of small-molecule β-arrestin inhibitors show how they block β-arrestin engagement with activated GPCRs, revealing their mechanism of action and uncovering a previously unrecognized allosteric regulatory site.

15.
arXiv (CS.AI) 2026-06-19

DataMagic: Transforming Tabular Data into Data Insight Video

arXiv:2606.20388v1 Announce Type: cross Abstract: Data videos integrate dynamic charts, voice narration, and synchronized animations to communicate data insights as temporal narratives, making them an effective medium for improving data consumption efficiency in the data management lifecycle. However, producing high-quality data videos requires expertise spanning data analysis, narrative design, and video production. Existing approaches fall short: static visualization tools (e.g., BI dashboards) lack narrative logic and animation; authoring tools require users to pre-prepare visualizations rather than working from raw data; pixel-level video generation models cannot guarantee data fidelity or provenance. We demonstrate DataMagic, an end-to-end interactive system that transforms raw tabular data and natural language queries into narrative data-insight videos. To ensure data fidelity, DataMagic introduces the declarative specification DVSpec, which binds visual and animation elements to underlying data fields through data-driven semantic references. To address the combinatorial explosion of the design space, DataMagic adopts a Generate-then-Orchestrate multi-agent architecture that generates candidate scenes in parallel and then optimizes narrative coherence through global orchestration. Leveraging DVSpec's decoupling of logic and rendering, the system further supports three interaction modes and structured provenance-based data Q&A, transforming one-way videos into explorable interactive data interfaces. Evaluation on 109 real-world samples validates the effectiveness of the DataMagic. Homepage: https://datamagic-home.github.io/

16.
arXiv (CS.AI) 2026-06-11

KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition

arXiv:2605.19031v2 Announce Type: replace Abstract: Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but struggle to maintain performance on noisy and imperfect real-world datasets. In contrast, conventional multi-layer perceptrons (MLPs) are far more tolerant to noise and computationally efficient. Replacing all MLP components with KANs in HAR models often degrades accuracy and computation efficiency, highlighting an open challenge: how to combine KANs' precision with MLPs' noise robustness and efficiency. To address this, we systematically explore various placements of KAN modules within deep HAR networks and propose a hybrid architecture that strategically synergizes the strengths of both paradigms, which uses a KAN-based input embedding layer, retains MLP layers for intermediate feature mixing, and introduces a specialized LarctanKAN module for final activity classification. Across eight public HAR datasets, the hybrid KAN-MLP model achieves an average macro F1 score relative improvement of 5.33\% compared pure-MLP model, significantly outperforming standalone KAN and MLP baselines. Furthermore, integrating this hybrid strategy into other state-of-the-art HAR architectures consistently boosts their performance. Our findings demonstrate that a carefully orchestrated combination of KAN, MLP, or other conventional neural components yields more robust and accurate HAR models for real-world wearable sensing environments.

17.
arXiv (math.PR) 2026-06-17

Killed resolvents and measure-valued stopping gains for reflected optimal stopping with max-type rewards

arXiv:2606.17517v1 Announce Type: new Abstract: We study an infinite-horizon optimal stopping problem for a normally reflected two-dimensional diffusion in the positive quadrant with nonsmooth max-type reward \(G(x_1,x_2)=x_1\vee \alpha x_2\). The paper develops a conditional measure-theoretic framework for the associated reflected obstacle problem. The main innovation is to show that the stopping gain \(\Gamma=c+rG-\mathcal LG\) is a signed measure, not a function: the kink of \(G\) generates an explicit negative surface measure on \(\Delta=\{x_1=\alpha x_2\}\). We then prove that the correct potential representation uses the resolvent of the reflected diffusion killed on first entry into the stopping set, rather than the unrestricted reflected resolvent. Under explicit monotonicity, regularity, and measure-superharmonicity assumptions, we derive an epigraph representation, a continuation-side boundary-trace condition, and a candidate verification theorem. The framework clarifies hidden regularity and uniqueness assumptions in multidimensional nonsmooth optimal stopping.

19.
medRxiv (Medicine) 2026-06-22

Referral pathways, ETAT triage acuity, and inpatient outcomes among children presenting to a national tertiary paediatric emergency unit in Ghana: a prospective cohort study

Emergency referral systems in sub-Saharan Africa are fragmented, and children reaching tertiary facilities through different referral pathways often arrive in advanced clinical states. Prospective data simultaneously characterising referral patterns, triage acuity at presentation, diagnostic case mix, and inpatient mortality at a national tertiary paediatric emergency unit are lacking from West Africa. This prospective cohort study enrolled 675 consecutively presenting children aged one month to 12 years at the Paediatric Emergency Unit of Korle Bu Teaching Hospital, Accra, Ghana, from February to December 2019. The primary outcome was all-cause inpatient mortality. Key variables collected included referral status and facility tier, Emergency Triage Assessment and Treatment (ETAT) triage category, ICD-10 diagnostic classification, Oyedeji socioeconomic classification, and time from symptom onset to PEU registration. Crude odds ratios were computed for all candidate predictors. Multivariable logistic regression was conducted using complete case analysis (n = 613). Of 675 children, 63.0% (n = 425) were referred from another health facility; referred children had higher ETAT emergency triage category rates than self-presenting children (32.7% vs 27.6%, p < 0.001). Overall inpatient mortality was 9.9% (67/675). Mortality varied by referral source: 16.7% among secondary/regional hospital referrals, 11.0% among lower-tier facility referrals (district, municipal, CHAG, polyclinic, private, health centre, and maternity home facilities combined, n = 356), 7.6% among self-presenting children, and 7.4% among tertiary referrals. Overall, 30.8% of children were classified as ETAT emergencies on arrival, with case fatility rate of 21.6%. The three most common diagnostic domains were respiratory conditions (17.2%), blood and haematological disorders (17.0%), and digestive presentations (16.4%). Inpatient mortality was highest in neoplastic disease (33.3%, n = 30) and circulatory presentations (31.0%, n = 29). In the primary multivariable analysis (n = 613, 51 events; events-per-variable ratio 4.2), no referral tier was independently associated with inpatient mortality after adjustment. Referral from secondary/regional hospitals showed a borderline non-significant association (adjusted odds ratio 3.09, 95% CI 0.96 to 9.90, p = 0.058). School going children (60-119 months) had higher odds of inpatient death than infants (adjusted odds ratio 5.56, 95% CI 1.16 to 26.53, p = 0.032), as did adolescents (adjusted odds ratio 10.01, 95% CI 2.15 to 46.69, p = 0.003). ETAT emergency category and lower socioeconomic status were not independently significant in this model. A pre-specified sensitivity analysis using the full analytic cohort (n = 674, events-per-variable ratio 6.7) with collapsed referral categories did not confirm any referral tier association; ETAT emergency category and lower SES were independently associated in the sensitivity model. All multivariable estimates should be regarded as exploratory. This prospective cohort provides simultaneous characterisation of referral patterns, ETAT triage acuity, diagnostic case mix, and inpatient mortality at a national tertiary paediatric emergency unit in West Africa. The referral-mortality gradient and high ETAT emergency category proportion document the severity of illness arriving through different referral pathways at this facility. The association between secondary/regional hospital referral and inpatient mortality is hypothesis-generating and requires replication in an adequately powered multicentre study before any service-level conclusions can be drawn.

20.
PLOS Computational Biology 2026-06-18

scMagnifier: Resolving fine-grained cell subtypes via GRN-informed perturbations and consensus clustering

作者:

by Zhenhui He, Dong Kangning Resolving fine-grained cell subtypes in single-cell RNA sequencing (scRNA-seq) data remains challenging, as their subtle transcriptional differences are often obscured by technical noise and data sparsity. Here, we present scMagnifier, a consensus clustering framework that leverages gene regulatory network (GRN)-informed in silico perturbations to amplify subtle transcriptional differences and uncover latent cell subpopulations. scMagnifier perturbs candidate transcription factors (TFs), propagates perturbation effects through cluster-specific GRNs to simulate post-perturbation expression profiles, and integrates clustering results across multiple perturbations into stable subtype assignments. Additionally, scMagnifier introduces regulatory perturbation consensus UMAP (rpcUMAP), a perturbation-aware visualization that provides clearer separation between cell subtypes and guides the selection of the optimal number of clusters. In both single-batch and multi-batch benchmarks, scMagnifier consistently improves the resolution and accuracy of fine-grained cell type identification. Notably, when integrated with spatial clustering methods such as STAGATE, scMagnifier is compatible with spatial transcriptomics workflows and effectively reveals tumor cell subtypes and their spatial organization in ovarian cancer.

21.
arXiv (CS.CV) 2026-06-15

A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at https://github.com/LpyNow/MMA-82.

22.
arXiv (CS.CL) 2026-06-16

Not All Skills Help: Measuring and Repairing Agent Knowledge

LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming-lab/assay.

23.
arXiv (CS.CL) 2026-06-17

Perceptual compensation for tonal context in self-supervised speech models

This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

24.
arXiv (CS.AI) 2026-06-24

Beyond U-Net: A Latent-Representation-Aligned Skip-Free Backbone for Flow-Matching Speech Enhancement

arXiv:2606.24745v1 Announce Type: cross Abstract: Generative models, particularly diffusion and score-based approaches, have recently achieved strong performance in speech enhancement, but their iterative sampling process limits real-time deployment. Flow Matching offers an efficient alternative by transporting noisy speech toward clean speech through an ordinary differential equation with few function evaluations. In this work, we propose a skip-free encoder-decoder backbone for flow-matching speech enhancement, guided by Latent Representation Alignment (LRA). Instead of relying on U-Net skip connections, which may transfer noise-correlated low-level features to the decoder, the proposed model aligns its bottleneck and decoder representations with clean latent features extracted from a frozen Descript Audio Codec encoder-decoder without quantization. This codec-aligned supervision promotes compact clean-speech representations while preserving efficient few-step inference. Experiments on WSJ0-CHiME3 and VoiceBank-DEMAND show improved PESQ and perceptual quality, especially on VoiceBank-DEMAND, using only five function evaluations.

25.
arXiv (CS.CL) 2026-06-17

A Two-Phase Stability Study of LLM Judges and Bar Council Examiners on Thai Bar-Exam Free-Form Essays

Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: three Bar Council-trained examiners (A, B, C) and a 26-LLM judge panel score the same 15 cross-graded answers from the same four inputs (question, official Bar Council grading regulation, gold answer, candidate answer). The headline finding is asymmetric. On 10 of 15 cells where the rubric prescribes both axes, all 29 raters converge in a tight band: panel agreement is universal. On the remaining 5 cells where the rubric does not prescribe how to grade a correct final answer that omits a decisive statutory citation, the human panel splits between two coherent readings (B/C majority at the upper rubric band, score 6-8; A minority at the lower band, score 1-2). The LLM judge population does not split symmetrically: 22 of 26 LLMs score in or near B/C's contested band, 3 sit in the regulation-silent middle gap, and only 1 (GPT-5.4 Nano) approaches A's band without consistently scoring within it. Zero LLMs in our 26-judge panel reproduce the minority human reading on the contested cells. The B/C-direction cluster spans every model size, vendor, and price tier we tested. An instrumented three-LLM anchor sub-panel (Claude 4.6 Opus, Gemini 3.1 Pro, GPT-5.4 Pro) carries determinism probes, input ablations, and bootstrap CIs, and reaches anchor panel $\alpha = 0.77$ on the 15 cells against human-panel $\alpha = 0.36$. The high LLM-panel $\alpha$ reflects systematic convergence on the majority reading rather than balanced reproduction of both readings; a benchmark that selects its LLM judge by maximising agreement with a human reference panel will inherit this asymmetry by construction.