Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-18

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

02.
arXiv (CS.CV) 2026-06-12

Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32

WiFi Channel State Information (CSI) has shown promise for single-person gait identification, raising interest in its use for contactless biometrics, continuous authentication, and passive identification. However, the feasibility of multi-person identification on low-cost commodity devices remains unclear. A critical question is whether weak multi-person performance is primarily an algorithmic limitation, or whether it reflects a more fundamental sensing ceiling on commodity WiFi hardware. We address this question through a systematic empirical study using commodity ESP32 WiFi sensors. We evaluated six different signal separation methods–FastICA, SOBI, PCA-ICA, NMF, Wavelet, and Tensor decomposition–across seven scenarios spanning 1-10 people in both controlled and realistic indoor environments. To investigate beyond classification accuracy, we introduce three diagnostic metrics: intra-subject variability (ISV), inter-subject distinguishability (ISD), and performance degradation rate (PDR). In all methods, performance remains moderate (39%-56% accuracy), with limited evidence that algorithmic choice alone solves the problem. The best-performing method, NMF, reaches 56% accuracy, while all methods exhibit extremely high feature-space overlap (97%-99%), unstable within-subject representations, and marked environmental sensitivity. These findings suggest that, under commodity ESP32 CSI constraints, dense multi-person gait identification is limited more by sensing quality and spatial diversity than by the chosen separation algorithm. Our results have direct implications for security and privacy: they call into question the practicality of commodity WiFi CSI as a robust multi-user biometric primitive for authentication, while also placing important bounds on the passive identification capabilities achievable with low-cost off-the-shelf WiFi hardware.

03.
arXiv (CS.CL) 2026-06-12

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a commitment boundary – a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by epiphenomenal CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

04.
arXiv (CS.AI) 2026-06-12

EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence

arXiv:2606.12690v1 Announce Type: cross Abstract: In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and fully frozen Cosmos3 backbone network. Evaluated entirely under a zero-shot task protocol, EWAM is centrally focused on reducing the amount of additional deployment data required to adapt to new task layouts. Notably, no extra task-specific demonstration sets were introduced in any of the evaluations, and no fine-tuning was performed on the backbone network. Its performance gains stem entirely from an inference-time co-reasoning mechanism composed of four inserted lightweight neural layers: the Neural Experience Memory Layer located in the intermediate layers of the Diffusion Transformer (DiT) provides task-relevant execution context; the Neural Anomaly Detection Layer after the state prediction head monitors the divergence between predicted and actual states in real time; the Neural Policy Routing Layer dynamically selects direct execution, conservative replanning, or rollback recovery based on the anomaly severity; and the Neural Action Correction Layer refines the generated action chunks using execution diagnostics. Unlike naive feature fusion, the memory, anomaly detection, and correction modules are deeply integrated into the Cosmos3 forward path in a differentiable manner, with only the final routing decision being a discrete supervised one.

05.
PLOS Computational Biology 2026-06-15

WormSORT: A detection-based multiple object tracking model for individual silkworms in breeding environments

作者:

by Hongkang Shi, Linbo Li, Shiping Zhu, Haibo He, Minghui Zhu, Jianfei Zhang Variety breeding has long been a cornerstone of high-quality agriculture, and recent advances in artificial intelligence have opened new avenues for accelerating biological breeding. In this study, we applied multiple object tracking (MOT) technology to silkworm breeding to achieve efficient, non-invasive, and dynamic individual monitoring. Unlike pedestrian or vehicle tracking, silkworms pose unique challenges for MOT due to their small size, dense distribution, and high inter-individual similarity, which complicate accurate tracking and behavioral analysis. To address these issues, we propose WormSORT, an enhanced tracking method based on a tracking-by-detection framework with an optimized data association strategy. A pre-trained detection model identifies silkworms in each frame, and deep feature vectors are extracted using a re-identification network. Identity association is first performed using Intersection over Union (IoU) matching, followed by deep feature similarity for unmatched cases, improving both tracking accuracy and reliability. To further enhance tracking stability, we introduce a candidate input padding mechanism, including IoU padding and feature padding, ensuring that high-confidence unmatched trajectories and detections remain involved in the matching process. To validate the proposed tracking strategy, we constructed two multiple silkworm tracking (MST) datasets: MST-50, containing approximately 50 individuals over 1000 frames, and MST-100, containing approximately 100 individuals over 1200 frames. Experimental results demonstrate that WormSORT outperforms existing methods, including DeepSORT, StrongSORT, OCSORT, ByteTrack, and BotSORT, achieving superior tracking performance. This study provides a valuable reference for silkworm tracking and behavioral analysis, contributing to the advancement of high-quality silkworm rearing and management.

06.
medRxiv (Medicine) 2026-06-23

Multivariate Echocardiographic Phenotyping of Hypertensive Heart Failure Using Unsupervised Machine Learning: A Pilot Study

Background Heart failure in hypertensive patients is heterogeneous and poorly captured by traditional left ventricular ejection fraction (LVEF) based classification. Multivariate echocardiographic data combined with unsupervised machine learning may provide a more precise phenotypic characterization. This pilot study evaluated the feasibility of unsupervised clustering of routine transthoracic echocardiographic data to identify phenotypic subgroups of hypertensive heart failure. Methods This retrospective pilot study analyzed transthoracic echocardiography reports from hypertensive patients with clinical heart failure. After data cleaning and exclusion of incomplete records, 102 patients with 11 echocardiographic variables were included. Variables describing left ventricular geometry, systolic function, and diastolic performance were standardized and subjected to K-means clustering. Optimal cluster number was determined using the elbow method and silhouette analysis. Cluster characteristics were assessed using descriptive statistics and Kruskal Wallis testing. Concordance with LVEF based heart failure categories was evaluated. Results Three distinct echocardiographic phenotypes were identified. Cluster 0 (n = 50) demonstrated preserved LVEF with concentric remodeling, consistent with heart failure with preserved ejection fraction (HFpEF) phenotype. Cluster 1 (n = 37) showed marked ventricular dilation and reduced systolic function, consistent with heart failure with reduced ejection fraction (HFrEF). Cluster 2 (n = 15) exhibited concentric hypertrophy with intermediate LVEF, consistent with heart failure with mildly reduced ejection fraction (HFmrEF) like phenotype. All echocardiographic variables differed significantly across clusters (p < 0.001). While Cluster 0 showed strong concordance with HFpEF (96%), Clusters 1 and 2 demonstrated substantial overlap across LVEF categories, indicating partial discordance between structural phenotypes and LVEF based classification. Conclusion Application of unsupervised machine learning to routine echocardiographic data identifies distinct heart failure phenotypes in hypertensive patients. These phenotypes demonstrate significant structural heterogeneity beyond LVEF based classification, supporting the utility of data-driven approaches for refined cardiac phenotyping. This pilot study provides a foundation for larger prospective studies.

07.
arXiv (CS.LG) 2026-06-19

PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection

arXiv:2606.20055v1 Announce Type: new Abstract: Time-series anomaly detection has significant practical value for industrial and medical monitoring, as well as other critical domains. Current Transformer- and large-model-based detection approaches incur excessive computational overhead, while existing lightweight alternatives are constrained by insufficient feature extraction and inadequate modeling of dependencies across multivariate variables. To mitigate the above drawbacks, this study develops a lightweight, efficient anomaly detection model, dubbed PaAno, within the patch-oriented representation learning paradigm. In the encoder module, a multiscale feature-extraction backbone is constructed using convolutional kernels with differentiated receptive fields to capture hierarchical temporal characteristics; subsequent cross-scale adaptive attention aggregation, combined with residual connection optimization, further stabilizes feature representation learning. A cross-variable fusion attention module is embedded to explicitly characterize inter-variable correlations, empowering the model to identify anomalous patterns amid intricate operational conditions. Moreover, a novel pretext task based on temporal patch-window sorting is customized to uncover intrinsic structural properties of time series, and triplet loss is leveraged to optimize the patch embedding space for enhanced feature discrimination. Extensive experiments on the TSB-AD benchmark demonstrate that the proposed PaAno achieves state-of-the-art detection accuracy on both univariate and multivariate tasks, yielding significant performance gains across evaluation metrics, including VUS-PR, relative to the original PaAno. Leveraging a compact network design, the presented model achieves favorable computational efficiency, enabling deployment on resource-limited terminals for real-time anomaly inference.

08.
medRxiv (Medicine) 2026-06-15

Primary care practitioners preconception health literacy and information-seeking: A cross-sectional survey.

Background Parental health before pregnancy influences maternal and child outcomes. Primary care professionals, including general practitioners [GPs], midwives, and naturopaths, can provide preconception care, yet many report limited knowledge and difficulty accessing relevant information. This study described Australian GPs, midwives, and naturopaths preconception health literacy, including knowledge and ability to access information. Methods Between July and September 2022, Australian GPs, midwives, and naturopaths completed a 32-item online cross-sectional survey. Participants were recruited through professional associations, and data were analysed using descriptive and inferential statistics Results Participants (N=373) included naturopaths (40.7%), GPs (32.4%), and midwives (26.8%). Reported barriers to clinician health literacy including lack of preconception care resources (25.5%), and limited clinician knowledge (23.6%). The proportion identifying limited clinician knowledge differed significantly between professions (GP: 31.4%; midwives: 23.0%; naturopaths: 17.8%; p=0.030). The highest level of accurate knowledge regarding preconception exposures was for pre-pregnancy obesity (82.7%), while low birth weight was the most accurately identified preconception outcomes (83.7%). Incorrect responses were most common for maternal multivitamin use as an exposure (28.3%) and childhood leukaemia as an outcome (26.3%). Differences between professions were strongest for infant outcomes, with moderate associations observed for shoulder dystocia (V=.2355), precipitous labour (V=.2173), macrosomia (V=.2060), labour dystocia (V=.2018) and cryptorchidism (V=.2018). Discussion Preconception health literacy varies across primary care professions. Clinicians require greater access to targeted resources and education tailored to their differing scopes of practice and experience. Improving clinician preconception health literacy may strengthen consistent evidence-based care and support better maternal, child, and long-term family health outcomes.

09.
arXiv (quant-ph) 2026-06-19

Many-Body Protection of Topological Edge Memory in Strong Interacting Quenches

arXiv:2606.19437v1 Announce Type: cross Abstract: Quantum quenches drive edge states far from equilibrium, yet whether the memory of a topological initial state survives in a non-integrable, interacting system has remained largely unexplored. We study this question in the bond-alternating XXZ chain – an interacting Su–Schrieffer–Heeger model hosting symmetry-protected topological edge modes with markedly enhanced boundary magnetization – and analyze quenches across all combinations of single-particle and many-body initial and final Hamiltonians. The results organize by a single distinction as we rigorously establish in this work: whether the post-quench Hamiltonian is free or genuinely interacting. For a free post-quench Hamiltonian, the dynamics is solved exactly by a correlation-matrix approach; the boundary-mode return amplitude decays as $t^{-3/2}$, and initial interactions enter only through a dressed one-body density matrix. For a genuinely interacting post-quench Hamiltonian, finite-time stability bounds prove that away from local resonances the first-dimer magnetization remains stable on time windows growing as arbitrarily large powers of the inverse inter-dimer coupling. Matrix product state simulations across all four protocols show that interactions in the final Hamiltonian markedly extend finite-time boundary memory – with local suppression near the isotropic $SU(2)$ point – revealing a many-body protection mechanism in a non-integrable system where scrambling would otherwise wash out initial-state memory fast.

10.
arXiv (CS.CV) 2026-06-17

MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization

Online 3D reconstruction from monocular image sequences is a challenging and ongoing research topic. 3D Gaussian Splatting (3DGS), leveraging its high-quality real-time rendering capability, empowers online 3D reconstruction to represent dense scenes with enhanced expressiveness, and thus holds great promise for a wide range of applications such as robotics and AR/VR. However, existing online 3DGS methods still suffer from some key challenges: fragile camera pose estimation due to the lack of global optimization, and low optimization efficiency in large-scale or long-sequence scenarios. To address these issues, we propose a robust and efficient online voxelized 3DGS reconstruction framework integrated with global $Sim(3)$ optimization, which enables reliable camera tracking and efficient global loop closure for both camera poses and voxelized 3DGS. To accelerate the convergence of the voxelized 3DGS, we further introduce a color residual learning strategy, which not only boosts optimization speed but also enhances rendering quality. Extensive experiments on diverse indoor and outdoor datasets demonstrate that our method achieves state-of-the-art performance in both camera pose estimation accuracy and rendering quality, while retaining real-time efficiency. Additionally, we develop and deploy a real-world UAV-based active reconstruction system grounded on our proposed method, validating its robustness and generalizability for practical online 3D reconstruction tasks. Our code and data are available at https://github.com/TrickyGo/MoonSplat.

11.
arXiv (quant-ph) 2026-06-19

Recursive perturbation approach to time-convolutionless master equations: Explicit construction of generalized Lindblad generators for arbitrary open systems

arXiv:2506.04095v2 Announce Type: replace Abstract: We develop a recursive perturbative expansion for the time-convolutionless (TCL) generator of an open quantum system in a generalized Lindblad form. This formulation provides a systematic approach to derive the generator at arbitrary order while preserving a Lindblad-like structure, without imposing assumptions on the system or environment beyond an initially uncorrelated state. The generator is written, at all orders, in a canonical form, which also corresponds to the minimal dissipation condition, which uniquely specifies the decomposition of the generator into Hamiltonian and dissipative contributions. To validate the method and show its effectiveness in addressing non-Markovian dynamics and strong-coupling effects, we compute the generator explicitly up to fourth order.

12.
arXiv (CS.AI) 2026-06-15

Design Methodology and Performance Trade-offs Management for Distributed and Compound AI Systems

arXiv:2606.14350v1 Announce Type: cross Abstract: Artificial Intelligence (AI) systems must typically satisfy service-level objectives including accuracy, latency, and cost. The prevailing model-centric approaches select a monolithic model at design time and apply identical computation regardless of input difficulty, cannot decompose tasks across specialized components, and have knowledge that is fixed at training time. During runtime, this can lead to performance degradation and increasing costs. Because the model is the main design variable, it determines the majority of system behavior, coupling operational objectives to a single design-time choice. Addressing these limitations requires shifting from model-centric to system-centric design. Compound AI systems realize this shift by orchestrating multiple models, algorithms, and tools as distributed AI systems through explicit control logic. The performance of such systems depends on their workflow topology, the models assigned to each task, and the parameters governing runtime behavior. We present a design methodology that organizes this space along two dimensions, workflow topology and configuration selection, and identifies eight design patterns, each consolidating techniques to address a specific limitation of monolithic deployment. We validate our methodology through three case studies. Across our case studies, Compound AI configurations approach accuracy of monolithic models within 2.5 to 4 percentage points while reducing latency by up to 60% and cost by up to 71%. We show that model selection and parameter configuration jointly determine system performance, but the resulting design space grows combinatorially, as workflows compose more patterns and components. Thus, we identify five open challenges that define a roadmap from manually configured prototypes towards systems that automatically discover and maintain SLO-compliance in Compound and Distributed AI systems.

13.
arXiv (CS.AI) 2026-06-12

The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism

arXiv:2606.12721v1 Announce Type: new Abstract: Inferring others' beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credibly. The Theory of Mind Utility (ToM-U) formalizes this epistemic state inference problem at the computational level of analysis, specifying what mentalizing computes and why without commitment to algorithmic or neural implementation. ToM-U achieves this by constructing Local Epistemic World Models (LEWMs) – directed typed graphs that represent agents, state nodes, and the epistemic relationships among them – and evaluating discrete candidate LEWMs against observed behavior until one achieves sufficient confidence. Five formal definitions specify the LEWM structure, agent node properties including ordered information access history, a bounded proliferation mechanism for recursive mentalizing, three inference procedures, and a residue function that captures the structured trace left by failed mentalizing attempts. ToM-U differs from Bayesian Theory of Mind and adjacent formal accounts, which presuppose rather than derive belief states, and from simulation theory and theory-theory, which lack a formal apparatus for epistemic state inference. The architecture generates directional, falsifiable predictions about mentalizing failure that follow from structural properties of the model rather than auxiliary assumptions, and positions ToM-U as a domain-agnostic mechanism upstream of goal inference and other downstream social cognitive processes.

14.
arXiv (CS.LG) 2026-06-19

Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

arXiv:2510.08807v2 Announce Type: replace-cross Abstract: From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.

15.
arXiv (quant-ph) 2026-06-11

Superspace Concentration and Adversarial Robustness in Quantum Algorithms

arXiv:2606.11580v1 Announce Type: new Abstract: We study superspace concentration as a quantum resource, formalized through the focus measure F(\r{ho}) = {\lambda}_max(\r{ho}_super) - the largest eigenvalue of the reduced superspace state - which quantifies the capacity of a quantum system to concentrate informational weight into a preferred subspace of an extended degree-of-freedom space. We develop a complete resource-theoretic framework around this measure and validate its properties through GPU-accelerated numerical simulation. Analytic decoherence predictions are confirmed to machine precision (1.11 x 10^{-16}) for superspace dimensions dS in {2,4,8,16,32}. Focus monotonicity holds across 10,000 random states with zero violations under four focus-non-generating channels across six system configurations. Focused quantum states resist coherent unitary attacks with significantly greater resilience than standard fidelity predicts, with focus remaining above 0.9 at attack strength {\epsilon} = 0.302 versus {\epsilon} = 0.174 for fidelity. We further demonstrate that the focus measure and the U(dS)-asymmetry measure are operationally distinct: asymmetry remains near zero and provides no robustness signal under coherent and targeted attacks while focus tracks spectral concentration and remains robust until {\epsilon} > 0.3. The connection between Grover's algorithm and superspace concentration is made explicit via the identity F(|{\psi}_k>

16.
arXiv (CS.CL) 2026-06-11

Measuring language complexity from hierarchical reuse of recurring patterns

We introduce the ladderpath index as a measure of language complexity grounded in algorithmic information theory. It counts the minimum steps needed to reconstruct a sequence through hierarchical reuse of repeated substructures, capturing an exactly computable but constrained form of algorithmic compressibility related to, but distinct from, Kolmogorov complexity. We apply the ladderpath approach to 21 parallel corpora from the Parallel Universal Dependencies dataset. The ladderpath index is approximately invariant across the languages, and varies much less than the corpus length. This is more pronounced when all corpora are mapped to a unified binary representation, providing evidence for the equi-complexity hypothesis from a representation-independent perspective. We also observe trade-offs between character inventory size and corpus length, and between vocabulary-level and corpus-level reconstruction complexity, supporting the trade-off hypothesis that total complexity is conserved and redistributed across linguistic levels. The reusable substructures identified by the ladderpath approach, without any linguistic input, overlap with words and morphological components attested in the natural vocabulary. The hierarchical reuse captured by the ladderpath approach parallels the chunking mechanisms proposed in cognitive science, where the human cognitive system compresses linguistic input into nested, reusable units under shared memory and processing constraints. This connection between cognitive chunking and the ladderpath approach provides a new interpretation for the equi-complexity and trade-off hypotheses, grounding both in the shared cognitive architecture that underlies language processing across human languages.

17.
arXiv (CS.LG) 2026-06-11

Querying Counterfactuals on Tissue Graphs with Supervised Disentanglement

arXiv:2606.08493v2 Announce Type: replace-cross Abstract: Tissue graph counterfactuals ask how a cell's expression would change under altered spatial neighbor contexts. Such queries are central to predicting cell behavior in tissues, but lack a unified definition, with existing methods targeting specific intervention types or treating cells as i.i.d. In this work, we first formalize tissue graph counterfactuals as a class of spatial interventions that either rewire connections between cells (edge perturbation) or modify the expression of their neighbors (node perturbation). We then introduce Cellina (https://cellina.readthedocs.io) - a framework that uses supervised disentanglement to decompose a cell's intrinsic state from its spatial context, using the latter as a conditioning input for counterfactual predictions. Across benchmarks spanning over 2.5 million spatially-resolved cells in colorectal cancer and mouse brain, Cellina outperforms spatially-informed and non-spatial competitors in in-silico graph perturbations, disentanglement, and scalability. Additionally, we show that Cellina reveals biologically distinct cancer subdomains in an unsupervised manner and enables targeted neighbor perturbation simulations.

18.
medRxiv (Medicine) 2026-06-23

THE SILENT STRUGGLE: EXPLORING THE EFFECTS OF COMMUNICATION BREAKDOWNS IN HEALTHCARE DELIVERY IN THE NORTHERN REGION OF GHANA

Abstract Effective health communication is central to patient-centred care and improved health outcomes, particularly in culturally diverse healthcare settings. In clinical and assistive practice, communication breakdowns may negatively affect diagnosis, treatment adherence, and preventive care. A qualitative phenomenological design was employed, utilizing Semi-Structured interviews with purposively sampled twenty patients and healthcare professionals from Tamale Teaching Hospital, Yendi Hospital, and Bimbilla Hospital. The researchers adopted Content Analysis as the tool of analysis for the data. The findings of this study revealed that language discrepancies Poor attitudes of healthcare providers hinderer patient openness and the quality treatment. Logistical issues, such as inadequate medicines and medical supplies, resulted in delayed treatment and additional financial burden on patients and their relatives. Cultural and social factors discourage patients from discussing certain health conditions with healthcare providers, leading to delayed treatment. These hurdles adversely impact on treatment and assistive practice, specifically in culturally diverse environment and preventive care. The study recommends training and capacity-building programs for healthcare providers in cultural competence, fostering effective and ethical health communication between patients and healthcare providers, and recruiting professional interpreters to bridge the linguistics gap between patients and providers. Abstract Effective health communication is central to patient-centered care and improved health outcomes, particularly in culturally diverse healthcare settings. In clinical and assistive practice, communication breakdowns may negatively affect diagnosis, treatment adherence, and preventive care. A qualitative phenomenological design was employed, utilizing semi-structured interviews with twenty purposively sampled patients and healthcare professionals from Tamale Teaching Hospital, Yendi Hospital, and Bimbilla Hospital. The researchers adopted content analysis as the tool of analysis for the data. The findings of this study revealed that language discrepancies Poor attitudes of healthcare providers hinder patient openness and quality treatment. Logistical issues, such as inadequate medicines and medical supplies, resulted in delayed treatment and additional financial burden on patients and their relatives. Cultural and social factors discourage patients from discussing certain health conditions with healthcare providers, leading to delayed treatment. These hurdles adversely impact treatment and assistive practice, specifically in culturally diverse environments and preventive care. The study recommends training and capacity-building programs for healthcare providers in cultural competence, fostering effective and ethical health communication between patients and healthcare providers, and recruiting professional interpreters to bridge the linguistics gap between patients and providers.

19.
arXiv (CS.CL) 2026-06-18

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

20.
arXiv (CS.CV) 2026-06-15

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

The rapid advancement of AI-generated multimodal video-audio content has raised significant concerns regarding information security and content authenticity. Existing synthetic video datasets predominantly focus on the visual modality alone, while the few incorporating audio are largely confined to facial deepfakes–a limitation that fails to address the expanding landscape of general multimodal AI-generated content and substantially impedes the development of trustworthy detection systems. To bridge this critical gap, we introduce the Multimodal Video-Audio Dataset (MVAD), the first comprehensive dataset specifically designed for detecting AI-generated multimodal video-audio content. Our dataset exhibits three key characteristics: (1) genuine multimodality with samples generated according to three realistic video-audio forgery patterns; (2) high perceptual quality achieved through diverse state-of-the-art generative models; and (3) comprehensive diversity spanning realistic and anime visual styles, four content categories (humans, animals, objects, and scenes), and four video-audio multimodal data types. Our dataset will be available at https://github.com/HuMengXue0104/MVAD.

21.
arXiv (CS.CV) 2026-06-16

Planning with Unified Multimodal Models

With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on embodied decision-making tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

22.
arXiv (CS.AI) 2026-06-19

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

arXiv:2606.20554v1 Announce Type: cross Abstract: Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.

23.
arXiv (CS.AI) 2026-06-16

An AI Security Agent for University ACMIS: Multi-Vector Threat Detection and Automated Response

arXiv:2606.08270v2 Announce Type: replace-cross Abstract: University Academic Management Information Systems (ACMIS) are high-value targets for a wide spectrum of security threats including brute-force login attacks, payment fraud, privilege escalation, insider data theft, and academic integrity violations. Traditional rule-based intrusion detection systems are inadequate because many malicious activities are structurally indistinguishable from normal operations. This paper presents an AI-based security agent for ACMIS that combines supervised anomaly detection, behavioural analytics, and a natural language processing chatbot for secure password recovery. The agent monitors five operational layers: authentication, authorisation, financial transactions, user behaviour, and system health, and responds through a four-tier risk escalation framework. A modular architecture allows the core engine to be extended to other institutional systems. Experiments on a simulated ACMIS event log dataset of 147,922 sessions demonstrate a threat detection macro-average F1 of 0.966, compared to 0.156 for a rule-based baseline and 0.836 for a sequence-only (LSTM) baseline, with end-to-end critical-tier automated response latency under 1 ms on a single-node prototype. The integrated recovery chatbot achieves 97.1 percent identity verification accuracy and an 87.3 percent mass-reset attack detection rate with zero false positives on legitimate high volume recovery periods.

24.
arXiv (CS.AI) 2026-06-12

Multiagent Protocols with Aggregated Confidence Signals

arXiv:2606.13591v1 Announce Type: new Abstract: Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

25.
arXiv (CS.CL) 2026-06-18

VISUALSKILL: Multimodal Skills for Computer-Use Agents

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.