Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CV) 2026-06-16

ReGenHuman: Re-Generating Human Appearances for Realistic Full-Body Video Anonymization

Anonymizing human-centric video data is an understudied problem. Prior anonymization techniques either blur or redact pixels at the cost of realism and downstream utility, or generate frame-by-frame at the cost of temporal coherence. We introduce ReGenHuman, the first full-body video anonymization pipeline that is simultaneously realistic, temporally consistent, and anonymous by construction. Contrary to past approaches which redact or edit the inputs directly, we propose a regenerate, don't edit paradigm. Our approach composites 2D pose, segmentation, and monocular depth into two complementary conditioning streams - StructAll and StructHuman, which are used to fine-tune a video-to-video diffusion backbone on in-the-wild human videos, synthesizing the human regions entirely from identity-free structural cues. We evaluate our model on privacy, quality, and utility, and show that our ReGenHuman achieves the best tradeoff across all three axes against current baselines. We further show that our anonymized videos remain effective for downstream tasks, including video question answering.

02.
arXiv (CS.AI) 2026-06-16

Thinking with Visual Grounding

arXiv:2606.16122v1 Announce Type: new Abstract: Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.

03.
arXiv (CS.LG) 2026-06-16

Drivers, Receivers, and Dynamic Linkages: The Directed Structure of SDG Interdependence, 2000–2024

arXiv:2601.20875v2 Announce Type: replace-cross Abstract: Governments with limited fiscal and administrative capacity need to know which Sustainable Development Goals (SDGs) propagate progress through the goal system and how quickly. We map the directed interdependence structure of all seventeen goals using a balanced panel of 114 countries observed annually from 2000 to 2024. The goal series are persistent, trending, and cross-sectionally dependent, so we apply two estimators matched to this regime: a Dumitrescu-Hurlin panel Granger non-causality test, run on first-differenced series, to recover the directed interaction network, and panel local projections with Driscoll-Kraay standard errors to measure the dynamic magnitude of 31 theory-derived indicator linkages. Of 272 directed goal pairs, 84 linkages survive false-discovery control (40 synergies, 44 trade-offs; network density 0.31). Synergies and trade-offs occur at comparable strength, so no single goal behaves as a universal accelerator, and the goal-level hierarchy itself is fragile. Driver-receiver rankings correlate weakly across lag orders and centrality metrics, and under a country bootstrap only two roles are distinguishable from zero: peace and strong institutions as the clearest net receiver, and poverty reduction as the most probable effect-size-weighted driver. The supported linkages are dynamic, accruing over four to five years: sanitation and poverty improvements are the strongest predictors of lower child mortality, and the education-child-health association is corroborated in independent World Development Indicators data across 183 countries. These results caution against rankings-based accelerator policy and support adaptive portfolios built on supported, time-lagged linkages monitored through constituent indicators.

04.
arXiv (CS.CV) 2026-06-17

Structured Adversarial Camouflage via Voronoi Diagrams

Pixel-wise adversarial patches are computationally heavy and often visually detectable, limiting utility in security-critical systems. We present adversarial Voronoi camouflage that optimizes only seed-point locations under fixed, printable palettes using a soft assignment, producing structured, splinter camouflage-like patterns without additional regularization. Evaluated on person detection with COCO-style AP@[.5:.95], naive placement (Inria -> COCO) performs comparably bad, while garment-level application via segmentation mask (3DPeople) results in a significant AP drop. The attack transfers to out-of-domain backgrounds and across detector families (YOLOv9/10/11/12), indicating robustness in black-box settings. Repainting with different palettes largely nullifies the effect, and single-color tweaks show limited tolerance (

05.
arXiv (CS.LG) 2026-06-12

Scalable anomaly detection via a univariate Christoffel function

arXiv:2606.12483v1 Announce Type: new Abstract: Anomaly detection plays a critical role in identifying unusual patterns across domains such as fraud detection, network intrusion, and system fault diagnosis. Recently, Christoffel function-based methods, rooted in polynomial optimization, have emerged as promising alternatives to deep learning due to their strong mathematical foundations and computational frugality. However, their practical applicability is hindered by the need to invert a matrix whose size grows exponentially with the data dimension, rendering the method intractable even for moderate-dimensional datasets. This paper addresses the dimensionality limitations of Christoffel function-based anomaly detection while preserving its key theoretical properties, i.e., the on-off support dichotomy behavior and the accurate support shape capture. We introduce UCF, a univariate Christoffel function which is based on the squared distance between the query point and the support points. Extensive experiments on the ADBench benchmark demonstrate that UCF consistently outperforms 14 state-of-the-art baselines in terms of Average Precision. By resolving the scalability bottleneck of the Christoffel Function, this work expands the toolkit of anomaly detection methods with a robust, theoretically grounded, and universally applicable approach.

06.
arXiv (CS.AI) 2026-06-12

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

arXiv:2606.13211v1 Announce Type: new Abstract: AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

07.
Nature Medicine 2026-06-17

General-purpose chatbots outperform clinical AI tools on physicians’ real-world questions

Authors: Unknown Author

Specialized clinical AI tools are entering medical practice with little independent testing. In a head-to-head evaluation across two public benchmarks and real questions from physicians, three general-purpose frontier large language models outperformed two leading clinical AI tools, which performed no better than Google search AI overview.

08.
arXiv (CS.CV) 2026-06-17

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency

09.
arXiv (CS.AI) 2026-06-18

Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts

arXiv:2606.18258v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit a wide range of human-like behaviors, from expressing thoughts and emotions, to engaging in relationship-building with users, to refusing requests and maintaining boundaries. Despite their prevalence, researchers and practitioners lack methods and empirical insights to make informed decisions about when and what types of human-like behaviors LLMs should exhibit. To fill this gap, we present a multi-dimensional analysis of the prevalence, potential effects, and controllability of these behaviors using LLM-as-a-judge and human evaluation. Across 21,000 multi-turn conversations from four widely used models (gpt-4o, gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash), we find that human-like behaviors are pervasive but vary across models and user factors (conversation goals and user profiles). In terms of perceived appropriateness, human evaluators judged self-referential and relationship-building behaviors as less appropriate from LLMs than from humans, but boundary-maintaining behaviors more appropriate from LLMs than from humans. Finally, we show that system prompting can control these behaviors, though it requires careful evaluation to avoid unintended effects. We discuss the implications of our findings and provide recommendations for responsible LLM design and evaluation.

10.
arXiv (CS.CL) 2026-06-11

Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

11.
arXiv (CS.AI) 2026-06-18

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

arXiv:2606.19197v1 Announce Type: cross Abstract: Abduction is a central approach to explain missing entailments from a knowledge base by providing a hypothesis, that would, if added to the knowledge base, make the missing entailment become true. Abduction under repair semantics has recently been investigated in detail, where several desirable properties and optimality criteria were considered, such as signature-restrictions and minimality in size and of introduced conflicts. Naturally, hypotheses that satisfy more than one of these properties or combine a property with an optimality criterion would be even more desirable for applications. So far, such hypotheses have not been investigated in the literature. In the present paper, we consider the ABox abduction problem for hypotheses satisfying more than one property or additional optimality criteria, for EL_bot under brave and AR semantics. Our main observation is that often requiring additional properties for hypotheses does not lead to an increase of complexity.

12.
arXiv (CS.AI) 2026-06-19

SleepMaMi: A Universal Sleep Foundation Model for Integrating Macro- and Micro-structures

arXiv:2602.07628v2 Announce Type: replace Abstract: While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi , a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age, sex and BMI to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of $>$20,000 PSG recordings (158K hours),SleepMaMi outperforms or matches state-of-the-art existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.

13.
arXiv (quant-ph) 2026-06-17

Superconductor-"Metal" Transition of One-dimensional Interacting Bosons with Ohmic Quantum Dissipation

arXiv:2605.30746v2 Announce Type: replace-cross Abstract: The phase diagram of a system of interacting bosons (Cooper pairs) hoping on a one-dimensional (1D) lattice with onsite phase dissipation describing the Josephson tunneling to a nearby diffusive normal-metal electrode is studied. Starting from the system at commensurate lattice filling, it is shown by a combination of analytical techniques that the phase diagram contains two quantum phases: A dissipative Bose-Einstein condensate (D-BEC) or superconductor with long-range phase coherence, and a dissipative Mott insulator (D-Mott) or "metal" with exponentially decaying phase correlations in space and local imaginary-time correlations decaying as the local pairing correlations of the electrode. The D-Mott/metal phase can be described as a 1D array of dissipative boson puddles, weakly coupled by Josephson tunneling. The puddle size roughly corresponds to the length scale beyond which phase slips suppress phase coherence. The dissipative time-dependent Ginsburg-Landau theory phenomenologically used by Sachdev, Werner, and Troyer [Phys. Rev. Lett. {\bf 92} 237003 (2004)] for the superconductor-metal transition in quasi-1D wires is derived from this microscopic puddle picture. Thus, the criticality of the D-Mott/D-BEC transition is shown to belong to the Wilson-Fisher universality class with dynamical exponent $z\approx 2$. At small doping, the D-Mott/metal phase remains stable due to its finite compressibility, which is computed to leading order in a perturbation expansion of the dissipation strength and the inter-puddle Josephson coupling. At larger doping, using a mapping to a pseudospin chain combined with bosonization, the D-BEC/superconductor phase is the ground state for non-vanishing but arbitrarily small dissipation. Similarities and differences with deconfinement transition of an array 1D bosonic Mott insulators in anisotropic optical lattices are also discussed.

14.
arXiv (CS.CL) 2026-06-16

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.

15.
arXiv (CS.LG) 2026-06-17

From Compression to Deployment: Real-Time and Energy-Efficient FastGRNN on Ultra-Constrained Microcontrollers

arXiv:2606.17249v1 Announce Type: cross Abstract: The dominant trajectory of modern machine learning has been to scale up: larger models, larger accelerators, larger memory budgets. Yet a multi-year global semiconductor supply constraint and the growing energy and carbon cost of always-online inference expose the fragility of this trajectory and motivate the opposite direction: refactoring AI and ML algorithms to fit the small, ubiquitous microcontrollers already in mass production in wearables, sensors, and edge appliances. We present an end-to-end open-source reproduction of FastGRNN, a compact gated recurrent cell, deployed on two bare-metal targets: the 8-bit Arduino (ATmega328P) and the 16-bit MSP430 (no hardware multiplier; 16 KB Flash; 512 B SRAM). Our compression pipeline combines low-rank weight factorization, iterative hard-thresholding sparsity, and per-tensor Q15 post-training quantization with explicit activation calibration. The deployed model occupies 566 bytes of weights and achieves macro F1 = 0.918 (seed 0; five-seed Q15 mean 0.853+-0.107) on the HAPT test set. It matches a PyTorch reference at 100% prediction agreement across 3,399 test windows (MCU seed 0; 99.91-100% C-equivalent across five seeds). Both platforms sustain real-time 50 Hz streaming inference (9.21 ms per sample on Arduino; 13 ms on MSP430), where a 256-entry sigmoid/tanh look-up table delivers a 30.5x speedup on the multiplier-less MSP430. Four contributions extend the original FastGRNN paper: (i) cross-platform bit-equivalent deterministic inference; (ii) characterization of recurrent warm-up latency (median 74 samples, 1.48 s; worst-case 125 samples, 2.50 s over 100 test windows); (iii) a deployable look-up-table recipe for multiplier-less embedded targets; and (iv) hardware energy characterization showing 17.7 mW active inference power,

16.
Nature (Science) 2026-06-08

Daily briefing: Human embryo genomes precisely altered

Authors:

The use of ‘base editing’ to precisely tweak human embryos has divided researchers. Plus, the number of lives saved by less-polluting cars in China and how to tip the world towards a sustainable future. The use of ‘base editing’ to precisely tweak human embryos has divided researchers. Plus, the number of lives saved by less-polluting cars in China and how to tip the world towards a sustainable future.

17.
arXiv (quant-ph) 2026-06-11

Entanglement generation between field modes mediated by a fluctuating conducting wall

arXiv:2606.12338v1 Announce Type: cross Abstract: We consider a movable conducting plate of finite mass, between two fixed ones, whose mechanical degrees of freedom are treated quantum-mechanically and bound to its equilibrium position by a harmonic potential. The movable wall is thus subjected to quantum fluctuations of its position. This creates a system of two sub-cavities separated by the movable fluctuating plate, and two massless one-dimensional scalar fields, one in each sub-cavity. This system is described by an appropriate generalization of the Law Hamiltonian. The presence of the movable wall yields an effective plate-fields interaction, as well as an effective interaction between the field modes. We obtain, at the second order in perturbation theory, the ground state of the interacting system and the reduced density operator of the fields in each sub-cavity by tracing out the wall's degrees of freedom. We calculate the entanglement between two field modes, one in each cavity, by evaluating analytically the negativity; we then evaluate numerically also the total multimode negativity. Our results show that in both cases the fields in the two sub-cavities are entangled, in contrast to the case in which the wall is fixed in space. We discuss the amount of the field entanglement present as a function of relevant physical parameters of the system such as the mass and oscillation frequency of the movable wall, its distance from the fixed walls and the frequencies of the field modes considered.

18.
arXiv (CS.AI) 2026-06-18

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

arXiv:2508.21720v3 Announce Type: replace Abstract: Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.

19.
arXiv (CS.LG) 2026-06-16

Brownian Kernel Ladders

arXiv:2606.15812v1 Announce Type: new Abstract: Constructing mathematically tractable function spaces that capture hierarchical compositional representations remains a central challenge in statistical learning theory. We introduce Brownian kernel ladders (BKLs), a recursively defined hierarchy of integral reproducing kernel Hilbert spaces generated through Brownian-kernel integral constructions. Starting from linear functionals, each layer is obtained by integrating Brownian kernels over probability measures supported on subsets of the previous layer, yielding a recursive function-space model in which depth is encoded directly through the hierarchy. Based on this framework, we define canonical BKL spaces together with an associated complexity functional. We establish several analytical and statistical properties of these spaces. In particular, we show that BKL spaces form quasi-Banach spaces, satisfy depth-dependent Hölder regularity estimates, and exhibit strict monotonicity with respect to depth. We further prove existence results for regularized empirical risk minimization and derive Gaussian complexity bounds that remain uniformly controlled with respect to both the ambient dimension and the hierarchy depth. A key ingredient of the analysis is a combinatorial proof technique based on recursive subset decompositions and Brownian-kernel threshold representations. These estimates yield excess-risk guarantees of near-parametric order for regularized empirical risk minimization over BKL spaces. Our results provide a mathematically tractable hierarchical function-space framework for studying compositional representations in deep learning.

20.
arXiv (CS.AI) 2026-06-18

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

arXiv:2606.18976v1 Announce Type: cross Abstract: Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

21.
arXiv (CS.CV) 2026-06-16

Multi-Task Tennis Stroke Biomechanics Analysis Using MediaPipe Pose

We built a multi-task pipeline for tennis stroke biomechanics from plain RGB video. On top of pose-based stroke recognition, it adds two new tasks, predicting shot direction and grading posture quality, plus a rule-based feedback layer that suggests coaching tips. Strokes are found automatically using a weighted joint velocity score, s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder, removing the need for manual annotation. Pose comes from MediaPipe Pose Landmarker (33 landmarks, metric world coordinates), with each stroke turned into a 30-frame by 39-feature sequence for TennisTransformerGPU, a compact 564,103-parameter transformer (4 layers, 4 heads, d=128) with three parallel output heads. Trained on 1,281 labeled strokes from 7 pros and 1 amateur across 11 videos, it hits 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture under a random 80/20 split. The interesting test is cross-player: train on pros, evaluate on the amateur. Stroke type barely budges, 82.9%, a 0.8% drop. Direction prediction does not transfer; it just falls back to the majority class. An ablation shows why world coordinates matter so much here: switching to image-space landmarks tanks cross-player stroke-type accuracy from 83% to 47% and direction from 68% to 21%. Everything runs on Kaggle's free T4 GPU tier and is fully reproducible.

22.
arXiv (CS.AI) 2026-06-16

ALCL: An Adaptive Log-Correntropy Loss for Robust Learning under Non-Gaussian Noise

arXiv:2606.16050v1 Announce Type: cross Abstract: Robust deep learning under heavy-tailed and impulsive noise remains challenging because conventional losses such as mean squared error (MSE) exhibit unbounded sensitivity to outliers. Although correntropy-based objectives improve robustness, existing formulations rely on fixed kernel parameters that must be empirically tuned and remain static during training. To address these limitations, we propose an Adaptive Log-Correntropy Loss (ALCL), a heavy-tailed loss formulation that adaptively learns its robustness geometry during optimization. ALCL introduces a logarithmic residual model whose shape and scale parameters are learned jointly with network weights through differentiable reparameterization. This yields a principled maximum likelihood formulation whose influence function is formally bounded and redescending, allowing the loss geometry to adapt dynamically to evolving residual statistics while suppressing extreme outliers. Comparative experiments on four widely used benchmark datasets spanning grayscale and red-green-blue (RGB) image data under mixed heavy-tailed and impulsive noise demonstrate that ALCL consistently outperforms MSE and optimally tuned generalized correntropy losses in both reconstruction fidelity and downstream classification accuracy. While performance differences remain small under low-noise conditions, under high-noise regimes ALCL improves median accuracy by up to 4.75% on grayscale benchmarks and 4.51% on RGB datasets, with reduced variance across runs. These results demonstrate that adaptive robustness through joint learning of loss parameters provides a computationally efficient alternative to static correntropy-based losses for deep learning in non-Gaussian environments.

23.
arXiv (CS.AI) 2026-06-12

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

arXiv:2508.04427v2 Announce Type: replace-cross Abstract: Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that most studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. To address these gaps, we not only synthesize findings from the surveyed works but also incorporate a complementary analysis that integrates recent and emerging advances driving multimodal explainability. Based on these insights, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible multimodal AI systems, with explainability at their core.

24.
arXiv (CS.AI) 2026-06-18

Robust Regularized Policy Iteration under Transition Uncertainty

arXiv:2603.09344v3 Announce Type: replace Abstract: Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $\gamma$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

25.
arXiv (quant-ph) 2026-06-15

Electromagnetic Wightman functions and vacuum densities for a brane intersecting the AdS boundary

arXiv:2604.17583v2 Announce Type: replace-cross Abstract: We investigate the combined effects of a brane intersecting the AdS boundary and background gravitational field on the local characteristics of the electromagnetic vacuum. Two types of boundary conditions on the brane are considered, which are higher-dimensional generalizations of the perfect electric (PEC) and perfect magnetic (PMC) boundary conditions in Maxwell's electrodynamics. The brane-induced contributions to the Wightman functions of the vector potential and field tensor are explicitly extracted. Simple expressions in terms of elementary functions are provided. The behavior of the vacuum expectation values (VEVs) is mimicked by a scalar field with a negative effective mass squared determined by the radius of the AdS spacetime. The expectation values of the electric and magnetic fields squares and of the energy-momentum tensor are investigated as local characteristics of the vacuum state. The brane-induced contributions to these VEVs have opposite signs for the PEC and PMC conditions. For the PMC condition, this contribution is negative for the electric field squared and positive for the magnetic field squared. The VEV of the energy-momentum tensor has a nonzero off-diagonal component. The brane-induced vacuum energy density is positive for PMC condition, whereas the normal and parallel stresses change sign as functions of the distance from the brane. Unlike the problem involving a planar boundary in the Minkowski bulk, the vacuum energy-momentum tensor does not vanish in (3+1)-dimensional AdS spacetime.