Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

02.
arXiv (CS.CL) 2026-06-11

"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild

LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper scripts that execute with full user privileges. Community registries have emerged to distribute these skills, but the security implications remain unstudied due to the absence of labeled threat data. This paper presents a systematic security analysis of 98,380 skills collected from two major registries. Through a combination of static pattern matching and dynamic behavioral verification, we identify 157 skills exhibiting confirmed malicious behavior, encompassing 632 distinct vulnerabilities across 13 attack techniques. Our analysis reveals that these threats are deliberate rather than accidental: each malicious skill contains an average of 4.03 vulnerabilities spanning multiple attack phases. We identify two dominant attack strategies with statistically significant negative correlation – credential theft via remote code execution, and agent manipulation through adversarial instructions embedded in documentation. Over half of all confirmed cases originate from a single threat actor employing templated brand impersonation at scale. We further observe that attack sophistication correlates with concealment investment, with advanced skills universally employing undocumented capabilities while also exploiting platform-native trust mechanisms. Following responsible disclosure, registry maintainers removed all 157 (100%) of the reported skills. Our dataset and detection pipeline are publicly available to facilitate future research on securing LLM agent ecosystems.

03.
arXiv (CS.CV) 2026-06-19

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

04.
arXiv (CS.CV) 2026-06-17

OpenTie: Open-vocabulary Sequential Rebar Tying System

Robotic practices on the construction site emerge as an attention-attracting manner owing to their capability of tackling complex challenges, especially in the rebar-involved scenarios. Most of existing products and research are mainly focused on the collection of large amounts of data with model training demands. To fulfill this gap, we propose OpenTie, a 3D training-free rebar tying framework utilizing a RGB-to-point-cloud generation and an open-vocabulary rebar detection on the real-world test. We implement the OpenTie via a robotic arm with a binocular camera and guarantee a high accuracy by applying the prompt-based object detection method on the image filtered by our proposed post-processing procedure for the image-to-point-cloud generation framework. Our pipeline requires no training efforts and outperforms the training-based object detection, i.e., YOLO-based method, with the verification on the real-world sequential rebar tying test. The system is flexible for horizontal and vertical rebar tying tasks and holds the potential application to the real construction site with possibility of commercialization.

05.
arXiv (quant-ph) 2026-06-17

Full-state information-disturbance tradeoff for direction estimation with antiparallel spin-coherent pairs

arXiv:2606.18040v1 Announce Type: new Abstract: We determine the optimal information–disturbance tradeoff for estimating an unknown spatial direction encoded in two antiparallel spins. Rotational covariance reduces the optimization over all instruments to a finite-dimensional Choi problem: a positive seed operator obeys one trace constraint for each irreducible sector of the input representation, while both the directional score and the operation fidelity are linear functionals of this seed. For two antiparallel spin-$1/2$ particles, whose physical representation decomposes as $0\oplus1$, we derive the two-multiplier dual problem and characterize the optimal instrument from the kernel vectors of the dual slack operator. The optimal operation is a covariant filter with scalar–vector coherence and is generally not a convex interpolation between the identity channel and a measure-and-reprepare strategy. At maximum information we recover the Gisin–Popescu score, but the least disturbing output state is optimized independently, giving a smaller disturbance than both the parallel-spin benchmark and antiparallel measure-and-reprepare. We also formulate the parallel benchmark and, as a central extension of the method, treat antiparallel spin-coherent states of arbitrary spin $j$. In this case the signal coherently occupies all sectors $\ell=0,\ldots,2j$ of $j\otimes j$, the endpoint information is governed by nearest-neighbor sector coherences, and the endpoint disturbance is obtained from an explicit finite block-diagonal eigenvalue problem.

06.
arXiv (CS.LG) 2026-06-18

Zero-Shot Active Feature Acquisition via LLM-Elicitation

arXiv:2606.18933v1 Announce Type: new Abstract: Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.

07.
arXiv (CS.AI) 2026-06-19

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

arXiv:2606.20381v1 Announce Type: new Abstract: FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

08.
arXiv (CS.CV) 2026-06-25

TTSA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction

Streaming recurrent models enable efficient 3D reconstruction by maintaining persistent state representations. However, they suffer from catastrophic forgetting over long sequences due to balancing historical information with new observations. Recent methods alleviate this by deriving adaptive signals from the attention perspective, but they operate on single dimensions without considering temporal and spatial consistency. To this end, we propose a training-free framework termed TTSA3R that leverages both temporal state evolution and spatial observation quality for adaptive state updates in 3D reconstruction. In particular, we devise a Temporal Adaptive Update Module that regulates update magnitude by analyzing temporal state evolution patterns. Then, a Spatial Contextual Update Module is introduced to localize spatial regions that require updates through observation-state alignment and scene dynamics. These complementary signals are finally fused to determine the state updating strategies. Extensive experiments show that TTSA3R achieves competitive performance on standard short-sequence benchmarks and provides substantially stronger robustness on extended sequences. On NRGBD, as sequences extend from 50 to 250 frames, TTSA3R exhibits only a 1.33x error increase, compared with over 4x degradation for CUT3R. This highlights the practical value of temporal-spatial adaptive updates for long-term reconstruction stability. Our code is available at https://github.com/anonus2357/ttsa3r.

09.
arXiv (CS.AI) 2026-06-25

Taxonomy of Risks on Automated Fact-Checking Systems Considering its Propagation

arXiv:2606.25645v1 Announce Type: cross Abstract: In recent years, the posting of fake news including disinformation and misinformation on social networking services (SNS) has become a social problem. To combat this fake news, fact-checking that is the process of assessing the veracity of posts on SNS has become increasingly important. While fact-checking is currently performed by fact-checking organizations, it is difficult to fact-check all posts on SNS. Therefore, the use of automated fact-checking systems is effective. Recent automated fact-checking systems utilize artificial intelligence and large language models, so there are risks of incorrect judgments and posting incorrect results on social media which can lead to the spread of misinformation or to engage in defamation. In this paper, as a first step toward enabling the safe use of automated fact-checking systems, we categorize the specific risks on automated fact-checking systems. In this categorizing, we consider a three-stage risk propagation: risk factors, hazardous situations, and harm. Our analysis revealed that 32 specific risks exist in automated fact-checking systems. In this paper, we utilize the categorized risks as analytical cues (guide words) to present the risk assessment of the automated fact-checking system DEFAME. This assessment result indicates that risks that cannot be derived using STRIDE, a conventional IT security risk assessment method can be derived using our guide words.

10.
arXiv (CS.CL) 2026-06-24

Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

Extracting structured procedural knowledge from unstructured business documents is a critical yet unresolved bottleneck in process automation. While prior work has focused on extracting linear action flows from instructional texts, such as recipes, it has insufficiently addressed the complex logical structures, including conditional branching and parallel execution, that are pervasive in real-world regulatory and administrative documents. Furthermore, existing benchmarks are limited by simplistic schemas and shallow logical dependencies, restricting progress toward logic-aware large language models.To bridge this Logic Gap, we introduce BREX, a carefully curated benchmark comprising 409 real-world business documents and 2,855 expert-annotated rules. Unlike prior datasets centered on narrow service scenarios, BREX spans over 30 vertical domains, covering scientific, industrial, administrative, and financial regulations. We further propose ExIde, a structure-aware reasoning framework that investigates five distinct prompting strategies, ranging from implicit semantic alignment to executable grounding via pseudo-code generation. This enables explicit modeling of rule dependencies and provides an out-of-the-box framework for different business customers without finetuning their own large language models. We benchmark ExIde using 13 state-of-the-art large language models. Our extensive evaluation reveals that executable grounding serves as a superior inductive bias, significantly outperforming standard prompts in rule extraction. In addition, reasoning-optimized models demonstrate a distinct advantage in tracing long-range and non-linear rule dependencies compared to standard instruction-tuned models.

11.
arXiv (CS.LG) 2026-06-17

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

arXiv:2602.03420v2 Announce Type: replace-cross Abstract: Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.

12.
arXiv (CS.AI) 2026-06-16

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

arXiv:2510.04212v4 Announce Type: replace-cross Abstract: The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.

13.
arXiv (CS.AI) 2026-06-15

Regional Climate Model Emulation with Diffusion Approaches: What is the Added Value of Generative Machine Learning?

arXiv:2606.14570v1 Announce Type: cross Abstract: Emulators provide a cost-effective alternative to regional climate models (RCMs) by capturing their dynamical downscaling function. They link large-scale predictors simulated by global climate models (GCMs) to RCM-simulated high-resolution fields of the target variable, here precipitation. Machine learning methods, typically deep learning, are cheaper than running RCMs in computation time and energy. Among them, generative models are appealing because they can simulate ensembles of local high-resolution fields consistent with the predictors. This ensemble, which we call the uncertainty envelope, remains to be properly assessed for added value. Here, we make three contributions. First, we introduce ParamDiffusion, a new two-stage diffusion-based framework, and compare it with a state-of-the-art diffusion approach. Second, we expand standard validation through a comprehensive framework aligned with climate-science needs, examining specific precipitation events, including extremes. Third, within this framework, we assess the added value of diffusion approaches relative to deterministic methods. We intercompare four deep-learning models: a deterministic model designed to capture the precipitation tail; a parametric probabilistic model based on it; a recently proposed diffusion approach; and ParamDiffusion, which couples the parametric model with a diffusion model. Our results show that diffusion-based approaches reproduce climatological precipitation statistics with high skill, including distributional tails and spatially compounded extremes, while generating spatially detailed fields. However, none of the assessed models consistently accounts for the most extreme RCM-simulated events within its uncertainty envelope. Diffusion models are therefore promising for probabilistic RCM emulation, but progress is still required before they can reliably represent high-impact precipitation extremes.

14.
arXiv (CS.CV) 2026-06-16

A Dual-Branch Collaborative Framework for Joint Optimization of Underwater Image Enhancement and Object Detection

Due to wavelength dependent light absorption and scattering, underwater images usually suffer from color distortion and blurred details, which limits underwater object detection performance. Existing underwater image enhancement methods mainly focus on visual quality improvement, while it is still difficult to balance enhancement quality, processing efficiency, and downstream detection performance. Therefore, this paper proposes an efficient dual-branch underwater image enhancement framework for object detection. The detail enhancement branch improves brightness and local contrast to recover texture details in dark regions. The color restoration branch uses adaptive compensation to reduce color distortion and improve color gradation. By combining the complementary outputs of the two branches, the proposed framework provides clearer and more informative images for object detection. On the UIEB and EUVP datasets, the proposed method achieves UIQM scores of 2.249 and 2.576. When applied to the YOLOv8 detection task on the URPC dataset, the proposed method improves mAP50 by 2.1\% compared with the baseline. Extensive experiments show that our method improves object detection in complex underwater scenes, while balancing enhancement quality and processing efficiency.

15.
arXiv (CS.LG) 2026-06-15

On the Geometry and Optimization of Polynomial Convolutional Networks

arXiv:2410.00722v3 Announce Type: replace Abstract: We study convolutional neural networks with monomial activation functions. Specifically, we prove that their parameterization map is regular and is an isomorphism almost everywhere, up to rescaling the filters. By leveraging on tools from algebraic geometry, we explore the geometric properties of the image in function space of this map - typically referred to as neuromanifold. In particular, we compute the dimension and the degree of the neuromanifold, which measure the expressivity of the model, and describe its singularities. Moreover, for a generic large dataset, we derive an explicit formula that quantifies the number of critical points arising in the optimization of a regression loss.

16.
arXiv (CS.LG) 2026-06-16

Fantastic Pretraining Optimizers and Where to Find Them II: Hyperball Optimization

arXiv:2606.16899v1 Announce Type: new Abstract: Matrix based optimizers such as Muon can substantially speed up language model pretraining, but their gains over AdamW are observed to shrink as model size and data scale grow when using standard constant decoupled weight decay. We propose Hyperball, a simple optimizer wrapper that addresses this issue. Given a base optimizer such as Adam or Muon, Hyperball sets the Frobenius norms of weight matrices and their corresponding optimizer updates to fixed constants. On Qwen3 style models up to 1.2B parameters, Muon Hyperball achieves 20–30% token equivalent speedup over weight decay baselines. Hyperball also improves learning rate transfer across widths and depths compared to decoupled weight decay. This method is motivated by prior theory showing that training with weight decay leads to an equilibrium weight norm that only depends on the training hyperparameters. Through this mechanism, the weight decay then decides the angular learning rate, i.e. how fast the direction of the weight matrix changes.

17.
arXiv (CS.CV) 2026-06-17

RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

18.
arXiv (CS.CL) 2026-06-16

Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

In-context learning (ICL) is the standard method for low-resource classification, yet its efficacy in specialized domains remains largely unexplored. We address the challenge of classifying semantically complex, multi-party B2B conversations, where traditional ICL encounters significant limitations, especially as context length increases due to the concatenation of multiple few-shot examples. We introduce the \texttt{Call Playbook} dataset, featuring five classification tasks derived from real-world B2B conversations targeting core sales concepts. To bridge the gap between performance and practical utility, we propose novel knowledge extraction methods that distill verbose examples into compact, interpretable representations of structured classification criteria and precise task descriptions. Our approach achieves a 99\% reduction in token usage and improves macro-averaged AUC by up to 7\% over traditional ICL. Notably, it remains robust as context grows, unlike advanced token compression baselines which degrade by over 9 F1 points. Importantly, our framework enables direct refinement of classification logic, addressing critical needs for transparency, efficiency, and user interaction in real-world NLP applications.

19.
arXiv (CS.CV) 2026-06-18

BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection

The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance

20.
arXiv (math.PR) 2026-06-12

Conditional means, vector pricings, amenability and fixed points in cones

arXiv:2512.13829v4 Announce Type: replace Abstract: We develop a generalization of conditional probability for arbitrary ordered vector spaces. A related problem is that of assigning a numerical value to one vector relative to another. We characterize the groups for which these generalized probabilities can be stationary, respectively invariant. Our results deviate from the setting of classical probability and lead to a new criterion for amenability and for fixed points in cones.

21.
arXiv (CS.CL) 2026-06-12

PiDA: Phonetically-Informed Data Augmentation for Robust Vietnamese Speech Translation

Cascaded speech translation (ST) systems suffer from error propagation when Automatic Speech Recognition (ASR) outputs incorrect transcripts. We present the first systematic categorization of ASR errors for Vietnamese ST, classifying substitution errors by phonetic cause and quantifying their impact on downstream Neural Machine Translation (NMT) performance using Linear Mixed-Effects Modelling. We confirm that most ASR substitution errors arise from phonetic confusions rather than random noise, and that these phonetic errors significantly degrade ST quality. Motivated by this finding, we propose Phonetically-Informed Data Augmentation (PiDA), which generates ASR-like corruptions by substituting words with phonetically similar alternatives using phonetic word embeddings. Fine-tuning on a PiDA-augmented version of FLEURS Vietnamese-English improves translation of erroneous ASR outputs (up to +2.04 BLEU over standard fine-tuning) while also slightly improving clean-text performance.

22.
medRxiv (Medicine) 2026-06-18

Expert in Ultrasound Skills: Feasibility of an IMU-video platform to describe technical profiles during focused cardiac ultrasound. Pilot study

Background: Focused cardiac ultrasound (FoCUS) is operator dependent and requires coordinated probe manipulation, image interpretation and iterative visual feedback. Existing assessment approaches often emphasize final image quality or expert rating. We developed Expert in Ultrasound Skills (EXUS) , a platform that synchronizes transducer-mounted inertial measurement unit (IMU) data with ultrasound video, and evaluated its technical feasibility during FoCUS acquisition. Methods: This observational pilot study included 6 operators performing two repetitions of a four-view FoCUS protocol, yielding 12 analytical sessions and 48 planned acquisitions. Feasibility was defined by acquisition completion, video availability, start/stop events, fused IMU-video windows, temporal coverage, complete human label entries and IMU integrity. A 100-image Likert rating task was used to summarize pairwise inter-rater agreement for still-frame image quality assessment. Results: All 48 planned acquisitions were completed with video, start/stop events, fused windows and complete human label entries. Temporal coverage was at least 90% in 47/48 acquisitions. IMU integrity endpoints exceeded the 80% threshold: 43/48 acquisitions had no extreme IMU-derived artifact, 43/48 had no active-segment IMU restart and 44/48 had no complete motion flatline. Mean pairwise exact agreement for the Likert task was 38.9%, with mean quadratic-weighted Cohen's kappa of 0.564. Post hoc profiles varied across duration, visual quality, mechanical load and motor efficiency. Conclusions: EXUS was technically feasible for synchronized IMU-video capture during FoCUS. The pilot supports multimodal acquisition data as a way to describe technical profiles and generate formative feedback hypotheses, but the post hoc indices are not validated competency measures. Keywords: focused cardiac ultrasound; point-of-care ultrasound; inertial measurement unit; medical education; deliberate practice

23.
arXiv (CS.CL) 2026-06-17

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

This work describes the participation of the MLLP-VRAIN research group in the shared task of the IWSLT 2026 Simultaneous Speech Translation track. Our submission utilizes the recently released Parakeet and Qwen 3.5 models to create a robust, cascaded solution for long-form SimulST through the use of adaptive "black-box" policies. We explore relaxations of these policies to achieve better quality-latency trade-offs. Compared to last year, we participate on all language directions. In addition to this, for the En$\rightarrow${De, It, Zh} directions we also participate in this year's new context track employing a combination of ASR word-boosting and a RAG mechanism of offline pre-translated exemplars to guide generation and enrich our system with domain-specific context. Finally, we provide a detailed latency analysis of our system. Compared to last year, results on the MCIF En$\rightarrow$De test set shows a substantial quality improvement of +5.82 XCOMET-XL. Our context track processing further improves performance by +1.03.

24.
arXiv (CS.LG) 2026-06-25

Efficient Analytic Uncertainty Quantification for Multi-Modal Regression

arXiv:2606.25188v1 Announce Type: new Abstract: Efficient uncertainty quantification (UQ) is essential for trustworthy large-scale learning. Existing UQ methods for regression tasks mainly operate under the assumption that the conditional label marginal satisfies single-peak parametric models, e.g., Gaussians, where the negative log-likelihood function simplifies to the mean square error. However, such single-peak assumptions fail in regression tasks featuring multi-modal distributions. On the other hand, semi-parametric methods which achieve strong regression performance for multi-modal distributions often lack efficient quantification on their prediction variances. In this work, we extend UQ techniques based on Variational Bayesian Inference (VBI) to two widely used semi-parametric regression models that yield histogram-like reconstructions of the conditional label densities: Quantile Regression (QR) and Classification Restoration (CR). Our approach introduces a unified, distribution-agnostic framework that simultaneously achieves accurate estimation of complex conditional distributions and highly efficient UQ. Theoretically, our method is grounded in novel formulations of QR and CR within the VBI framework, yielding analytic Evidence Lower Bounds (ELBO) to streamline training and a closed-form or analytically approximated predictive density for efficient inference. Empirically, we evaluate our methods on three large-scale regression benchmarks with multi-modal label distributions. Our framework outperforms state-of-the-art multi-modal regression baselines, and even matches predictive performance of computationally expensive ensemble models. Furthermore, by leveraging epistemic uncertainty estimation, our approach enables highly data-efficient active learning strategies.

25.
arXiv (CS.CV) 2026-06-17

Effective Gaussian Management for High-fidelity Object Reconstruction

This paper proposes an effective Gaussian management framework for high-fidelity scene reconstruction of both appearance and geometry. Unlike recent Gaussian Splatting (GS) pipelines that treat all primitives uniformly during optimization, our framework explicitly manages the attribute activation, representation and pruning of Gaussian. Specifically, our framework first introduces GauSep, a novel densification strategy that selectively activates Gaussian color or normal attributes to alleviate destructive gradient conflicts arising from dual supervision. We further propose GauRep, an adaptive Gaussian representation that dynamically adjusts spherical harmonics (SHs) orders and performs task-decoupled pruning to reduce redundancy at both the individual and global levels. To provide reliable geometric supervision for above mangement process, we additionally introduce CoRe, an regularized surface reconstruction module that distills robust normal fields from an SDF branch to the Gaussian representation through a confidence mechanism. Notably, the proposed Gaussian management is compatible with various reconstruction architectures and can be seamlessly integrated to improve performance while reducing size of the model. Extensive experiments demonstrate that our approach achieves superior or comparable performance in appearance and geometry reconstruction compared with state-of-the-art methods, while using significantly fewer parameters.