Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-17

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

02.
arXiv (CS.AI) 2026-06-24

DeepBD: A Grounded Agentic Workflow for Variant Prioritization and Diagnosis of Genetic Birth Defects

arXiv:2606.24779v1 Announce Type: cross Abstract: Birth defects are a major cause of fetal loss, neonatal morbidity and long-term disability. In the subset with suspected genetic etiologies, exome and genome sequencing have moved many cases from variant detection to post-sequencing interpretation: clinicians must rank patient-specific candidate variants under incomplete fetal or infant phenotypes and heterogeneous evidence from population genetics, variant-effect prediction, gene-disease validity, phenotype ontologies, cellular and pathway context, protein structure and clinical literature. We present DeepBD, a grounded agentic workflow for variant prioritization and diagnostic interpretation of genetic birth defects. DeepBD organizes the workflow into LLM-assisted case structuring, a pretrained evidence engine, specialist evidence modules and a grounded diagnostic review layer. The evidence engine learns patient-specific variant scores from structured rule evidence, sequence and variant-effect representations and phenotype-conditioned biological context, whereas specialist modules and the agentic layer provide tool-based refinement, candidate-pool review and diagnosis-oriented synthesis from ranked candidates. Developed using an in-house fetal and infant cohort comprising 18,622 cases, DeepBD achieved Recall@1/3/5/10 of 0.658/0.882/0.912/0.929 on an internal held-out solved-case benchmark, outperforming standalone Exomiser, DeepRare and prompted LLM reranking baselines evaluated on Exomiser-derived top-20 candidate variants. Ablation and overlap analyses show that rule evidence, mechanistic context, and specialist refinement provide complementary signals. These findings support a grounded agentic workflow that separates evidence integration, tool-based refinement, and LLM-assisted diagnostic review for retrospective variant prioritization in genetic birth defects.

03.
arXiv (CS.CV) 2026-06-17

ProCUA-SFT Technical Report

Training computer-use agents (CUAs) – models that interact with graphical desktops through screenshots and keyboard/mouse actions – requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content – 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs – and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld – an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

04.
arXiv (CS.CV) 2026-06-11

Time-Conditioned and Multi-Time Survival Prediction from 2D PET/CT Projections in Lung Cancer

Accurate prediction of overall survival (OS) from positron emission tomography/computed tomography (PET/CT) can support personalized treatment and follow-up strategies in oncology. However, the impact of temporal modeling on imaging-based survival prediction remains insufficiently explored. We investigate how different temporal formulations influence survival prediction by developing two complementary approaches: Attention-guided Time-Conditioned Survival (ATCS) and Multi-Time Survival (MTS). We retrospectively analyzed pre-treatment PET/CT images from 848 patients with non-small cell lung cancer (NSCLC), including 556 for model development and 292 for held-out testing. A previously proposed Time-Conditioned Survival (TCS) model was used as a baseline. Models were trained using 5-fold cross-validation and evaluated on the test set using time-dependent area under the curve (AUC) at 6-month intervals from 0.5 to 5 years. Both ATCS and MTS outperformed the baseline TCS model, achieving mean AUCs of 0.794 and 0.793, respectively, compared to 0.767. ATCS performed better at earlier time points (0.5-3 years), whereas MTS performed better at later intervals (3.5-5 years). Combining tumor-specific and tissue-wise PET/CT features improved performance over either input alone. Finer temporal discretization improved short-term prediction, while coarser intervals provided more stable long-term estimates. These findings demonstrate that temporal modeling and input design influence PET/CT-based survival prediction. The proposed approaches enable time-specific survival estimation from pre-treatment imaging and may support improved risk stratification and clinical decision-making.

05.
arXiv (CS.CV) 2026-06-18

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

07.
arXiv (CS.CV) 2026-06-16

Gaussian Spatial Priors for Anatomy-Aware Object Detection in Surgical Videos

Detecting anatomical structures in surgical video is essential for intraoperative safety frameworks such as the Critical View of Myopectineal Orifice (CVMPO) in inguinal hernia repair. While prominent structures like the Cooper's Ligament and Triangle of Doom are reliably detected by standard methods, smaller structures such as the epigastric vessels remain challenging due to their visual ambiguity and intermittent visibility. We observe that the spatial relationship between structures is anatomically constrained, and propose a Gaussian Spatial Prior (GSP) module that encodes this relationship as a compact, parametric bias injected into the self-attention of a DAB-DETR decoder. The prior is computed offline from training annotations as a small set of frozen Gaussian parameters and recomputed at each decoder layer using the iteratively refined reference points. On a dataset of inguinal hernia repair videos with 5-fold cross-validation, GSP improves dependent class detection by $+33.5\%$ ($AP_{50}$) over DAB-DETR and $+53.9\%$ over YOLOv26, while also improving anchor detection by $+6.0\%$. These gains are statistically significant across all folds ($p=0.012$, paired $t-$test).

08.
arXiv (CS.AI) 2026-06-24

Beyond U-Net: A Latent-Representation-Aligned Skip-Free Backbone for Flow-Matching Speech Enhancement

arXiv:2606.24745v1 Announce Type: cross Abstract: Generative models, particularly diffusion and score-based approaches, have recently achieved strong performance in speech enhancement, but their iterative sampling process limits real-time deployment. Flow Matching offers an efficient alternative by transporting noisy speech toward clean speech through an ordinary differential equation with few function evaluations. In this work, we propose a skip-free encoder-decoder backbone for flow-matching speech enhancement, guided by Latent Representation Alignment (LRA). Instead of relying on U-Net skip connections, which may transfer noise-correlated low-level features to the decoder, the proposed model aligns its bottleneck and decoder representations with clean latent features extracted from a frozen Descript Audio Codec encoder-decoder without quantization. This codec-aligned supervision promotes compact clean-speech representations while preserving efficient few-step inference. Experiments on WSJ0-CHiME3 and VoiceBank-DEMAND show improved PESQ and perceptual quality, especially on VoiceBank-DEMAND, using only five function evaluations.

09.
arXiv (CS.CL) 2026-06-24

A Synthetic Reliability-Aware PINN Benchmark for Offshore Wind Turbine Support-Structure Monitoring with Bayesian Inverse Identification

Reliable structural health monitoring (SHM) of offshore wind turbine (OWT) support structures requires fast state estimation from sparse measurements. Repeated high fidelity finite element or aeroelastic analyses are difficult to use directly in online monitoring loops, while purely data-driven surrogates can require large training sets. This paper presents Digi Turbine, a synthetic reliability-aware Physics Informed Neural Network (PINN) benchmark for OWT monopile support structure monitoring. The workflow embeds a simplified Euler Bernoulli beam equation with Winkler soil foundation in the training objective, couples it with Bayesian-prior-informed inverse identification, and adds First Order Reliability Method (FORM) screening. All validation uses synthetic configurations with analytical or finite-difference ground truth motivated by the NREL 5MW reference turbine context.

10.
arXiv (CS.CL) 2026-06-11

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

11.
arXiv (CS.CV) 2026-06-11

IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal

Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by clouds in optical remote sensing images by exploiting complementary SAR observations. Existing multimodal fusion methods typically rely on direct spatial concatenation and pixel-wise supervision, which can propagate SAR speckle noise into optical reconstruction and lead to over-smoothed results. To address these limitations, we propose an Information Bottleneck-driven High-Fidelity Network (IB-HFN) for SAR-assisted optical cloud removal. IB-HFN employs a dual-stream backbone to preserve modality-specific representations before deep semantic fusion, thereby mitigating premature cross-modal contamination. At the fusion stage, we introduce a Spatial Information Bottleneck Fusion module that compresses SAR features through a channel-wise variational information bottleneck to suppress unstructured speckle noise. In parallel, a local-global gating mechanism predicts clear-sky regions and routes reliable optical details through a Dirac-initialized skip connection, decoupling noise suppression from texture preservation. We further develop a joint optimization strategy that integrates feature-level bottleneck regularization with image-level constraints on reconstruction accuracy, structural consistency, spectral fidelity, and contrastive sharpness. A dynamic weighting schedule balances these objectives to stabilize training and reduce hazy artifacts. Experiments on the SEN12MS-CR dataset under challenging spatio-temporal splits demonstrate that IB-HFN achieves superior structural preservation and spectral fidelity over existing methods.

12.
arXiv (CS.CV) 2026-06-18

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

13.
arXiv (CS.LG) 2026-06-18

Everywhere Valid Bounds on False Discovery Proportions in Conformal Inference

arXiv:2605.20726v2 Announce Type: replace-cross Abstract: Modern applications of conformal inference to multiple testing problems, such as outlier detection and candidate selection, often involve selecting test samples whose conformal p-values fall below a threshold. The quality of such methods is often measured by the false discovery proportion (FDP), defined as the fraction of incorrect selections. Existing approaches typically control the expected value of the FDP, using methods such as the Benjamini-Hochberg procedure. This approach fails to provide high-probability bounds on the realized false discovery proportion and invalidates statistical guarantees if the rejection threshold is selected after inspecting the data. This paper establishes finite-sample, distribution-free upper bounds on the FDP that hold simultaneously over all possible rejection thresholds, enabling arbitrary post hoc selection of the threshold. Simultaneous validity is achieved by constructing a high-probability envelope for the empirical distribution function of null conformal p-values by sampling from their joint distribution. Furthermore, our framework allows practitioners to modulate the envelope's shape, thereby producing tight bounds in rejection regions of primary interest. We use this flexible approach to derive simultaneous FDP upper bounds for both outlier detection and conformal selection. We demonstrate through synthetic and real-data experiments that the resulting bounds are both valid and substantially less conservative than those derived from existing approaches.

14.
arXiv (CS.CV) 2026-06-16

Self-Supervised Learning as Discrete Communication

Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.

15.
arXiv (CS.CV) 2026-06-24

MSPL: Multi-Step Pseudo-Labeling for Open-Vocabulary Object Detection

Open-vocabulary object detection (OVD) aims to recognize and localize object categories beyond the training set. Recent approaches leverage vision-language models to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on single-step image-text matching, neglecting the intermediate reasoning steps crucial for interpreting semantically complex visual contexts, such as crowding or occlusion. In this paper, we introduce MSPL, a framework that incorporates multi-step visual reasoning into the pseudo-labeling process for OVD. It decomposes complex scene understanding into three interpretable steps-object localization, category recognition, and background grounding-where these intermediate reasoning states serve as rich supervision sources. Extensive experiments on standard OVD evaluation protocols demonstrate that MSPL achieves state-of-the-art performance with superior pseudo-labeling efficiency, outperforming the strong baseline by 9.4 AP50 for novel classes on OV-COCO and improving box and mask APr by 3.2 and 2.2, respectively, on OV-LVIS. Code and models are available at https://github.com/hchoi256/mspl.

16.
arXiv (CS.LG) 2026-06-24

Experiments with Optimal Model Trees

arXiv:2503.12902v4 Announce Type: replace Abstract: Model trees provide an appealing way to perform interpretable machine learning for both classification and regression problems. In contrast to ``classic'' decision trees with constant values in their leaves, model trees can use linear combinations of predictor variables in their leaf nodes to form predictions, which can help achieve higher accuracy and smaller trees. Typical algorithms for learning model trees from training data work in a greedy fashion, growing the tree in a top-down manner by recursively splitting the data into smaller and smaller subsets. Crucially, the selected splits are only locally optimal, potentially rendering the tree overly complex and less accurate than a tree whose structure is globally optimal for the training data. In this paper, we empirically investigate the effect of constructing globally optimal model trees for classification and regression with linear support vector machines at the leaf nodes. To this end, we present mixed-integer linear programming formulations to learn optimal trees, compute such trees for a large collection of benchmark data sets, and compare their performance against greedily grown model trees in terms of interpretability and accuracy. We also compare to classic optimal and greedily grown decision trees, random forests, and support vector machines. Our results show that optimal model trees can achieve competitive accuracy with very small trees. We also investigate the effect on the accuracy of replacing axis-parallel splits with multivariate ones, foregoing interpretability while potentially obtaining greater accuracy.

17.
medRxiv (Medicine) 2026-06-10

Global and local genetic overlap among ME/CFS, irritable bowel syndrome and psychiatric traits: a hypothesis-generating analysis

作者:

Background. Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) and irritable bowel syndrome (IBS) frequently co-occur following infection, yet shared genetic architecture at the locus level has not been systematically characterised. Aims. To estimate global and local genetic correlations between ME/CFS (including infection-onset subgroup), IBS, major depressive disorder (MDD) and loneliness/isolation, and characterise ME/CFS cell-type heritability enrichment. Method. GWAS summary statistics: DecodeME (15,579 ME/CFS; 9,738 infection-onset), FinnGen R9 (9,296 IBS), PGC MDD Wave 2 (45,396) and UK Biobank loneliness (N=455,364). LDSC for global correlations; LAVA for local correlations across 2,495 loci; MAGMA for cell-type enrichment (Descartes Human atlas); coloc.abf for colocalisation. Results. All pairwise global correlations were significant after Bonferroni correction, including ME/CFS-all-MDD (rg=0.598, 95% CI 0.46-0.74) and ME/CFS-all-IBS (rg=0.573, 0.39-0.75). Of 4,232 local tests, 16 reached FDR

19.
arXiv (CS.AI) 2026-06-12

M*: A Modular, Extensible, Serving System for Multimodal Models

arXiv:2606.12688v1 Announce Type: cross Abstract: We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

20.
arXiv (CS.CV) 2026-06-24

What Do Flow-Based Inverse Solvers Approximate? A Posterior-Transport View

A growing family of training-free solvers – FlowDPS, FLOWER, PnP-Flow and their diffusion ancestors (DPS, DAPS) – repurpose a pretrained flow-matching prior to solve imaging inverse problems by adding a measurement-guidance term to the deterministic probability-flow ODE. Despite strong empirical results, what these per-step corrections actually approximate – and how far the resulting samples are from the true posterior $p(x\mid y)$ – has not been characterized. We give a posterior-transport account of flow-based inverse problem solving. Our starting point is a simple but consequential fact: for a deterministic flow prior, Bayesian conditioning is realized entirely by a reweighting of the source distribution, not by a drift correction; pushing the reweighted source through the unmodified velocity field yields exact posterior samples. From this we show that trajectory-guidance solvers can be read as the minimum-kinetic-energy correction field needed to morph the unconditional source into the posterior, and that FlowDPS / FLOWER / PnP-Flow correspond to distinct zeroth-order / Gaussian / proximal approximations of this single object; we bound the resulting posterior bias in Wasserstein distance. A controlled $2$D study with a closed-form posterior confirms the theory decisively: source reweighting matches the true posterior to the Monte-Carlo floor on every metric, whereas trajectory guidance incurs $200$–$800\times$ larger error and collapses posterior modes, regardless of guidance strength. Guided by the analysis we propose a cheap, principled velocity-correction solver that is competitive across two in-domain priors (AFHQ, CelebA) and two out-of-distribution settings while, unlike point-estimate source-space optimizers, producing diverse posterior samples with uncertainty that correlates with reconstruction error.

21.
arXiv (CS.LG) 2026-06-19

HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction

arXiv:2606.20437v1 Announce Type: cross Abstract: Charged-particle tracking – reconstructing trajectories from sparse detector measurements – is a fundamental high-energy-physics inference problem and a canonical example of learning under extreme combinatorial ambiguity. At the High-Luminosity Large Hadron Collider (HL-LHC), tracking must remain accurate and efficient despite unprecedented collision densities. Graph neural networks perform strongly, but incur substantial costs from graph construction and processing, while transformer-based approaches rely on auxiliary stages that prevent end-to-end optimization. To address this, we present HEPTv2, an end-to-end point-transformer architecture that reconstructs tracks from detector hits in one trainable pipeline. HEPTv2 combines a locality-aware point encoder with a track decoder that predicts complete trajectories without graph-building, clustering, or filtering. The encoder uses locality-sensitive hashing in detector coordinate space to preserve tracking-relevant geometry while enabling efficient local attention. The decoder resolves ambiguities through sectorized decoding and direct hit-to-track prediction under joint encoder-decoder supervision, allowing the full pipeline to be optimized end-to-end. On TrackML, HEPTv2 achieves 98.6% double-majority tracking efficiency at a 0.8% fake rate, while requiring only $\sim$15~ms inference time and 0.4~GB peak memory per event on a NVIDIA A100 GPU. Latency and memory scale approximately linearly for events with up to $5\times10^5$ hits. HEPTv2 establishes a new state of the art in the accuracy-latency trade-off, improving efficiency by 4.5% over the strongest prior transformer and by 1.1–2.2% over optimized graph-based pipelines, while reducing latency by factors of 7 and 38–52, respectively. These results show end-to-end transformers can deliver the accuracy and efficiency required for real-time particle reconstruction at the HL-LHC.

22.
arXiv (CS.AI) 2026-06-18

Rescaling MLM-Head for Neural Sparse Retrieval

arXiv:2606.18811v1 Announce Type: cross Abstract: Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

23.
arXiv (CS.LG) 2026-06-16

Audited Conformal Prediction for Classification under Unknown Distribution Shift

arXiv:2606.14909v1 Announce Type: cross Abstract: We consider the problem of uncertainty quantification for a pretrained classification model deployed under unknown distribution shift. We propose Audited Conformal Prediction (ACP), a method that leverages a small labeled dataset from the target population to train an auxiliary audit model identifying inputs where the legacy model is likely to fail. By integrating the audit model's outputs into the conformal prediction framework, ACP produces prediction sets that guarantee marginal coverage while achieving substantially higher conditional coverage in practice than existing approaches. We develop and analyze two complementary integration strategies – one targeting marginal coverage with improved conditional performance, the other providing explicit group-conditional coverage guarantees – and establish theoretical guarantees for both. Experiments on synthetic and real-world datasets validate the method and illustrate trade-offs between prediction set size and conditional coverage.

24.
arXiv (CS.CV) 2026-06-11

Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

25.
arXiv (CS.AI) 2026-06-24

Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization

arXiv:2605.01482v3 Announce Type: replace Abstract: Multi-Hop Fact Verification requires complex reasoning across disparate evidence, posing significant challenges for Large Language Models , which may suffer from hallucinations and fractured logical chains. Existing methods, while improving transparency via Chain-of-Thought , often lack explicit modeling of the structural dependencies between evidence and claims. In this work, we introduce an SCM-inspired framework that grounds reasoning in explicit directed dependency graphs, treating verification as a constructive structural reasoning process rather than full causal inference with interventions or counterfactual semantics. We empirically identify an "inverted U-shaped" correlation between reasoning-chain length and accuracy, revealing that excessive structural complexity can degrade performance. To address this, we propose a rule-based reinforcement learning strategy using Group Relative Policy Optimization. This approach dynamically optimizes the trade-off between structural depth and conciseness. Extensive experiments on HoVer and EX-FEVER demonstrate that our SCM-GRPO framework outperforms strong baselines while producing more traceable reasoning structures for complex fact verification.