Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

SAMTok: Representing Any Mask with Two Words

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

02.
arXiv (CS.CV) 2026-06-18

Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition

In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.

04.
arXiv (CS.AI) 2026-06-12

The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems

arXiv:2606.13079v1 Announce Type: cross Abstract: Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

05.
arXiv (CS.CL) 2026-06-11

Scenario-based Probing and Steering Cultural Values in Large Language Models–Extended Version

Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart–Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

06.
arXiv (CS.AI) 2026-06-16

Do we have the knowledge we need? Rethinking human-AI decision-making in corporations

arXiv:2606.15575v1 Announce Type: new Abstract: Organizational knowledge is fragmented across a variety of software systems, tacit expertise, and manual documents that have traditionally been designed for human consumption. As AI systems are increasingly deployed and granted decision-making roles, they require access to this knowledge. This raises two questions: how should organizations store and maintain knowledge so that it remains accessible to both humans and future AI systems, and how should agency be allocated between humans and AI across tasks with different risks and levels of uncertainty? In this position paper, we describe how organizational knowledge evolves and contribute a framework that maps task attributes and knowledge availability to recommended agency allocations and control mechanisms. We illustrate the applicability of the framework on two different manufacturing tasks: a routine operation (visual quality inspection) and a one-off strategic decision (factory location), and conclude with opportunities for future research.

07.
arXiv (CS.CV) 2026-06-11

CoVEBench: Can Video Editing Models Handle Complex Instructions?

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

08.
arXiv (quant-ph) 2026-06-15

OQMD: Single-Qubit Rotation Control Improves Low-CNOT Multiclass Quantum Classification

arXiv:2606.14088v1 Announce Type: new Abstract: Near-term variational classifiers incur substantial error and latency from two-qubit gates, yet practitioners often assume that additional entangling depth is the default route to higher accuracy. This work studies Optimal Quantum Measurement Decoding (OQMD): optimizing how quantum outcomes are mapped to classical labels by training a readout layer before measurement, jointly with the variational circuit, without adding CNOTs. Experiments use trainable triple single-qubit rotations as one concrete, hardware-native realization of OQMD; other single-qubit parametrizations fit the same classical outer loop. On the Iris benchmark with a 30-point stratified test split, the best observed 0-CNOT configuration with OQMD reaches 83.33\% accuracy, with a 96\% at 9 CNOTs, exceeding the best 18-CNOT controls (56.67\%) and the best 18-CNOT configuration with OQMD (66.67\%) under a common protocol. A six-point CNOT-depth series from 0 to 18 (fixed optimizer, iteration budget, random-seed count, and ZXZ readout) shows that the highest raw scores need not occur at the largest template, so aggregate complexity is not summarized by CNOT count alone. Because run-level accuracies are discrete and non-Gaussian, we emphasize best-observed scores and, where a global comparison of pooled runs is required, Mann–Whitney $U$ tests rather than parametric tests on means. Across architectures, OQMD shows statistically consistent but magnitude-dependent gains: large peak lifts on minimal circuits coexist with a small pooled mean shift on complex 18-CNOT runs ($p\approx 0.03$) that is not ``universal'' in the sense of uniformly large practical effects.%

09.
arXiv (CS.CL) 2026-06-11

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle. We show that valid mathematical reasoning induces a measurable, training-free spectral signature in transformer attention. By treating each attention matrix as a weighted token graph, we extract four diagnostics: Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and smoothness, that require no learned parameters. Experiments across seven models from four architectural families yield effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling $85$–$96\%$ single-threshold classification accuracy. Two findings sharpen the interpretation. First, Platonic validity: the spectral signal tracks logical coherence rather than compiler acceptance, proofs rejected for timeouts or missing imports are correctly classified as valid, a distinction confirmed by a manual audit ($\kappa = 0.82$, $n = 51$). Second, architectural determinism: Sliding Window Attention shifts the discriminative feature from HFER to smoothness ($d = 2.09$, $p < 10^{-48}$), showing that attention design governs which spectral channel encodes reasoning quality. Causal ablation confirms the signature traces induction-head circuits. The method generalises to informal chain-of-thought ($d = 0.78$, $p < 10^{-3}$), and in proof search, HFER reranking improves Best-of-16 Pass@1 by $+4.4$–$6.6$\%, matching $98\%$ of the AUC of fully supervised probes with zero labels. Spectral graph analysis is a principled, architecture-aware primitive for reasoning verification.

10.
medRxiv (Medicine) 2026-06-23

Multivariate Echocardiographic Phenotyping of Hypertensive Heart Failure Using Unsupervised Machine Learning: A Pilot Study

Background Heart failure in hypertensive patients is heterogeneous and poorly captured by traditional left ventricular ejection fraction (LVEF) based classification. Multivariate echocardiographic data combined with unsupervised machine learning may provide a more precise phenotypic characterization. This pilot study evaluated the feasibility of unsupervised clustering of routine transthoracic echocardiographic data to identify phenotypic subgroups of hypertensive heart failure. Methods This retrospective pilot study analyzed transthoracic echocardiography reports from hypertensive patients with clinical heart failure. After data cleaning and exclusion of incomplete records, 102 patients with 11 echocardiographic variables were included. Variables describing left ventricular geometry, systolic function, and diastolic performance were standardized and subjected to K-means clustering. Optimal cluster number was determined using the elbow method and silhouette analysis. Cluster characteristics were assessed using descriptive statistics and Kruskal Wallis testing. Concordance with LVEF based heart failure categories was evaluated. Results Three distinct echocardiographic phenotypes were identified. Cluster 0 (n = 50) demonstrated preserved LVEF with concentric remodeling, consistent with heart failure with preserved ejection fraction (HFpEF) phenotype. Cluster 1 (n = 37) showed marked ventricular dilation and reduced systolic function, consistent with heart failure with reduced ejection fraction (HFrEF). Cluster 2 (n = 15) exhibited concentric hypertrophy with intermediate LVEF, consistent with heart failure with mildly reduced ejection fraction (HFmrEF) like phenotype. All echocardiographic variables differed significantly across clusters (p < 0.001). While Cluster 0 showed strong concordance with HFpEF (96%), Clusters 1 and 2 demonstrated substantial overlap across LVEF categories, indicating partial discordance between structural phenotypes and LVEF based classification. Conclusion Application of unsupervised machine learning to routine echocardiographic data identifies distinct heart failure phenotypes in hypertensive patients. These phenotypes demonstrate significant structural heterogeneity beyond LVEF based classification, supporting the utility of data-driven approaches for refined cardiac phenotyping. This pilot study provides a foundation for larger prospective studies.

11.
arXiv (CS.CV) 2026-06-16

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining and inter-layer dependencies that complicate optimization, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60% of FLOPs), we introduce Token Channel Selection (TCS), a training-free method that filters redundant noise channels at inference time. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52% accuracy (+1.64%p) with 39.4% FLOPs reduction. ToaSt also transfers effectively to diverse downstream tasks (COCO detection, ADE20K segmentation, CIFAR-100 classification), achieving 52.2 versus 51.9 mAP on COCO. Code: github.com/SHANNonLab-HUFS/ToaSt

12.
bioRxiv (Bioinfo) 2026-06-19

Evaluation of analysis modes for RNA coexpression in single-cell and bulk tissue

Coexpression of transcripts presents the most common means of computational inference of transcription factor regulation, and is often combined with other data types to infer regulatory networks. With the growing popularity of single-cell approaches, there are questions about how best to extract coexpression information from the data. Recently we reported a simulation study that explored the differences among coexpression performed at different levels: across single cells (xCell, per cell type), across subjects from pseudobulked single-cell data (xSubject, per cell type), or across subjects using bulk tissue samples (xBulk). Here we test predictions made by those models using real data. We consider both preservation (consistency of coexpression findings across different levels of analysis of the same data) and replicability across independent studies, as well as biological interpretability. We find that preservation across levels is limited, indicating the choice of analysis level will affect outcomes. We show that xCell coexpression is more replicable across studies compared to xSubject. xBulk coexpression is dominated by patterns driven by variability in cellular composition and fails to capture much coexpression that is reliably detected at finer resolutions. While all modes of analysis exhibit some enrichment for known regulatory relationships, it was highest with the xCell mode. Finally, we present a case study of the effect of analysis modes on a schizophrenia-associated pattern, reinforcing the importance of analytic choices in the interpretation and replicability of coexpression analyses. Together with our modeling study, this work emphasizes the importance of understanding sources of expression covariation as they relate to the goals of the analysis, and recommend single-cell-based data with biological replicates should be the focus of attempts to infer dynamic regulatory interactions that are more likely to be replicable by others.

13.
bioRxiv (Bioinfo) 2026-06-22

EMAlign: accurate alignment of cryo-EM maps through main-chain probability using deep learning

Accurate alignment of cryo-EM density maps is essential for comparing conformational states, searching map libraries, and guiding atomic model building, but remains challenging for noisy experimental maps and partially overlapping structures. Existing alignment methods are often based on raw maps, which may result in reduced accuracy due to the density noise, or require manual intervention for local alignment, which suffers from limited general applicability. Addressing the limitations, we present EMAlign, an automatic global and local cryo-EM map alignment with predicted main-chain probability using deep learning. First, EMAlign predicts main-chain prob ability maps from raw cryo-EM density maps using a BiMCUNet network. Then, a fast Fourier transform (FFT)-based search strategy is used to globally search the accurate alignment between cryo-EM maps based on predicted main-chain probability maps. As such, the main-chain prob ability map overcomes the noisy raw map problem, and the FFT-based exhaustive global search ensures the general applicability of alignment. EMAlign is evaluated on 64 global map pairs, 195 local map pairs, and 60 structure-to-map pairs at 3-10 [A] resolution and compared with gmfit, fitmap, VESPER, and CryoAlign. It is shown that EMAlign outperforms the other methods in both global and local alignment, achieving mean RMSDs of 1.03 [A] (global), 2.56 [A] (local), and 0.82 [A] (structure-to-map), with success rates of 100.0%, 100.0%, and 98.3% under the criterion of RMSD < 10 [A]. The EMAlign package is freely available at https://github.com/huang-laboratory/EMAlign/.

14.
bioRxiv (Bioinfo) 2026-06-11

A quantitative coordinate system for developmental dynamics

Quantitative comparison of morphogenesis across individuals remains a fundamental challenge, as developing embryos vary in shape, orientation and developmental tempo. Moreover, real-time three-dimensional imaging generates large, heterogeneous four-dimensional datasets that are difficult to directly align. As a result, developmental variability is typically described qualitatively rather than measured. Here we introduce STERN, a quantitative framework that learns continuous spatiotemporal representations of morphogenesis directly from in vivo 4D imaging data. By embedding embryos into a shared spatiotemporal space, STERN defines a quantitative developmental coordinate system that enables direct comparison of developmental trajectories across individuals without requiring explicit registration or staging. Applied to mouse embryogenesis, STERN reveals that embryos follow conserved developmental trajectories while progressing at distinct temporal rates, providing a quantitative measure of developmental heterochrony. Extending this framework to zebrafish neural crest light-sheet timelapse imaging, we further show that developmental order is preserved across distinct imaging views even with altered anatomical coverage, supporting the generality of the learned representation across vertebrate imaging contexts. Finally, in developing mouse hearts, where morphogenesis proceeds through subtle and continuously evolving structural changes, STERN resolves fine-scale developmental dynamics at minute-scale temporal resolution that are difficult to localize reproducibly using human experts or general-purpose multimodal AI. Together, these results establish a shared quantitative coordinate system for morphogenesis, in which developmental trajectories become directly comparable across individuals and developmental variability becomes a measurable property.

15.
arXiv (quant-ph) 2026-06-12

Supersymmetry of dissipative Bose-Fermi systems with application to Jaynes-Cummings and Dicke models

arXiv:2606.12682v1 Announce Type: new Abstract: We demonstrate how supersymmetries of Hamiltonians for coupled Bose-Fermi systems can be used to place the Hamiltonians of the Jaynes-Cummings model and Dicke model under the rotating wave approximation in matrix form and provide explicit analytic solutions for their eigenvalues. We then use this supersymmetry to place the Liouvillians of the associated Markovian open systems in matrix form and provide explicit solutions for their eigenvalues. These results are a consequence of the fact that the Hamiltonian of the Jaynes-Cummings model commutes with the linear Casimir invariant of the superalgebra $u(1|1)$ and that the Hamiltonian of the Dicke model commutes both with the linear invariant of $\sum_{i} u_{i}(1|1)$ and with the invariant of an additional $su(2)$ algebra. Our methods apply to various coupled Bose-Fermi systems with $u(1|1)$ and more generally with $u(n|m)$ dynamical superalgebras, and may provide efficient tools for studying more complicated examples.

16.
arXiv (CS.CV) 2026-06-19

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

17.
arXiv (CS.LG) 2026-06-16

Decomposing one-class support vector machine into an ensemble of one-data support vector machines

arXiv:2606.16002v1 Announce Type: new Abstract: One-class classification (OCC) is a classification problem in which the training data contains only one class. The one-class support vector machine (OCSVM) is one of the most competitive OCC algorithms. However, OCSVM has scalability issues with large-scale datasets. This paper proposes the acceleration strategy of OCSVM. The idea is to decompose the dataset into samples and train OCSVM models for single data points. Subsequently, ensemble learning is applied to combine all models to compute the OCSVM model for the dataset. In addition, further acceleration is achieved through a data-reduction strategy with an OCSVM model trained on the average of the training samples. The experiment compared the proposal and traditional OCSVM using the Python package. The proposed strategy is faster than traditional OCSVM, while achieving similar classification results. Moreover, the proposed strategy can create one-to-one correspondence between samples and models. Source code is uploaded at https://github.com/ToshiHayashi/ODSVM

18.
medRxiv (Medicine) 2026-06-22

Virtual Responsive Neurostimulation Implantation: From Intracranial Connectivity to Optimized Lead Placement

Responsive neurostimulation (RNS) is an implanted device that delivers direct brain stimulation for drug-resistant focal epilepsy. Individual responses are highly variable, and no validated framework exists to predict outcome or guide lead placement before implantation. We hypothesized that this variability is partly explained by lead placement in relation to patterns of functional connectivity in brain networks. Fourty-nine patients with drug-resistant focal epilepsy who underwent pre-implantation intracranial EEG (iEEG) and RNS implantation across three independent epilepsy centers were retrospectively studied. We developed a composite functional connectivity score, based on simple Spearman correlation, combining the standard deviation and kurtosis of interictal iEEG connectivity distributions to predict the response outcome in a training cohort (HUP, n=18) and validated in two independent cohorts (NYU, n=17; UCSF, n=14). We accounted for a spatial mismatch between iEEG and RNS electrodes with a distance-based correction. The score was extended to generate patient-specific 3D maps of predicted RNS efficacy across 200 simulated, or virtual RNS, lead configurations. Accuracy of the score in predicting clinical outcome was 72% at the group level, 61% at the individual patient level, and, after distance-based optimization, 100% in patients with RNS electrodes placed close to location of iEEG electrodes. Applied to the validation cohort, the same score reached 68% accuracy (71% balanced accuracy, 55% sensitivity, 88% specificity). The spatial combination of the scores at different SEEG contacts localization gives a spatial score for each patient. Responders showed significantly higher spatial scores than non-responders, supporting that actual RNS lead placement in responders was located in map-identified favorable regions. Interictal iEEG functional connectivity predicts individual RNS response across independent epilepsy centers, and patient-specific 3D maps derived from this biomarker could prospectively guide lead implantation toward favorable network regions, opening a promising avenue toward network-informed RNS surgical planning.

19.
arXiv (CS.AI) 2026-06-11

Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

arXiv:2602.08986v2 Announce Type: replace-cross Abstract: In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging modern quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in $F_{1}$ score. We also show our approach aids convolutional networks on challenging tasks, as in situations with suboptimal encoders or limited data.

20.
arXiv (CS.CV) 2026-06-16

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

Visual Question Answering (VQA) in robotic surgery, referred to as surgical VQA, requires high-level understanding of complex surgical scenes and the integration of visual perception with language reasoning, with the potential to support surgical training and intraoperative decision-making. Recent Vision-Language Models (VLMs) have shown promising performance through parameter-efficient fine-tuning; however, most existing approaches rely on coarse visual grounding, typically limited to bounding boxes, which fails to capture the fine-grained spatial structure of surgical objects. In this work, we propose a unified framework that jointly performs pixel-level segmentation and visual question answering within a single framework. Our approach integrates a VLM with a Segment Anything Model (SAM)-based decoder and represents scene elements as object tokens generated by the VLM. These object tokens guide answer prediction and are further projected to the SAM-based decoder to produce segmentation masks. By optimizing the object token embeddings through both segmentation and question answering objectives, the model learns spatially grounded representations that enhance visual reasoning while providing explicit pixel-level grounding. We evaluate the proposed method on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public EndoVis18 dataset, where it consistently outperforms baseline methods for surgical VQA. These results demonstrate that incorporating context-aware object tokens into vision-language models improves fine-grained surgical scene understanding.

21.
arXiv (CS.CV) 2026-06-15

Avatar V: Scaling Video-Reference Avatar Video Generation

Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (>10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.

22.
arXiv (CS.CL) 2026-06-12

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

With the growth of LLMs' (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

23.
arXiv (CS.CV) 2026-06-19

GEN-Guard: Correcting Generalization Failures for Deployable Federated Surgical AI

Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the "best" global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.

24.
arXiv (CS.AI) 2026-06-17

Optimism Stabilizes Thompson Sampling for Adaptive Inference

arXiv:2602.06014v2 Announce Type: replace-cross Abstract: Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study adaptive inference for Thompson sampling with Gaussian randomized indices in $K$-armed stochastic bandits with independent sub-Gaussian reward noises, and identify optimism as a key mechanism for restoring stability, meaning that each arm's pull count concentrates around a deterministic scale. This stability yields asymptotically valid Wald inference despite adaptive sampling. First, we prove that variance-inflated TS is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal, with asymptotically uniform allocation over optimal arms and sharp logarithmic pull-count asymptotics for suboptimal arms. This resolves the $K$-armed extension question raised by \citet{halder2025stable}, using new winner-map and Lyapunov-drift techniques to control allocation among multiple optimal arms. Second, we analyze an alternative optimistic modification that keeps the Gaussian index variance unchanged but adds an explicit mean bonus to the index center, and establish a similar stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid Wald inference in multi-armed bandits, while incurring only a mild additional regret cost.

25.
bioRxiv (Bioinfo) 2026-06-18

Bayesian modeling of longitudinal metatranscriptomes of broiler meat spoilage microbiomes shows shared predictive signature associated with spoilage at refrigerated temperatures

Microbial spoilage of packaged meat is driven by complex microbial succession and related metabolic activity, yet conventional shelf-life assessment is mainly based on shelf-life studies relying on culturing and sensory analysis. In routine quality assurance, results are obtained retrospectively, and they are only indirectly linked to the metabolic activity related to sensory deterioration. Functional, time informative approaches that capture the active metabolic state of the spoilage microbiome and predict the rate of spoilage are lacking. We developed a censoring-aware Gaussian process (CAGP) framework to model longitudinal pathway expression profiles from broiler meat metatranscriptomes collected over consecutive storage days at 4 or 6{degrees}C. Samples were annotated using odor-based sensory scores defining fresh, early-spoilage, and late-spoilage phases. Because observed zeros in pathway-level data may reflect non-detection rather than true absence, the model treats low values as left-censored observations below a detection threshold while estimating smooth temporal trajectories with uncertainty. In leave-one-out prediction within the 4{degrees}C time series, predicted sampling days differed from the true days by an average of 0.43 days, and predicted spoilage phases agreed with the sensory classification. Trajectories learned at 4{degrees}C also transferred to an independent 6{degrees}C time series at the spoilage-phase level, suggesting that shared functional spoilage programs are preserved despite temperature-dependent changes in spoilage rate. Cross-entropy ranking further identified pathway modules carrying time- and phase-informative signals across temperatures. Overall, this framework provides a probabilistic approach for linking metatranscriptomic functional dynamics to sensory spoilage progression, supporting shelf-life assessment beyond retrospective microbial enumeration.