Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. Existing approaches rely on a dictionary of activity label as input, cannot provide frame-by-frame labeling explanations, or rely on superseded computer vision techniques. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.

03.
arXiv (CS.CV) 2026-06-16

SurroundNEXO: Ego-Centric Metric Bridging for Spatially Consistent Geometry in Autonomous Driving

Modern autonomous driving depends on accurate metric 3D understanding for perception, reconstruction, and planning, which in turn requires reliable multi-camera depth prediction. However, the outward-facing nature of vehicle-mounted surround-view camera rigs inherently limits visual overlap across views, challenging the correspondence-based assumptions that underpin conventional multi-view geometry. To bridge this gap, we present SurroundNEXO, named after the Spanish word nexo for a geometric link, a low-overlap multi-camera metric depth framework that grounds cross-view reasoning in ego-centric geometry rather than dense visual correspondences. Instead of directly enforcing early global fusion, SurroundNEXO first assigns image tokens globally comparable ego-frame viewing directions through Ego-Ray Positional Encoding, then uses sparse LiDAR measurements as metric anchors to propagate absolute scale cues, and finally expands feature interaction progressively from view-local modeling to decomposed spatio-temporal reasoning and global integration. This design enables metric-scale depth prediction with improved spatial consistency across weakly overlapping cameras. Across low-overlap autonomous driving benchmarks, including NuScenes, Waymo and DDAD, SurroundNEXO reduces single-view error by 33.2%, improves cross-view consistency by 10.5%, and enhances metric reconstruction quality by 25.6% compared with SOTA methods. It further remains robust under extremely sparse depth prompts and exhibits strong zero-shot generalization to unseen camera layouts.

04.
arXiv (CS.AI) 2026-06-18

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

arXiv:2606.19004v1 Announce Type: cross Abstract: Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69–77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

05.
arXiv (CS.CL) 2026-06-19

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords. We present TOTEN, a knowledge-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities (OEE). We formalize TOTEN as the triple : the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the instantiator family yields a self-descriptive structured representation. Robustness derives from deterministic coupling with three external oracles: Pint (dimensional), Unicode Character Database (typographic), and RSLP (Portuguese morphology). Intrinsic evaluation covers four properties verifiable by construction – ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction – over an internal, physically validated benchmark (EngQuant, N=800) and four Brazilian Portuguese external corpora (N=1771 eligible cases). We also report detection recall, distinguishing coverage from conditional atomicity. Against eight state-of-the-art baselines, TOTEN achieves unit ontological atomicity in all contrasts and numerical reconstruction of 0.775-0.904 on external corpora, vs. 0.627-0.703 for the best baseline (Quantulum3); on EngQuant, 0.780 vs. 0.340. Differences are statistically significant (McNemar with Holm correction). Spearman correlation between internal and external rankings confirms concurrent validity of the control benchmark. Dimensional equivalence shows statistical parity with Pint, the oracle from which the system inherits dimensional authority.

06.
arXiv (CS.AI) 2026-06-18

Detecting High-Potential SMEs with Heterogeneous Graph Neural Networks

arXiv:2602.19591v3 Announce Type: replace-cross Abstract: Small and Medium Enterprises (SMEs) constitute 99.9% of U.S. businesses and generate 44% of economic activity, yet systematically identifying high-potential SMEs remains an open challenge. We introduce SME-HGT, a Heterogeneous Graph Transformer framework that predicts which SBIR Phase I awardees will advance to Phase II funding using exclusively public data. We construct a heterogeneous graph with 32,268 company nodes, 124 research topic nodes, and 13 government agency nodes connected by approximately 99,000 edges across three semantic relation types. SME-HGT achieves an AUPRC of 0.621 0.003 on a temporally-split test set, outperforming an MLP baseline (0.590 0.002) and R-GCN (0.608 0.013) across five random seeds. At a screening depth of 100 companies, SME-HGT attains 89.6% precision with a 2.14 lift over random selection. Our temporal evaluation protocol prevents information leakage, and our reliance on public data ensures reproducibility. These results demonstrate that relational structure among firms, research topics, and funding agencies provides meaningful signal for SME potential assessment, with implications for policymakers and early-stage investors.

07.
arXiv (CS.LG) 2026-06-17

Differential Privacy of Gaussian Process Posterior Sampling

arXiv:2606.17995v1 Announce Type: cross Abstract: We study the privacy of releasing posterior sample paths from a Gaussian process (GP) when the entire training set including covariates and responses is private. Unlike standard differential-privacy (DP) mechanisms that add external noise, posterior sampling is random by construction. We show that this intrinsic randomness yields DP guarantees by deriving explicit Rényi-DP bounds for GP posterior sample-path release. The bounds separate posterior-mean leakage from data-dependent posterior-covariance leakage showing that meaningful privacy depends sharply on effective ridge regularisation. We apply membership-inference attacks to show that empirical leakage follows the predicted dependence on regularisation, posterior variance and the number of released posterior sample-paths. Utility experiments on downstream posterior-sampling tasks identify noisy-observation regimes where privacy-compatible regularisation preserves useful decisions with modest utility loss. When stronger privacy is needed, the intrinsic guarantee can be sharpened by adding calibrated GP noise, providing an explicit additional privacy knob.

08.
arXiv (CS.LG) 2026-06-19

How to sketch a learning algorithm

作者:

arXiv:2604.07328v3 Announce Type: replace Abstract: How does the choice of training data influence an AI model? This broad question is of central importance to interpretability, privacy, and basic science. At its technical core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error $\varepsilon$ and failure probability $\delta$ in the deep learning setting. Our precomputation and prediction algorithms are only $\tilde{O}(\log(1/\delta)/\varepsilon^2)$ factors slower than regular training and inference, respectively. The storage requirements are those of $\tilde{O}(\log(1/\delta)/\varepsilon^2)$ models. Our proof is based on an assumption that we call stability. In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at https://github.com/SamSpo1/microgpt-sketch. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.

09.
arXiv (CS.AI) 2026-06-16

Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents

arXiv:2603.15952v2 Announce Type: replace Abstract: Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physics-based heteropolymer design software, capable of modeling non-canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user-defined objectives, combining LLM reasoning with Rosetta's generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non-canonical residues – where ML approaches fail – achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.

10.
bioRxiv (Bioinfo) 2026-06-21

GENATATORs: ab initio Gene Annotation With DNA Language Models

Inference of gene structure and location from genome sequences - known as de novo gene annotation - is a fundamental task in biological research. However, sequence grammar encoding gene structure is complex and poorly understood, often requiring costly transcriptomic data for accurate gene annotation. In this work, we benchmark current solutions and develop new methods of gene annotation. We show that pretrained DNA language model (DNA LM) embeddings do not capture the features necessary for precise gene segmentation, and that task-specific fine-tuning remains essential. We comprehensively evaluate the impact of model architecture, training strategy, receptive field size, dataset composition, and data augmentations on gene segmentation performance. We revisit standard evaluation protocols, showing that commonly used per-token and per-sequence metrics fail to capture the challenges of real-world gene annotation. We introduce and theoretically justify new biologically grounded metrics, along with benchmarking datasets that better capture annotation quality. We show that fine-tuned DNA LMs outperform existing annotation tools, generalizing across species separated by hundreds of millions of years from those seen during training, and providing segmentation of previously intractable non-coding transcripts and untranslated regions of protein-coding genes. Our results thus provide a foundation for new biological applications centered on accurate gene annotation.

11.
arXiv (CS.CL) 2026-06-12

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

12.
arXiv (CS.LG) 2026-06-18

AGDN: Learning to Solve Traveling Salesman Problem with Anisotropic Graph Diffusion Network

arXiv:2606.19185v1 Announce Type: new Abstract: The Traveling Salesman Problem (TSP) is a cornerstone of combinatorial optimization and arises in many practical scenarios. Although graph-based learning approaches have been explored for TSP, the question of how to exploit graph structure more effectively remains open. We present the Anisotropic Graph Diffusion Network (AGDN), a new Graph Neural Network framework designed to solve TSP. Our method tackles two central difficulties: (1) the lack of informative topological prior in fully connected TSP graphs, and (2) losing connected nodes in the optimal solution after the commonly used graph sparsification techniques. To overcome these issues, we construct a MixScore transition matrix that merges node similarity with pairwise distance, and we develop an anisotropic graph diffusion strategy that supports efficient information exchange across multiple hops. Comprehensive experiments spanning diverse instance sizes and node distributions show that AGDN consistently outperforms existing methods while keeping computation time competitive. Furthermore, AGDN generalizes well to problem sizes and distributions beyond those seen during training. The implementation is publicly available at: https://github.com/LabRAI/AGDN.

13.
arXiv (CS.LG) 2026-06-16

Early Anomaly-Onset Detection based on Wigner–Ville Distribution Slice Spectra: A Transmission-Grid Test Case

arXiv:2606.15856v1 Announce Type: cross Abstract: Operational disturbance monitoring in power networks requires decisions to be made from waveform windows as they arrive, rather than from completed records after the event. This study evaluates full-vector Wigner–Ville Distribution Slice (WVDS) spectra for sequential anomaly-onset detection in high-voltage grid-voltage waveforms. The approach keeps the bilinear midpoint interaction structure of the Wigner–Ville distribution and represents each 128-sample voltage window by a 128-dimensional slice spectrum, avoiding manually selected fault-frequency markers. WVDS is used with a baseline-normalized deviation (BND) score and is compared against the BND of Fast Fourier Transform (FFT-BND), raw-window autoencoders, FFT autoencoders, and WVDS autoencoders under the same thresholding and three-window persistence rule. A synthetic autoencoder–clustering teacher is used to select RTE fault records that start from an initially normal region and then transition to anomalous behavior. On the filtered test set, FFT-BND achieves the highest sensitivity, whereas WVDS-BND provides the lowest false-alarm operating point, reducing record-level pre-onset false alarms to 0.69%. The autoencoder comparison follows the same selectivity pattern: WVDS reconstruction decreases false alarms relative to FFT reconstruction but misses more examples. The results indicate that preserved WVD cross-term information can form a selective representation for online grid-waveform anomaly monitoring when false alarms are costly.

14.
arXiv (CS.CV) 2026-06-18

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

15.
arXiv (CS.AI) 2026-06-11

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

arXiv:2606.12016v1 Announce Type: cross Abstract: Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers' ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent ${\sim}15$ percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.

16.
arXiv (CS.CV) 2026-06-11

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

17.
arXiv (quant-ph) 2026-06-19

Passive-User Bell-State Loop-Back Key Establishment without Quantum Detectors at the User Nodes

arXiv:2606.19551v1 Announce Type: new Abstract: We propose and analyze a Bell-state extension of the Loop-Back quantum key distribution architecture for secret-key establishment between two passive users that do not require quantum transmitters or quantum detectors. In the proposed setting, a single active station, Alice, provides the entangled-state infrastructure, retains one qubit of an initially prepared Bell pair, and sends the traveling subsystem through two passive users, denoted by $B_1$ and $B_2$. Each passive user applies a local Pauli operation to the same traveling subsystem, so that the operation observed by Alice is only the effective composition $U_{\mathrm{eff}}=U_2U_1$. After the subsystem returns, Alice performs a Bell-state measurement and, using her private knowledge of the initial Bell state, deterministically identifies the effective Pauli operation. However, the individual factors $U_1$ and $U_2$ remain algebraically hidden from Alice whenever the local choices are uniformly and independently selected. The public effective operation acts as a parity-like constraint: each passive user can infer the operation applied by the other from its own private choice, while the active station learns only the global composition. This construction transfers the essential distributed-transformation mechanism of passive-user Loop-Back QKD to the entangled-state regime. Unlike single-qubit passive-user schemes, whose useful events are intrinsically post-selected, the Bell-state version is limited primarily by the success probability of the Bell-state measurement. We discuss the algebraic structure of the protocol, its interpretation as an infrastructure-assisted mediated key-establishment mechanism, and the physical assumptions required to protect passive Pauli modulators against active injection or Trojan-horse-type attacks.

18.
medRxiv (Medicine) 2026-06-22

Study protocol: Feasibility and clinical implications of real-time cerebral autoregulation monitoring in major noncardiac surgery with the Medtronic Cotrending algorithm (AUTOREGULATE-NONCARDIAC-COTRENDING)

Background: Perioperative hypotension is associated with postoperative organ injury. However, trials of hypotension avoidance have not found meaningful improvements in postoperative cardiovascular, renal, neurological or functional outcomes. One possible explanation is that organ perfusion depends on patients individual autoregulatory ranges. Hence, technology enabling monitoring of the autoregulatory status of vital organs, e.g. the brain, could provide a physiologic basis for personalising of blood pressure targets. However, current established methodologies for monitoring cerebral autoregulation in noncardiac surgery, e.g. the cerebral oximetry index (COx), are limited by performance and usability. The Medtronic Cotrending algorithm has been developed to provide automated, near real-time assessment of cerebral autoregulation. While feasibility was demonstrated in cardiac surgery, its applicability in major noncardiac surgery remains unknown. This study aims to evaluate the technical feasibility and clinical implications of Cotrending-based cerebral autoregulation monitoring in major noncardiac surgery. Objectives: Primary objective: To evaluate the technical feasibility of using the Medtronic Cotrending algorithm to monitor intraoperative cerebral autoregulation in real-time during major noncardiac surgery, drawing comparisons to the COx algorithm. Secondary objectives: to investigate the potential clinical implications of Cotrending-based cerebral autoregulation monitoring. Design: Single-centre, prospective cohort study. Setting: Swiss tertiary care centre Patients: Patients enrolled in AUTOREGULATE-NONCARDIAC who were monitored intraoperatively with the Medtronic INVOS(TM) 5100 near-infrared spectroscopy (NIRS) system. Outcomes: Technical feasibility outcomes include success rate of determination of the lower limit of cerebral autoregulation, intraoperative uptime, time to first estimate of the lower limit of cerebral autoregulation, sensitivity to external factors and to data artefacts; agreement of Cotrending-derived lower limit of cerebral autoregulation with COx-derived lower limit of cerebral autoregulation. Conclusions: N/A Trial registration: Clinicaltrials.gov NCT07630129

19.
arXiv (CS.AI) 2026-06-12

LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning

arXiv:2604.27960v2 Announce Type: replace Abstract: Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these issues by coupling LLMs with symbolic reasoners, existing approaches typically rely on monotonic logics (e.g., SMT) that cannot represent defeasible reasoning – essential components of human cognition. We present "LLM+ASP," a framework that translates natural language into Answer Set Programming (ASP), a nonmonotonic formalism based on stable model semantics. Unlike prior "LLM+ASP" approaches that require manually authored knowledge modules, domain-specific prompts, or evaluation restricted to single problem classes, our framework operates without any per-task engineering and applies uniformly across diverse reasoning tasks. Our system utilizes an automated self-correction loop where structured feedback from the ASP solver enables iterative refinement. Evaluating across six diverse benchmarks, we demonstrate that: (1) stable model semantics allow LLMs to naturally express default rules and exceptions, outperforming SMT-based alternatives by significant margins on nonmonotonic tasks; (2) iterative self-correction is the primary driver of performance, effectively replacing the need for handcrafted domain knowledge; (3) compact in-context reference guides substantially outperform verbose documentation, revealing a "context rot" phenomenon where excessive context hinders constraint adherence.

20.
arXiv (CS.CV) 2026-06-19

GH-ESD: Grounded Hypothesis-Driven Error Slice Discovery for Instance-Level Vision Tasks

Systematic failures of vision models on semantically coherent subsets, known as error slices, reveal limitations in robustness and evaluation. Existing slice discovery approaches largely model slices as clusters in representation space or combinations of predefined attributes. While effective for image-level classification, such formulations are insufficient for instance-level tasks such as object detection and segmentation, where failures often arise from contextual relational and spatially grounded visual patterns. We propose GH-ESD (Grounded Hypothesis-Driven Error Slice Discovery), a generate and verify framework that reformulates slice discovery as grounded hypothesis generation and statistical verification. GH-ESD constructs relational failure hypotheses using LLM priors and grounded visual evidence, discovers hypothesis slices at the instance level via Vision Language Models, and verifies them through statistical trend analysis over instance-level errors. We also introduce GESD (Grounded Error Slice Dataset), a new benchmark for instance-level error slice discovery, providing expert-defined and spatially grounded slices derived from detection and segmentation failures. Extensive experiments demonstrate that GH-ESD consistently outperforms baselines, improving Precision@10 by 0.10 (0.73 vs. 0.63) on the GESD benchmark for detection tasks, while also supporting segmentation scenarios. GH-ESD identifies interpretable slices that facilitate actionable model improvements. The GESD dataset will be made publicly available upon acceptance.

21.
bioRxiv (Bioinfo) 2026-06-12

PHI-Reason: evidence-grounded species-level phage-host prediction from structured biological text profiles

Phage–host interaction (PHI) prediction is a fundamental problem in microbiology with applications in microbial ecology and microbiome engineering. Existing computational approaches typically convert phage and host information into numerical representations derived from sequence similarity, protein content, genome composition or reference databases, then score candidate hosts or train host-prediction models. Although effective, such representations often make it difficult to inspect which biological evidence supports a prediction. Here, we present PHI-Reason, a species-level PHI prediction framework that reformulates host prediction as constrained biological text reasoning. Instead of embedding phages and hosts directly as numerical vectors, PHI-Reason converts heterogeneous PHI-related evidence from phage genomes, host genomes, functional annotations, homology searches and biological metadata into modular natural-language profiles. A frozen large language model then performs species-level candidate-host ranking or pairwise PHI assessment by integrating the supplied evidence at inference time. Across species-level benchmarks, PHI-Reason achieved competitive host-prediction performance and recovered complementary correct assignments relative to established sequence- and reference-based methods. Its explicit profile design enabled systematic evidence perturbation and rationale-grounding analyses, showing that predictions depend on coherent multi-source biological evidence and that hallucination risk from unsupported or incomplete profiles can be made operationally measurable. These results position PHI-Reason as a constrained evidence-integration framework for species-level PHI prediction. Rather than replacing sequence-based predictors, it provides an interpretable layer that shows how far explicit biological evidence can support host inference, and where that evidence falls short.

22.
arXiv (CS.LG) 2026-06-16

TCHG: Tri-Trust Conditioned Heterogeneous Graph Learning for Reliable Dynamic Trust Prediction

arXiv:2606.16611v1 Announce Type: new Abstract: Trust prediction infers latent user-user trust relations and provides important support for social recommendation, fake-review and manipulation detection, and risk identification. Graph neural networks have become a prominent approach to trust prediction because of their ability to learn network structures and complex trust dependencies. However, existing methods often rely on a unified representation of trust signals and do not disentangle heterogeneous trust evidence into separate evidence channels, failing to exploit the distinct roles that different evidence channels should play during trust modeling. To address this gap, this paper argues that trust evidence should not be treated as an undifferentiated input, but should be decomposed and used as functional control factors over graph propagation. We propose TCHG, a tri-trust conditioned heterogeneous graph learning framework that decomposes trust evidence into three channels and assigns them distinct functional roles in propagation: entity reliability governs message admission, interaction-behavior reliability modulates propagation strength, and contextual trust adjusts the propagation mode through context-conditioned operator selection. Since the three evidence channels evolve at different temporal scales, TCHG maintains independent temporal states with non-uniform decay rates to prevent rapidly changing contextual signals from overwriting slowly accumulated entity reliability. It further predicts trust probability and calibrates the output probability, improving predictive confidence under sparse or conflicting evidence. Extensive experiments on multiple public trust datasets show that TCHG achieves effective and reliable trust prediction compared with representative trust prediction and heterogeneous graph baselines.

23.
arXiv (CS.LG) 2026-06-16

Best Arm Identification with Minimal Regret

arXiv:2409.18909v2 Announce Type: replace Abstract: Motivated by real-world applications that necessitate responsible experimentation, we introduce the problem of best arm identification (BAI) with minimal regret. This variant of the multi-armed bandit problem elegantly amalgamates two of its most ubiquitous objectives: regret minimization and BAI. More precisely, the agent's goal is to identify the best arm with a prescribed confidence level $\delta$, while minimizing the cumulative regret up to the stopping time. Focusing on single-parameter exponential families of distributions, we leverage information-theoretic techniques to establish an instance-dependent lower bound on the expected cumulative regret. Moreover, we present an impossibility result that underscores the tension between cumulative regret and sample complexity in fixed-confidence BAI. Complementarily, we design and analyze the Double KL-UCB algorithm, which achieves asymptotic optimality as the confidence level tends to zero. Notably, this algorithm employs two distinct confidence bounds to guide arm selection in a randomized manner. Our findings elucidate a fresh perspective on the inherent connections between regret minimization and BAI.

24.
arXiv (CS.CV) 2026-06-11

MedVeriSeg: Teaching LISA-Like Medical Segmentation Models to Verify Query Validity Without Extra Training

Despite recent progress in text-prompt-based medical image segmentation, existing LISA-like MLLM-based methods typically generate masks regardless of whether the target specified in the query is present, leading to hallucinated segmentation. In this work, we propose MedVeriSeg, a training-free query verification framework that enables LISA-like medical segmentation models to reject false segmentation queries. MedVeriSeg first quantifies the response quality between the [SEG] token and image features through a Similarity Response Quality Scoring Module. To further improve robustness, it employs a Lightweight Routed Multi-Agent Verification Module, which fuses quantitative score evidence with qualitative agent evidence to comprehensively verify the validity of the query. To support systematic evaluation, we construct MedVeriSeg-Bench, a benchmark designed for query verification in medical image segmentation. Experimental results demonstrate that MedVeriSeg effectively identifies false segmentation queries and reduces hallucinated segmentation, while maintaining a high acceptance rate for valid queries, thereby largely preserving the segmentation utility of LISA-like medical segmentation models.

25.
arXiv (CS.AI) 2026-06-12

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

arXiv:2606.13607v1 Announce Type: new Abstract: When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.