Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-16

libhmm: A Modern C++20 Library for Hidden Markov Models with Correct MLE Emission M-Steps

Authors:

arXiv:2605.29208v2 Announce Type: replace-cross Abstract: We describe libhmm, a C++20 library for Hidden Markov Model parameter estimation, sequence decoding, and model selection. libhmm addresses two gaps in existing software: the absence of a well-maintained, zero-dependency C++ HMM library suitable for embedding in production systems, and the widespread use of method-of-moments (MOM) approximations in the emission distribution M-step of the Baum-Welch algorithm. The library implements correct maximum likelihood estimators for sixteen scalar emission distributions, including an ECME algorithm for the location-scale Student-t distribution, Newton-Raphson maximization for Gamma, Beta, Weibull, and Negative Binomial distributions, and the von Mises distribution for circular data. All forward-backward and Viterbi calculations operate in full log-space. SIMD acceleration is provided for AVX-512, AVX2, SSE2, and ARM NEON via compile-time dispatch with scalar fallback. Version 4 adds multivariate observation support via the BasicHmm template, with three multivariate emission families (diagonal Gaussian, full-covariance Gaussian, and independent components) each with correct weighted MLE M-steps. Python bindings are available via the companion package pylibhmm. We compare libhmm against established C and C++ HMM libraries and against published R reference packages on seven real-data benchmarks, and discuss the architectural tradeoffs made in the design.

02.
arXiv (CS.CV) 2026-06-25

Efficient Real-World Dehazing via Physics-Inspired Global-Local Decoupling

Real-world single image dehazing is highly ill-posed due to spatially and spectrally varying scattering, while practical deployment demands lightweight and low-latency models. Existing approaches either rely on fragile physical inversion under simplified assumptions or adopt heavy blind architectures unsuitable for edge deployment. To overcome these limitations, we propose PGL-Net (Physics-Inspired Global-Local Decoupling Network), a lightweight framework that incorporates physical inductive biases via operator-level emulation, avoiding explicit parameter estimation. It decouples dehazing into global distribution rectification and local structural refinement. A Physics-Inspired Affine Fusion (PAF) module performs globally conditioned alignment across hierarchical skip connections to compensate for haze-induced bias, while a compact Degradation-Aware Modulation (DAM) block adaptively restores spatially and spectrally variant details through dynamic feature modulation. Extensive experiments on multiple real-world benchmarks demonstrate that PGL-Net achieves state-of-the-art restoration quality with significantly reduced complexity. Compared with the recent SOTA SGDN, the Tiny variant (PGL-Net-T) improves PSNR by up to 2.6dB and consistently enhances downstream object detection accuracy, while achieving over a 10x reduction in inference latency. Code is publicly available at: https://github.com/sc-30-bit/PGL-Net.

03.
arXiv (CS.CV) 2026-06-12

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

04.
medRxiv (Medicine) 2026-06-23

Timing of S. aureus-related mortality in a large randomized clinical trial: Implications for future study design

Background: Longer follow-up periods in clinical trials for S. aureus bacteremia (SAB) may capture unrelated deaths, adding random noise that risks biasing trial results towards the null. Objective: To evaluate the timing and infection-relatedness of deaths within a large SAB clinical trial platform. Design: Blinded duplicate adjudication of trial deaths using a modified 7-point Likert-Scale. A third reviewer settled disagreements. Setting: 37 Canadian hospitals participating in the S. aureus Network Adaptive Platform (SNAP) Trial. Participants: 1515 adult patients recruited to SNAP between February 2022 and May 2026. Measurements: Timing and relatedness of 90-day deaths categorized as at least possibly SAB-related not likely to be SAB-related. Optimal follow-up cut-off was determined using Youden's index and graphically. Results: 247 deaths occurred; 97 (39.3%) were adjudicated as at least possibly SAB-related and 150 (60.7%) as not likely related. For probably/definitely related deaths, interrater agreement was 85.0% (Gwet's AC 0.73, substantial); for at least possibly related, it was 77.3% (Gwet's AC 0.55, moderate). Median survival was significantly shorter for SAB-related deaths (12 vs. 30.5 days; difference: 19 days earlier, 95% CI: 12-26, p

05.
arXiv (CS.CV) 2026-06-12

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Multimodal agents, which integrate a controller e.g., a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

06.
Nature (Science) 2026-06-17

A 98-qubit trapped-ion quantum computer with all-to-all connectivity

Quantum computers require both high-fidelity operations and large qubit numbers to surpass classical capabilities1. Trapped-ion platforms have demonstrated the highest gate fidelities of any modality2–6 but scaling to larger qubit numbers while preserving performance has remained a central challenge. We report on Quantinuum Helios, a 98-qubit trapped-ion quantum processor based on the quantum charge-coupled device (QCCD) architecture7. Helios features 137Ba+ hyperfine qubits8,9, all-to-all connectivity enabled by a rotatable ion storage ring connecting two quantum operation regions by a junction10,11, speed improvements from parallelized operations12 and a new software stack with real-time compilation of dynamic programs13. Averaged over all operational zones in the system, we achieve average infidelities of 2.5(1) × 10−5 for single-qubit (1Q) gates, 7.9(2) × 10−4 for two-qubit (2Q) gates and 3.3(5) × 10−4 for state preparation and measurement (SPAM), none of which are fundamentally limited and probably able to be improved. These component infidelities are predictive of system-level performance in both random Clifford circuits and random circuit sampling (RCS), the latter demonstrating that Helios operates well beyond the reach of classical simulation and establishes a new frontier of fidelity and complexity for quantum computers14. A new quantum computer, Quantinuum Helios, which is a 98-qubit trapped-ion quantum processor built on the QCCD architecture, demonstrates performance well beyond classical capabilities and provides a path for scaling up quantum computing.

07.
arXiv (CS.CV) 2026-06-24

MSPL: Multi-Step Pseudo-Labeling for Open-Vocabulary Object Detection

Open-vocabulary object detection (OVD) aims to recognize and localize object categories beyond the training set. Recent approaches leverage vision-language models to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on single-step image-text matching, neglecting the intermediate reasoning steps crucial for interpreting semantically complex visual contexts, such as crowding or occlusion. In this paper, we introduce MSPL, a framework that incorporates multi-step visual reasoning into the pseudo-labeling process for OVD. It decomposes complex scene understanding into three interpretable steps-object localization, category recognition, and background grounding-where these intermediate reasoning states serve as rich supervision sources. Extensive experiments on standard OVD evaluation protocols demonstrate that MSPL achieves state-of-the-art performance with superior pseudo-labeling efficiency, outperforming the strong baseline by 9.4 AP50 for novel classes on OV-COCO and improving box and mask APr by 3.2 and 2.2, respectively, on OV-LVIS. Code and models are available at https://github.com/hchoi256/mspl.

08.
arXiv (CS.AI) 2026-06-11

A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge

arXiv:2605.10592v2 Announce Type: replace Abstract: Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online (https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ).

09.
arXiv (CS.AI) 2026-06-19

Analyzing the Narration Gap in LLM-Solver Loops

arXiv:2606.19588v1 Announce Type: new Abstract: Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

10.
arXiv (math.PR) 2026-06-11

The $K$-th nearest neighbor random walk on a Poisson point process gets trapped

arXiv:2606.11271v1 Announce Type: new Abstract: The $K$-th nearest neighbor random walk $(X_n)_{n \geq 0}$ on a homogeneous Poisson point process $\chi$ on $\R^d$ ($d\geq 1$), starts at the origin and at each step picks its next Poisson point among its closest neighbors according to i.i.d. labels having the same distribution as $K$. Our main result (Theorem 1) states that the number of Poisson points visited by $(X_n)_{n \geq 0}$ admits an exponential decay whenever the random variable $K$ has a bounded support (BS). In particular, the $K$-th nearest neighbor random walk visits finitely many Poisson points if and only if $K$ satisfies Assumption (BS). To prove it, we introduce the key notion of pioneer point which allows us to deal with the region of $\R^d$ already explored by $(X_n)_{n \geq 0}$. Still under Assumption (BS), we also prove an exponential decay for the Euclidean length of the trajectory performed by $(X_n)_{n \geq 0}$ (Theorem 2). Finally, and quite surprisingly, we exhibit an example of label distribution with bounded support for which the $K$-th nearest neighbor random walk discovers new Poisson points after a number of steps whose tail distribution is at least polynomial (Theorem 3).

11.
medRxiv (Medicine) 2026-06-17

High burden of subclinical TB in Africa revealed from a postmortem cohort.

Tuberculosis (TB) is increasingly recognised as a spectrum of infection and disease, yet the prevalence of viable, asymptomatic Mycobacterium tuberculosis (M.tb) infection remains uncertain. Subclinical Tuberculosis (scTB), defined as microbiologically confirmed M.tb infection in the absence of recognised symptoms, is under detected by symptom, sputum and imaging-based approaches. We conducted postmortem examinations of 94 adults who died from non-infectious causes, none of whom were clinically suspected of TB or reported TB related symptoms prior to death. Lung and extrapulmonary tissues were cultured for M.tb. Viable M.tb was confirmed in six individuals, corresponding to a prevalence of 6.4% (95% CI: 2.4 to 13.4%). These findings provide direct tissue-based evidence that viable, asymptomatic M.tb infection can persist beyond the reach of conventional clinical detection. Our data suggest that a biologically active reservoir of infection may exist undetected within high-burden settings, with implications for surveillance strategies aimed at TB elimination.

12.
medRxiv (Medicine) 2026-06-11

Decoding the Genetic Architecture of Autistic Traits in the Aging Population

Autism research has mostly focused on diagnostic frameworks in childhood. However, autistic traits including social skills, communication, attention switching, attention to detail, and imagination may also vary in many undiagnosed individuals beyond childhood, and the genetic architecture of autistic traits in undiagnosed aging adults remains poorly understood. Here, we performed an exome-wide association study of autistic traits in adults aged >=40 from the UK Biobank (n = 161,269) and independently validated key findings in the SPARK cohort (n = 142,357). We identified exome-wide significance at 17q21.31, represented by a lead variant associated with social skills (rs199533, beta = 0.081, P = 2.04e-11). In addition, we identified an independent signal for communication (rs12632110, beta = 0.042, P = 3.07e-12) and two independent signals for attention switching (rs690733, beta = 0.046, P = 4.26e-12; rs2164272, beta = -0.047, P = 1.73e-12). Gene-based analyses further implicated loss-of-function variation in ZSCAN2 (beta = 1.00, P = 2.44e-6), which was associated with communication differences. Enrichment analyses revealed preferential expression of implicated genes in the cerebral cortex, while phenotypic and neuroimaging analyses linked those variants to cortical brain structure and regional volume. Taken together, these findings delineate the genetic architecture of autistic traits in the aging population and link genetic variation to downstream molecular and neuroanatomical mechanisms.

13.
Nature (Science) 2026-06-24

Crude oil fractionation by means of mesoporous polyacrylonitrile membranes

Authors:

Atmospheric and vacuum distillation consume more than 1,100 TWh year−1 and emit more than 160 million metric tonnes of CO2 equivalent annually1,2, making membrane-based pre-fractionation a compelling retrofit strategy for lowering the energy and carbon intensity of petroleum refining3–10. Here we demonstrate that porous polyacrylonitrile (PAN) membranes, typically used as support layers, achieve effective molecular refining of crude oil at steady state. Under tangential flow, PAN membranes exhibited high crude oil permeances of up to 0.591 ± 0.040 l m−2 h−1 bar−1, a more than 23-fold increase over the previous benchmark (<0.1 l m−2 h−1 bar−1)1,11, selectively yielding enriched lighter hydrocarbon fractions such as naphtha and kerosene. This unexpected selectivity arises from the dynamic deposition of heavy hydrocarbons within the initially approximately 15-nm surface mesopores, which narrows the pore diameter to sub-2-nm dimensions. Depth-resolved chemical identification reveals selective accumulation of n-alkanes, suggesting a self-limiting pore constriction mechanism that stabilizes selective transport pathways. Once the n-alkane deposition is stabilized, selective enrichment of raw crude oils occurs with sustained stability over 4 weeks. Process simulations show that PAN-membrane-based pre-fractionation could reduce energy by 31.6%, cooling water by 20.7% and CO2 emissions by 37.6% compared with traditional atmospheric distillation. Porous polyacrylonitrile membranes—typically used as non-selective support layers—can be used to achieve effective molecular refining of crude oil at steady state, enabling substantial reductions in energy consumption, cooling water, and CO2 emissions compared with distillation processes.

14.
arXiv (CS.LG) 2026-06-16

Near-Optimal Stochastic Linear Bandits with Delay

arXiv:2606.16656v1 Announce Type: new Abstract: We study stochastic linear bandits with delayed feedback under several delay models and establish near-optimal regret guarantees. Our results identify when delayed linear bandits exhibit the same qualitative behavior as multi-armed bandits (MAB), and when the linear structure creates fundamentally new challenges. Specifically, (1) for loss-independent delays, where the delay does not depend on the realized loss (but potentially depends on the arm), we show that delays incur only an additive regret penalty. Under stochastic delays, this penalty scales with the expected delay, while under adversarial delays, it scales with the maximum number of outstanding observations. Notably, both delay penalties are dimension-free, improving upon the state-of-the-art results; (2) for loss-dependent delays, we show that linear bandits are substantially harder than MAB: unlike in MAB, we prove matching (up to log factors) upper and lower bounds in linear bandits, whose delay penalty depends on the square root of the dimension. (3) for the delay-as-payoff model, a special case of loss-dependent delay, we show that the optimal MAB guarantee, which depends only on the delay of the optimal arm, is also unattainable in linear bandits. Together, these results provide a sharp characterization of how delayed feedback interacts with linear generalization.

15.
arXiv (CS.LG) 2026-06-12

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

arXiv:2606.12895v1 Announce Type: new Abstract: Spiking Neural Networks (SNNs) are well-regarded for their biological plausibility and energy efficiency in processing sequential data. However, dominant SNN architectures typically rely on first-order Ordinary Differential Equations (ODEs) to govern neuronal state transitions. This first-order assumption imposes a "memoryless" bottleneck, limiting the model's capacity to capture the complex, long-range dependencies inherent in long-sequence tasks. In this work, we propose LongSpike, a novel SNN framework that integrates fractional-order State-Space Modeling, or f-SSM, from control theory into the spiking domain. By extending traditional integer-order SSMs to the fractional-calculus regime, LongSpike enables the hierarchical integration of neuronal dynamics with long-memory kernels. To mitigate the computational overhead and parallelization challenges typically associated with fractional operators, we leverage a state-space formulation that supports efficient, parallel training. Empirical evaluations on challenging benchmarks, including Long Range Arena (LRA), large-scale WikiText-103, and Speech Commands, demonstrate that LongSpike outperforms state-of-the-art SNNs in accuracy while preserving sparse synaptic computation. The code is available at https://github.com/xinruihe389-commits/LongSpike.

16.
arXiv (CS.CV) 2026-06-25

TemPose-TF-ASF: Two-Stage Bidirectional Stroke Context Fusion for Badminton Stroke Classification

Accurate badminton stroke prediction is crucial for fine-grained sports analysis and tactical decision support. However, existing methods struggle to model rich temporal context. This paper introduces TemPose-TF-ASF (Adjacent-Stroke Fusion), a context-aware extension of TemPose. It enhances stroke recognition by incorporating stroke-type information from both preceding and subsequent strokes. A two-stage training and inference strategy is adopted. Preliminary predictions from the baseline model are reused as estimated temporal context. These predictions guide the joint optimization of the ASF module and the classifier. By explicitly modeling bidirectional temporal stroke dependencies, the proposed method can be seamlessly integrated into existing state-of-the-art models. Experiments on a large-scale badminton match dataset show consistent improvements over the baseline and its variants in terms of Accuracy and Macro-F1. Moreover, integrating ASF into other advanced methods yields notable performance gains. These results demonstrate strong transferability and generalization capability.

17.
arXiv (CS.CL) 2026-06-17

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.

18.
arXiv (CS.LG) 2026-06-11

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

arXiv:2506.08473v4 Announce Type: replace Abstract: Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

19.
arXiv (quant-ph) 2026-06-16

Inverted Dirac oscillator

arXiv:2606.15303v1 Announce Type: new Abstract: The Dirac oscillator is obtained from the Dirac Hamiltonian $H^{\mathrm{D}} = \left( c\vec{\alpha}\cdot \vec{p} + mc^{2}\beta \right)$ by modifying the momentum through a non-Hermitian substitution $\overrightarrow{p} \rightarrow \overrightarrow{p} \pm i\omega \beta \overrightarrow{q}$. Despite the non-Hermitian nature of this momentum operator, the full Hamiltonian remains Hermitian due to the presence of the Dirac matrix $\vec{\alpha}$. However, if one instead introduces a Hermitian modification of the form $\vec{p} \rightarrow \vec{p} \pm \omega \beta \overrightarrow{q}$, the resulting Hamiltonian is no longer Hermitian. In this case, the system corresponds to an inverted Dirac oscillator $H^{\mathrm{r}}$, where the potential becomes unbounded from below, the energy spectrum becomes continuous, and the eigenfunctions fail to be square-integrable, leading to normalization difficulties. We show that the Hamiltonian $H^{\mathrm{r}}$ is a pseudo-$\mathcal{PT}$-symmetric operator, and we introduce an unbounded, non-unitary transformation that establishes a connection between $H^{\mathrm{r}}$ and $H^{\mathrm{D}}$. The purpose of this work is to analyze this relativistic quantum system – known as the Dirac inverted oscillator – which, despite its various applications, admits an exact analytical solution

20.
medRxiv (Medicine) 2026-06-17

MedAgent: A Retrieval-Augmented Clinical Decision Support Agent with Verifiable Evidence Grounding for Evidence-Based Medicine

Evidence-based medicine demands clinical answers that are not only fluent and medically plausible, but also anchored in traceable evidence, tailored to patient-specific clinical questions, sensitive to the hierarchy of evidence, and respectful of clinical safety boundaries. While general-purpose large language models (LLMs) exhibit strong medical language generation ability, they tend to lean on parametric memory, underuse retrieved evidence, hallucinate citations, conflate evidence levels, and draw conclusions that are not fully supported by the underlying literature. Such limitations pose particular risks in clinical decision support, where answer reliability, evidence traceability, and reasoning consistency are paramount. To address these issues, we present MedAgent, an evidence-based medical agent trained through an end-to-end pipeline that integrates supervised fine-tuning (SFT) cold start, reward modeling, and Group Relative Policy Optimization (GRPO). The agent is designed to execute a structured workflow encompassing clinical question understanding, PICO extraction, evidence retrieval, evidence stratification, citation-grounded answer generation, and quality evaluation. Specifically, a Qwen2.5-14B-Instruct backbone is first cold-started on 200 human-verified agent trajectories, equipping it with tool invocation, PICO parsing, structured response generation, and citation faithfulness. Next, a Qwen2.5-7B reward model is trained on 2{,}099 pairwise preference samples to provide semantic-level quality signals for evidence-based responses. Finally, GRPO reinforcement learning is conducted in a retrieval-augmented agent environment, where every rollout involves real evidence retrieval and is scored jointly by rule-based rewards and reward-model signals. To avoid over-reliance on training rewards, we further construct an independent evidence-based medical evaluation benchmark, MedTrustBench, which contains 200 clinical questions spanning 10 specialties and four difficulty levels. Each question is annotated with standardized PICO elements and rubric-based scoring criteria. The benchmark includes 1{,}187 rubrics across seven dimensions: question relevance, evidence hierarchy, evidence quality and timeliness, evidence-answer consistency, completeness and depth, logical rigor, and medical terminology. Under an identical RAG pipeline, retrieval tool, retrieval configuration, and evaluation protocol, MedAgentv17 attains 78.6 points, outperforming GPT-4.1 (75.3) and approaching GPT-5.4 (80.3). These results show that a 14B domain-aligned model can surpass strong general-purpose baselines on specialized evidence-based medical reasoning, while delivering practical advantages in cost, privacy, controllability, and hospital-oriented private deployment. The model and associated datasets are publicly released at https://www.modelscope.cn/profile/InfoxmedModel

21.
arXiv (CS.AI) 2026-06-16

Tensor-Coord: Algebraic Decomposition of Joint Plan Tensors for Conflict-Free Multi-Agent LLM Planning

Authors:

arXiv:2606.16478v1 Announce Type: new Abstract: Large language models (LLMs) remain limited in multi-agent planning because independently generated plans can create coordination failures such as spatial collisions, resource contention, and temporal deadlocks. We introduce Tensor-Coord, a multilinear algebra framework that represents the joint plan of N agents as a third-order tensor \(T \in R^{N \times H \times A}\) over agents, timesteps, and actions. Canonical Polyadic (CP) and Tucker decompositions are used to identify latent coordination structure. The minimal epsilon-approximate CP rank R* defines a computable coordination complexity measure, with \(CC(Pi)=(R*-N)/N\). We prove that R*=N is necessary and sufficient for plan independence. The residual \(E=T-T_{R*}\) defines a conflict score over agent pairs, timesteps, and actions, localizing failures without domain-specific rules. Tucker factors provide interpretable agent roles, temporal phases, and action clusters that are converted into natural language constraints for iterative LLM replanning. Experiments on multi-robot delivery tasks across Easy (2 agents, 5x5 grid), Medium (3 agents, 5x5 grid), and Hard (4 agents, 5x5 grid) settings show convergence to conflict-free plans in 100% of 2-agent cases within 1.4 iterations on average, 80% of 3-agent cases within 3.2 iterations, and 60% of 4-agent cases within 4.0 iterations. CP rank scaled approximately linearly as \(R*(N) = 3.9N + 0.5\), supporting its use as a predictor of coordination complexity.

22.
arXiv (CS.CV) 2026-06-11

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: https://github.com/elmma/mllm-reroute/

23.
arXiv (CS.AI) 2026-06-19

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

arXiv:2606.19632v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with

24.
arXiv (CS.CV) 2026-06-16

ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

Segment Anything Model 3 (SAM 3) provides a strong frozen backbone for concept-prompted segmentation, but applying it directly to open-vocabulary semantic segmentation (OVSS) is inefficient: full-resolution decoding is typically run over the entire dataset vocabulary, whereas each image contains only a small active subset of classes. We introduce ActiveSAM, a training-free, zero-shot inference framework that turns SAM 3 into an active-vocabulary segmenter. ActiveSAM first canonicalizes and expands class prompts, then estimates an image-conditioned active set from a low-resolution presence preview. Only the retained classes are decoded at full resolution, using bucketed prompt multiplexing with the frozen SAM 3 decoder. The preview stage uses only class-presence evidence and skips unnecessary segmentation-head computation, while the final stage applies margin-aware background calibration to suppress low-confidence pixels. ActiveSAM requires no target-dataset training, no weight updates, and no oracle class-presence labels. Across eight OVSS benchmarks, ActiveSAM improves the speed-accuracy tradeoff of training-free open-vocabulary semantic segmentation, outperforming the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets. ActiveSAM also demonstrates the strongest robustness under image corruption that simulates real-world distribution shift, making it well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI. Code is available at https://github.com/VILA-Lab/ActiveSAM.

25.
arXiv (CS.CV) 2026-06-11

SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, a framework that bypasses those intermediates and achieves end-to-end character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address the lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To achieve the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: https://teal024.github.io/SCAIL-2/.