Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (math.PR) 2026-06-18

Very large cliques in a scale-free random graph

arXiv:2606.18722v1 Announce Type: new Abstract: In this short article we consider a preferential attachment random graph model with edge steps, studied by Alves, Ribeiro and Sanchis. Starting with an initial graph $\mathbb{G}_1$ formed by a vertex with a self-loop attached to it, the model evolves as follows. At every subsequent (discrete) time step, either with probability $p$ we add a vertex to the graph and connect it to exactly one of the older vertices selected with probability proportional to its degree, or with probability $1-p$ we add one edge between two existing vertices, both selected (independently) with probability proportional to their degrees. Let $\omega(\mathbb{G})$ be the clique number of a graph $\mathbb{G}$, i.e.\ the number of vertices in a largest complete subgraph of $\mathbb{G}_{}$. Alves, Ribeiro and Sanchis showed that, for any given $\varepsilon>0$, we have $\omega(\mathbb{G}_{2t})\geq t^{\frac{1-p}{2-p}(1-\varepsilon)}$ with high probability (i.e.\ with probability tending to $1$ as $t\rightarrow \infty$). Here we strengthen this bound by showing that, for any function $f:\mathbb{N}\mapsto \mathbb{N}$ that satisfies $f(t)\rightarrow \infty$ as $t\rightarrow \infty$, with high probability \[\omega(\mathbb{G}_{2t}) = \Omega\left(t^{\frac{1-p}{2-p}}\Big(\log^{\frac{1}{2-p}}(t)f(t)\Big)^{-1}\right).\]

02.
arXiv (CS.LG) 2026-06-16

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

arXiv:2606.17043v1 Announce Type: cross Abstract: When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate $g_t$ merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

03.
arXiv (CS.CL) 2026-06-11

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

04.
arXiv (CS.LG) 2026-06-17

Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

arXiv:2606.18209v1 Announce Type: new Abstract: Dataset distillation (DD) has emerged as a prominent approach in data centric machine learning, aiming to synthesize compact training sets for efficient training by compressing the information in large datasets into a small number of synthetic samples. However, DD methods are often evaluated under inconsistent evaluation protocols, ranging from standard ERM to single/multi-teacher supervision, making it difficult to isolate the effectiveness of distilled data from evaluation. Moreover, many prior methods claim that DD outperforms data pruning approaches such as coreset selection (CS), based on the assumption that restricting condensed datasets to subsets of real samples fundamentally limits their expressiveness. In this work, we critically evaluate DD methods through large-scale experiments using standardized datasets and evaluation protocols to assess their intrinsic effectiveness. We benchmark seven state-of-the-art (SOTA) DD methods on ImageNet-1K, ImageNet100, and ImageNette, using three widely adopted training protocols against three CS strategies. Our results show that while some DD methods fail to outperform even simple random subsets, the SOTA DD approaches are comparable to or worse than coresets on large-scale datasets and incur a substantially higher cost for construction. Beyond accuracy, we also evaluate the representativeness, diversity, and quality of condensed sets, and find that coresets consistently achieve better coverage of the original data distribution. These findings highlight the limited practical advantages of current DD methods and show that coresets remain competitive and are often a more computationally efficient alternative for data-centric learning.

05.
arXiv (CS.AI) 2026-06-16

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

arXiv:2606.15673v1 Announce Type: new Abstract: Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.

06.
arXiv (CS.AI) 2026-06-12

A Theory of Training Profit-Optimal LLMs

arXiv:2605.16430v3 Announce Type: replace-cross Abstract: Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.

07.
arXiv (CS.AI) 2026-06-24

Offline Reinforcement Learning for Warehouse SLAM Throughput Control

arXiv:2606.23978v1 Announce Type: cross Abstract: We present an offline reinforcement learning (RL) framework for optimizing SLAM throughput control in a warehouse fulfillment environment. SLAM (Scan/Label/Apply/Manifest) throughput directly influences system congestion and operational efficiency. Our RL-based control approach dynamically recommends SLAM throughput settings that adaptively balance throughput maximization with downstream stability through intelligent adjustment of throttling behavior. We include a history-informed state representation, action space abstraction for delayed-impact control, and a reward function that captures both upstream and downstream operational metrics. Our approach is algorithm-agnostic, enabling integration of multiple offline RL methods under a unified architecture. We instantiate our framework with three state-of-the-art offline RL algorithms, and trained the models offline using de-identified historical operational logs from a large-scale warehouse. Policy performance is evaluated using a comprehensive multi-method strategy. These include model-free approaches including immediate reward estimation via regression models and long-horizon Fitted Q Evaluation (FQE), as well as model-based Deep Koopman dynamics evaluation. Empirical results reveal that the CQL policy consistently outperforms alternatives, improving system health by 22.97% and reducing average throttling duration by 3.18%. These findings demonstrate the potential of offline RL for safe and scalable warehouse throughput control optimization.

08.
arXiv (math.PR) 2026-06-16

An Algebraic Matrix Spencer Theorem

arXiv:2606.16005v1 Announce Type: new Abstract: We develop an algebraic approach to matrix discrepancy based on the representation theory of finite-dimensional C$^*$-algebras. As an application, we resolve a substantial structured special case of the Matrix Spencer conjecture. In particular, we show that for every family of contractions $A_1,\ldots,A_n$ that are contained in a finite-dimensional $C^*$-algebra $\mathcal A$ with $dim_{\mathbb C} (\mathcal A) \lesssim n$, there exists signs $x\in\{\pm1\}^n$ such that $\|\sum_{i=1}^n x_i A_i\| \le O(\sqrt n)$. As a noteworthy special case, our main result also resolves the Group Spencer conjecture of (Bandeira'24). We furthermore prove that Matrix Spencer continues to hold for low-rank perturbations of matrix families coming from an $C^*$-algebra of small dimension.

09.
medRxiv (Medicine) 2026-06-12

Immunologically Optimized Zmp1 Peptides Reveal a Translational Serological Biomarker Platform for Tuberculosis Diagnosis Across Disease Manifestations

Tuberculosis (TB) diagnosis remains challenging, particularly for extrapulmonary TB (EPTB), where invasive sampling, low bacillary burden, and suboptimal sensitivity of nucleic acid-based tests in peripheral specimens hinder timely detection. Here, we report an immunology-driven strategy for biomarker discovery and development of a peptide-based serological assay targeting Mycobacterium tuberculosis zinc metalloprotease-1 (Zmp1). Leveraging fundamental principles of adaptive immunity that antigenic regions containing overlapping B-cell and CD4 T-helper cell epitopes would preferentially generate high antibody titers through linked recognition and cognate T-cell help, we used an immunoinformatics pipeline to identify two nested immunodominant peptide regions within Zmp1 (Mtb-Zp-NT and Mtb-Zp-CT) enriched for overlapping B- and T-cell epitopes. The diagnostic potential of these peptides was evaluated through ELISA-based serological assays. A blinded pilot study (N=137) demonstrated a clear discrimination between active TB and TB-recovered individuals. The assay was subsequently validated in an expanded cohort (N=875) by screening 6,086 individuals, which identified 457 TB-positive cases. The cohort included pulmonary TB (PTB), EPTB, TB-recovered individuals, household contacts, non-specific infections, and healthy controls. Receiver operating characteristic analyses, supported by DeLong and bootstrap comparisons, revealed superior diagnostic performance of the peptide-based assays relative to full-length Zmp1. Mtb-Zp-CT exhibited the highest accuracy (AUC=0.93; specificity >90%), while Mtb-Zp-NT also demonstrated strong discriminatory power (AUC{approx}0.89). These findings establish that the immunologically optimized Zmp1 peptides are highly promising serological biomarkers for TB and EPTB. More broadly, they demonstrate how mechanistically informed epitope selection can accelerate translation of pathogen-specific immune signatures into sensitive, minimally invasive, and potentially point-of-care diagnostic platforms for resource-limited settings.

10.
arXiv (CS.AI) 2026-06-17

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

arXiv:2606.17199v1 Announce Type: cross Abstract: Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. However, we show that this estimator suffers from severe training pathologies in practice: sample inefficiency, unstable generation dynamics, and a substantial performance gap compared to exact full-vocabulary OPD. Reward-level diagnosis traces these pathologies to the log-ratio reward, which is unbounded by construction, producing extremely high-variance gradients concentrated at early positions and persisting throughout training; standard post-hoc scaling fail as they operate only after this distortion occurs. To solve this problem, we propose PowerOPD: a family of natively bounded, sign-consistent rewards from the Box-Cox power transformation, parameterized by alpha > 0, of which the log-ratio is the degenerate alpha -> 0 limit. Across six mathematical reasoning benchmarks and four Qwen3 teacher-student pairs, PowerOPD achieves benchmark-averaged Avg@8/Pass@8 gains of up to +6.37/+5.71 over vanilla OPD, +3.01/+3.54 over post-hoc stabilization, and +2.59/+8.90 over full-vocabulary OPD, while reducing wall-clock time by 59.2% and peak GPU memory by 23.1%. Larger alpha generally improves accuracy, consistently shortens responses, and keeps gradient norms more than 3,000x smaller than vanilla OPD.

11.
arXiv (math.PR) 2026-06-16

Convergence to the Brownian CRT for critical branching Markov processe

arXiv:2601.05906v2 Announce Type: replace Abstract: We prove an invariance principle for a general class of continuous time critical branching processes with finite variance (non-local) branching mechanism. We show that the genealogical trees, viewed as random compact metric measure spaces, converge under rescaling to the Brownian continuum random tree in the Gromov-Hausdorff-weak topology, establishing a universal scaling limit for critical finite variance branching processes.

12.
bioRxiv (Bioinfo) 2026-06-19

Geometric Deep Learning Reveals Ligandable and Cryptic RNA Binding Small Molecule Pockets (SMARTPocket)

RNAs are important therapeutic targets, however identifying ligandable small-molecule binding pockets remains a major barrier to RNA-targeted drug discovery. Here, SMARTPocket, an atomic-level geometric deep learning framework for predicting RNA-small molecule binding pockets directly from three-dimensional structure is introduced. SMARTPocket represents RNA as full-atom point clouds and uses transfer learning from more than 110,000 protein binding interface structures to overcome the limited number of experimentally elucidated RNA-ligand complexes. Across four established single-chain benchmarks and three broader curated benchmarks, SMARTPocket consistently outperforms existing RNA pocket predictors and general biomolecular modeling approaches. The model generalizes to apo RNA structures when conformational changes are modest, identifies cryptic ligandable pockets, and recapitulates experimentally validated binding sites in the SARS-CoV-2 frameshifting element and an RNA aptamer evolved to bind small molecules. SMARTPocket-guided docking further improves near-native RNA-ligand pose recovery and computational efficiency compared with blind docking. These results establish SMARTPocket as a generalizable framework for structure-based identification of ligandable RNA pockets and for accelerating discovery of RNA-targeted small molecules.

13.
arXiv (quant-ph) 2026-06-16

A New Definition of Quantum Superposition

arXiv:2606.15607v1 Announce Type: new Abstract: The usual description of the superposition of two (pure quantum) states is ambiguous, since the binary operation of summation in a Hilbert space does not pass down to the quotient projective space. Even though Dirac noted this as early as 1930, it is often asserted that the superposition is a binary operation acting on two states with a value that is a unique state. The goal for this note is to motivate a rigorous, geometrical definition of the superposition of states in the setting of complex projective space, which has been argued elsewhere to be the natural geometric phase space for quantum theory. The upshot is that the new definition of the superposition of two pure states, viewed as two distinct points in the projective space, is the unique (complex) line on which those two points lie. Finally, a comparison is given between superposition and expansion in an orthonormal basis.

14.
arXiv (CS.CL) 2026-06-19

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

15.
arXiv (CS.CL) 2026-06-11

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.

16.
arXiv (CS.CV) 2026-06-25

Pulmonary Embolism Risk Stratification from CTPA and Medical Records: Vascular Graphs Are Not All You Need

Risk stratification for pulmonary embolism (PE) is critical for clinical decision-making. Stratification guidelines are based on patient medical records, parameters measured from computed tomography pulmonary angiography (CTPA), and blood tests. However, blood tests are often missing in routine practice. This work studies whether state-of-the-art models can accurately classify risk stratification from only medical records and biomarkers extracted from CTPA images. We benchmark different approaches to combine medical records and cardiac biomarkers with rich pulmonary vascular information; we add vascular biomarkers to tabular models and apply graph neural networks (GNNs) on the vascular tree's intrinsic graph representation. We use a private dataset (n=353) with uniquely complete data for PE risk stratification. Our results show that, among global features, medical records and cardiac biomarkers are the most significant predictors, while vascular biomarkers do not further improve stratification. Even more surprising, even GNNs on vascular graphs fail to outperform strong tabular baseline on global features. We consider hypotheses, on both models and data, that could explain this suboptimal performance. Our investigation suggests that, counter-intuitively, vascular graphs might hold no discriminative information for PE risk stratification. Code is available from https://github.com/creatis-myriad/GENESIS.

17.
arXiv (CS.AI) 2026-06-24

Can Scale Save Us From Plasticity Loss in Large Language Models?

arXiv:2606.24752v1 Announce Type: new Abstract: The loss of plasticity - the ability of a network to learn new information after having already learned older information - is a fundamental challenge in creating artificial neural networks capable of continual learning. Although this phenomenon has been known for decades, it has mostly been studied in older, relatively small architectures and rarely in natural-language domains. To determine whether loss of plasticity remains a problem in the modern transformer-based LLM paradigm, we study plasticity loss in GPT-style Transformer models trained on a multilingual continual learning problem. Consistent with prior work, we find evidence of plasticity loss across models ranging from 5M to 314M non-embedding parameters, as measured by deterioration on a held-out Vietnamese probing task. We further find that the onset of plasticity loss follows a predictable scaling law, growing sublinearly with model size. These results suggest that larger models may delay the measurable effects of plasticity loss, but that increasing parameter count alone is likely to be insufficient to completely prevent it. We also find evidence of plasticity loss under stationary multilingual training, challenging the view that the phenomenon is exclusive to continual learning with abrupt task changes. Overall, our results suggest that even large Transformer language models trained on natural-language will eventually lose the ability to efficiently adapt to new data after sufficiently long training, in both continual and stationary settings.

18.
arXiv (CS.CL) 2026-06-11

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.

19.
arXiv (CS.CV) 2026-06-16

EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

20.
arXiv (CS.LG) 2026-06-25

Multi-Stream Temporal Fusion for Financial Fraud Detection

arXiv:2606.25007v1 Announce Type: new Abstract: Financial fraud detection in digital banking requires reasoning over multiple heterogeneous event streams – transactions, login sessions, risk signals – that individually appear benign but collectively reveal fraudulent patterns. We propose the Multi-Stream Fraud Transformer (MSFT), a unified architecture that encodes each event stream with independent Transformer encoders and fuses their representations through configurable mechanisms. We conduct a systematic ablation study comparing five fusion strategies: concatenation, gated fusion, time-aware positional encoding, cross-stream attention, and a full combination. On a large-scale dataset (10M users, 1.5% fraud rate) with 85M parameter models, we demonstrate that (1) sequence models significantly outperform gradient-boosted trees operating on aggregated features (0.74 vs. 0.99 AUROC), (2) per-stream encoding is essential – a single-stream Transformer baseline with matched parameter budget reaches only 0.82 AUROC, an 18-point gap that confirms the multi-stream inductive bias is necessary, (3) time-aware positional encoding achieves the highest discrimination (0.9961 AUROC), (4) gated fusion yields the best precision (0.989) suitable for production deployment, and (5) the risk event stream provides the strongest individual signal contribution. We further validate on proprietary production data from a digital banking platform, showing over 22% relative AUROC improvement over the XGBoost baseline.

21.
arXiv (CS.AI) 2026-06-24

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

arXiv:2606.23238v2 Announce Type: replace Abstract: Logical reasoning is essential for reliable AI, yet existing benchmarks are largely first-order-logic-centric, focusing on object-level deduction over fixed predicates. This misses many realistic scenarios where models must reason over rules, predicates, functions, constraints, and decision procedures themselves. We introduce HOLMES (Higher-Order Logic Meets real-world Explainable Symbolic reasoning), the first real-world benchmark for higher-order symbolic reasoning in LLMs, containing 1379 instances. Built on higher-order logic, HOLMES pairs natural-language problems with HOL formalizations, ground-truth answers, verifiable reasoning traces, and fine-grained controllable reasoning factors across law and finance. Experiments show that current LLMs still struggle on HOLMES, with an average accuracy of only 50.64% and the best model reaching 59.54%. Our analyses further reveal that high final-answer accuracy can mask shortcut reasoning in conflict-resolution settings, while performance drops sharply under scope-conditioned and compositional reasoning. These findings identify higher-order symbolic reasoning as a key bottleneck for building reliable and verifiable LLMs. The project code and dataset are publicly available at https://github.com/wuyucheng2002/HOLMES.

22.
arXiv (CS.CV) 2026-06-25

BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning

Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP's existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.

23.
arXiv (CS.CV) 2026-06-25

LastAct: Trajectory-Guided Latest-Activity Localization for Real-Time Smart-Home Activity Recognition

Human Activity Recognition (HAR) from ambient sensors enables smart-home applications such as health monitoring and assisted living. In realistic deployments, however, sensor events arrive as a continuous stream and activity boundaries are unknown. Sliding-window inference therefore produces many windows that straddle transitions and contain mixed activities, creating boundary contamination that violates the pre-segmented instance assumption used by most benchmarks and models. Moreover, many pipelines under-use spatial context by treating sensor IDs as independent tokens. We present LastAct, a trajectory-centric framework for streaming smart-home HAR that targets the most recent activity under mixed windows while explicitly modeling spatial structure. LastAct projects sensor events onto the home floorplan to form a layout-aligned trajectory image sequence that preserves spatial continuity. A lightweight gate identifies contaminated windows, and a boundary localizer estimates the last transition to enable boundary-guided masking that emphasizes post-boundary evidence and suppresses stale context. For efficiency, we reuse a precomputed layout-aligned template cache to avoid repeated rendering. Empirically, across four public smart-home datasets under near-realistic mixed-activity protocols, LastAct achieves competitive or superior performance on pure windows and yields substantial Macro-F1 gains on cross/mixed windows, demonstrating improved robustness under near-realistic sliding-window regimes.

24.
arXiv (CS.LG) 2026-06-24

FuseSampleAgg: One-Pass Neighborhood Estimation for Budgeted Knowledge-Graph Refresh and Validation

arXiv:2511.13645v2 Announce Type: replace Abstract: Operational knowledge-graph (KG) pipelines in networking and cybersecurity increasingly need to refresh embeddings under strict time, memory, and audit budgets, especially as curated feeds and LLM-assisted extraction accelerate KG updates. A recurring per-step cost in mini-batch KG learning is neighborhood-context estimation: uniform neighbor sampling without replacement followed by mean aggregation. Common frameworks implement this estimator through sampled-subgraph materialization and intermediate feature gathers, adding kernel launches, allocator pressure, and transient memory spikes. We present One-Pass Neighborhood Estimation, a fused PyTorch CUDA operator that samples neighbors and directly emits the sampled-neighborhood mean, avoiding explicit block construction while preserving GraphSAGE-mean semantics for the same sampled neighbor IDs. It supports seed-controlled sampling and optional saved-index replay for reproducible validation and regression testing. Across large-graph mini-batch workloads, it improves FP32 end-to-end step latency by 2.24x-3.48x over tuned DGL baselines and reduces transient GPU memory by up to 160x in our measurements. On OGB KG completion benchmarks such as WikiKG2 and BioKG, it reduces step time and peak VRAM while matching ranking quality within seed variability, improving time-to-quality for budgeted KG refresh.

25.
arXiv (CS.CV) 2026-06-16

Hierarchical GRU with Input-Conditioned Slot Queries for Ball Action Anticipation

We present a hierarchical model for ball action anticipation in football broadcast video. Given a 30-second observation window, the system predicts actions occurring in the subsequent 5-second window across 10 classes. A shared local Transformer encodes clip-level features within each 5-second sub-window; a GRU then aggregates temporal context across all sub-windows; finally, a Transformer decoder with K input-conditioned event slots decodes the anticipation target via three decoupled heads (objectness, class, temporal offset). We introduce frequency-reweighted Hungarian matching that systematically favours rare action classes, and Gaussian soft targets for temporal bin supervision. On the SoccerNet Ball Action Anticipation benchmark, our method achieves 17.91% mAP on the test server.