Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-15

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

02.
arXiv (CS.AI) 2026-06-24

Invariant Graph Representations for Continuous-Time Dynamic Graphs Under Distribution Shifts

arXiv:2405.19062v2 Announce Type: replace-cross Abstract: Continuous-Time Dynamic Graphs (CTDGs) enable fine-grained modeling of evolving relational systems. However, most existing CTDG representation learning methods are tailored to in-distribution settings and exhibit limited robustness under out-of-distribution (OOD) shifts. Although recent causal approaches learn invariant representations via interventions, they are primarily designed for static or discrete-time graphs and become computationally prohibitive for CTDGs due to the combinatorial explosion of structural and temporal variations. To address these challenges, we propose CIR, a framework grounded in a novel structural causal model termed the ICCM. To avoid exhaustive interventions, we leverage the Normalized Weighted Geometric Mean (NWGM) to efficiently approximate interventional predictions. We further instantiate ICCM within a practical deep learning architecture that jointly captures invariant structural and temporal patterns through dedicated subgraph extractors, and maintains an environment memory bank to model distributional shifts across evolving contexts. Extensive experiments demonstrate that CIR consistently outperforms existing methods under diverse OOD scenarios.

03.
arXiv (CS.LG) 2026-06-16

Beyond Defensive Reporting: Machine Learning for Active Anti-Money Laundering Control in Insurance

arXiv:2606.16663v1 Announce Type: new Abstract: Money laundering through insurance claims poses a threat to insurers both through fraudulent payouts and reputational and regulatory risk. Despite this, little research has examined how such laundering can be prevented. This paper examines whether machine learning can help insurers flag suspicious claims before payout, shifting the focus from passive reporting to active prevention. Using production data from a major Norwegian insurer, we train gradient-boosted decision tree models to detect claims later reported to authorities for suspected money laundering. Because fraud and laundering may share behavioural patterns, we also examine whether insurance fraud labels can serve as an auxiliary training signal. We compare different learning setups using the Budget-Weighted Capture Rate, a metric introduced in this paper to measure how many laundering cases are captured when only a small share of claims can be manually reviewed. The results show that incorporating fraud-related investigation labels substantially improves laundering detection. The best-performing model captures nearly two-thirds of laundering cases within the top-ranked 2 to 6 percent of claims selected for investigation. To our knowledge, this is the first empirical study of machine learning for money laundering detection in insurance claims.

04.
arXiv (CS.CL) 2026-06-11

A Geometric Profile of Semantic Information in Text: Frame-Conditional Uniqueness and a Trade-Off Triangle for Scalar Summaries

How much meaning does a text carry? Shannon's theory measures uncertainty over symbols and is intentionally indifferent to meaning, while pairwise metrics such as BERTScore compare two texts rather than characterizing one. We develop a geometric framework that measures semantic content from the structure of a text's sentence embeddings. The framework has three parts. First, within a fixed embedding and baseline, six natural axioms uniquely determine a scalar measure up to scale, a frame-conditional uniqueness theorem. The resulting scalar is empirically too coarse, motivating a richer representation. Second, we propose a three-coordinate semantic profile capturing novelty (displacement from generic discourse), breadth (diversity of distinct ideas), and integration (connectedness among them), together with a discrete minimal unit (the semantic quantum) whose resolution is fixed by a clustering threshold $\tau$. Third, we prove a no-go theorem: no scalar summary of the profile can simultaneously satisfy analytic stability under paraphrase and concatenation, ordinal robustness across text scales, and cross-representation comparability. We exhibit two practical scalars, $S_{\mathrm{minmax}}$ and $S_{\mathrm{rank}}$, each occupying a distinct corner of this trade-off triangle. Validation across 23 synthetic categories, 5 Project Gutenberg novels, and 3 embedding models confirms the trade-off. The recommended rank-normalized configuration passes 25 of 28 ordinal checks as point estimates (21 of 28 after Benjamini-Hochberg correction), outperforming seven baselines including unigram entropy and a BERTScore-based novelty signal. A separate variational result connects the breadth coordinate to the log-determinant of a determinantal point process (Spearman $\rho = 0.985$ over 507 Gutenberg chapters), giving an optimization-theoretic foundation for breadth.

05.
arXiv (quant-ph) 2026-06-15

Jones-matrix analysis of phase accumulation in a linear-optical multi-pass interferometer

arXiv:2606.14422v1 Announce Type: new Abstract: Quantum information science has traditionally relied on nonclassical resources, such as entangled photon pairs and squeezed states, to achieve measurement performance beyond classical limits. Here, we revisit the multi-pass photonic scheme reported in Nature 450, 393 (2007) to clarify the physical origin of the observed superresolution and the associated claim of supersensitivity. Using a rigorous Jones-matrix formalism, we show that the round-trip evolution of the HQMQ linear optics unit is equivalent to the product of two reflections in polarization space, resulting in an effective rotation operator. This equivalence reveals that the accumulated phase arises from coherent polarization-state rotation on the Poincare'e sphere. The resulting phase accumulation is interpreted geometrically as a progressive realignment of the polarization state during successive forward and backward propagations. To validate the theoretical model, a classical-wave implementation is experimentally conducted, analyzed, and compared with the corresponding Jones-matrix solution. Finally, the scaling behavior of the Fisher information is analyzed to examine the origin of the claimed supersensitivity. The results are further compared with a recently developed coherence de Broglie wavelength framework, which achieves identical superresolution through repeated coherent interactions in a cascaded interferometeric architecture.

06.
arXiv (CS.CL) 2026-06-24

Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking

Recent multi-bit watermarking methods for large language models (LLMs) prioritize capacity over reliability, often conflating decoding with detection. Our analysis reveals that existing ECC-based extractors suffer from catastrophic false positive rates (FPR), and applying rejection thresholds merely collapses detection sensitivity (TPR) to random guessing. To resolve this structural limitation, we propose BREW (Block-wise Reliable Embedding for Watermarking), a framework shifting the paradigm to designated verification. BREW employs a two-stage mechanism: (i) blind message estimation via independent block voting, followed by (ii) window-shifting verification that rigorously validates the payload against local edits. Experiments demonstrate that BREW achieves a TPR of 0.965 with an FPR of 0.02 under 10% synonym substitution, demonstrating that the high-FPR issue is not an inherent trade-off of multi-bit watermarking, but a solvable structural flaw of prior decoding-centric designs. Our framework is model-agnostic and theoretically grounded, providing a scalable solution for reliable forensic deployment.

07.
arXiv (CS.CV) 2026-06-25

TTSA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction

Streaming recurrent models enable efficient 3D reconstruction by maintaining persistent state representations. However, they suffer from catastrophic forgetting over long sequences due to balancing historical information with new observations. Recent methods alleviate this by deriving adaptive signals from the attention perspective, but they operate on single dimensions without considering temporal and spatial consistency. To this end, we propose a training-free framework termed TTSA3R that leverages both temporal state evolution and spatial observation quality for adaptive state updates in 3D reconstruction. In particular, we devise a Temporal Adaptive Update Module that regulates update magnitude by analyzing temporal state evolution patterns. Then, a Spatial Contextual Update Module is introduced to localize spatial regions that require updates through observation-state alignment and scene dynamics. These complementary signals are finally fused to determine the state updating strategies. Extensive experiments show that TTSA3R achieves competitive performance on standard short-sequence benchmarks and provides substantially stronger robustness on extended sequences. On NRGBD, as sequences extend from 50 to 250 frames, TTSA3R exhibits only a 1.33x error increase, compared with over 4x degradation for CUT3R. This highlights the practical value of temporal-spatial adaptive updates for long-term reconstruction stability. Our code is available at https://github.com/anonus2357/ttsa3r.

08.
arXiv (CS.CV) 2026-06-25

DRM: Diffusion-based Reward Model With Step-wise Guidance

Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.

09.
arXiv (CS.CL) 2026-06-12

BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection

Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.

10.
arXiv (CS.CV) 2026-06-15

Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems

The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.

12.
arXiv (CS.CL) 2026-06-24

On the Stability of Prompt Ranking in Large Language Model Evaluation

Prompt-based interaction has become a dominant paradigm for using large language models (LLMs), where multiple candidate prompts are evaluated and the top-ranked one is selected for downstream use. This workflow implicitly assumes that prompt rankings are stable under minor variations in evaluation conditions. In this paper, we systematically study prompt ranking stability under common sources of variability, including random seeds and limited evaluation subsets. Across three open-weight LLMs and two benchmark tasks, we find that while overall rank correlations are often moderate to high, the identity of the top-performing prompt frequently changes, leading to unreliable selection decisions. To address this issue, we propose a simple stability-aware selection strategy based on a lower confidence bound, which accounts for both performance and variance. Our results show that this approach improves robustness in unstable settings while remaining competitive in more stable regimes. These findings highlight the importance of accounting for evaluation uncertainty in prompt selection and LLM benchmarking.

13.
arXiv (CS.AI) 2026-06-25

AI-Assisted Computational Reproducibility on the FABRIC Testbed

arXiv:2606.25879v1 Announce Type: cross Abstract: Computational reproducibility remains difficult despite being central to scientific research. In this paper, we show how the international FABRIC testbed, combined with large language model (LLM) coding assistants through LoomAI, can simplify reproducing published experiments across multiple domains. We reproduced three case studies on FABRIC, covering BBR-family congestion-control evaluations, LAMMPS molecular dynamics scaling benchmarks on a CPU-only MPI cluster, and stress protein homeostasis genomics pipelines. Rather than focusing only on matching numerical outputs, we evaluate whether the reproduced experiments support the same scientific conclusions as the original studies. The AI assistant was effective in setting up the environment, adapting code, and debugging, but struggled with the analysis stages that lacked clearly defined workflows, which required human guidance to establish execution order and data dependencies. Across the case studies, the AI-assisted workflow reduced reproduction effort by roughly 4–6 times. We conclude with practical recommendations for improving AI-assisted reproducibility on research testbeds.

14.
arXiv (CS.CL) 2026-06-15

Succeeding at Scale: Enterprise Retrieval Benchmark Construction and Index-Preserving Query Adaptation for Multi-Tenant Search

Large-scale multi-tenant retrieval systems generate extensive query logs but lack curated relevance labels for effective domain adaptation, resulting in substantial underutilized "dark data." This challenge is compounded by the high cost of model updates, as jointly fine-tuning query and document encoders requires full corpus re-indexing, which is impractical in multi-tenant settings with thousands of isolated indices. We introduce DevRev-Search, a passage retrieval benchmark for technical customer support built via a fully automated pipeline. Candidate generation uses fusion across diverse sparse and dense retrievers, followed by an LLM-as-a-Judge for consistency filtering and relevance labeling. We further study and systematically evaluate index-preserving query-only adaptation strategies that fine-tune only the query-encoder while keeping the document indices fixed. Experiments on DevRev-Search, SciFact, and FiQA-2018 show that parameter-efficient fine-tuning of the query encoder delivers a remarkable quality-efficiency trade-off, enabling scalable and practical enterprise multi-tenant retrieval.

15.
arXiv (CS.CV) 2026-06-24

REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs

Vision-language models (VLMs) are increasingly used as perception-reasoning backbones for embodied intelligence in safety-critical physical systems, where perception or reasoning errors can lead to unsafe decisions or actions. Although many red-teaming methods have been developed to probe VLM vulnerabilities, their evaluation remains fragmented across datasets, metrics, and threat models, making direct comparison difficult and obscuring whether observed differences arise from stronger attacks, more vulnerable models, or incompatible evaluation settings. Existing chatbot-centric red-teaming benchmarks mainly standardize jailbreak and content-safety evaluation, but they do not systematically capture physically grounded functional failures or cover red-teaming methods that target physical-world VLMs. This raises the key challenge of comparing diverse attack methods under a unified protocol while targeting the same scenario-specific failures. We introduce REALM, to our knowledge the first unified red-teaming benchmark for physical-world VLMs. REALM integrates 12 red-teaming methods, 3 model-agnostic defenses, and 13 VLMs under a practical black-box threat model with shared datasets and metrics. To align adversarial objectives across attack families, REALM introduces an agentic target-generation pipeline that constructs shared, scenario-specific, and physically grounded attack objectives for each scene, enabling fair comparison of diverse red-teaming methods under aligned adversarial goals. Our evaluation shows that text and typographic injection attacks induce the most failures, multimodal co-optimization yields the strongest visual-perturbation transfer, single-pass attacks approach iterative methods at much lower cost, and model scale alone does not confer adversarial robustness. Code is available at https://github.com/UCF-ML-Research/REALM.

16.
arXiv (CS.CV) 2026-06-11

DeceptionX: Explainable Deception Detection with Multimodal Large Language Models

Deception detection is a critical and highly challenging task within affective computing and behavioral analysis. Existing deep learning methods typically treat this task as a straightforward classification problem; however, this black-box approach lacks interpretability and fails to capture the complex logical deduction processes utilized by human experts when identifying lies. While Multimodal Large Language Models (MLLMs) have shown potential, applying them effectively requires a bridge between low-level audiovisual cues and high-level logical reasoning. In this paper, we propose DeceptionX, a novel MLLM framework that shifts the paradigm of deception detection from black-box classification to an interpretable Observe-Think-Summarize reasoning process. To address the scarcity of high-quality reasoning data, we first constructed DeceptChain, a high-quality dataset developed through a human-in-the-loop process. This dataset synthesizes fine-grained visual and auditory evidence (such as micro-expressions and vocal tremors) into structured chain-of-thought reasoning data. Furthermore, we propose a three-stage training pipeline and a Discrepancy-Aware Redundancy Elimination~(DARE) strategy for DeceptionX to further enhance the model's generalization capabilities. Extensive experiments demonstrate that DeceptionX not only outperforms existing MLLM baselines and state-of-the-art methods on standard real-world benchmarks but also provides transparent, expert-level reasoning paths, bridging the critical gap between accuracy and interpretability in multimodal deception detection.

17.
arXiv (CS.CV) 2026-06-16

SAMTok: Representing Any Mask with Two Words

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

18.
arXiv (CS.CV) 2026-06-18

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real-world domain shifts when generalizing to new locations. In this work, we formulate zero-shot cross-city transfer as a controlled representation-level stress test for end-to-end autonomous driving and ask how visual pretraining affects transfer behavior under geographic domain shift. We conduct a comprehensive study by integrating self-supervised backbones I-JEPA, DINOv2, and MAE into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models across cities with different road topologies, traffic conventions, and visual environments. In open-loop evaluation, a supervised backbone exhibits severe degradation when transferring between cities, yet some domain-specific self-supervised methods can substantially reduce both displacement and collision degradation. In closed-loop evaluation, self-supervised pretraining improves average out-of-distribution PDMS in several single-city training settings. Our results provide empirical evidence that representation learning influences the robustness of cross-city planning and motivate zero-shot geographic transfer as an important stress test for evaluating end-to-end autonomous driving systems.

19.
arXiv (CS.AI) 2026-06-17

Learn to Quantify Social Interaction with Constraints for Pedestrian Walking

作者:

arXiv:2606.17897v1 Announce Type: new Abstract: Long-term human path forecasting in crowds is critical for autonomous moving platforms (like autonomous driving cars and social robots) to avoid collision and make high-quality planning. Although the current research take into account social interactions for prediction, they don't reveal the exact kinds of social interactions happened among people and how the social interactions affect the decision-making process of pedestrians, which further limits its robustness. Social interactions in pedestrian walking are intuitively massive and hard to label and quantify. In this paper, we explore creatively to quantify and interpret how pedestrians interact with others by proposing Learn to Cluster. Our clustering social interactions is probabilistic latent variable generative, learning directly from sequential trajectory observations, scalable to arbitrary number of pedestrians. Learn to cluster is label-free and can be naturally integrated into the training process of the prediction model. The latent variables will then serve as 'labels' to categorize social interactions. Extensive experiments over several trajectory prediction benchmarks demonstrate that our method is able to learn the patterns of social interactions and effectively integrate the patterns to pedestrian trajectory prediction.

20.
arXiv (CS.LG) 2026-06-17

Evaluating Intersectional Fairness across Clinical Machine Learning Use Cases using Fairlogue and the All of Us Research Program

arXiv:2604.16450v2 Announce Type: replace-cross Abstract: Intersectional biases in healthcare data can produce compound disparities in clinical machine learning models, yet most fairness evaluations assess demographic attributes independently. FairLogue, a toolkit for intersectional fairness auditing, was applied across multiple clinical prediction tasks to evaluate disparities across combined demographic groups. Using the All of Us dataset, two published models were selected for replication and evaluation: (A) prediction of selective serotonin reuptake inhibitor associated bleeding events and (B) two-year stroke risk in patients with atrial fibrillation. Observational fairness metrics were computed across race, gender, and intersectional subgroups, followed by counterfactual analysis to evaluate whether disparities were attributable to group membership. Intersectional evaluation revealed larger disparities than single-axis analyses; however, counterfactual diagnostics indicated that most observed disparities were comparable to those expected under randomized group membership. These results highlight the importance of intersectional fairness auditing and demonstrate how FairLogue provides deeper insight into bias in clinical machine learning systems.

21.
arXiv (CS.CV) 2026-06-12

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

22.
arXiv (CS.LG) 2026-06-17

Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

arXiv:2606.17500v1 Announce Type: new Abstract: Transformer-based models achieve strong performance for jet tagging at the CERN LHC, but deploying them in low-latency, resource-constrained trigger systems is challenging. We present an initial implementation of a quantized, integer-only transformer for jet tagging on the AMD Versal AI Engine (AIE), mapping dense and multi-head attention (MHA) layers to AIE tiles. The main contribution is a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high-level Python model description. This framework provides a foundation for future research and is released as open-source software at https://github.com/KastnerRG/particle_transformer_aie.

23.
arXiv (CS.CL) 2026-06-25

Membox: Weaving Topic Continuity into Long-Range Memory for LLM Agents

Long-term human-agent dialogues are organized by topic continuity: adjacent turns often develop the same goal, plan, problem, or event, while related activities may recur across distant sessions. Yet many LLM agent memory systems first decompose histories into isolated turns or fixed-size chunks, then compensate through enrichment, consolidation, or retrieval mechanisms still tied to semantic proximity or fragment-level records. This weakens temporal and causal organization and biases memory access toward semantic proximity rather than task- or topic-level continuity. We introduce Membox, a hierarchical memory architecture that instantiates topic continuity as an explicit organization layer for agent memory. Its Topic Loom incrementally organizes dialogue streams into boxes whose internal turns follow the same local topic, while its Trace Weaver links extracted events across boxes into macro-topic traces that recover recurring activities, goals, and factual developments across distant sessions. On LoCoMo, Topic-Loom-only retrieval improves over the best Mem0/A-MEM retrieval-depth setting by 13.00 F1 points (53.95 vs. 40.95), and trace-expanded retrieval further raises F1 to 55.28; with GPT-4o, trace-expanded retrieval reaches 59.71 F1. Additional DialSim results show the same gain from adding cross-box traces in multi-party dialogue. These results show that local topic-continuity organization and macro-topic trace expansion improve long-range memory beyond semantic retrieval over fragmented records.

24.
arXiv (CS.CL) 2026-06-25

Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

Large language models are widely employed as evaluators, a paradigm commonly referred to as LLM-as-a-judge. Prior research has predominantly examined point-wise or pair-wise evaluation protocols; in contrast, our focus is on rubric-based evaluation, which has been attracting increasing attention owing to its utility for training models in domains where verification is otherwise difficult. In this work, we show that rubric-based evaluation implicitly resembles a multiple-choice setting and therefore exhibits position bias: LLMs tend to prefer score options that appear at specific positions within the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate that this position bias is consistent. Its direction, however, is model-specific: some judges favor the first option, while others favor the last. We further identify a second, orthogonal axis of bias: when a prompt scores several criteria simultaneously, the ordering of the criteria itself shifts the resulting scores. We additionally explore permuting the order of the rubric options as a means of mitigating position bias, and find that although the bias can be attenuated, improvements in the correlation between model judgments and human annotations are obtained primarily for models that exhibit strong bias. Our results recast rubric-based LLM-as-a-judge as a multiple-choice problem with measurable, model-specific position bias, and we further confirm that only a small number of random order permutations are sufficient to reduce the error introduced by this bias for the majority of models.

25.
arXiv (CS.AI) 2026-06-11

A Physics-Inspired Optimizer: Velocity Regularized Adam

arXiv:2505.13196v3 Announce Type: replace-cross Abstract: We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.