Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-12

Counterfactual Explanations for Deep Two-Sample Testing

arXiv:2606.04009v2 Announce Type: replace-cross Abstract: Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

02.
arXiv (CS.AI) 2026-06-17

From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

arXiv:2606.17648v1 Announce Type: new Abstract: Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github.com/euyis1019/llm-brewing

03.
arXiv (math.PR) 2026-06-12

(Non)-hyperuniformity of perturbed lattices

arXiv:2405.19881v3 Announce Type: replace Abstract: We ask whether a stationary lattice in dimension $d$ whose points are shifted by identically distributed but possibly dependent perturbations remains hyperuniform. When $d = 1$ or $2$, we show that it is the case when the perturbations have a finite $d$-moment, and that this condition is sharp. When $d \geq 3$, we construct arbitrarily small perturbations such that the resulting point process is not hyperuniform. As a side remark of independent interest, we exhibit hyperuniform processes with arbitrarily slow decay of their number variance.

04.
arXiv (CS.AI) 2026-06-12

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

arXiv:2606.13233v1 Announce Type: cross Abstract: Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose ReSET, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

05.
bioRxiv (Bioinfo) 2026-06-21

Antibody-Antigen Affinity Prediction with Chain-Aware Protein Language Modeling

Motivation: Antibody-antigen affinity determines which antibodies advance in therapeutic discovery, repertoire analysis and affinity maturation, but experimental measurements are sparse relative to the scale of sequence libraries. Structure-based predictors can exploit interface geometry when reliable complexes are available, yet early discovery often requires ranking many heavy-light chain pairs against antigens for which no complex structure exists. Existing sequence-based models are scalable, but frequently compress heavy and light chains into a single antibody representation or concatenate antibody and antigen features obscuring the chain-specific and epitope-specific signals that drive binding. Results: We present AbAffinity, a sequence-only chain-aware three-stream architecture that maintains heavy chain, light chain and antigen as distinct streams. It integrates frozen ESM-2 embeddings with heavy-chain CDR-focused pooling, heavy-light self-attention, adaptive fusion gating and gated cross-attention, training only a compact interaction module. On the SAAINT-DB benchmark, AbAffinity achieves strong predictive performance under ten-fold cross-validation and maintains robust accuracy on novel antigens. It consistently outperforms recent sequence-based models across external benchmarks including SAbDab, AB-Bind and SKEMPI 2.0. Ablation studies highlight the contributions of chain-specific representations, CDR-focused pooling and the gated interaction pathway. Integrated Gradients attributions recover known paratope and epitope residues at structurally validated interfaces. AbAffinity provides a lightweight, explainable sequence-first framework for antibody triage and prioritisation when structural information is limited or unavailable.

06.
arXiv (CS.AI) 2026-06-19

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

arXiv:2606.20508v1 Announce Type: new Abstract: Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

07.
arXiv (CS.AI) 2026-06-12

Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response

arXiv:2603.02274v3 Announce Type: replace-cross Abstract: Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant but pharmacological response samples are sparse. While deep learning achieves predictive accuracy, it frequently fails to provide the mechanistic clarity required for clinical adoption. We present the Contextual Invertible World Model (CIWM), a Neuro-Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning emulator with a Large Language Model reasoning layer. Utilising a stringently curated, high-fidelity data engineering pipeline on the Sanger GDSC dataset (\( N=83 \)), we isolate true biological signals from in vitro artifacts to establish a rigorous baseline predictive correlation for complex transcriptomics (\( r=0.268 \)). Through Inverse Reasoning, we perform in silico CRISPR perturbations across the colorectal landscape. The framework autonomously overturns classical mechanistic assumptions, identifying a hierarchical dominance of mutant KRAS over the APC/Wnt-axis in driving 5-fluorouracil resistance (\( \Delta=-0.0469 \)) via a "KRAS Shield" mapped to MAPK/PI3K networks. Furthermore, the agentic layer identified a "PIK3CA Paradox", revealing that repairing PIK3CA inadvertently increases chemoresistance (\( \Delta=+0.0085 \)) by triggering a compensatory feedback loop that hyperactivates the dominant MAPK survival pathway.

09.
arXiv (CS.AI) 2026-06-17

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

arXiv:2606.17735v1 Announce Type: new Abstract: Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced early in generation can propagate irreversibly along the Markov decision process flow, triggering cascading failures that drive the reasoning trajectory toward collapse. To overcome this autoregressive cascade, in which a single early mistake can compromise all subsequent reasoning steps, we propose dynamic epistemic entropy orchestrated erasable reinforcement learning ($E^3RL$). $E^3RL$ eliminates reliance on external signals by grounding the model's endogenous local autoregressive cross-entropy as an intrinsic coordinate of epistemic uncertainty. By introducing segment-level adaptive dynamic thresholds and advantage allocation, $E^3RL$ enables the model to precisely excise localized logical defects while reusing historical key-value (KV) cache streams, thereby endowing the reasoning process with a self-healing capability. We train $E^3RL$ on the DeepMath-103k dataset. Experimental results show that $E^3RL$ reshapes the exploration efficiency of long-sequence reasoning and improves sample efficiency while maintaining linear memory overhead. On mathematical reasoning benchmarks such as AIME, $E^3RL$ achieves substantial performance gains, with the 4B and 8B parameter models surpassing previous state-of-the-art (SOTA) results by 5.349\% and 6.514\%, respectively. These findings suggest that $E^3RL$ shatters the autoregressive curse in long-sequence reasoning and establishes a theoretical and systems-level foundation for the next generation of self-healing artificial general intelligence (AGI).

10.
arXiv (math.PR) 2026-06-17

Limit theorems for descents and inversions of shelf-shuffles

arXiv:2510.00343v2 Announce Type: replace Abstract: We prove central limit theorems for the number of descents and inversions of permutations produced by shelf-shuffles. These are a model for casino card shuffling machines. We show the asymptotic normality of the number of descents in two limiting regimes depending on the ratio of cards to shelves. On the other hand, we study the inversions by employing a modification of the techniques from Islak's analysis of the statistics of riffle shuffles. In particular, we obtain a bound for the rate of convergence for inversions that is independent of the number of shelves.

11.
arXiv (CS.CL) 2026-06-16

Semantic-Preserving Prompt Hijacking: A Black-Box Adversarial Attack on Auto-Prompt Optimization

LLMs increasingly integrate auto-suggestion optimization modules, enabling them to rewrite and display user input before generating the final response. While this design aims to enhance transparency and trust, its process of autonomously selecting a single best result from multiple candidate solutions allows attackers to hijack this optimization process by inducing subtle, imperceptible semantic shifts. To address this, we propose a semantic preservation hijacking attack method based on black-box conditions: Adaptive Greedy Local Search. This method hierarchically decomposes the input text, masks key language units, and dynamically adjusts candidate replacement words at predefined semantic checkpoints. This maximizes the deviation between the model output and the original intent while strictly maintaining semantic similarity to the original text. Experimental results on commercial and open-source LLMs demonstrate that, under the same semantic similarity constraints, this method achieves a higher attack success rate than existing attack methods in over 2400 test cases. Code is available at: https://github.com/franz-chang/DOBS

12.
arXiv (CS.AI) 2026-06-17

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

arXiv:2606.04513v2 Announce Type: replace Abstract: Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

13.
arXiv (CS.CL) 2026-06-15

ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

Scientific papers advance $claims$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $1{,}617$ ACL Anthology papers $(1979 - 2025)$ that are manually annotated with $5{,}689$ claims and $4{,}871$ cross-paper claim relations, indicating whether a citing paper $\texttt{supports}$, $\texttt{extends}$, $\texttt{qualifies}$, $\texttt{refutes}$, or references a cited claim as $\texttt{background}$. Building on $\texttt{ClaimFlow}$, we define a new task – $Claim Relation Classification$ – which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating neural models and large language models on this task, we report baseline performance of $0.81$ macro-F1, suggesting that the task is tractable while leaving room for improvement. We then scale this framework to $\sim$$13k$ NLP papers to study claim evolution across decades of NLP research. We show that $63.5\%$ claims are never reused; only $11.1\%$ are ever challenged. Widely propagated claims are more often $reshaped$ through qualification and extension than supported or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP.

14.
arXiv (CS.CV) 2026-06-15

Vanishing Depth: Training Generalized Depth Adapters with Sinusoidal Depth Preprocessing for Pretrained RGB Encoders

Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose a self-supervised training approach that extends pretrained RGB encoders with a depth adapter to incorporate and align metric depth into a combined latent space without interfering with the pretrained RGB feature extraction. In combination with our sinusoidal depth encoding, the depth adapter enables generalized and robust depth density and distribution invariant feature extraction. Our depth adapters improve a wide set of generalized RGB baselines across a spectrum of relevant RGBD downstream tasks in segmentation, pose estimation, and depth completion – without the necessity of finetuning. Most importantly, we achieve 56.05 mIoU in the SUN-RGBD segmentation, while outperforming SOTA depth-aware and multi-modal encoders in our experiments. When no depth is present, one can activate our depth adapter with an empty map, use single pixel depth clues, or monocular depth estimation to include the depth aware feature extraction into subsequent downstream tasks.

15.
arXiv (CS.CV) 2026-06-11

Precision-Aware Illumination-Disentangled Vision Transformer for Spacecraft 6D Pose Estimation

Vision sensors provide a lightweight solution for spacecraft proximity operations, but monocular spacecraft 6D pose estimation remains difficult under illumination variation, specular reflection, shadowing, weak texture, and background interference. These factors make local visual evidence spatially unreliable and can destabilize pose regression. This article proposes a Precision-Aware Illumination-Disentangled Vision Transformer (PAID-ViT) for robust spacecraft pose estimation.The proposed model separates pose-relevant structure tokens from illumination-sensitive appearance tokens, estimates patch reliability before pose aggregation, and uses foreground mask supervision to preserve silhouette cues. A parameter-free geometric recovery module converts normalized crop coordinates, log-depth, and a continuous 6D rotation representation into camera-frame rotation and translation. Experiments on SPEED+ V2, the SPEED+ validation/lightbox/sunlamp evaluation configuration used in this study, suggest that PAID-ViT reduces translation error and improves robustness in the challenging sunlamp domain, while ablation studies support the complementary roles of illumination disentanglement, reliability-aware token aggregation, mask supervision, and training-side regularization.

16.
arXiv (CS.AI) 2026-06-16

FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail (June 13th version)

arXiv:2606.06510v2 Announce Type: replace-cross Abstract: Conventional HPC holds that native hardware FP64 is the irreducible foundation of scientific computing. On AI-optimized GPUs of the NVIDIA B300 generation and beyond, native FP64 throughput has collapsed to ~1.3 TFLOPS even as FP8 tensor throughput has grown to multiple PFLOPS. We argue something stronger than that this is survivable: the FP8 tensor-core matrix-multiply is the sole computational primitive on which double-precision scientific computing needs to be built. Every canonical kernel – dense and sparse linear algebra, spectral transforms, stencils – and every application composing them reduces, via the Chinese Remainder Theorem-based Ozaki Scheme II, to sequences of FP8 matrix operations; the only non-FP8 arithmetic is a bounded, fixed-width integer accumulation at reconstruction. Native FP64 is thereby demoted from a hardware requirement to a derived accuracy guarantee obtained by composition over the FP8 primitive. We organize the claim as a five-layer hierarchy – the FP8 op, Ozaki II, the basic kernels or Berkeley "dwarfs", composite solvers, and full applications – and, because the dwarf taxonomy already spans scientific computing, establish it by exhibiting the reduction for every dwarf rather than a sample. The claim is falsifiable, and we build the instrument that tests it: a Tensor-Memory Equilibrium (TME) model extending the Roofline with emulation parameters (alpha, beta, gamma). We identify register-level fusion as the mechanism that keeps emulation memory-bound, project recovered FP64 performance across B300 and Rubin against an H100 baseline, and close the kernel coverage with a companion FFT analysis and compensated reductions. The model could have returned a negative verdict; instead it passes across the dwarfs and their compositions. This is the analytical half of a two-part program, with a follow-on implementation to validate the thesis on real silicon.

17.
arXiv (CS.CV) 2026-06-15

Orchestra-o1: Omnimodal Agent Orchestration

The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and collaboration. However, existing orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to more complex settings where heterogeneous modalities coexist and interact. This limitation becomes particularly pronounced in omnimodal scenarios, where tasks require the unified understanding and coordination of diverse inputs such as text, image, audio, and video. In this work, we propose Orchestra-o1, an omnimodal agent orchestration framework designed to support efficient agent collaboration across multiple modalities. Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This scalable design allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources, surpassing the second-best approach by 10.3% accuracy on the OmniGAIA benchmark. Furthermore, we introduce decision-aligned group relative policy optimization (DA-GRPO), an efficient agentic reinforcement learning approach for training Orchestra-o1-8B, which also achieves state-of-the-art performance against all existing open-source omnimodal agents.

18.
arXiv (CS.CV) 2026-06-15

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

19.
arXiv (quant-ph) 2026-06-19

Random Projections for Multi-Copy Quantum Algorithms

arXiv:2606.20238v1 Announce Type: new Abstract: Estimating nonlinear properties of quantum states is a central task in quantum information science. Multivariate traces, $\mathrm{tr}(\rho_1 \cdots \rho_K)$, and nonlinear observables such as $\mathrm{tr}(\rho^K)$, for integer $K$, can be accessed through collective measurements on multiple state copies, but standard protocols based on swap tests require coherent operations on the full Hilbert space and become experimentally unfeasible for large systems. In this work, we introduce a framework for multi-copy measurements based on random projections onto lower-dimensional subspaces prior to the collective measurement, which is then performed only on the reduced Hilbert space. This procedure yields a tunable tradeoff between coherent quantum resources and statistical sampling overhead, allowing the amount of coherent processing to be matched to the capabilities of the underlying hardware. We derive explicit formulas relating the Haar-averaged projected moments to multivariate traces of the original states and analyze the sampling overhead induced by the projection procedure. Specifically, after compressing an $n$-qubit state to a reduced $q$-qubit subspace, estimating $\mathrm{tr}(\rho^K)$ requires approximately $O(2^{(n-q)(K-1)})$ copies of $\rho$, with each qubit projected out increasing the sampling cost by a factor of $2^{K-1}$. Our results establish how coherent multi-copy operations can be traded for additional state copies, enabling multi-copy quantum protocols to be optimized for the available hardware resources.

20.
arXiv (CS.AI) 2026-06-19

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

arXiv:2605.25160v2 Announce Type: replace Abstract: GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom – the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset – while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.

21.
arXiv (CS.AI) 2026-06-12

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

arXiv:2601.19072v3 Announce Type: replace-cross Abstract: Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations – where the generated review comments are ungrounded in the actual code – poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

22.
arXiv (CS.LG) 2026-06-11

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

arXiv:2606.11268v1 Announce Type: new Abstract: Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns. To address these limitations, we introduce \textsc{LakeFM}, a foundation model for aquatic systems, pre-trained on large-scale ecological datasets comprising both simulated and observed lakes. Through extensive empirical evaluation, we show that \textsc{LakeFM} learns meaningful representations spanning broader lake-level characteristics, and achieves competitive or often superior-forecasting performance compared to existing time-series foundation and non-foundation models, while producing physically plausible predictions consistent with real-world lake dynamics.

23.
arXiv (math.PR) 2026-06-18

Formation of clusters and coarsening in weakly interacting diffusions

arXiv:2510.17629v3 Announce Type: replace-cross Abstract: This paper studies the clustering behavior of weakly interacting diffusions under the influence of sufficiently localized attractive interaction potentials on the one-dimensional torus. We describe how this clustering behavior is closely related to the presence of discontinuous phase transitions in the mean-field PDE. For local attractive interactions, we employ a new variant of the strict Riesz rearrangement inequality to prove that all global minimizers of the free energy are either uniform or single-cluster states, in the sense that they are symmetrically decreasing. We analyze different timescales for the particle system and the mean-field (McKean-Vlasov) PDE, arguing that while the particle system can exhibit coarsening by both coalescence and diffusive mass exchange between clusters, the clusters in the mean-field PDE are unable to move and coarsening occurs via the mass exchange of clusters. By introducing a new model for this mass exchange, we argue that the PDE exhibits dynamical metastability. We conclude by presenting careful numerical experiments that demonstrate the validity of our model.

24.
arXiv (CS.CL) 2026-06-16

A Large-Scale Multi-Dimensional Empirical Study of LLMs for Conversation Summarization

Despite the significant advancement of LLMs in conversation summarization, their evaluation remains limited by insufficient scenarios, input lengths, and sample sizes. Furthermore, existing benchmarks often omit frontier reasoning systems and efficient small models, or lack fine-grained, multi-dimensional assessments. To bridge these gaps, we propose OmniCSEval, a unified benchmark comprising 1,800 diverse conversations across six real-world scenarios, featuring context lengths ranging from 128 to 32k tokens. For fine-grained evaluation, we employ a bidirectional fact-checking framework that integrates key fact matching to assess completeness and conciseness, alongside summary fact verification to evaluate faithfulness. To ensure reliable assessment, we establish a human-LLM collaborative pipeline for key fact extraction and a multi-LLM consensus verifier for summary fact decomposition. Leveraging this framework, we evaluate 28 LLMs across four distinct categories grouped by reasoning capability and model scale. Our extensive empirical study reveals critical insights regarding the cross-scenario challenges current LLMs continue to face, the impacts of reasoning and scale, and the efficiency and adaptability of reasoning models. We also provide guidance for system selection in real-world deployments.

25.
arXiv (math.PR) 2026-06-11

Micro-macro population dynamics models of benthic algae with long-memory decay and generic growth

arXiv:2505.04289v4 Announce Type: replace Abstract: Benthic algae as a primary producer in riverine ecosystems develop biofilms on the riverbed. Their population dynamics involve growth and decay processes, the former owing to the balance between biological proliferation and mortality, while the latter to mechanical abrasion because of the transport of sediment particles. Contrary to the assumptions of previous studies, the decay has experimentally been found to exhibit long-memory behavior, where the population decreases at an algebraic rate. However, the origin and mathematical theory of this phenomenon remain unresolved. The objective of this study is to introduce a novel mathematical model employing spin processes to describe microscopic biofilm dynamics. A spin process is a continuous-time jump process transitioning between states 0 and 1, and the continuum limit of these processes captures the long-memory decay and generates generic growth. The proposed framework leverages heterogeneous spin rates, achieved by appropriately superposing spin processes with distinct rates, to reproduce the long-memory decay. Computational simulations demonstrate the behavior of the model, particularly emphasizing rate-induced tipping phenomena. This mathematical model provides a computationally tractable interpretation of benthic algae dynamics and their long-term prediction, relevant to river-engineering applications.