Paper Plaza - AcademicHub

01.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2606.05922

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

Authors:

Wenbo Pan ↗Shujie Liu ↗Chin-Yew Lin ↗Jingying Zeng ↗Xianfeng Tang ↗Xiangyang Zhou ↗Yan Lu ↗Xiaohua Jia ↗

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

Read & Discuss → View Source →

02.

arXiv (CS.AI) 2026-06-12 DOI: arXiv:2606.12429

Muse Spark Safety & Preparedness Report

Authors:

Cristina Menghini ↗Peter Ney ↗Hamza Kwisaba ↗Zifan ↗Wang ↗Miles Turpin ↗Felix Binder ↗Jean-Christophe Testud ↗Aidan Boyd ↗Nathaniel Li ↗Ivan Evtimov ↗Klaudia Krawiecka ↗…

arXiv:2606.12429v1 Announce Type: cross Abstract: Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains under Meta's Advanced AI Scaling Framework, along with the evidence that informed our launch decision. We then discuss additional considerations, such as Muse Spark's broader content safety and behavioral profile, that are relevant to overall safety but fall outside the catastrophic risk domains governed by the Framework. Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. We conducted a broad set of evaluations targeting dual-use and high-risk capabilities across these catastrophic risk domains. Those evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities assessed as likely reaching the "high risk" category under the Advanced AI Scaling Framework before safeguards were applied. We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI.

Read & Discuss → View Source →

03.

arXiv (CS.CL) 2026-06-15 DOI: arXiv:2606.13931

DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation

Authors:

Li Zhang ↗Yuzhen Shi ↗Yiran Hu ↗Jingwen Zhang ↗Wenbo Lv ↗Yubo Ma ↗Wei Wang ↗Rongyao Shi ↗Yuanyang Qiu ↗Xinran Xu ↗Yuemeng Qi ↗Linlin Miao ↗…

Lawyer-client consultation is a critical starting point for legal services. Effective legal assistance hinges on eliciting sufficient and truthful information from clients in order to devise strategies that best protect their interests. This task requires Large Language Models (LLMs) not only to perform robust legal reasoning, but also to strategically elicit material facts through multi-turn interactions and effectively guide clients with diverse personalities. Yet existing legal benchmarks overlook this interactive capability. To fill this gap, we introduce DLawBench, a diagnostic benchmark for real-world legal consultation. Drawing on realistic client behavior, we characterize lawyer-client interactions into four types: Cooperative, Dependent, Withdrawn, and Adversarial. Using dialogues grounded in real cases, DLawBench evaluates whether LLMs can effectively conduct legal consultation under realistic conditions. DLawBench comprises 461 cases from Chinese and U.S. law, 5,532 paired fact entries, 3,411 inquiry rubrics, and 3,348 issue-resolution rubrics, and evaluates 26 representative LLMs. Systematic experiments show substantial headroom: the best-performing model, GPT-5.5, achieves only 0.562 on consultation-grounded legal reasoning. More importantly, DLawBench exposes both sycophancy in legal consultation and a paradox: models perform worse when clients need guidance most.

Read & Discuss → View Source →

04.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.19407

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

Authors:

Tingzhu Bi ↗Xinrui Jiang ↗Xun Zhang ↗Pengcheng Su ↗Congjie He ↗Jinglin Li ↗Ping Wang ↗Meng Ma ↗

arXiv:2606.19407v1 Announce Type: cross Abstract: Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

Read & Discuss → View Source →

05.

arXiv (CS.CV) 2026-06-11 DOI: arXiv:2606.11285

EventRadar: Long-Range Visual UAV Discovery through Spatiotemporal Event Sensing

Authors:

Zhiting Zhou ↗Xingchen Liu ↗Xinglin Yu ↗Jiashen Chen ↗Haoyang Wang ↗Jingao Xu ↗Yunhao Liu ↗Xinlei Chen ↗

Unauthorized unmanned aerial vehicle (UAV) activity around airports, public venues, and other sensitive sites has made protected-airspace monitoring increasingly important. A practical sensing system must search a wide angular region, find small long-range targets, and return both bearing support and UAV-specific evidence before a restricted perimeter is breached. Existing UAV detection paths often rely on spatially organized evidence, such as body extent, silhouette, or track continuity. At long range, however, these cues become difficult to preserve and verify as the target footprint weakens and its image-plane support shrinks. EventRadar follows a complementary cue: propeller-induced temporal periodicity, which recent event-camera sensing studies have shown can reveal UAV-specific motion after appearance becomes weak. We extend this cue to kilometer-scale active sensing with an event-camera prototype. Scene-Anchored Geometry Evidence (SAGE) fuses scanning events with IMU pose to maintain a bearing-indexed scene memory, separating transient candidate support from persistent background clutter. Comb-guided Harmonic-Group Learned Iterative Shrinkage and Thresholding Algorithm (CHG) then treats each candidate as a weak high-rate timing signal and recovers phase-insensitive harmonic evidence with fixed compute. Compared with related event-camera baselines on 700-1500 m UAV event recordings, EventRadar achieves 0.990 mAP$_{.3}$ and 0.949 F1$_{.3}$, reduces FN$_{.3}$ to 0.009, and shows real-time feasibility in prototype profiling.

Read & Discuss → View Source →

06.

arXiv (CS.CL) 2026-06-17 DOI: arXiv:2606.17474

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

Authors:

Jiahui Niu ↗Huizi Yu ↗Wenkong Wang ↗Guangxin Dai ↗Jingxian He ↗Xiang Li ↗Zhiying Liang ↗Xinxin Lin ↗Kent CY So ↗Bryan YP Yan ↗Yun Kwok Wing ↗Yanqiu Xing ↗…

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

Read & Discuss → View Source →

07.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2606.17846

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Authors:

Haoqi Yuan ↗Zhixuan Liang ↗Anzhe Chen ↗Ye Wang ↗Haoyang Li ↗Pei Lin ↗Yiyang Huang ↗Zixing Lei ↗Tong Zhang ↗Jiazhao Zhang ↗Jie Zhang ↗Jingyang Fan ↗…

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $\pi$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

Read & Discuss → View Source →

08.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15874

LLM-as-Code Agentic Programming for Agent Harness

Authors:

Junjia Qi ↗Zichuan Fu ↗Jingtong Gao ↗Wenlin Zhang ↗Hanyu Yan ↗Xian Wu ↗Xiangyu Zhao ↗

arXiv:2606.15874v1 Announce Type: new Abstract: Every major LLM agent framework gives the LLM the role of orchestrator; the model decides what to do next, when to call tools, and when to stop. We argue that token explosion, control-flow hallucination, and unreliable completion are not implementation bugs but architectural consequences of assigning the deterministic work of looping, branching, and sequencing to a probabilistic system. A better prompt or a stronger model cannot guarantee the reliability of the LLM agent. We therefore propose Agentic Programming, in which the program governs all control flow, and the LLM is itself part of it, an adaptive component we call LLM-as-Code and invoke only where a task calls for reasoning or generation. Within each call the model keeps full flexibility, but it cannot alter the program's execution path. With control in the program, the LLM's context is built from the execution history's call tree and forms a directed acyclic graph (DAG). Each call's context length is then determined by its call depth rather than by accumulation over steps. A case study of computer-use agents shows that the design is practical, not just a theoretical stance, substantially improving the stability of long visual operation sequences.

Read & Discuss → View Source →

09.

arXiv (CS.LG) 2026-06-18 DOI: arXiv:2606.18844

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Authors:

Zhilin Huang ↗Hang Gao ↗Ziqiang Dong ↗Yuan Chen ↗Yifeng Luo ↗Chujun Qin ↗Jingyi Wang ↗Yang Yang ↗Guanjun Jiang ↗

arXiv:2606.18844v1 Announce Type: new Abstract: Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

Read & Discuss → View Source →

10.

arXiv (CS.CL) 2026-06-19 DOI: arXiv:2504.02885

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Authors:

Hao Wang ↗Shuchang Ye ↗Jinghao Lin ↗Usman Naseem ↗Jinman Kim ↗

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

Read & Discuss → View Source →

11.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.15558

Degeneracy Cannot Violate the Quantum Hamming Bound

Authors:

Yu-Xuan Zhang ↗Jing-Ling Chen ↗

arXiv:2606.15558v1 Announce Type: new Abstract: The quantum Hamming bound is the standard finite-length sphere-packing bound for exact correction of arbitrary qubit errors. Whether degeneracy can evade this bound has remained unresolved in full generality for nearly three decades: distinct correctable errors may act identically on the code space, so the usual disjoint-sphere argument breaks down. We prove that every exact binary quantum subspace code with $K>1$ obeys the bound, without assuming either nondegeneracy or additivity. Our proof turns the Li–Xing linear-programming polynomial into an exact intersection count for quaternary Hamming balls. Monotonicity in block length and in ball-center separation then reduces the problem to a local node–edge charging inequality at the shortest admissible length. Thus degeneracy can merge correctable error sectors, but cannot enlarge the finite-length binary Hamming bound.

Read & Discuss → View Source →

12.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.20005

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

Authors:

Guangda Liu ↗Yiquan Wang ↗Chengwei Li ↗Wenhao Chen ↗Jing Lin ↗Yiwu Yao ↗Danning Ke ↗Wenchao Ding ↗Jieru Zhao ↗

arXiv:2606.20005v1 Announce Type: cross Abstract: Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring $O(N_QN_K)$ memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to $43\times$ and $14\times$ speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, enabling long-context distillation on a single GPU.

Read & Discuss → View Source →

13.

arXiv (CS.CV) 2026-06-19 DOI: arXiv:2606.20092

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Authors:

Ganlin Yang ↗Zhangzheng Tu ↗Yuqiang Yang ↗Sitong Mao ↗Junyi Dong ↗Tianxing Chen ↗Jiaqi Peng ↗Jing Xiong ↗Jiafei Cao ↗Jifeng Dai ↗Wengang Zhou ↗Yao Mu ↗…

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

Read & Discuss → View Source →

14.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.12913

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

Authors:

Dongyue Wu ↗Zilin Guo ↗Xiaoyu Li ↗Jiajia Liu ↗Jingdong Chen ↗Nong Sang ↗Changxin Gao ↗

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

Read & Discuss → View Source →

15.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.16337

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

Authors:

Wei Xu ↗Ke Yang ↗Gang Luo ↗Keli Zheng ↗Lingyan Hu ↗Jing Wang ↗Kefeng Li ↗

arXiv:2606.16337v1 Announce Type: new Abstract: Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

Read & Discuss → View Source →

16.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.15320

Conditional Multi-Event Temporal Grounding in Long-Form Video

Authors:

Yuanhao Zou ↗Arthad Kulkarni ↗Lucas Tonanez ↗Lincoln Spencer ↗Guangyu Sun ↗Tianxingjian Ding ↗Andong Deng ↗Yi Li ↗Shuangjun Liu ↗Yuan Li ↗Dashan Gao ↗Ning Bi ↗…

Multimodal large language models have made rapid progress in video temporal grounding, yet real-world applications routinely require localizing every event that satisfies compositional temporal and spatial conditions. Existing benchmarks fall short: they localize only a single moment per query, count without temporal conditions, or treat grounding and counting as disjoint tasks. We introduce CoMET-Bench for Conditional Multi-Event Temporal Grounding in long-form video, comprising 2789 queries over 600 videos averaging 33.8 minutes across five real-world domains, with each query composed from 4 temporal conditions, 3 spatial conditions, and a dedicated negative-query subset. We further propose a unified evaluation protocol jointly measuring counting, grounding, and negative-query recognition, including a new Rejection-F1 metric that prevents trivial gaming by lazy "always-empty" models. Benchmarking a broad suite of MLLMs, agent-based, and grounding-specialized methods reveals that existing approaches remain far from solving this task. Building on these findings, we propose CoMET-Agent, a training-free agentic framework that reformulates the task as structured search-and-aggregate, improving F1@0.5 by 6.1% over GPT-5 purely through structural reasoning. Failure analysis further surfaces three open directions: fine-grained entity tracking, position-uniform retrieval, and causal event pairing.

Read & Discuss → View Source →

17.

arXiv (quant-ph) 2026-06-12 DOI: arXiv:2606.13290

Understanding quantum behaviors of an electron in a uniform magnetic field alternatively

Authors:

Jin-Ming Wang ↗Yuan-Zao Gao ↗Dai-Lin Cun ↗Jian Jing ↗

arXiv:2606.13290v1 Announce Type: cross Abstract: Quantum mechanically, an electron moving in a uniform magnetic field forms Landau levels. A curious feature is that for states with a negative angular quantum number, the total probability current vanishes, which appears to contradict the classical picture of cyclotron motion. While a geometric interpretation based on classical orbits exists, alternative interpretations remain of interest. In this paper, we examine the probability current density and identify a critical radius that naturally partitions the plane into an inner clockwise-flow region and an outer counterclockwise-flow region. We show that the vanishing total current results from an exact cancellation between these two regions. Furthermore, by defining a partitioned kinetic angular momentum with respect to the critical radius, we reveal an intrinsic competitive structure: the electron simultaneously carries two opposing rotational components. The negative quantum number manifests in the strength of the inner counter-rotation, while the net kinetic angular momentum remains positive. This bidirectional flow picture also provides a dynamical interpretation of the infinite degeneracy of Landau levels.

Read & Discuss → View Source →

18.

arXiv (CS.AI) 2026-06-24 DOI: arXiv:2605.06177

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

Authors:

Jinge Wu ↗Hongjian Zhou ↗Mingde Zeng ↗Jiayuan Zhu ↗Junde Wu ↗Jiazhen Pan ↗Ayush Noori ↗Sean Wu ↗Honghan Wu ↗Fenglin Liu ↗David A. Clifton ↗

arXiv:2605.06177v2 Announce Type: replace Abstract: Reproducing and comparing deep research agents today is hard: the same backbone evaluated on the same benchmark can report different accuracies across papers because the harness and tool registry differ, and integrating a new model into a comparable evaluation surface costs weeks of model-specific engineering. These are symptoms of a broader reproducibility problem in deep research agent research. Here, we introduce BioMedArena, an open-source toolkit that addresses this reproducibility gap and provides an arena for comparing deep research agents under a shared evaluation environment. BioMedArena decouples six layers of biomedical agent evaluation – benchmark loading, tool exposure, tool selection, harness mode, context management, and scoring – and exposes 166 biomedical benchmarks and 75 biomedical tools across 9 functional families. Adding a new model, benchmark, or tool can be accomplished with a few-line provider adapter. Beyond evaluation infrastructure, BioMedArena ships a library of high-quality reference components: 6 agent harnesses (including our proposed Mutual-Evolve) and 6 context-management strategies, any of which can be equipped on any backbone. Equipping these components substantially improves all 12 backbones; on each of 8 representative biomedical benchmarks, the best equipped backbone surpasses prior state-of-the-art (SOTA), by 15.01 percentage points on average. The toolkit, configurations, and per-task traces are available at https://github.com/AI-in-Health/BioMedArena.

Read & Discuss → View Source →

19.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2606.13473

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Authors:

Jiacheng Chen ↗Xinyu Zhang ↗Shunkai Zhang ↗Yanmohan Wang ↗Lin Li ↗Tiancheng Qin ↗Qin Wang ↗Zhengmao Zhu ↗Tianle Li ↗Jingyang Li ↗Zehan Li ↗Binyang Jiang ↗…

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities – proof generation, proof verification, and critique-conditioned proof repair – using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

Read & Discuss → View Source →

20.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.17070

KFTD: Koopman-Fourier Time-Differentiable Network for Continuous Ocean Spatiotemporal Forecasting

Authors:

Qinghui Chen ↗Zekai Zhang ↗Hailong Liu ↗Jinglin Zhang ↗Cong Bai ↗

arXiv:2606.17070v1 Announce Type: cross Abstract: Accurate oceanic forecasting is critical for climate monitoring and disaster early warning. However, ocean spatiotemporal forecasting encounters the double challenges of modeling complex dynamical systems and ensuring computational efficiency. We present Koopman Fourier Time-Differentiable (KFTD) Network, a time continuous twostage paradigm that decouples interpolation from prediction to achieve efficient and scalable spatiotemporal modeling. We map complex nonlinear dynamics into the Koopman linear space and exploit Fourier analysis to enable continuous time interpolation at arbitrary sub-steps. A lightweight residual network consumes the high fidelity intermediate states to yield the final forecast. Unlike diffusion models, KFTD eliminates multi step noise sampling and directly evolves the system in continuous time, yielding a 4 computational speedup. We further introduce a DPP Loss that supports arbitrary PDE constraints in an endtoend manner, breaking the physical consistency bottleneck of pure data-driven approaches. Empirical results on four ocean datasets confirm that our continuous time framework reduces MSE by an average of 5.6% (up to 12.7% for SST) and improves efficiency over MCVD by 76.25%.

Read & Discuss → View Source →

21.

arXiv (CS.CV) 2026-06-25 DOI: arXiv:2507.16863

Position: Reasoning After Perception Means Reasoning Without Vision

Authors:

Hongcheng Gao ↗Zihao Huang ↗Jingyi Tang ↗Lin Xu ↗Xinhao Li ↗Haoyang Li ↗Yue Liu ↗Minhua Lin ↗Xinlong Yang ↗Taihang Hu ↗Ge Wu ↗Balong Bi ↗…

A common belief in multimodal research is that the perceptual weaknesses of vision–language models can be compensated by stronger language reasoning (e.g., chain-of-thought, in-context learning, or external tools). We challenge this assumption. We argue that for a broad class of visual tasks hard to specify in language, failures stem from a structural fatality where the temporal decision of when to reason strictly dictates the spatial constraint of where reasoning takes place. When visual reasoning is deferred to language generation, current architectures do not merely delay computation; they displace it from the continuous visual representation to a discrete textual space. Consequently, the sequential ``Perception-then-Reasoning'' paradigm degenerates perception into a passive, one-off feature encoding process, rendering it functionally equivalent to ``Reasoning-in-Text-Space'', where task-critical spatial signals are collapsed before reasoning begins. We substantiate this claim with the Turing Eye Test (TET): tasks that must be resolved in visual space and are hard to verbalize; results show text-only reasoning cannot remedy these perceptual failures. Our findings suggest rethinking the architectural divide: shifting from reasoning about perception to reasoning within perception. This facilitates actively reasoning-driven perception that operates directly on pixel-level visual representations, rather than within a collapsed textual space.

Read & Discuss → View Source →

22.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2501.05103

Exactly Solvable Quantum Model with Spin-Dependent Coulomb Interaction

Authors:

Jiang-Lin Zhou ↗Yu-Xuan Zhang ↗Choo Hiap Oh ↗Jing-Ling Chen ↗

arXiv:2501.05103v5 Announce Type: replace Abstract: In this work, we report an exactly solvable quantum model featuring a spin-dependent Coulomb interaction, described by the spin vector potential $\vec{\mathcal{A}} = k (\vec{r} \times \vec{S}) / r^2$ together with a Coulomb-type scalar potential $\varphi = \kappa / r$ . The model is governed by the Schrödinger-type Hamiltonian $\mathcal{H}_S = \vec{\Pi}^2 / (2M) + q \varphi$ in nonrelativistic quantum mechanics and by the Dirac-type Hamiltonian $\mathcal{H}_D = c \vec{\alpha} \cdot \vec{\Pi} + \beta M c^2 + q \varphi$ in relativistic quantum mechanics, where $\vec{\Pi} = \vec{p} - (q/c)\vec{\mathcal{A}}$ is the canonical momentum. We demonstrate two main results: (i) Just as the Coulomb-type scalar potential $\mathcal{S}_Maxwell = \{\vec{\mathcal{A}} = 0,\ \varphi = \kappa / r\}$ is a local exact solution of Maxwell's equations on $r\neq0$, the gauge potential $\mathcal{S}_YM = \{\vec{\mathcal{A}} = k (\vec{r} \times \vec{S}) / r^2,\ \varphi = \kappa / r\}$ constitutes a local exact solution of the Yang–Mills equations on the punctured region $r\neq0$. (ii) Both Hamiltonians $\mathcal{H}_S$ and $\mathcal{H}_D$ can be solved exactly in the presence of this spin-dependent Coulomb interaction. The resulting energy spectra are derived, and they naturally reduce to those of the ordinary hydrogen atom when the spin-dependent terms are neglected. Finally, we clarify the quantization conditions and the fixed-background interpretation of the model.

Read & Discuss → View Source →

23.

arXiv (CS.LG) 2026-06-19 DOI: arXiv:2606.19549

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

Authors:

Lin Tang ↗Wei Zhang ↗Jing Li ↗Hongyu Chen ↗Ming Zhao ↗Yuxuan Wang ↗

arXiv:2606.19549v1 Announce Type: new Abstract: Low-rank adaptation (LoRA) makes it cheap to train many domain- and task-specific language model adapters, but whether two adapters can be merged is usually discovered only after both have been fully trained and evaluated. This late feedback is costly: adapters that are strong in isolation can interfere destructively once their updates are combined. We ask whether this outcome can be anticipated. We formalize adapter mergeability as the degree to which an adapter preserves its single-task utility after merging, and show that it can be forecast from signals measured in the first few percent of training – chiefly how the low-rank updates and their gradients align across tasks and how much they disturb shared representations. We package these signals into MergeProbe, a lightweight predictor that estimates pairwise and set-level retention and turns the estimate into a concrete decision: merge directly, reweight, prune, or route. On MERGE-PEFT, a five-domain benchmark spanning math, code, science, instruction following, and safety, MergeProbe attains the best average and worst-case retention among strong interference-aware merge baselines while adding far less deployment overhead than full task routing. This turns LoRA merging from a post-hoc engineering step into an anticipatory measurement problem.

Read & Discuss → View Source →

24.

arXiv (CS.AI) 2026-06-24 DOI: arXiv:2606.23888

E-MRL: Cross-view Aligned Evidence-driven Multimodal Reinforcement Learning for Reliable 3D Tumor Analysis

Authors:

Sijing Li ↗Zhongwei Qiu ↗Zhuoya Wang ↗Boxiang Yun ↗Zhenyu Yi ↗Jianwei Xu ↗Wenqiao Zhang ↗Yingda Xia ↗Ling Zhang ↗

arXiv:2606.23888v1 Announce Type: cross Abstract: While Vision-Language Models (VLMs) show great promise in volumetric medical report generation, they frequently suffer from visual hallucinations and a lack of grounding in 3D CT data. Current Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) strategies typically optimize text fidelity alone, essentially rewarding correct diagnoses derived from language priors rather than genuine visual perception. To address this, we propose cross-view aligned Evidence-driven Multimodal Reinforcement Learning (Evidence-MRL, noted as E-MRL), a reliable RL reasoning framework that formulates the generation process as a Markov Decision Process of "diagnosis-localization-verification". Unlike standard approaches, our model is explicitly trained to identify a "key evidence slice" alongside the global diagnostic report, grounding its findings in verifiable visual evidence. Crucially, we introduce a novel cross-view consistency reward, which validates the semantic alignment between the golden-standard report and a local visual re-query of the selected key slice, providing additional rewards for correctly-localized reasoning. Experiments on large-scale 3D CT tumor datasets demonstrate that E-MRL significantly reduces hallucinations and improves diagnostic accuracy compared to SFT and RL baselines, offering a clinically interpretable solution for visually-grounded and tumor analysis.

Read & Discuss → View Source →

25.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2504.14582

NTIRE 2025 Challenge on Image Super-Resolution (x4): Methods and Results

Authors:

Zheng Chen ↗Kai Liu ↗Jue Gong ↗Jingkai Wang ↗Lei Sun ↗Zongwei Wu ↗Radu Timofte ↗Yulun Zhang ↗Xiangyu Kong ↗Xiaoxuan Yu ↗Hyunhee Park ↗Suejin Han ↗…

This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.

Read & Discuss → View Source →

Explore the Frontier of Global Academia