论文广场 - AcademicHub

01.

arXiv (CS.LG) 2026-06-16 DOI: arXiv:2312.06173

Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion

作者:

Anke Tang ↗Xianglin Luo ↗Li Shen ↗Yong Luo ↗Liang Ding ↗Han Hu ↗Bo Du ↗Dacheng Tao ↗

arXiv:2312.06173v2 Announce Type: replace Abstract: Merging models fine-tuned from a common, extensively pre-trained large model but specialized for different tasks has been demonstrated as a cheap and scalable strategy to construct a multi-task model that performs well across diverse tasks. Recent research, exemplified by task arithmetic, highlights that this multi-task model can be derived through arithmetic operations on task vectors. Nevertheless, current merging techniques frequently resolve potential conflicts among parameters from task-specific models by evaluating individual attributes, such as the parameters' magnitude or sign, overlooking their collective impact on the overall functionality of the model. In this work, we propose the CONtinuous relaxation of disCRETE (Concrete) subspace learning method to identify a common low-dimensional subspace and utilize its shared information to track the interference problem without sacrificing much performance. Specifically, we model the problem as a bi-level optimization problem and introduce a meta-learning framework to find the Concrete subspace mask through gradient-based techniques. At the upper level, we focus on learning a shared Concrete mask to identify the subspace, while at the inner level, model merging is performed to maximize the performance of the merged model. We conduct extensive experiments on both vision domain and language domain, and the results demonstrate the effectiveness of our method. The code is available at https://github.com/tanganke/subspace_fusion

阅读与讨论 → 访问原文 →

02.

arXiv (CS.CV) 2026-06-12 DOI: arXiv:2606.13108

PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

作者:

Yubo Zhang ↗Xueqing Wang ↗Manhui Lin ↗Yue Zhang ↗Penglongyi Deng ↗Ting Sun ↗Tingquan Gao ↗Zelun Zhang ↗Jiaxuan Liu ↗Changda Zhou ↗Hongen Liu ↗Suyin Liang ↗…

Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

阅读与讨论 → 访问原文 →

03.

arXiv (quant-ph) 2026-06-25 DOI: arXiv:2606.25638

Towards Robust Optimal Measurements Against Noise in Quantum Metrology

作者:

Xinglei Yu ↗Xinzhi Zhao ↗Liangsheng Li ↗Stanis{\l}aw Kurdzia{\l}ek ↗Chengjie Zhang ↗Chuan-Feng Li ↗Guang-Can Guo ↗

arXiv:2606.25638v1 Announce Type: new Abstract: Quantum parameter estimation utilizes quantum mechanical effects to attain higher measurement precision than classical schemes. In practical implementations, however, noise is inevitably present during the measurement process, causing a decrease in precision. Quantifying the impact of noise on different measurements is of considerable significance. Here, we experimentally investigate robust optimal measurements based on the theory of Fisher information measurement noise susceptibility (FI MENOS), which quantifies how susceptible a measurement is to noise. By constructing a polarizing Mach-Zehnder interferometer, we implement phase estimation under controlled noise. Our results indicate that different measurements exhibit distinct sensitivities to noise. To assess the influence of diverse noise types on precision, we further construct an experimental setup capable of introducing various forms of noise. The experimental results affirm that FI MENOS represents the worst-case scenario for estimation precision, enabling us to evaluate the noise immunity of optimal measurements. Our work provides a deeper insight into quantum metrology with noise, marking a notable advancement in quantifying the robustness of quantum estimation schemes against measurement noise effects.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.CV) 2026-06-17 DOI: arXiv:2507.14632

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

作者:

Haiquan Wen ↗Tianxiao Li ↗Zhenglin Huang ↗Yiwei He ↗Guangliang Cheng ↗

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present BusterX++, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce GenBuster-Bench++, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

阅读与讨论 → 访问原文 →

05.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2606.15258

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

作者:

Jierui Zhang ↗Siyuan Tan ↗Xinhang Li ↗Longzhuangzhi Lin ↗Dailin Li ↗Chengfeng Gu ↗Xinping Li ↗Yaxian Hao ↗Shengjia Liang ↗Yuxiang Ren ↗Wenhao Liu ↗

arXiv:2606.15258v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.CL) 2026-06-18 DOI: arXiv:2606.18797

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

作者:

Qingyu Lu ↗Ruochen Li ↗Liang Ding ↗Yufei Xia ↗Youxiang Zhu ↗Dacheng Tao ↗

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.CV) 2026-06-25 DOI: arXiv:2606.25041

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

作者:

Lianghua Huang ↗Zhifan Wu ↗Wei Wang ↗Yupeng Shi ↗Mengyang Feng ↗Junjie He ↗Chenwei Xie ↗Yu Liu ↗Jingren Zhou ↗Ang Wang ↗Bang Zhang ↗Baole Ai ↗…

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.CL) 2026-06-18 DOI: arXiv:2606.18620

BCL: Bayesian In-Context Learning Framework for Information Extraction

作者:

Haoliang Liu ↗Chengkun Cai ↗Xu Zhao ↗Han Zhu ↗Shizhou Huang ↗Xinglin Zhang ↗Tao Chen ↗Jenq-Neng Hwang ↗Zhang Huaping ↗Lei Li ↗

Existing information extraction (IE) tasks increasingly adopt in-context learning (ICL) with large language models. However, current approaches either show inconsistent performance across model scales or lack systematic optimization and generalizability. Building on this, we propose BCL (Bayesian In-Context Learning Framework for Information Extraction), the first optimization framework that uses particle filtering with Bayesian updates to systematically refine label representations across IE tasks. Through four steps initialization, observation, weight update, and resampling, BCL generalizes to both sequence labeling and relation classification paradigms. Extensive experiments demonstrate substantial and consistent improvements over existing approaches.

阅读与讨论 → 访问原文 →

09.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.19950

Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

作者:

Yuetian Du ↗Yucheng Wang ↗Ming Kong ↗Tian Liang ↗Qiang Long ↗Bingdi Chen ↗Qiang Zhu ↗

arXiv:2606.19950v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) show great potential in medical tasks, but their elicited confidence often misaligns with actual accuracy, potentially leading to misdiagnosis or overlooking correct advice. This study presents the first comprehensive analysis of the relationship between accuracy and confidence in medical MLLMs. It proposes a novel method that combines Multi-Strategy Fusion-Based Interrogation (MS-FBI) with auxiliary expert LLM assessment, aiming to improve confidence calibration in Medical Visual Question Answering (VQA). Experiments demonstrate that our method reduces the Expected Calibration Error (ECE) by an average of 40\% across three Medical VQA datasets, significantly enhancing MLLMs' reliability. The findings highlight the importance of domain-specific calibration for MLLMs in healthcare, offering a more trustworthy solution for AI-assisted diagnosis.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.LG) 2026-06-17 DOI: arXiv:2606.18218

Finite-Time Queue Peak Laws in Stochastic Networks: Logarithmic Scaling After Geometric Thresholds

作者:

Hao Liang ↗Cheng Tang ↗Yunzong Xu ↗

arXiv:2606.18218v1 Announce Type: cross Abstract: We study finite-horizon queue peaks in generalized switches, a standard stochastic-network model in which many queues share constrained service resources. Arrivals may be dependent, time-varying, and adapted to the past; the standing load condition is uniform interior slack, meaning the conditional mean arrival vector stays in a fixed contraction of the capacity region. We show that this slack reshapes the finite-time peak law for drift-minimizing scheduling policies such as MaxWeight. The square-root envelope that is sharp without slack persists only up to a geometry-dependent threshold; beyond that threshold, the running maximum grows only logarithmically with the horizon, both with high probability and in expectation. The mechanism is self-normalization: in the current queue direction, the projected fluctuation scale is normalized by the stabilizing drift scale. This removes capacity geometry from the logarithmic coefficient, while geometry remains in the threshold. Matching lower bounds show that both the logarithmic term and a geometric threshold are unavoidable. When finite-time state-space collapse is available, the threshold can be sharpened using local bottleneck geometry. For generalized input-queued switches, we obtain finite-time peak bounds with tight logarithmic coefficients. Simulations illustrate the two-phase envelope, local geometric refinements, and variance-sensitive improvements predicted by the theory.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.AI) 2026-06-25 DOI: arXiv:2606.25396

Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions

作者:

Kaicheng Shen ↗Lingyu Li ↗Wen Wu ↗Yan Teng ↗Liang He ↗Yingchun Wang ↗

arXiv:2606.25396v1 Announce Type: new Abstract: AI companions powered by large language models increasingly interact with cognition-developing users, including children and adolescents, creating risks that may accumulate over time. Existing safety evaluations largely rely on single-turn or short-session tests, which cannot capture risks that emerge only through prolonged interaction. To address this gap, we propose TSJ (Theater-Stage-Judge), a longitudinal framework combining persona-driven user simulation, dynamic psychological-state updating and retrospective evaluation. We evaluate six mainstream models across four developmental stages, twenty-four risk dimensions and three psychological-vulnerability personas, covering 12,960 simulated person-day interactions. TSJ shows that short-horizon testing systematically underestimates developmental risks, for which TSJ yields a stable risk estimate only after 140 turns within prolonged simulated relationships. Applying TSJ further identifies early childhood and emerging adulthood as the most vulnerable stages, with cognitive trust and emotional dependency as the weakest domains. TSJ provides a scalable methodology for longitudinal cognitive developmental risk evaluation in AI companion systems.

阅读与讨论 → 访问原文 →

12.

arXiv (CS.CL) 2026-06-15 DOI: arXiv:2509.24102

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

作者:

Guangliang Liu ↗Xi Chen ↗Bocheng Chen ↗Han Zi ↗Xitong Zhang ↗Kristen Johnson ↗

While moral reasoning has emerged as a promising research direction for large language models (LLMs), achieving robust generalization remains a critical challenge. This challenge arises from the gap between what is said and what is morally implied. In this paper, we build on metapragmatic links and Moral Foundations Theory to close this gap. Specifically, we develop a pragmatic inference approach that enables LLMs, given a moral situation, to acquire the metapragmatic links between moral reasoning objectives and the social variables that influence them. We adapt this approach to three different moral reasoning tasks to demonstrate its adaptability and generalizability. Experimental results show that our approach significantly enhances LLMs' generalization in moral reasoning, paving the way for future research to leverage pragmatic inference across a wide range of moral reasoning tasks.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CL) 2026-06-25 DOI: arXiv:2606.25207

ASAP: Agent-System Co-Design for Wall-Clock-Centered Auto HPO Research for ML Experiments

作者:

Taicheng Guo ↗Haomin Zhuang ↗Kehan Guo ↗Yujun Zhou ↗Nitesh V. Chawla ↗Olaf Wiest ↗Xiangliang Zhang ↗

Hyperparameter Optimization (HPO) is essential for maximizing machine learning model performance, and its core challenge is sample efficiency: finding strong configurations within a limited budget. Because every HPO tool relies on a surrogate prior that imparts its own inductive bias, individual tools struggle once problems become sufficiently diverse and drift from these priors. Motivated by the reasoning and generalization capabilities of LLMs, recent work has explored using LLMs for HPO and reports improved per-iteration performance. Yet these methods share two limitations with a common origin: they use the LLM as a single-tool replacement evaluated by iteration count. (i) Deployed in place of prior tools, the LLM is itself constrained by its pretraining objective to one family of inductive-biased proposals; this single-source setup still fails to handle the full diversity of problems. (ii) Per-iteration evaluation ignores that, in real runs, LLM inference or tool execution is paid serially on top of model evaluation every round, so iteration-count gains do not translate into end-to-end wall-clock gains. We present ASAP, an agent-system co-design that addresses both limitations. On the agent side, ASAP uses the LLM to integrate a diverse pool of inductive-biased optimizers and to select among their proposals each round. On the system side, ASAP re-architects the loop to reduce end-to-end wall-clock while preserving regret quality: a prefix-stable prompt maximizes KV-cache reuse across rounds; speculation parallelism hides the remaining LLM and tool latency under model evaluation via a relative-error accept test; and a Self-Tuner adapts the speculation threshold from execution logs off the critical path. Extensive experiments on diverse modern HPO tasks show that ASAP consistently outperforms baselines, underscoring the value of tool integration and agent-system co-design.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.14832

PhoneHarness: Harnessing Phone-Use Agents through Mixed GUI, CLI, and Tool Actions

作者:

Chenxin Li ↗Zhengyao Fang ↗Zhengyang Tang ↗Pengyuan Lyu ↗Xingran Zhou ↗Xin Lai ↗Fei Tang ↗Liang Wu ↗Yiduo Guo ↗Weinong Wang ↗Junyi Li ↗Yi Zhang ↗…

Phone agents are increasingly expected to complete real mobile workflows rather than merely predict the next screen action. However, much of the current mobile-agent literature still evaluates agents primarily as GUI controllers that observe a screen, emit taps and swipes, and are scored by target app state. Real phone-use tasks are broader: they require deciding when to use app GUIs, device-side commands, or structured tools, while leaving evidence that the intended side effect actually occurred. We introduce PhoneHarness, a mixed-action benchmark and execution harness for studying phone-use agents on verifiable mobile workflows. PhoneHarness runs a device-side agent loop over GUI, CLI, and host-side tool actions, combining deterministic action routing with bounded GUI delegation and auditable execution traces. Its benchmark, PhoneHarness Bench, evaluates whether agents complete tasks with observable side effects, not only whether they produce plausible final answers. On the annotated evaluation split, PhoneHarness reaches a 75.0% pass rate, outperforming the strongest non-PhoneHarness settings by 12.9 percentage points. PhoneHarness and PhoneHarness Bench therefore play distinct but mutually dependent roles: the harness makes mixed phone workflows executable, while the benchmark measures whether agents can use that harness reliably and safely. Our findings suggest that reliable phone automation depends on action-surface routing and verifiable execution, not only visual GUI control.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CV) 2026-06-25 DOI: arXiv:2606.25034

Yuvion VL: A Multimodal Foundation Model for Adversarial Content and AI Safety

作者:

Shikai Qiu ↗Xiaowen Xu ↗Benlei Cui ↗Ting Ma ↗Xiufeng Huang ↗Wenjing Jiang ↗Shaoxuan He ↗Haolei Xu ↗Chunyang Chai ↗Yujian Li ↗Yiliang Zhang ↗Guanghui Wang ↗…

General-purpose models often struggle to reliably identify and understand real-world multimodal risks, largely due to the inherent multimodal adversarial nature of content and AI safety. We present Yuvion VL, a family of multimodal large language models purpose-built for content and AI safety, with both instruction-tuned and reasoning-oriented variants. Yuvion VL addresses this gap by treating safety as an inherently adversarial and multimodal problem and designing the entire pipeline around adversarial robustness. For data construction, we develop an automated pipeline integrating adversarial-aware data synthesis with multi-stage quality control, producing large-scale, high-quality multimodal samples augmented with domain knowledge and reasoning annotations. For training, we adopt a three-stage pipeline that includes continued pretraining for risk-concept cross-modal alignment, instruct post-training for production-grade safety tasks, and reasoning post-training for enhanced interpretability and performance in complex tasks. We further introduce Confuse-then-Contrast Fine-Tuning, a contrastive framework that mines model-specific confusions and constructs multi-image contrastive groups to enforce explicit discrimination of fine-grained visual-semantic elements, enabling the model to distinguish between visually similar cases with different safety implications in adversarial safety tasks. To support rigorous evaluation, we further introduce Yuvion VL RiskEval (YVRE), a collection of benchmarks covering diverse open and internal evaluations, with a focus on content and AI safety, adversarial robustness, and real-world capability requirements. Experiments show that Yuvion VL-32B achieves industry-leading safety performance, surpassing comparably sized open-source models and best closed-source commercial models, while maintaining comparable general capabilities.

阅读与讨论 → 访问原文 →

16.

arXiv (CS.AI) 2026-06-24 DOI: arXiv:2606.24062

RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting

作者:

Cheng He ↗Zhenyu Guan ↗Xijie Liang ↗Defu Lian ↗Jiajia Li ↗Enhong Chen ↗Patrick P. C. Lee ↗Geng Hu ↗Zehao Chen ↗

arXiv:2606.24062v1 Announce Type: cross Abstract: Financial time series forecasting presents structural challenges absent from standard benchmarks. Log-returns are non-stationary, exhibit exceptionally low signal-to-noise (SNR) ratios, and are governed by regime-dependent temporal dependencies. We identify a key limitation of state-of-the-art (SOTA) time series models in financial settings. A fixed context window is mismatched to the time-varying optimal look-back of non-stationary price processes. We propose the Regime-Aware Variable-context Expert Network (RAVEN), a Mixture-of-Experts framework designed to adaptively determine the temporal context for each input sample. Instead of relying on a fixed look-back horizon, RAVEN constructs a hierarchy of nested contiguous windows whose lengths are determined by the data itself. Specifically, RAVEN scores patches by learned importance in reverse chronological order and applies the Cumulative Importance Thresholding (CIT) mechanism to derive nested prefix windows, each routed to a scale-specialized expert. A Global Compressed Representation (GCR) branch runs in parallel over the full context, preserving global temporal coherence that local experts cannot guarantee. Because the nested routing induces structured overlap among expert inputs, we introduce a Correlation-Aware Weighting (CAW) to align variable-length expert outputs and penalize pairwise cosine similarity prior to aggregation. Experiments on cumulative log-return prediction (HS300, S&P500) and fund sales forecasting demonstrate that RAVEN achieves SOTA performances, improves Pearson correlation by 9.2% on HS300 and 20.2% on S&P500, and reduces MSE by 18.2% on fund sales forecasting, while achieving the best results in 14 of 16 metrics on four PEMS traffic benchmarks.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.CL) 2026-06-18 DOI: arXiv:2606.18709

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

作者:

Han Chen ↗Ming Li ↗Chenguang Wang ↗Yijun Liang ↗Dawei Zhou ↗Hong jiao ↗Tianyi Zhou ↗

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CL) 2026-06-25 DOI: arXiv:2606.23049

PhoneBuddy: Training Open Models for Agentic Phone Use

作者:

Zhengyang Tang ↗Xin Lai ↗Pengyuan Lyu ↗Xinyuan Wang ↗Tianyi Bai ↗Chenxin Li ↗Yiduo Guo ↗Huawen Shen ↗Yuxuan Liu ↗Junyi Li ↗Zhengyao Fang ↗Yang Ding ↗…

Phones are becoming an important execution surface for general-purpose agents, but training open models for reliable phone use remains difficult because the environment that matters at deployment, real devices running real apps, is slow, stateful, side-effectful, and hard to reset or verify, while scalable mock environments only approximate real behavior. We present PhoneBuddy, a training recipe and open-model line for agentic phone use that combines a real-app environment with a mock-app environment, PhoneWorld, which reconstructs runnable mock apps from real GUI usage structure. PhoneBuddy first builds a shared supervised fine-tuning stage from trajectories collected in both environments, then compares real-app RL against mixed RL across both environments. Across a 150-task human evaluation on real phones spanning apps, mini-apps, and cross-app workflows, task success rate improves from 36.67\% after supervised fine-tuning to 40.67\% after real-app RL and 45.33\% after mixed RL. On AndroidWorld, the same progression rises from 60.3\% to 77.2\% to 83.2\%. These results show that mock-app training is not a replacement for real-app RL, but a complementary source of scalable, resettable, and automatically checked interaction. The gains are strongest on app and mini-app tasks, while long-horizontal cross-app workflows remain an important open challenge.

阅读与讨论 → 访问原文 →

19.

arXiv (CS.AI) 2026-06-12 DOI: arXiv:2606.12871

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

作者:

Jingxuan Han ↗Wei Liu ↗Mingyang Zhu ↗Youpeng Wang ↗Ziwen Wang ↗Lin Qiu ↗Xuezhi Cao ↗Xunliang Cai ↗Zheren Fu ↗Licheng Zhang ↗Zhendong Mao ↗

arXiv:2606.12871v1 Announce Type: new Abstract: Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring web sources and synthesizing information into comprehensive responses. For SAs evaluation, prior benchmarks mainly focus on specialized tasks that are unlikely to arise in real-world user scenarios. Moreover, their reliance on coarse task-level rubrics often limits evaluation interpretability. To bridge this gap, we introduce DailyReport, an open-ended benchmark to evaluate SA capabilities on daily search tasks. It contains 150 open-ended tasks with 3,546 associated rubrics, capturing widely discussed and timely information demands of real-world users. Each task is decomposed into subtasks and evaluated with cascade rubrics across disentangled dimensions. Through cascade performance attribution and user-centric aggregation, we derive highly interpretable scores for each dimension, along with a user preference score. Our results on 17 agentic systems show that current systems still fall short of users' expectations. To facilitate future research, our dataset and code are made publicly available at https://github.com/AGI-Eval-Official/DailyReport.

阅读与讨论 → 访问原文 →

20.

arXiv (CS.CL) 2026-06-24 DOI: arXiv:2606.24459

An LLM-based Two-Stage Transformer Framework for Cross-Domain Bearing Fault Diagnosis with Limited Data

作者:

Jinghan Wang ↗Feng Cheng ↗Wentao Wu ↗Hang Li ↗Gaoliang Peng ↗Tianchen Liu ↗

Bearing fault diagnosis faces critical challenges when dataset heterogeneity, operating condition variations, and limited labeled data occur simultaneously in industrial environments. Existing approaches address these issues in isolation and rely on implicit feature alignment, limiting effectiveness under concurrent challenges. This paper proposes a knowledge-guided two-stage transfer learning framework that employs a lightweight GPT-2-style Transformer with causal self-attention for hierarchical feature extraction from vibration signals, establishing explicit pathways where pre-trained encoder weights and fault prototype embeddings serve as knowledge carriers from multi-source pre-training to target adaptation. The framework addresses the dual-shift challenge through multi-source learning for generalizable representations, prototype-based knowledge modulation for target adaptation, and taxonomy-adaptive classification for seamless transfer across heterogeneous fault categories. Experimental validation on four real-world datasets demonstrates 92.61% average accuracy with only 10% labeled target data, outperforming state-of-the-art methods by 17.24 percentage points, establishing a practical pathway toward cost-effective predictive maintenance in Industry 4.0 applications.

阅读与讨论 → 访问原文 →

21.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.03090

"Important You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

作者:

Hang Li ↗Fedor Filippov ↗Yuping Lin ↗Pengfei He ↗Kaiqi Yang ↗Yucheng Chu ↗Yingqian Cui ↗Hui Liu ↗Jiliang Tang ↗

arXiv:2606.03090v2 Announce Type: replace-cross Abstract: The emergence of large language models (LLMs) has significantly accelerated recent research on LLM-based automatic grading (AG) systems. Benefiting from the strong instruction-following capabilities and broad prior knowledge of LLMs, educators can deploy AG systems across diverse tasks using only natural language rubrics while achieving satisfactory grading performance. Despite these advantages, new security concerns may also arise. In particular, prompt injection (PI) attacks have recently become a major threat to LLM-based applications. In the context of AG, attackers can potentially exploit PI vulnerabilities to manipulate grading systems into assigning artificially high scores regardless of the actual answer quality. Such behavior poses serious risks to the fairness, reliability, and integrity of educational assessment. In this work, we study PI attacks in AG systems, and systematically investigate the effectiveness of such attacks in educational scenarios. We further evaluate the effectiveness of existing defensive strategies against these attacks. Through comprehensive experiments under rubric-based grading settings, we demonstrate that current LLM-based AG systems remain highly vulnerable to PI attacks. We hope that our findings raise awareness of this emerging threat and motivate future research toward secure, robust, and trustworthy LLM-based educational systems.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CV) 2026-06-15 DOI: arXiv:2606.14383

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

作者:

Haonan Qi ↗Jin Cao ↗Yongqi Zhang ↗Xintong Wang ↗Weidong Tang ↗Bin Chen ↗Chengfu Huo ↗Haojun Pan ↗Hengyu You ↗Jing Li ↗Yingde Wang ↗Liang Ding ↗…

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction – recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86–94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15–34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

阅读与讨论 → 访问原文 →

23.

arXiv (CS.AI) 2026-06-16 DOI: arXiv:2508.00956

FOUNDv2: Learning Unified User Quantized Tokenizers for User Representation

作者:

Chuan He ↗Yang Chen ↗Bin Dou ↗Wuliang Huang ↗Baokun Wang ↗Yongchao Liu ↗Xing Fu ↗Yu Cheng ↗Chuntao Hong ↗Weiqiang Wang ↗Zhongle Xie ↗Jiajun Zheng ↗…

arXiv:2508.00956v3 Announce Type: replace-cross Abstract: User representation learning serves as a fundamental pillar for personalized services on large-scale web platforms. Despite its importance, conventional continuous embedding methods face significant challenges, including the lack of a unified paradigm for multi-source data integration, prohibitive storage overhead due to low information density, and the lack of multi-scale modeling granularity. To overcome these limitations, we introduce FOUNDv2, a comprehensive user representation scheme centered on the Unified User Quantized Tokenizer U2QT) framework. FOUNDv2 transforms heterogeneous user data into a standardized discrete token space through a robust two-stage architecture. Specifically, the framework first extracts compact feature representations and subsequently employs a multi-view RQ-VAE to discretize them into storage-efficient tokens using shared and source-specific codebooks. To empower these representations with predictive intelligence, we further design multi-scale alignment objectives to capture both fine-grained behavioral dependencies and macro-temporal periodicity. Extensive experiments on various benchmarks demonstrate that FOUNDv2 consistently outperforms task-specific baselines while achieving substantial reductions in storage and computational costs. Finally, the large-scale deployment of FOUNDv2 on Alipay validates its practical scalability and efficiency across diverse industrial scenarios. The main code is available at: https://github.com/chuanhe1999/FOUNDv2.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.15007

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

作者:

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.15971

SAG: SQL-Retrieval Augmented Generation with Query-Time Dynamic Hyperedges

作者:

Yuchao Wu ↗Junqin Li ↗XingCheng Liang ↗Yongjie Chen ↗Yinghao Liang ↗Linyuan Mo ↗Guanxian Li ↗

Retrieval-Augmented Generation (RAG) offers an effective approach for large language models to access external knowledge. However, existing methods rely on dense similarity retrieval and face inherent limitations in handling structured constraints and multi-hop reasoning. Incorporating knowledge graphs partially alleviates these issues, but at the cost of semantic fragmentation, high maintenance overhead, and difficult incremental updates. This paper introduces SAG (SQLRetrieval Augmented Generation), a structured architecture for retrieval and agent systems. Instead of pre-building a global static graph, SAG converts each chunk into one semantically complete event and a set of indexing entities, then uses SQL join queries to dynamically link events that share entities into local hyperedges,constructing, at query time, a dynamically instantiated local index structure. This design avoids the need for global graph rebuilding and ongoing maintenance; the system naturally supports incremental writes, concurrent processing, and continuous scaling through its reliance on standard database infrastructure. Across HotpotQA, 2WikiMultiHop, and MuSiQue, three standard multi-hop benchmarks,SAG achieves the best results on 8 out of 9 Recall@K metrics, reaching 80.0% Recall@5 on MuSiQue, the benchmark with the highest multi-hop reasoning demands.SAG has also been deployed at a production scale of hundreds of millions of data items, with online retrieval latency kept within seconds. Project site and code are available at https://github.com/Zleap-AI/SAG-Benchmark.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络