Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at https://github.com/Artalmaz31/DifFRACT

02.
arXiv (CS.CL) 2026-06-12

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.

03.
arXiv (CS.CL) 2026-06-11

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start protection.This report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.

04.
arXiv (CS.LG) 2026-06-11

Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

作者:

arXiv:2606.11251v1 Announce Type: new Abstract: Many multivariate dynamical systems are observed only through trajectories, leaving the mechanisms governing their joint dynamics hidden. Existing approaches can impose interpretable dynamics or learn flexible state transitions, yet the resulting interaction structure is typically either specified in advance or left implicit within the learned dynamics. We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law. Each variable carries a field component, and these components evolve jointly through a learnable mechanical transition. Here, mechanical refers to the relation-to-motion organization of the transition, where learned relations shape state-dependent flows, field responses, and motion tendencies that move the field state forward. The resulting structure is part of the rollout itself: learned relations influence how the field moves, and the same internal quantities support both forecasting and structural readout. Across known-law interaction systems, chaotic benchmarks, real neural recordings, and ecological time series, MF-Net achieves competitive short- and medium-horizon forecasting while retaining inspectable structural readout. On the 40-dimensional Lorenz–96 testbed, MF-Net achieves an eight-step $R^2$ of $0.798\pm0.018$; across five seeds, its learned relation matrix recovers the local coupling support with a local/nonlocal strength ratio of $19.80\pm1.00$ and Precision@$K$ of $1.000\pm0.000$. MF-Net provides a structure-readable dynamical modeling framework in which learned relations are trained through forward evolution and, on real data, interpreted as functional predictive couplings under appropriate observational limits.

05.
medRxiv (Medicine) 2026-06-10

Estimating COVID-19 Cumulative Incidence from Seroprevalence Surveys accounting for Time-Varying Seroreversion: A Fully Bayesian Methodology

Seroprevalence surveys reveal the extent of humoral immunity against pathogens such as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and under some circumstances represent cumulative incidence of prior infection. However, antibody waning - or seroreversion - biases these estimates by reducing assay sensitivity in a time-varying manner. Because assay sensitivity decays over time, naively using serosurveys can substantially bias estimates of SARS-CoV-2 cumulative incidence and fatality rates. The Bayesian assay-specific, time-varying sensitivity adjustment developed in this paper can reliably correct for this bias and account for the delay between infection and serosurvey. In seroprevalence studies conducted in the United States in 2020, adjusting for time-varying sensitivity increased cumulative incidence by up to 1.4-fold, with an adjustment of 1.08 for a national study. Our estimates contrast with a previously published 2-fold adjustment that did not account for assay design. This suggests that previous analyses overestimated cumulative incidence by applying seroreversion corrections that did not account for assay-specific effects, or underestimated cumulative incidence by not applying seroreversion corrections. These biases imply fatality rate underestimation and overestimation, respectively. Our model provides a framework for design-specific time-varying sensitivity corrections in seroprevalence surveys for other pathogens.

06.
arXiv (CS.LG) 2026-06-18

Protein-Based Fish Species Identification: Dataset, Models, and Insights from Native Bangladeshi Fish

arXiv:2606.18302v1 Announce Type: cross Abstract: Correct identification of fish species is highly significant for food security, economic development, and climate resilience in Bangladesh. Protein sequences directly reflect functional and evolutionary constraints which are important for species authentication and biodiversity monitoring. Yet there exists no benchmark for native Bangladeshi fish species identification from protein sequence. In this study, we addressed this gap by introducing the first curated dataset for nine native Bangladeshi fish species of 2845 high quality protein sequences. We also established the first protein sequence classification baseline for this domain through a systematic benchmarking of seven architectural paradigms. Moreover, we propose a realistic deployable novel hybrid architecture of MotifCNN and Transformer with Terminal-Aware Positional-Encoding (MotifCNN-Transformer+TA-PE). Our novel architecture achieves 79.80% accuracy with macro-F1 of 0.80. The highest 83.04% accuracy is achieved by finetuned protein language model ProtBERT that has 420M parameters and requires dual 16GB GPUs for inference. According to McNemar's test, ProtBERT's 3.24% accuracy gain over our MotifCNN-Transformer+TA-PE is statistically insignificant (p = 0.1120). Our novel architecture beats it among six of the nine classes in per class identification. Also our MotifCNN-Transformer+TA-PE is approximately 5x faster, 42x smaller, and supports 16x larger batch size than ProtBERT and has GPU free inference, making it more practical for deployment in resources constrained areas such as rural Bangladesh. Beyond this, our foundational work shows effects of phylogenetic relationships on sequence similarity and establishes pathways for fisheries management, food authentication and biodiversity conservation in South Asia's protein dependent economy.

07.
arXiv (CS.LG) 2026-06-16

Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning

arXiv:2606.15115v1 Announce Type: new Abstract: Multi-objective optimization (MOO) has emerged as a powerful approach to solving complex optimization problems involving multiple objectives. In many practical scenarios, function evaluations are unavailable or prohibitively expensive, necessitating optimization solely based on a fixed offline dataset. In this setting, known as offline MOO, the goal is to find out the Pareto set without access to the true objective functions. This setting suffers from the out-of-distribution (OOD) issue, where the surrogate model is not accurate for unseen designs. Due to the OOD issue, surrogate errors may cause the optimizer to select solutions that do not lie on the true Pareto front and are biased toward its extremes. To address this, this paper proposes Diversity-driven Offline Multi-Objective Optimization (DOMOO), which aims to find out a diverse and high-quality set of solutions. First, DOMOO incorporates an accumulative risk control module that estimates the potential risk of candidate solutions and alleviates the OOD issue between the training data and the generated solutions. In addition, a nested Pareto set learning (PSL) strategy is proposed to jointly learn preference and PSL parameters, then optimize them, enabling adaptation to diverse Pareto front geometries. To further enhance solution quality, we design a diversity-driven selection strategy that extracts a representative and well-distributed set of final solutions. To achieve this diversity-driven selection strategy, we propose $IGD_offline$, a tailored indicator for the offline setting that considers both diversity and convergence, and avoids the bias of hypervolume indicator. Extensive experiments on synthetic and real-world benchmarks show that DOMOO achieves the best average rank across tasks in both convergence and diversity among the compared methods.

08.
arXiv (CS.LG) 2026-06-16

Incentives and Evidence in Learned Service Orchestration

arXiv:2606.16555v1 Announce Type: cross Abstract: Reinforcement learning for service orchestration has been the subject of sustained research for over a decade, yet it is not used in production at scale. The usual explanation is that learned controllers degrade under delayed and noisy telemetry, workload shifts, and uncontrolled tenants. We test whether existing evidence supports that explanation. We evaluate three highly influential RL-based orchestration systems spanning resource allocation, DAG scheduling, and autoscaling, using pre-registered predictions about comparative degradation under production-relevant perturbations and paired inference with family-wise error correction. Across the tests, most predicted performance reversals do not occur. Diagnostic analyses show that these outcomes often reflect comparator collapse, artefact limitations, or evaluation choices rather than evidence that learned controllers tolerate the perturbations. One apparent advantage under observation lag is roughly fortyfold compared to a Kubernetes HPA-equivalent controller. Another widely cited result cannot be reconstructed from its released artefact, and the strongest reproducible margin is far smaller than the published results. Conclusions also reverse under changes in perturbation magnitude and evaluation mode. Based on these results and broader patterns in the literature, we identify an institutional problem. Publication and review incentives favour benchmark gains against convenient comparators, even when those gains provide little evidence of deployment performance. We argue that the problem is not solely technical. Rather, it is institutional, so learned orchestration needs production-grade comparators, registered perturbation models, separate operational metrics, and publication criteria that reward reproducible operational evidence. Without these changes, the literature can grow without establishing whether learning improves orchestration.

09.
arXiv (CS.CL) 2026-06-17

Learning from the Self-future: On-policy Self-distillation for dLLMs

On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

10.
arXiv (CS.AI) 2026-06-16

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

arXiv:2504.11320v4 Announce Type: replace-cross Abstract: Large language models now serve millions of users daily, with providers incurring costs exceeding $700,000 per day. Each request requires token-by-token inference, making GPU scheduling central to latency, capacity, and cost. The difficulty is endogenous memory growth: generated tokens expand the Key-Value (KV) cache, and overflow can evict in-progress requests and waste prior computation. We formulate inference as a multi-stage online scheduling problem with endogenous memory growth, linear iteration times, and GPU-resident KV-cache constraints. We introduce a fluid model that characterizes equilibrium batch composition, memory requirement, and stability region. Guided by the fluid model, we design WAIT (Waiting for Accumulated Inference Threshold), a threshold-based admission rule for known output lengths, and Nested WAIT, which extends the rule to unknown output lengths by regulating how requests advance across decode-stage segments. Both algorithms approximate the fluid benchmark asymptotically under the stated memory conditions. Nested WAIT uses an additional safety buffer of moderate scale to hedge against memory-overflow-induced evictions under unknown output lengths. In Vidur simulations configured for Llama-2-7B on an A100 GPU, with supplemental real-GPU validation reported in the appendix, the policies enlarge the empirically observed stable operating range relative to widely used baseline algorithms and reduce latency especially in near-overloaded and overloaded regimes.

11.
arXiv (CS.CV) 2026-06-15

Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.

12.
arXiv (math.PR) 2026-06-12

Characterizing metric-space-valued processes: separating classes and weak invariance principles for measure-theoretic inference

arXiv:2606.13084v1 Announce Type: cross Abstract: This article investigates stochastic processes taking values in metric spaces that lack a topological vector space structure, a regime characterized by intricate interplay between topological, geometric, and temporal dependence structures. It is formally established that spaces admitting an isometric Hilbertian embedding constitute a strict subclass within the much broader class of metric spaces possessing the ball property. While traditional kernel methods are susceptible to geometric distortion when the underlying space cannot be isometrically embedded into a Hilbert space, we bypass such limitations by exploiting a fundamental structural property inherent to this broader class; namely, that Borel probability measures are uniquely determined by their values on balls. These separating classes provide the foundation for the subsequently introduced measure-theoretic inference methodology. We derive uniform convergence of a family of time-dependent random measures, alongside weak invariance principles for the corresponding nonstationary random fields. This framework explicitly exposes how dependence and geometric complexity influence sample path regularity. Furthermore, because the rapid decay of small-ball probabilities can prohibit the existence of limiting distributions for supremum-based discrepancy measures, we develop $L^p$-based alternatives. By directly leveraging the introduced convergence results, this approach circumvents the need for higher-order $U$-process formulations. Finally, for spaces that do admit an isometric Hilbertian embedding, and where $U$-processes naturally arise, we establish limit theory for both degenerate and nondegenerate multi-parameter $U$-processes, and demonstrate that local discrepancy tests maintain asymptotic stability under dynamic parameter regimes.

13.
arXiv (CS.CL) 2026-06-12

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for internal reasoning mechanisms. Evaluating two LRMs (Coconut and CODI) against controls lacking the proposed recurrence or curriculum, we find these patterns also appear in the controls and do not always causally affect behavior. Causal interventions reveal that latent-thought utilization is not binary but graded, scaling with a thought's causal effect on model behavior. Geometric analyses reveal this effect concentrates in low-rank directions whose step-to-step geometry grows more structured as their behavioral influence increases. Latent thoughts should therefore be treated as hidden computation, not hidden explanation: decodability, attention, or static structure alone cannot establish mechanism. LRM interpretability thus requires matched controls and causal tests.

14.
arXiv (CS.AI) 2026-06-18

IOAH3: Importance-Driven Adaptive Spatial Partitioning

arXiv:2606.18280v1 Announce Type: cross Abstract: We present IOAH3 (Importance-Oriented Adaptive H3 partitioning), a computational method for constructing data-driven spatial partitions of geo-referenced observation domains. Standard approaches to spatial aggregation adopt fixed areal units, such as administrative boundaries or uniform hexagonal grids at a single resolution, without regard to the informational content of the underlying observations in each region. This leads to the well-known modifiable areal unit problem: statistical and inferential results depend on the arbitrary choice of partition, and spatially concentrated phenomena are averaged out in coarse cells that obscure fine-scale structure. IOAH3 addresses this by constructing an adaptive partition in three stages: multi-source feature extraction and importance scoring via principal component analysis over road density, POI density, building density, and terrain roughness signals, with population and flood-hazard data entering as auxiliary inputs to cell filtering and spatial smoothness; spatial cell selection via Markov Random Field graph-cut optimisation, which jointly maximises per-cell importance while enforcing spatial contiguity; and data-driven hierarchical refinement of high-importance regions to finer H3 resolution levels, with neighbour-propagated support to avoid isolated fine-resolution islands. The resulting partitions serve as input to spatial inference pipelines and provide a principled resolution of the partition-sensitivity problem prior to any modelling step.

15.
arXiv (CS.AI) 2026-06-12

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

arXiv:2606.12924v1 Announce Type: new Abstract: We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16–0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.

16.
arXiv (CS.AI) 2026-06-11

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

arXiv:2605.00545v2 Announce Type: replace-cross Abstract: Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.

17.
arXiv (CS.CV) 2026-06-15

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction – recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86–94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15–34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

18.
arXiv (CS.AI) 2026-06-12

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

arXiv:2606.12662v1 Announce Type: cross Abstract: Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

19.
arXiv (CS.AI) 2026-06-19

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

arXiv:2606.20227v1 Announce Type: new Abstract: Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models.

20.
arXiv (CS.CL) 2026-06-17

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.

21.
arXiv (CS.AI) 2026-06-17

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

作者:

arXiv:2606.17929v1 Announce Type: new Abstract: Computer-using agents drive real software through the screen – clicking and typing – but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program-states that check the screen, transitions that act-and on later runs replays it directly instead of invoking the agent 8.5-13x faster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off. PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task-catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate, worth 1.75-2.6 tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.

22.
arXiv (CS.CL) 2026-06-17

Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions

This Guideline presents a systematic and operationalizable annotation framework for representing legal argumentation structures in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and provide a reliable foundation for computational analysis. At the element level, the Guideline distinguishes between the non-propositional layer and the propositional layer. The non-propositional layer consists of two elements: Issue and Non-argumentative Component. At the propositional level, the Guideline defines four proposition types: General Normative Judgment, Particular Normative Judgment, General Factual Judgment, and Particular Factual Judgment. At the relational level, five relation types are defined to represent argumentative structures: Support, Attack, Joint, Match, and Identity. These relations capture positive and negative argumentative connections, conjunctive reasoning structures, correspondences between legal norms and case facts, and identity or semantic equivalence between propositions. The Guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent visualization of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure the reproducibility and reliability of annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this Guideline supports large-scale analysis of judicial reasoning and future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.

23.
arXiv (CS.LG) 2026-06-17

Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

arXiv:2606.17120v1 Announce Type: new Abstract: Deep neural networks (DNNs) exhibit first order phase transitions under variations of the L2 regularization strength, with each transition marking the onset of a new learnable feature. Below a critical regularization strength, all features are in principle learnable, but coexisting metastable states, separated by energy barriers, can trap the network and impede convergence. A strength of DNNs is their ability to generalize. But many open questions remain, among them the origin of so called grokking: the abrupt, delayed onset of generalization after prolonged apparent overfitting. We show for linear DNNs that grokking is consistent with hysteresis in first-order L2 phase transitions: using L2 regularization to engineer deliberate trapping, we demonstrate that a model in a low-accuracy metastable state escapes only when SGD noise drives it across an energy barrier, with escape times following Arrhenius scaling. We reproduce grokking-like delayed convergence across two orders of magnitude in escape time by deliberately trapping models in metastable phases. Using sparse sub-sampling we also reproduce the canonical grokking curve where test error eventually approaches the final training error. Our work suggests that the number of metastable states equals the number of learnable features – one per singular value of the data covariance – the potential for hysteresis grows naturally with task complexity. We provide evidence that the same mechanism likely operates in general nonlinear DNNs. Our results provide routes toward more efficient learning schemes.

24.
arXiv (quant-ph) 2026-06-12

Hamiltonian-Aware ADAPT Variational Quantum Eigensolver for Molecular Ground-State Simulation

arXiv:2606.13118v1 Announce Type: new Abstract: Designing compact ansätze in Variational Quantum Eigensolver (VQE) is crucial for solving energetic problems of practical molecules on near-term quantum devices. However, existing Adaptive Derivative-Assembled Pseudo-Trotter (ADAPT) ansätze face two challenges: improper operator selection and accumulation of degraded operators. In this paper, we propose the Hamiltonian-Aware (HA) ADAPT-VQE algorithm to address these issues. First, we establish a novel excitation operator selection criterion. It breaks the local constraint of existing criteria by incorporating Hamiltonian information, prioritizes physically meaningful excitation operators, and incurs no extra classical or quantum computational overhead. Furthermore, we develop a problem-adaptive method for discriminating and pruning redundant excitation operators stemming from improper selection and inevitable degradation. This method balances redundant operator pruning and convergence guarantee, and is applicable to ansätze with arbitrary scales. Systematic numerical experiments on typical strongly correlated molecular systems demonstrate that our HA-ADAPT-VQE avoids energy plateaus and outperforms baseline algorithms in terms of energy error, ansatz size, and measurement cost. This work offers an efficient, robust ansatz construction paradigm, facilitating the development and practical deployment of large-scale VQE in quantum chemistry.

25.
arXiv (CS.CV) 2026-06-18

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

Reliable pixel-level uncertainty quantification holds the potential to transform clinical workflows by enabling high-fidelity longitudinal monitoring and distinguishing true pathological changes from artifacts. Ideally, these models provide the stability required for critical treatment planning and surgical intervention. However, standard deep learning models often suffer from miscalibration, yielding overconfident predictions that mask underlying vulnerabilities at subtle pathological boundaries. To address this, we propose QUAM-SM, a post-hoc framework using targeted adversarial search to identify "adversarially fragile" pixels. By actively seeking perturbations that expose predictive instability, our method highlights regions where decisions are most vulnerable to being flipped. Importantly, the framework disentangles epistemic uncertainty from aleatoric uncertainty. Experiments on two public datasets with multiple expert annotations demonstrate that QUAM-SM outperforms both standard and recent uncertainty estimation approaches in terms of reliability and boundary sensitivity. Code is available at https://github.com/HanaJebril/quam_sm