Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-11

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

02.
arXiv (CS.CL) 2026-06-18

LLM Parameters for Math Across Languages: Shared or Separate?

Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.

03.
arXiv (CS.CV) 2026-06-16

NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis

Interpretability is essential for trustworthy medical image diagnosis. However, existing concept-driven interpretable methods have key limitations: Concept Bottleneck Models (CBMs) require scoring all predefined concepts at inference time and for manual intervention, imposing a substantial burden on clinicians, while rationale-based generative approaches often select concepts by class discriminability, which can drift from diagnostic ontologies. To address these issues, we propose Neuro-Symbolic Rule Distillation (NeRD), a framework that produces efficient, ontology-grounded reasoning chains that are sufficient yet non-redundant, without manually crafting diagnostic rules. Experiments on two skin datasets demonstrate strong diagnostic performance and interpretability, and blinded expert evaluation confirms the clinical plausibility of NeRD rationales. Our method further enables a first expert-in-the-loop study for Multimodal Chain-of-Thought-based diagnosis, achieving efficient and effective concept-level intervention.

04.
arXiv (math.PR) 2026-06-25

Particle Filtering for Non-Deterministic Electrocardiographic Imaging

arXiv:2509.19404v2 Announce Type: replace-cross Abstract: Electrocardiographic imaging (ECGI) aims to non-invasively reconstruct activation maps of the heart from temporal body surface potentials. While most existing approaches rely on inverse and optimization techniques that may yield satisfactory reconstructions, they typically provide a single deterministic solution, overlooking the inherent uncertainty of the problem stemming from its very ill-posed nature, the poor knowledge of biophysical features and the unavoidable presence of noise in the measurements. The Bayesian framework, which naturally incorporates uncertainty while also accounting for temporal correlations across time steps, can be used to address this limitation. In this work, we propose a low-dimensional representation of the activation sequence that enables the use of particle filtering, a Bayesian filtering method that does not rely on predefined assumptions regarding the shape of the posterior distribution, in contrast to approaches like the Kalman filter. This allows to produce not only activation maps but also probabilistic maps indicating the likelihood of activation at each point on the heart over time, as well as pseudo-probability maps reflecting the likelihood of a point being part of an earliest activation site. Additionally, we introduce a method to estimate the probability of the presence of a conduction lines of block on the heart surface. Combined with classical reconstruction techniques, this could help discriminate artificial from true lines of block in activation maps. We support our approach with a numerical study based on simulated data, demonstrating the potential of our method.

05.
arXiv (CS.CV) 2026-06-15

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

06.
arXiv (CS.LG) 2026-06-17

Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis

arXiv:2606.17476v1 Announce Type: new Abstract: Laser-induced breakdown spectroscopy (LIBS) quantitative analysis faces critical challenges in wavelength selection due to high-dimensional spectral data and the fundamental trade-off between prediction accuracy and feature efficiency. This paper presents a novel Multi-Adapter PPO framework that transforms wavelength selection into a reinforcement learning problem, leveraging cross-attention mechanisms and multiple specialized adapters to capture complex spectral relationships. Our approach outperforms traditional Particle Swarm Optimization (PSO) by an average of 28.4\% in comprehensive score and 45.2\% in prediction accuracy across steel and coal datasets. The proposed method demonstrates superior performance in balancing prediction accuracy with feature efficiency, achieving state-of-the-art results in LIBS quantitative analysis while maintaining interpretability and computational efficiency. We released our code and dataset here: https://github.com/Hflying/MAPPO

07.
PLOS Medicine 2026-06-12

Comparison of count-based and clustering definitions of multimorbidity and their association with prevalence of multimorbidity, health profiles, and mortality: A cohort study of UK Biobank participants

by Gabriella C. Silva, Aurore Fayosse, Louis Jacob, Séverine Sabia, Archana Singh-Manoux, Benjamin Landré Background Multimorbidity, the presence of several chronic conditions, is linked to higher mortality and healthcare use and thus poses a major challenge for aging populations. While most studies rely on simple counts of conditions, clustering approaches have been proposed to describe patterns of co-occurring diseases. We aimed to evaluate the extent to which these methodological choices influence prevalence and association with health profiles and mortality. Methods and findings Using UK Biobank baseline data (n = 474,397), collected between 2006 and 2010, we compared six count-based definitions of multimorbidity based on different condition lists (extended, most prevalent, or body systems) and thresholds (≥2 versus ≥3 conditions). We also applied a clustering analysis to characterize subtypes of multimorbidity among participants with at least two chronic conditions. We compared prevalence and associations with concurrent health outcomes (polypharmacy, self-rated health, frailty, falls, surgery, chronic pain), blood-based measures (C-reactive protein, Cystatin-C, HDL, LDL Cholesterol, IGF-1), and 3- and 10-year mortality risks. Analyses were undertaken separately in men and women using multivariable regression models adjusted for sociodemographic characteristics and body mass index. Multimorbidity prevalence ranged from 1.0% (cluster-based) to 35.3% (count-based). Count-based definitions using lists with more conditions yielded higher prevalence. Higher thresholds identified more severe health profiles on all measured health outcomes, blood-based measures, but not higher mortality risks. Associations with blood-based measures were more pronounced using clustering, with the highest differences from the standard definition distributed across clusters. Odds ratios for 3-year mortality ranged from 1.44 [1.26; 1.64] to 4.60 [3.73; 5.62] for men and 1.35 [1.07; 1.69] to 3.83 [2.78; 5.14] for women. For 10-year mortality, they ranged from 1.42 [1.34; 1.50] to 3.86 [3.46; 4.30] in men and 1.29 [1.21; 1.39] to 3.33 [2.93; 3.77] for women, with clustering identifying groups with low prevalence and high mortality risks. Findings should be interpreted in light of the selected nature of the UK Biobank cohort and the cross-sectional assessment of several health indicators. Conclusion Operational definitions of multimorbidity substantially influence prevalence estimates, while associations with mortality appear more robust across count-based approaches. Clustering analyses provide complementary insights into heterogeneity within multimorbid populations. Future translational studies are warranted to determine how multimorbidity definitions can be optimized to ultimately improve clinical management and health outcomes in practice.

08.
Nature (Science) 2026-06-17

A 98-qubit trapped-ion quantum computer with all-to-all connectivity

Quantum computers require both high-fidelity operations and large qubit numbers to surpass classical capabilities1. Trapped-ion platforms have demonstrated the highest gate fidelities of any modality2–6 but scaling to larger qubit numbers while preserving performance has remained a central challenge. We report on Quantinuum Helios, a 98-qubit trapped-ion quantum processor based on the quantum charge-coupled device (QCCD) architecture7. Helios features 137Ba+ hyperfine qubits8,9, all-to-all connectivity enabled by a rotatable ion storage ring connecting two quantum operation regions by a junction10,11, speed improvements from parallelized operations12 and a new software stack with real-time compilation of dynamic programs13. Averaged over all operational zones in the system, we achieve average infidelities of 2.5(1) × 10−5 for single-qubit (1Q) gates, 7.9(2) × 10−4 for two-qubit (2Q) gates and 3.3(5) × 10−4 for state preparation and measurement (SPAM), none of which are fundamentally limited and probably able to be improved. These component infidelities are predictive of system-level performance in both random Clifford circuits and random circuit sampling (RCS), the latter demonstrating that Helios operates well beyond the reach of classical simulation and establishes a new frontier of fidelity and complexity for quantum computers14. A new quantum computer, Quantinuum Helios, which is a 98-qubit trapped-ion quantum processor built on the QCCD architecture, demonstrates performance well beyond classical capabilities and provides a path for scaling up quantum computing.

09.
arXiv (CS.AI) 2026-06-12

Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning

arXiv:2606.12805v1 Announce Type: cross Abstract: Collaboration is widely recognized as a cornerstone of 21st-century education, yet teachers still encounter persistent challenges in fostering productive peer interaction. LLM conversational peer agents introduce new possibilities for mediating in-person group work, raising questions about how persona design, particularly their voice characteristics, shapes learners' perceptions, trust, and interactional dynamics. While prior work has examined agent accent effects in one-to-one settings, little is known about how these effects manifest in groups. We conducted a between-subjects mixed-methods study with 33 teachers examining how a GenAI voice agent with different accents (British, Indian, and African American) influenced collaboration and agent perception. Across surveys, group interaction analyses, and artifacts, we find that accent shaped participants' mental models and the roles the agent assumed in group interaction. The British-accented agent was largely treated as a tool and engaged in detached, utility-based ways, whereas Indian- and African American-accented agents were more readily anthropomorphized and integrated as peers. These role expectations influenced trust, engagement, and reliance over time. This work advances understanding of how GenAI's sociolinguistic design features shape group dynamics in CSCL, with implications for designing culturally inclusive AI partners in group learning.

10.
arXiv (CS.CL) 2026-06-12

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

We present M\"OVE (Modelle für die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. M\"OVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. M\"OVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

11.
arXiv (CS.CV) 2026-06-25

SAC$^2$-Net: Semantic Anchoring and Complementary-Consensus Fusion for Multimodal Micro-Expression Recognition

Micro-expression recognition (MER) is challenging due to subtle facial movements, limited data, and the ambiguous relationship between Action Units (AUs) and emotion categories. Optical flow and motion magnification have been widely used to describe subtle facial dynamics from different perspectives: the former captures local motion displacement, while the latter amplifies weak appearance changes. In this work, we observe that these two modalities often exhibit asymmetric failure patterns: one modality may become noisy, distorted, or uninformative, while the other still preserves discriminative AU-related evidence. This phenomenon reveals their complementarity, but also raises two key challenges for fusion: cross-modal heterogeneity and spatially varying modality reliability. Motivated by this observation, we propose SAC$^2$-Net, a Semantic Anchoring and Complementary-Consensus Network for multimodal MER, which first aligns visual modalities with semantic anchors and then performs reliability-aware fusion. To reduce cross-modal heterogeneity before fusion, we introduce Semantic Anchoring Soft Alignment (SASA), which converts activated AUs into textual prompts and uses them as stable semantic anchors to align motion-magnified and optical-flow representations. Unlike hard contrastive learning, SASA constructs hierarchical AU-aware soft labels to preserve semantic proximity among samples with overlapping or anatomically related AU patterns. Based on the aligned representations, Complementary-Consensus Fusion (CCF) first repairs unreliable local evidence through complementary exchange and then enforces a shared spatial focus through consensus refinement. Extensive experiments on five MER benchmarks show that SAC$^2$-Net achieves state-of-the-art or highly competitive performance across coarse-grained, fine-grained, large-scale, and cross-dataset evaluation settings.

12.
arXiv (CS.LG) 2026-06-12

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

arXiv:2501.08425v3 Announce Type: replace Abstract: In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?

13.
arXiv (CS.CL) 2026-06-11

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics – Faithfulness, Coverage, Informativeness, and Acuity – to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

14.
bioRxiv (Bioinfo) 2026-06-24

A comprehensive analysis of calreticulin mutants reveals distinct biophysicochemical proprieties with a potential for refined targeted therapies

Calreticulin mutations in myeloproliferative neoplasms result in the replacement of the C-terminus acidic sequence with a positively charged tail that causes pathological activation of the thrombopoietin. The two canonical variants are Type-1 and Type-2. The remaining are mainly classified as Type-1 or Type-2 like based on the wild type sequence retained. Here, we performed in silico biophysicochemical analyses of 76 CALR exon 9 frameshift variants by their sequence and predicted biophysical properties, complemented by structural modeling of the mutant homodimers. Beyond confirming the Type-1 versus Type-2 distinction, we found that the Type 1-like variants form a continuum of charge architecture along which two reproducible subgroups can be identified, rather than sharply separated classes. This work refines the conventional mechanism-based classification into a charge-resolved framework and provides testable hypotheses linking novel-tail chemistry to receptor activation in CALR-mutant neoplasms and paves the way for improved targeted therapies based on individual mutants characteristics

15.
arXiv (CS.AI) 2026-06-11

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

arXiv:2606.11537v1 Announce Type: new Abstract: Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

16.
arXiv (CS.AI) 2026-06-11

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

arXiv:2606.11272v1 Announce Type: cross Abstract: Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial IoT (IIOT), cybersecurity, and smart cities-data streams are inherently non-stationary, leading classical FL methods to suffer from performance degradation, instability, and catastrophic forgetting. Continual Learning (CL) addresses learning under evolving data distributions but has been largely studied in centralized settings, overlooking key constraints of federated systems, including privacy, limited communication, and client heterogeneity. Federated Continual Learning (FCL) emerges at the intersection of FL and CL, aiming to support lifelong, adaptive, and privacy-aware learning over distributed and non-stationary data. This survey provides a comprehensive and systematic overview of FCL. We first present a formal definition of the FCL problem and clarify its distinctive characteristics. We then analyze the limitations of classical FL under non-stationary conditions, highlighting how CL principles support long-term adaptation. To organize the rapidly growing literature, we propose a multi-dimensional taxonomy of FCL approaches. Furthermore, we review representative application domains and data modalities, summarize commonly used evaluation metrics, and discuss experimental perspectives for assessing long-term performance and forgetting. Finally, we highlight key open challenges, including handling extreme heterogeneity under temporal drift, designing scalable and privacy-preserving memory mechanisms, and establishing standardized benchmarks. This survey aims to serve as a reference and a roadmap for advancing FCL toward robust and deployable real-world systems.

17.
arXiv (CS.LG) 2026-06-25

Low Variance Trust Region Optimization with Independent Actors and Sequential Updates in Cooperative Multi-agent Reinforcement Learning

arXiv:2606.25526v1 Announce Type: new Abstract: Cooperative multi-agent reinforcement learning assumes each agent shares the same reward function and can be trained effectively using the Trust Region framework of single-agent. Instead of relying on other agents' actions, the independent actors setting considers each agent to act based only on its local information, thus having more flexible applications. However, in the sequential update framework, it is required to re-estimate the joint advantage function after each individual agent's policy step. Despite the practical success of importance sampling, the updated advantage function suffers from exponentially high variance problems, which likely result in unstable convergence. In this work, we first analyze the high variance advantage both empirically and theoretically. To overcome this limitation, we introduce a clipping objective to control the upper bounds of the advantage fluctuation in sequential updates. With the proposed objective, we provide a monotonic bound with sub-linear convergence to $\epsilon$-Nash Equilibria. We further derive two new practical algorithms using our clipping objective. The experiment results on three popular multi-agent reinforcement learning benchmarks show that our proposed method outperforms the tested baselines in most environments. By carefully analyzing different training settings, our proposed method is highlighted with both stable convergence properties and the desired low advantage variance estimation. For reproducibility purposes, our source code is publicly available at https://github.com/giangbang/Low-Variance-Trust-Region-MARL.

18.
arXiv (CS.CL) 2026-06-16

Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing three paradigms, including pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. While introducing formal structure improves accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, we show that this gain does not imply faithful reasoning. We identify three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not; implicit constraint blindness, where LLMs overlook logical constraints present in formal representations; and program synthesis failures, where LLMs generate incorrect Z3 code despite structured prompting. Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.

19.
arXiv (CS.CL) 2026-06-19

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the verifier, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: what is the optimal granularity of verification under a given compute budget? Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called GRACE (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

20.
arXiv (quant-ph) 2026-06-15

Reaffirming a Challenge to Bohmian Mechanics

arXiv:2509.06584v4 Announce Type: replace Abstract: In our recent work, we reported the first measurement of the speed of tunnelling particles using a coupled waveguide system. The measured speed is operationally defined through a comparison of two orthogonal motions in a coupled waveguide system, is compatible with the standard definition of dwell time and with the Büttiker-Landauer tunnelling time, and does not presuppose a trajectory picture. Here we respond to objections raised in comments, referee reports, preprints, and articles. We distinguish two questions that are often conflated: whether Bohmian mechanics reproduces the measured density, and whether the standard guiding equation assigns the correct state of motion to the particles. The first point follows under the usual quantum equilibrium assumptions. The second is a separate physical assumption, since the standard guiding equation does not follow from the Schrödinger equation alone. We argue that, in the evanescent regime, the state of motion assigned by the standard guiding equation is in disagreement with the measured speed. To make the distinction explicit, we also present a bidirectional Bohmian model that reproduces the same stationary density while assigning finite speeds compatible with the speed inferred in the evanescent regime.

21.
arXiv (CS.AI) 2026-06-11

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

arXiv:2606.09426v2 Announce Type: replace Abstract: Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.

22.
arXiv (CS.AI) 2026-06-15

Robustness without Wrinkles: Parallel Simulation and Robust MPC for Certified Deformable Manipulation

arXiv:2606.14188v1 Announce Type: cross Abstract: We present CORD-SLS, a real-time control method for safe deformable object manipulation, with a focus on ropes and cloth. At its core is a GPU-parallel differentiable simulator with contact smoothing which enables efficient gradient-based planning through intermittent contact. To robustly satisfy constraints under model and sensing uncertainty, we develop a real-time, GPU-parallel output-feedback robust model predictive control (MPC) algorithm that plans with this simulator. We further show that the simulator accelerates model-based RL for training neural manipulation policies. To improve real-world robustness, we use conformal prediction to calibrate visual-feedback and perception-error bounds for MPC, producing reachable tubes that enable high-probability safe control. We evaluate CORD-SLS on high-dimensional, contact-rich rope and cloth manipulation tasks in simulation and hardware, including obstacle avoidance, routing, folding, and smoothing. Across settings, CORD-SLS achieves millisecond-speed planning, exceeding baselines in safety, speed, and task success.

23.
arXiv (CS.AI) 2026-06-12

ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space

arXiv:2606.13282v1 Announce Type: new Abstract: As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

24.
arXiv (quant-ph) 2026-06-17

An energy-based uncertainty principle and low-energy state preparation

作者:

arXiv:2603.15495v2 Announce Type: replace Abstract: Preparing low-energy states of many-body Hamiltonians is a central challenge in quantum computing, quantum complexity, and condensed matter physics. Existing approaches often get trapped in suboptimal states such as high-energy eigenstates or, more generally, low-variance states that resist further energy reduction. In this work, we explore a different perspective: instead of optimizing with respect to a single Hamiltonian, we leverage the fact that many systems admit families of Hamiltonians that share similar low-energy subspaces but differ at higher energies. We show that this redundancy can be turned into an algorithmic resource by establishing an energy-based uncertainty principle, which implies that these Hamiltonians cannot simultaneously admit low-variance states at higher energies. This suggests a simple strategy of alternating energy-lowering steps across such Hamiltonians, which we investigate numerically on several models. We also introduce a sparse variant where the uncertainty principle yields quadratically larger variance at higher energies, leading to more pronounced energy change. Overall, this work suggests a range of open questions at the interface of random matrix theory, local Hamiltonians and low-energy state preparation, aimed at understanding when such approaches are practical and how they can be analyzed rigorously.

25.
arXiv (CS.AI) 2026-06-25

Attractive and Repulsive Pattern Control in Sequence Generation

arXiv:2606.24911v1 Announce Type: cross Abstract: Variable-order Markov models preserve local symbolic syntax by adapting context length, but long continuations can enter recurring high-order "tunnels": repeated suffixes, locally periodic passages, or copied fragments longer than the formal Markov order. This paper introduces signed pattern control for variable-order Markov generation with BP-Regular sampling. A weighted recurrence automaton computes an activation R for a chosen family of target patterns, and belief propagation samples exactly from P_beta(x) proportional to P_0(x) exp(beta R(x)). Negative coupling makes the target patterns costly during sampling; positive coupling rewards the same patterns and turns them into controlled attractors. The target family may be mined online from overactive generated material, supplied by a score or style vocabulary, or designed as an experimental probe. The main experiments use the online homeostatic case, choosing patterns that become overactive in the sampling history. On six duration-bearing monophonic sources, including Bach and Telemann material, the negative branch reduces generated 8-gram self-reuse, increases the effective number of generated 8-grams, and increases coverage of training-supported 4-gram contexts while preserving substantial lower-order support. A pitch-sequence replication on five Weimar Jazz Database solos gives the same anti-reuse signature outside Baroque material. The same signed mechanism also provides a positive branch for probing attractor basins, phase transitions, and hysteresis in the underlying variable-order model.