Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
bioRxiv (Bioinfo) 2026-06-10

HOMED enables hierarchical and multimodal optimization of DNA methylation deconvolution across tissues

Cellular heterogeneity is a major confounder in bulk DNA methylation data for epigenome-wide association studies. Existing reference-based DNAm deconvolution methods often ignore hierarchies among related cell types and may generalize poorly across datasets due to limited variability in reference profiles. We developed HOMED (Hierarchically Optimized Methylation Deconvolution), a framework that integrates cell-lineage hierarchies, single-cell RNA sequencing-guided deconvolution, and paired bulk RNA-seq/DNAm data for CpG signature optimization. Across simulated and real peripheral blood mononuclear cell, lung, and placental datasets, HOMED consistently yielded the highest PCCs and lowest RMSEs, outperforming existing scRNA-seq-guided DNAm deconvolution methods, improving accuracy, resolution, and cross-tissue generalizability.

02.
arXiv (CS.CL) 2026-06-16

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

作者:

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

03.
arXiv (CS.LG) 2026-06-19

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

arXiv:2606.20547v1 Announce Type: new Abstract: We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ – a bare transformation, with no feature payload and no external action $\rho(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

04.
medRxiv (Medicine) 2026-06-17

Treatment of Multi-Drug-Resistant Tuberculosis with Second-Line All-Oral Drugs in Ghana: Incidence of Adverse Events.

Introduction: The treatment of multidrug-resistant tuberculosis (MDR-TB) remains challenging due to the toxicity of second-line medications and suboptimal treatment outcomes. This study aimed to determine the incidence of adverse events and identify factors associated with these events in patients undergoing treatment for MDR-TB with second-line all-oral drugs in Ghana. Methods: This retrospective cohort study reviewed the medical records of 384 MDR-TB patients treated with second-line all-oral drugs at selected health facilities in Ghana, including the Greater Accra Regional Hospital, Eastern Regional Hospital, and Kumasi South Hospital. Data were extracted using the Kobo Collect tool, capturing patient demographics, baseline clinical and laboratory characteristics, treatment regimens, and adverse events. The study period spanned from 2020 to August 2024. Results: The study included a total of 384 MDR-TB patients, with a mean age of 45 years (SD = 15). The majority of patients were male (65.78%), and most were within the 45-64 years age group (33.85%), followed by those aged 25-44 years (31.25%). Regionally, the highest number of cases were reported from the Greater Accra Region (39.06%), followed by the Eastern Region (31.25%) and Kumasi South Hospital (29.69%). Approximately one in four patients (25%) presented with comorbidities, with HIV being the most common (19.5%). The most frequently reported adverse events were diarrhea (14%), dizziness (13.7%), and vomiting (12.3%). Most of these were mild to moderate in severity and tended to decrease as treatment progressed. Severe adverse events, such as leukopenia and acute kidney injury, were rare, occurring in less than 5% of patients. Over the course of treatment, gastrointestinal adverse events such as vomiting and nausea showed a significant decline, indicating possible patient adaptation or improved clinical management. Results from the multivariate Poisson regression analysis revealed that age and comorbidities were significant predictors of adverse events. Patients aged 65 years and above had a 56% lower risk of developing adverse events compared to younger patients (Adjusted Risk Ratio [aRR] = 0.44, 95% CI: 0.25-0.79, p = 0.005). Conversely, patients with comorbid conditions such as diabetes or hypertension were approximately 2.6 times more likely to experience adverse events compared to those without comorbidities (aRR = 2.65, 95% CI: 1.58-4.43, p < 0.001). The effect of sex was not statistically significant after adjustment (aRR = 1.03, 95% CI: 0.70-1.50, p = 0.86). At the end of the treatment period, 74.9% of patients achieved successful outcomes, including both those who were cured and those who completed treatment without being classified as cured. However, 25.1% had unsuccessful outcomes, which included treatment failure, relapse, or death. Conclusion: In conclusion, adverse events are common in the treatment of MDR-TB with second-line All-Oral drugs, with gastrointestinal adverse events being the most prevalent. These findings highlight the importance of monitoring and managing adverse events to optimize treatment outcomes for MDR-TB patients in Ghana.

05.
arXiv (CS.CL) 2026-06-16

HiMPO: Hindsight-Informed Memory Policy Optimization for Less-Entangled Credit in Long-Horizon Agents

Long-horizon agents rely on memory mechanisms to compress interaction history, but optimizing memory writing faces a distinct credit assignment challenge: a memory update may be rewarded or penalized due to downstream tool failures, noisy observations, or reasoning errors rather than its own contribution. This causally entangled credit can lead agents to discard useful evidence or preserve irrelevant information. We propose HiMPO, a Hindsight-Informed Memory Policy Optimization framework for assigning less-entangled credit to memory-writing actions in long-horizon agents. HiMPO first estimates the local utility of a memory update by comparing the task-relevant information recoverable from the previous and updated memories under the same pre-write state. It then uses hindsight relevance as a bounded retrospective filter that attenuates memory credit when local utility is not supported by the target outcome. The resulting memory-specific advantage is applied only to memory tokens, while trajectory-level rewards optimize the rest of the agent behavior. Across judge-based open-domain tasks and objective compressive-memory QA, HiMPO improves over strong memory-based and RL-based baselines while preserving compressed-context efficiency. Controlled interventions further show that HiMPO reduces blame leakage from tool-induced errors and improves attribution fidelity of memory updates.

06.
arXiv (CS.CV) 2026-06-16

AME: A Multi-Type Contributor Attribution Framework in Generative AI Markets

Generative AI enables value creation through multi-stage collaboration among heterogeneous contributors, including training data, base models, fine-tuning behaviors, and prompts. However, how to fairly allocate the data value remains largely unexplored. This paper formulates multi-stage generative AI value allocation as a new research problem and identifies three core challenges: heterogeneous data contribution valuation, data rights mapping, and trustworthy execution. We propose AME (Attribution-Mapping-Execution) framework, a unified framework that integrates data contribution valuation, data rights mapping, and trustworthy execution into a single workflow. Experimental results demonstrate that AME framework achieves data value allocation outcomes more consistent with human reference judgments while maintaining low-cost trustworthy execution. Our work provides an initial foundation for value assessment and revenue allocation in generative AI data markets.

07.
arXiv (CS.LG) 2026-06-16

Representation Costs in Data Science: Foundations and the Quasi-Banach Spaces of Deep Neural Networks

arXiv:2606.14954v1 Announce Type: cross Abstract: We develop a general framework for analyzing representation costs of parametric data-fitting methods through their parameter-space regularizers. From this abstract perspective, we define representation costs for arbitrary parametric models and reveal their induced (native) function spaces. This unifies recent function-space views of data-fitting methods. We also prove that many natural results hold in this abstract setting, including representer theorems for parametric methods on their native spaces. The framework also rigorously connects parametric methods with their equivalent nonparametric descriptions under sufficient overparameterization. Classical methods and their native spaces, such as kernel methods / reproducing kernel Hilbert spaces, wavelets / Besov spaces, and shallow neural networks / variation spaces emerge as special cases of our abstract framework. A byproduct of "axiomatizing" the study of representation costs is that we also immediately obtain new results for deep neural networks: For depth-$L$ feedforward ReLU networks, their induced native spaces are $p$-normable quasi-Banach spaces with $p = 2/L$. This reveals that the inductive bias of deep neural networks (as given by the representation cost) cannot be captured by norms for depths $L > 2$.

08.
arXiv (CS.CL) 2026-06-19

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization – within the training domains, across different domains, and from CoD to Ralph-loop settings – of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

09.
arXiv (CS.LG) 2026-06-15

DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation

arXiv:2606.14192v1 Announce Type: new Abstract: Auto-bidding is a core component of real-time advertising systems, where decisions must optimize long-term performance under budget and cost constraints, while online exploration is prohibitively risky. Offline reinforcement learning and, more recently, Transformer-based sequence modeling have shown promise for learning bidding policies from logged data, but their unimodal and purely parametric formulations often collapse multiple effective bidding strategies into suboptimal averaged actions and perform unreliably under sparse or long-tail traffic. To mitigate these limitations, we propose DRIVE (Distributional and Retrieval-Augmented Bidding with Value Evaluation), a unified Transformer-based framework that decouples candidate action generation from decision making for offline auto-bidding. DRIVE combines distributional action modeling, retrieval-augmented candidate generation from high-quality historical decisions, and value-based evaluation to select the most promising bid at inference time. Extensive experiments on AuctionNet and additional offline reinforcement learning benchmarks demonstrate that DRIVE consistently improves bidding performance and generalizes well across multiple Transformer-based methods.

10.
Nature Medicine 2026-06-08

Post-adjuvant chemotherapy in ctDNA-positive patients with resected colorectal cancer: a randomized phase 3 trial

Tumor-informed circulating tumor DNA (ctDNA) enables detection of molecular residual disease (MRD) after curative resection of colorectal cancer (CRC), but whether early intervention improves outcomes remains uncertain. ALTAIR was a randomized, double-blind, phase 3 trial embedded in the CIRCULATE-Japan platform evaluating a post-adjuvant ctDNA surveillance strategy with treatment initiation upon molecular recurrence. Patients with resected stage 0–IV CRC who became ctDNA positive after completion of standard-of-care therapy and had no radiological evidence of disease were randomly assigned (1:1) to receive trifluridine/tipiracil (FTD/TPI) or placebo for 6 months. The primary endpoint was investigator-assessed disease-free survival (DFS). Between July 2020 and June 2023, 243 patients were randomized to FTD/TPI (n = 122) or placebo (n = 121). Median DFS was 9.30 months with FTD/TPI and 5.55 months with placebo (hazard ratio = 0.79, 95% confidence interval: 0.60–1.05, P = 0.107), and the primary endpoint was not met. FTD/TPI increased grade 3 or higher hematologic adverse events (73.0% versus 3.3%) without new safety signals. These findings indicate that post-adjuvant intervention with FTD/TPI did not significantly improve DFS in ctDNA-positive patients without radiological disease. ClinicalTrials.gov identifier: NCT04457297 . In the randomized, double-blind phase 3 ALTAIR trial, patients with resected colorectal cancer who became positive for circulating tumor DNA during post-adjuvant surveillance received trifluridine/tipiracil hydrochloride therapy, which did not significantly prolong disease-free survival compared with placebo.

11.
arXiv (quant-ph) 2026-06-16

High-dimensional coherence to entanglement transduction under canonical noise

arXiv:2606.16695v1 Announce Type: new Abstract: We develop an analytical framework for coherence-to-entanglement conversion in bipartite high-dimensional quantum systems, so-called qunits. An arbitrary coherent input qunit is coupled to an incoherent ancilla through a generalized controlled-shift operation, producing a maximally correlated bipartite state. By analyzing the partial transpose of the output state, we establish an exact dimension-independent connection between the input coherence and the generated entanglement. We then study how this conversion is affected by three standard noise processes applied after the conversion step: phase damping, global depolarizing noise, and independent amplitude damping. The resulting expressions show that these channels degrade entanglement in qualitatively different ways. Phase damping leads to a uniform attenuation of the entanglement generated from coherence, depolarizing noise introduces pairwise thresholds associated with entanglement sudden death, and amplitude damping produces an asymmetric decay governed by relaxation toward the ground state. For maximally coherent inputs, the general results reduce to simple closed-form behavior, allowing direct comparison of the three noise mechanisms as the system dimension increases. In particular, global depolarizing noise exhibits a dimension-dependent sudden-death threshold, while amplitude damping leads to a smooth suppression in the maximally coherent case. These results provide useful analytical benchmarks for high-dimensional resource conversion and for assessing noisy entanglement generation in qudit-based quantum-information settings.

12.
arXiv (CS.LG) 2026-06-15

How Task Structure Limits Multi-Agent Success: An Information-Theoretic Analysis

arXiv:2606.13733v1 Announce Type: cross Abstract: Multi-agent systems (MAS) were expected to overcome the limitation of single-agent systems (SAS) through collaboration. However, under typicality conditions on the task's constraint graph and bounded inter-agent communication, we prove that the success probability of a MAS is closely tied to the connectivity of task constraints, where each agent has limited information-processing capacity. Specifically, the success probability decays exponentially with an information bottleneck that emerges from partitioning the task's constraint graph among agents. We define this quantity as the minimum cut cost $C_{\min}$ of the potential constraint graph of each task. This information-theoretic bound applies to both open systems with external feedback and closed systems without. We validate our theory on both synthetic experiments and real-world empirical data from SWE-bench submissions. From our framework, effective MAS design should incorporate task-inherent constraints alongside engineering optimization, and when $\Cmin$ is high, practitioners should restructure tasks rather than simply scaling agents or communication.

13.
arXiv (CS.CV) 2026-06-16

GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs

True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.

14.
arXiv (CS.LG) 2026-06-12

Feature-preserving Latent-EnKF for Data Assimilation of Flows with Shocks

arXiv:2606.12559v1 Announce Type: cross Abstract: The ensemble Kalman filter (EnKF) is widely adopted for sequential data assimilation, but fails for solutions with discontinuities, such as shocks in compressible flows. Uncertainty in shock location induces multimodal ensemble statistics that violate the Gaussian assumptions underlying the EnKF, producing large-scale spurious oscillations in the analysis state. We introduce a feature-preserving latent-EnKF that performs the ensemble update in a learned low-dimensional latent space, where shock and flow features admit a smooth manifold representation, thereby preserving sharp features during EnKF analysis. The updated latent state is mapped back to physical state through a shared decoder for all ensemble members. The algorithm eliminates the member-specific ordered training and positivity flooring used in prior approaches. Numerical experiments on a Sod shock tube and Mach 2 shock interaction with a 2D cylinder, using sparse and noisy observations, show accurate feature recovery of shocks and contact discontinuities without spurious oscillations.

15.
arXiv (CS.CL) 2026-06-19

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

16.
arXiv (CS.AI) 2026-06-15

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

arXiv:2606.14409v1 Announce Type: cross Abstract: In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

17.
medRxiv (Medicine) 2026-06-15

VarEx: A Large Language Model Pipeline for Automated Extraction of Exposures, Outcomes, and Covariates from Epidemiologic Studies

Objective: Observational studies are essential for investigating risk factors for Alzheimer's disease and related dementias (ADRD), but inconsistent reporting and selection of covariates can contribute to residual confounding, omitted-variable bias, and reduced reproducibility. We developed and evaluated VAREX (Variable Extraction), a large language model (LLM)-based information extraction framework designed to automatically identify exposures, outcomes, and covariates from epidemiologic studies and populate structured evidence repositories. Materials and Methods: VAREX combines retrieval-augmented generation, biomedical language-model embeddings, semantic chunking, cross-encoder reranking, and prompt-engineered LLM workflows to extract epidemiologic variables from full-text biomedical articles. The framework was evaluated using a reference-standard corpus of observational studies examining blood pressure variability (BPV) and Alzheimer's disease-related dementias (ADRD), together with external validation datasets involving other exposure-outcome relationships. Extracted variables were compared with independently curated human reference standards using semantic matching and one-to-one assignment procedures. Covariates were additionally classified into ten epidemiologically relevant semantic categories. Results: In the primary BPV[-&gt;]ADRD corpus (10 studies), VAREX achieved a precision of 0.91, recall of 0.84, and F1-score of 0.87 for variable extraction. Covariate classification accuracy was 0.90, yielding a strict extraction-and-classification F1-score of 0.78. External validation datasets demonstrated comparable performance across diverse epidemiologic domains, with extraction F1-scores ranging from 0.73 to 0.85. Category-level performance was strongest for health behaviors (F1=0.96), sociodemographic variables (F1=0.90), and medication exposures (F1=0.89). Compared with published estimates of manual systematic-review effort, VAREX reduced processing time from approximately 61 minutes to 9 minutes per article, representing an 85.7% reduction in review time. Discussion: These findings demonstrate that LLM-based information extraction can accurately identify and classify epidemiologic variables across heterogeneous observational-study designs. Automated extraction enables scalable construction of structured repositories of exposures, outcomes, and covariates while substantially reducing the labor required for evidence synthesis and systematic reviews. Conclusion: VAREX provides an effective framework for automated extraction and classification of epidemiologic variables from the biomedical literature. By supporting large-scale evidence synthesis and structured knowledge resource development, VAREX may facilitate more rigorous observational research, improved confounder identification, and enhanced reproducibility in epidemiology.

18.
arXiv (CS.AI) 2026-06-12

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

arXiv:2604.08958v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

19.
arXiv (CS.CL) 2026-06-12

M\"OVE: A Holistic LLM Benchmark for the German Public Sector

We present M\"OVE (Modelle für die \"Offentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. M\"OVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. M\"OVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

20.
arXiv (CS.CL) 2026-06-11

M4FC: a Multimodal, Multilingual, Multicultural, Multitask Real-World Fact-Checking Dataset

Existing real-world datasets for multimodal fact-checking have multiple limitations: they contain few instances, cover on only one or two languages, focus only on one task, or rely on external news article sets for sourcing true claims. To address these shortcomings, we introduce M4FC, a new real-world dataset comprising 4,982 images paired with 6,980 claims. The images, verified by professional fact-checkers from 22 organizations, represent a diverse range of cultural and geographic contexts. Each claim is available in one or two out of ten languages. M4FC spans six multimodal fact-checking tasks: visual claim extraction, claimant intent prediction, fake image detection, image contextualization, location verification, and verdict prediction. We provide baseline results for all tasks and analyze how combining intermediate tasks affects verdict prediction performance. We make our dataset and code publicly available.

21.
arXiv (CS.LG) 2026-06-19

DADP: Domain Adaptive Diffusion Policy

arXiv:2602.04037v3 Announce Type: replace Abstract: Learning domain adaptive policies that can generalize to unseen transition dynamics, remains a fundamental challenge in learning-based control. Substantial progress has been made through domain representation learning to capture domain-specific information, thus enabling domain-aware decision making. We analyze the process of learning domain representations through dynamical prediction and find that selecting contexts adjacent to the current step causes the learned representations to entangle static domain information with varying dynamical properties. Such mixture can confuse the conditioned policy, thereby constraining zero-shot adaptation. To tackle the challenge, we propose DADP (Domain Adaptive Diffusion Policy), which achieves robust adaptation through unsupervised disentanglement and domain-aware diffusion injection. First, we introduce Lagged Context Dynamical Prediction, a strategy that conditions future state estimation on a historical offset context; by increasing this temporal gap, we unsupervisedly disentangle static domain representations by filtering out transient properties. Second, we integrate the learned domain representations directly into the generative process by biasing the prior distribution and reformulating the diffusion target. Extensive experiments on challenging benchmarks across locomotion and manipulation demonstrate the superior performance, and the generalizability of DADP over prior methods. More visualization results are available on the https://outsider86.github.io/DomainAdaptiveDiffusionPolicy/.

22.
arXiv (CS.AI) 2026-06-16

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

arXiv:2606.15247v1 Announce Type: cross Abstract: The asymptotic behaviour of Monte Carlo Exploring Starts (MCES) is a long-standing open question in reinforcement learning, even in the tabular setting. We investigated the convergence properties of tabular MCES by constructing examples in which the algorithm converges to suboptimal solutions. This paper presents new counterexamples for both initial-visit and first-visit MCES and gives a convergence-restoring modification for the initial-visit case. We show that stable suboptimal solutions may exist for initial-visit MCES with sample-average updates even when greedy actions are updated more often than non-greedy actions on average. However, by scaling learning rates inversely to update frequencies on a state-by-state basis, convergence to optimality is guaranteed. Unlike previous uniformisation methods, this modification is applicable to large-scale problems that require approximating the estimated value function. We then extend the example to show that sample-average first-visit MCES may also converge to suboptimal solutions. This largely settles a fundamental open problem and shows that exploring starts alone do not guarantee convergence to optimality. More broadly, these results highlight that convergence depends critically on the relative size and frequency of updates applied to different actions, making the choice of learning rates and the balance between exploration and exploitation central to the analysis of MCES and the implementation of scalable Monte Carlo control methods.

23.
arXiv (CS.AI) 2026-06-11

Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions

arXiv:2606.11828v1 Announce Type: cross Abstract: Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.

24.
arXiv (CS.AI) 2026-06-17

A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics

arXiv:2606.17962v1 Announce Type: cross Abstract: Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi-Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro-symbolic framework that integrates large language models (LLMs) into the model-checking pipeline for MAS. The LLM acts as a strategy-generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate-and-certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy-synthesis dataset, consisting of 4211 instances. Experiments with an open-weight Qwen3-32B model show that our certified pipeline achieves 92\% accuracy on strategy-synthesis outcomes.

25.
arXiv (quant-ph) 2026-06-19

Approximating optimal decoding of quantum LDPC codes with narrow frontiers

arXiv:2606.20513v1 Announce Type: new Abstract: We introduce the Frontier decoder, a pruned dynamic-programming decoder for sparse quantum decoding problems. Frontier processes error variables in a chosen order, merges prefixes with the same residual syndrome and logical label, and approximates logical-coset posterior masses by retaining only a narrow scored frontier. Without pruning, the recursion is exact ordered inference with exponential complexity. In the code-capacity setting, the decoder reaches thresholds close to optimal for the surface code and the color code. In the circuit-level noise model, it achieves state-of-the-art performance with a very small average retained list size: less than 100 for the gross code $[[144,12,12]]$ at a physical error rate of $0.001$. When the list size is constant, the decoder has linear complexity, suggesting the possibility of low-latency implementations.