Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-15

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scattered across leaderboards, papers, blog posts, evaluation harness logs, and custom repositories. Second, results are created by different evaluation frameworks, which produce divergent scores for nominally identical evaluations and record metadata inconsistently, hindering comparison, cross-community evaluation science, cost reduction, and reuse. We introduce Every Eval Ever, the first shared schema and community-crowdsourced repository for AI evaluation results. The schema standardizes how evaluations are represented in a unified, single JSON document. It is source-agnostic by design, ingesting results from evaluation harnesses and papers alike, and optionally stores per-instance outputs for fine-grained analysis. We contribute: (i) a community-governed metadata schema with a companion instance-level schema, the first standardization effort of its kind; (ii) automatic converters from popular formats, evaluation harnesses, and leaderboards to the unified schema; and (iii) a crowdsourced community database hosted on Hugging Face, currently spanning to date 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats.

02.
arXiv (CS.CV) 2026-06-16

A biological vision inspired framework for machine perception of abutting grating illusory contours

Higher levels of machine intelligence demand alignment with human perception and cognition. Deep neural networks (DNN) dominated machine intelligence have demonstrated exceptional performance across various real-world tasks. Nevertheless, recent evidence suggests that DNNs fail to perceive illusory contours like the abutting grating, a discrepancy that misaligns with human perception patterns. Departing from previous works, we propose a novel deep network called illusory contour perception network (ICPNet) inspired by the circuits of the visual cortex. In ICPNet, a multi-scale feature projection (MFP) module is designed to extract multi-scale representations. To boost the interaction between feedforward and feedback features, a feature interaction attention module (FIAM) is introduced. Moreover, drawing inspiration from the shape bias observed in human perception, an edge detection task conducted via the edge fusion module (EFM) injects shape constraints that guide the network to concentrate on the foreground. We assess our method on the existing AG-MNIST test set and the AG-Fashion-MNIST test sets constructed by this work. Comprehensive experimental results reveal that ICPNet is significantly more sensitive to abutting grating illusory contours than state-of-the-art models, with notable improvements in top-1 accuracy across various subsets. This work is expected to make a step towards human-level intelligence for DNN-based models.

03.
arXiv (CS.LG) 2026-06-15

A Water Efficiency Dataset for African Data Centers

arXiv:2412.03716v3 Announce Type: replace Abstract: Artificial intelligence (AI) computing and data centers consume large amounts of freshwater, both directly for cooling and indirectly for electricity generation. While most attention has been paid to developed countries such as the U.S., this paper presents the first-of-its-kind dataset that combines nation-level weather and electricity generation data to estimate water usage effectiveness for data centers in 41 African countries across five different climate regions. We also use our dataset to evaluate and estimate the water consumption of inference on two large language models (i.e., Llama-3-70B and GPT-4) in 11 selected African countries. Our estimates suggest that writing a 10-page report using Llama-3-70B could consume as much as {0.66 liters} of water, while the water consumption by GPT-4 for the same task may go up to about {59 liters}. For writing a medium-length email of 120-200 words, Llama-3-70B and GPT-4 could consume about {0.13 liters} and {2.9 liters} of water, respectively. All the numbers for generative model inference tasks are based on public information available in 2024, when we initially prepared the analysis. Since then, AI inference systems have improved substantially. For example, recent disclosures suggest that energy efficiency improved by more than 30x between May 2024 and May 2025. Accordingly, our 2024 estimates should be interpreted as historical reference values rather than as representative of current performance. Interestingly, given the same AI model, 9 of the 11 selected African countries consume less water than the global average, mainly because of lower water intensities for electricity generation.

04.
arXiv (CS.AI) 2026-06-15

Numbers Already Carry Their Own Embeddings

arXiv:2606.14108v1 Announce Type: cross Abstract: We introduce Adelic operation-preserved embeddings (AOE), a training-free representation that captures both a number's real value and its modular (p-adic) signatures. This construction preserves additive and multiplicative structure by design, turning numerical input into embeddings that "speak in the language of mathematics." Unlike prior approaches that rely on task-specific retraining, AOE is plug-and-play and drops seamlessly into existing architectures. On algebraic combinatorics benchmarks, it delivers consistent gains including the first-ever perfect accuracy on the Weaving Pattern task-while suggesting a principled path forward for overcoming the long-standing "number problem" in AI.

05.
arXiv (CS.AI) 2026-06-16

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

arXiv:2606.16149v1 Announce Type: new Abstract: Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

06.
arXiv (quant-ph) 2026-06-16

A complexity theory for non-local quantum computation

arXiv:2505.23893v2 Announce Type: replace Abstract: Non-local quantum computation (NLQC) replaces a local interaction between two systems with a single round of communication and shared entanglement. Despite many partial results, it is known that a characterization of entanglement cost in at least certain NLQC tasks would imply significant breakthroughs in complexity theory. Here, we avoid these obstructions and take an indirect approach to understanding resource requirements in NLQC, which mimics the approach used by complexity theorists: we study the relative hardness of different NLQC tasks by identifying resource efficient reductions between them. Most significantly, we prove that $f$-measure and $f$-route, the two best studied NLQC tasks, are in fact equivalent under $O(1)$ overhead reductions. This result simplifies many existing proofs in the literature and extends several new properties to $f$-measure. For instance, we obtain sub-exponential upper bounds on $f$-measure for all functions, and efficient protocols for functions in the complexity class $\mathsf{Mod}_k\mathsf{L}$. Beyond this, we study a number of other examples of NLQC tasks and their relationships.

07.
arXiv (CS.AI) 2026-06-24

Legal Reasoning Is Not Lawyering: Rethinking Legal Benchmarks for Pro Se Access to Justice

arXiv:2606.23716v1 Announce Type: cross Abstract: Legal AI benchmark research frequently invokes the assumption that large language models can improve access to justice, including for people who cannot access lawyers in order to understand and exercise their legal rights. We argue that current benchmarks are not equipped to support this assumption because they evaluate legal reasoning over inputs that have already been preprocessed by legal experts, which measures the upper bound of model performance. Access to justice depends on a lower bound: how models perform when inputs come from pro se litigants, whose prompts may contain noisy narratives, buried facts, omissions, folk-legal assumptions, and surface-level errors. These degradations are comparable to conditions under which LLMs are known to degrade in the general machine learning literature, including long-context sensitivity, underspecification, hallucination, and typographical perturbations. We connect evidence from pro se literature with this body of machine learning research and present a small perturbation experiment on LEXam, a legal benchmark, to illustrate the gap between these two bounds. If model development continues to focus on benchmarks that measure only the upper bound, this gap may remain hidden or even widen. We conclude by calling for legal benchmarks that directly measure robustness under pro se-like inputs so that access-to-justice claims about legal AI can become empirically testable.

08.
arXiv (CS.LG) 2026-06-16

TriAdReview: Triangular Adversarial Review Architecture for Multi-Model Technical Document Generation

arXiv:2606.15074v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for technical document generation, yet single-model outputs often suffer from over-engineering, security blind spots, and incomplete coverage. We propose TriAdReview, a triangular adversarial review architecture that employs two independent reviewer models (engineering and boundary perspectives) and a triangular judging mechanism to iteratively improve a generator model's output. We evaluate TriAdReview across five benchmark tasks - architecture design, code generation, proposal review, security audit, and requirements analysis - using three configurations: single model (baseline), dual model (single review), and triple model (full system). Results across 75 experiments (n=5 per cell) show that the triple model configuration achieves a 10.1% overall improvement over the single model baseline (26.2 vs. 23.8 out of 50; p

09.
arXiv (math.PR) 2026-06-11

Unbiased Derivative Estimation for Stationary Mean of Parameterized Markov chains

arXiv:2606.11487v1 Announce Type: cross Abstract: We propose a new approach to unbiased estimation of the gradients of the stationary means associated with parametrized families of Markov chains. Our estimators are particularly efficient when the Markov chains have slow mixing rate. Our approach does not require a specific parametrization except for an oracle to evaluate the transition density and its gradient at a given data point without any additional knowledge about the density function itself. It makes our estimator suitable for parametrizations associated with neural networks. The estimator can potentially achieve large improvement in terms of efficiency. Numerical experiments confirm the good performance predicted by the theory.

10.
arXiv (CS.CL) 2026-06-12

HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG

Multi-hop RAG poses a data-engineering problem beyond passage matching: under fixed retrieval budgets, a system must organize retrieved text into evidence units that expose answer chains. Dense retrievers score passages independently, while graph-based memories make associations explicit but often rely on pairwise or entity-centered keys that fragment multi-hop evidence. We present HKVM-RAG, a key-value-separated evidence-organization layer. It assembles answer-path hyperedges from cached passage-level LLM evidence tuples and uses them as retrieval keys, while retaining passage text as answer values. To isolate key-space design, our fixed-substrate protocol holds the tuple cache, candidate passages, reader, and evaluation budget constant across pairwise graph and hypergraph variants. Weighted hypergraph key-value retrieval improves over KG-PPR by +3.426 F1 on 2WikiMultiHopQA and +3.592 F1 on MuSiQue; HotpotQA shows that higher structured support coverage need not yield standalone answer-F1 gains. We therefore study WHG-KV as an evidence-control signal rather than a dense-retrieval replacement. Oracle and train-to-dev analyses identify support selection as repairable, and a dense-aware controller combines frozen ColBERTv2 and HKVM rank/score features using out-of-fold HKVM predictions. It reaches 88.846, 65.073, and 85.810 F1 on the three benchmarks, improving over ColBERTv2 by +11.084, +6.763, and +5.966 F1. Source-level ablations show that matched non-WHG structured signals do not match the WHG-KV gains. These results provide bounded evidence that key-value-separated hypergraph organization can serve as a reusable evidence-control mechanism for multi-hop RAG.

11.
medRxiv (Medicine) 2026-06-11

Genetic Susceptibility to Incisional Hernia: Evaluation of Hernia Polygenic Risk Scores

Objectives: Incisional hernia (IH) affects 13-30% of people after abdominal surgery, resulting in substantial morbidity and costs. While clinical risk factors have been studied extensively, genomic risk for IH is incompletely understood. We aimed to evaluate the impact of polygenic risk scores (PRS) on IH risk prediction. Methods] We created and evaluated three PRS for abdominal hernia, ventral hernia and latent hernia susceptibility for prediction of IH in an institutional biobank. The primary outcome was defined as the diagnosis or repair of an IH based on ICD-9/10-CM/PCS and CPT codes. Clinical covariates included age, sex, body mass index (BMI), smoking status, index procedure type, and perioperative surgical site infection. A phenome-wide association study (PheWAS) was performed to assess clinical associations with increased PRS. We then tested the ability of the PRS to improve prediction for IH by modeling clinical covariates with and without PRS in patients who underwent abdominal surgery. Model performance was assessed using 10 iterations of 5-fold cross-validation to estimate Brier scores and area under the receiver operating characteristic curve (AUROC), which were compared using cross-model Bayesian analysis of variance. Results: In 55,809 subjects, assessed PRS was significantly associated with incisional, umbilical, and ventral hernia on PheWAS, with 1.19 greater odds of developing IH per 1-SD increase in PRS (95% CI: 1.13-1.25, P < 0.001). Of 9,909 subjects who underwent qualifying abdominal surgery, 706 developed IH. In this cohort, the latent hernia susceptibility PRS was associated with a 16% increased hazard of developing IH per 1-SD increase (HR 1.16; 95% CI: 1.07-1.26; P < 0.001). Compared to a predictive model using clinical covariates (Brier score = 0.047, 95% CI: 0.046-0.048; AUROC = 0.660, 95% CI: 0.653-0.666), addition of the PRS showed similar Brier score and AUROC estimates (Brier score = 0.047, 95% CI: 0.046-0.048; AUROC: 0.667, 95% CI: 0.661-0.673) at five years. Cross-model Bayesian analysis demonstrated >99% probability of practical equivalence when trying to detect a difference of [&ge;] 0.02. Conclusion: All three PRS for hernia were independently associated with IH, suggesting that genomic factors contribute significantly to IH development. However, none of the three PRS meaningfully improved clinical IH risk prediction in patients who underwent abdominal surgery. This suggests that clinical comorbidities and surgical techniques may be equally as important as genomic architecture.

12.
medRxiv (Medicine) 2026-06-16

Infections and suicide and self-harm: a population-based matched cohort study

Background Infections have been associated with adverse mental health outcomes, including suicide, but evidence beyond severe or central nervous system infections is limited. We investigated associations between a range of acute infections and subsequent suicide/self-harm outcomes. Methods We conducted six infection-specific matched cohort studies using English primary care records from the Clinical Practice Research Datalink Aurum (2007-2024), linked to hospital admissions and mortality data. Adults ([&ge;]18 years) with a primary care record of infection (gastroenteritis, lower respiratory tract [LRTI], skin/soft-tissue [SSTI], urinary tract [UTI], sepsis, meningitis/encephalitis [positive control]) were matched (age, sex, practice, calendar period) to up to five comparators without infection. We estimated hazard ratios (HRs) for suicide/self-harm outcomes using Cox regression, stratified by matched set and implicitly adjusting for matching factors, with additional adjustment for deprivation, lifestyle factors, and comorbidities. We examined whether associations varied over time, by infection severity, antimicrobial treatment, sex, and prior mental health conditions. Findings Cohorts ranged from 18,192 individuals with meningitis/encephalitis (matched to 90,915 without) to 398,099 with SSTI (matched to 1,743,747). After adjustment, individuals with infection had a higher hazard of suicide/self-harm outcomes than comparators across all cohorts: sepsis (HR 1.79, 95% CI 1.65-1.93), gastroenteritis (1.62, 1.55-1.70), meningitis/encephalitis (1.56, 1.32-1.84), UTI (1.41, 1.33-1.50), SSTI (1.37, 1.31-1.43), and LRTI (1.37, 1.31-1.44). Risk was highest in the year post-infection, attenuating over time, and was higher among severe infections and those without prior mental health conditions. Interpretation Common acute infections recorded in primary care are associated with increased risk of suicide and self-harm, particularly following severe infections and in the year post-infection. Findings support suicide risk monitoring following acute infection, particularly among individuals without prior mental health conditions, and highlight infection prevention as a potentially modifiable strategy in vulnerable populations. Funding Wellcome and La Caixa. Copyright This work is licensed under a Creative Commons Attribution (CC BY) licence.

13.
arXiv (quant-ph) 2026-06-15

Nanostructure modelling with early fault tolerant quantum computers

arXiv:2606.06442v2 Announce Type: replace Abstract: Semiconductor nanostructures are central to many developing technologies. Notably, double quantum dots are especially important for semiconductor spin-qubit architectures, quantum sensing applications, and quantum-dot solar cells. Accurate modelling is highly desirable but conventional methods can struggle when dynamics involve more than two interacting electrons. In this work, we present a quantum simulation framework capable of addressing multi-electron double quantum dots. We adopt an efficiently scaling 1$^st$ quantised representation of the system and develop algorithms based on both Trotterisation and Qubitisation. Incorporating insights from classical simulations enables us to produce resource estimates that are more realistic than those obtained from theoretical error bounds. Using a standard surface code model with physical noise at $10^{-3}$, our results indicate that the ground-state energy of four electrons in a double quantum dot can be estimated in approximately 22 hours using 226k physical qubits, or an eight-electron system in 3.3 days with 314k qubits (with runtimes falling dramatically when more qubits are available). We anticipate that incorporating recent advances in surface code architectures may reduce these costs significantly further. Our results suggest that early fault-tolerant quantum computers may become valuable tools for designing mature-era quantum technologies.

14.
arXiv (CS.LG) 2026-06-15

Geometric Domain Adaptation via Optimal Transport for Linear Regression in R^2

arXiv:2606.14023v1 Announce Type: cross Abstract: Optimal Transport has become recently a powerful method for domain adaptation by aligning source and target distributions. We study a supervised domain adaptation problem where source and target domains are related by a rotation or a translation or a homothety in $\mathbb{R}^2$. We prove that the optimal transport map recovers the underlying map when using a $p-$norm cost with $p \geq 2$. Based on this insight, we develop a method combining $K-$means and optimal transport to estimate the underlying map, enabling adaptation of linear regression models when target data is scarce. Simulations demonstrate improved performance over baseline methods. Rather than relying on highly expressive deep learning architectures, we focus on classical machine learning models to emphasize interpretability and theoretical insight. This perspective allows us to explicitly characterize the role of optimal transport in recovering geometric transformations such as rotations, translations, and homotheties. Our contributions include a theoretical result linking optimal transport and rotations, translations and homothecies in $\mathbb{R}^2$, and a practical method for adaptation in linear regression offering both conceptual clarity and applied value in domain adaptation tasks in this space.

15.
arXiv (CS.AI) 2026-06-18

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

arXiv:2606.06133v2 Announce Type: replace-cross Abstract: TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.

16.
arXiv (CS.LG) 2026-06-16

Repeated Bilateral Trade: The Quest for Fairness

arXiv:2606.15369v1 Announce Type: new Abstract: We study repeated bilateral trade from a fairness perspective. At each round, a fresh seller-buyer pair arrives, and the platform posts a price before observing the traders' valuations. Trade occurs only if both agents accept the price. Rather than maximizing only the gain from trade, we consider platforms that seek balanced divisions of the generated surplus. We show that natural fairness desiderata lead to a one-parameter Rawls-to-Nash family of fair-gain objectives, obtained by aggregating the seller's and buyer's net gains through nonpositive Hölder means. Unlike the standard gain-from-trade objective and the Rawlsian fair-gain objective studied in prior work, our proposed objectives induce a new statistical structure in which expected rewards are recovered from threshold feedback through a two-dimensional singular-kernel integral identity. This leads to a nonstandard pure-exploration problem whose natural estimators are rectangular double sums with row-column dependence and singular weights. Assuming independent i.i.d. seller and buyer valuation sequences with arbitrary unknown marginals, we characterize the optimal learning rates for the whole Rawls-to-Nash family of fair-gain objectives, giving matching fixed-confidence sample-complexity and regret bounds up to polylogarithmic factors.

17.
arXiv (CS.LG) 2026-06-11

Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

作者:

arXiv:2606.11251v1 Announce Type: new Abstract: Many multivariate dynamical systems are observed only through trajectories, leaving the mechanisms governing their joint dynamics hidden. Existing approaches can impose interpretable dynamics or learn flexible state transitions, yet the resulting interaction structure is typically either specified in advance or left implicit within the learned dynamics. We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law. Each variable carries a field component, and these components evolve jointly through a learnable mechanical transition. Here, mechanical refers to the relation-to-motion organization of the transition, where learned relations shape state-dependent flows, field responses, and motion tendencies that move the field state forward. The resulting structure is part of the rollout itself: learned relations influence how the field moves, and the same internal quantities support both forecasting and structural readout. Across known-law interaction systems, chaotic benchmarks, real neural recordings, and ecological time series, MF-Net achieves competitive short- and medium-horizon forecasting while retaining inspectable structural readout. On the 40-dimensional Lorenz–96 testbed, MF-Net achieves an eight-step $R^2$ of $0.798\pm0.018$; across five seeds, its learned relation matrix recovers the local coupling support with a local/nonlocal strength ratio of $19.80\pm1.00$ and Precision@$K$ of $1.000\pm0.000$. MF-Net provides a structure-readable dynamical modeling framework in which learned relations are trained through forward evolution and, on real data, interpreted as functional predictive couplings under appropriate observational limits.

18.
arXiv (quant-ph) 2026-06-12

A Quantum Algorithm for Random Number Generation

arXiv:2606.13034v1 Announce Type: new Abstract: We present a quantum algorithm for random number generation that achieves a provable quadratic speedup over classical Markov chain mixing, building on the Diaconis-Shahshahani Fourier analysis of the top-to-random card shuffle. The algorithm integrates three quantum primitives into a unified mixing circuit: the Quantum Fourier Transform (QFT), which diagonalizes the Markov transition operator; controlled phase rotations, which encode the shuffle eigenvalue spectrum; and the Grover diffusion operator, which acts as a quantum analogue of the Aldous-Diaconis strong uniform stopping time by reflecting amplitudes about their mean at each iteration. For an n-qubit register, the mixing time is O(\sqrt{n \log n}) iterations. Extending to m qudits of local dimension d reduces this to O(\sqrt{\log_d N}) iterations, where N = d^m, compared to the classical O(n \log n) bound. The qudit formulation further reduces QFT circuit depth from O(\log^2 N) to O(\log_d^2 N) gates per layer by encoding the same N-state space using m = \log_d N subsystems instead of \log_2 N qubits. We validate both variants on IBM superconducting hardware.

19.
arXiv (CS.LG) 2026-06-12

Is Spurious Correlation Removal Always Learnable?

arXiv:2606.12930v1 Announce Type: new Abstract: Invariant learning can fail even when the invariant structure is statistically identifiable. We show a conditional computational barrier: under a black-box samplable supervised sparse recovery primitive motivated by average-case sparse-recovery reductions, there exist samplable multi-environment instances with a one-dimensional predictive invariant subspace ($k=1$) that are learnable with polynomial samples by exhaustive search, while any polynomial-time constant-accuracy recovery algorithm would contradict the primitive. We further quantify environment diversity by a separation parameter $\gamma$, which controls identifiability and the curvature of invariance objectives. Under sufficient diversity and local Gaussian regularity, the minimax risk is $\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=\Theta(k(d-k)/(n|\mathcal{E}|))$, and under label-induced shifts a phase transition occurs at $n^*\propto k(d-k)/(|\mathcal{E}|\gamma^2)$ with refined estimation error scaling proportional to $1/\gamma^2$. Synthetic and real datasets illustrate the predicted gaps and transitions and motivate simple diversity diagnostics.

20.
arXiv (CS.AI) 2026-06-12

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

arXiv:2605.02249v2 Announce Type: replace Abstract: We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

21.
arXiv (CS.CL) 2026-06-11

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

22.
arXiv (CS.LG) 2026-06-16

Self-Supervised Learning of Iterative Solvers for Constrained Optimization

arXiv:2409.08066v3 Announce Type: replace Abstract: The real-time solution of parametric optimization problems is critical for applications that demand high accuracy under tight real-time constraints, such as model predictive control. To this end, this work presents a learning-based iterative solver for constrained optimization, comprising a neural network predictor that generates initial primal-dual solution estimates, followed by a learned iterative solver that refines these estimates to reach high accuracy. We introduce a novel loss function based on Karush-Kuhn-Tucker (KKT) optimality conditions, enabling fully self-supervised training without pre-solved optimizer solutions. Theoretical guarantees ensure that the training loss function attains minima exclusively at KKT points. A convexification procedure enables application to nonconvex problems while preserving these guarantees. Experiments on two nonconvex case studies demonstrate speedups of up to one order of magnitude compared to state-of-the-art solvers such as IPOPT, while achieving orders of magnitude higher accuracy than competing learning-based approaches.

23.
bioRxiv (Bioinfo) 2026-06-15

Multi-platform reassessment of human mitochondrial DNA methylation reveals signals consistent with technical artifacts

The existence and functional relevance of mitochondrial DNA methylation remain controversial. Here, we systematically profiled cytosine methylation and hydroxymethylation across human brain and blood tissues spanning healthy and malignant states using orthogonal sequencing approaches that avoid chemical conversion during library preparation. While nuclear DNA exhibited canonical methylation patterns, mitochondrial DNA consistently showed negligible signal, indistinguishable from background technical noise. By mapping cytosine-guanine sites between mitochondrial DNA and nuclear-embedded mitochondrial sequences, we demonstrate the potential of these nuclear counterparts to confound not only cytosine methylation but also hydroxymethylation measurements, corroborating and extending prior findings implicating nuclear contamination as a potential source of apparent mitochondrial epigenetic signals. Additional technical factors that inflate apparent mtDNA methylation signals were identified, including sequence context biases, flow cell chemistries, and coverage-dependent discrepancies between the heavy and light strands. Collectively, these results provide convergent evidence against the presence of biologically meaningful cytosine methylation or hydroxymethylation in mitochondrial DNA. These findings caution against interpreting apparent mtDNA methylation signals in human adult tissues as meaningful without rigorous orthogonal validation and comprehensive consideration of technical and analytical confounding factors.

24.
arXiv (quant-ph) 2026-06-19

Effects of interaction range on the mean-field dynamics of Bose polarons

arXiv:2606.20020v1 Announce Type: cross Abstract: We consider the three-dimensional Bose polaron problem in the regime of finite range interactions and competing length scales. Working in the reference frame of the impurity, we study both static and out of equilibrium properties of the system, in particular the transfer of momentum between the impurity and the host gas. We find that relaxation dynamics can occur via damped oscillations of the impurity velocity with simple dependence on the interaction strength. Furthermore, the equilibration process is sensitive to the type of the impurity-bath interaction. Specifically, interatomic forces describing ion-atom systems lead to much longer timescales and more pronounced oscillations in the strong coupling regime with respect to local interaction potentials. We also find that the effective masses can differ by a large amount between the two scenarios, even if the number of atoms in the polaron cloud remains similar for both cases.

25.
arXiv (CS.AI) 2026-06-18

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

arXiv:2602.23092v2 Announce Type: replace Abstract: The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.