Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-22

AI-driven Multimodal Representation Learning for Latent Mediation Structure Discovery of Socioeconomic Disadvantage, Psychosocial Factors, and Cardiometabolic Multimorbidity

作者:

Social disadvantage is associated with multimorbidity, but the pathways linking social conditions to disease burden remain poorly understood. We developed an AI-driven multimodal mediation framework that integrates socioeconomic, psychosocial, clinical, laboratory, behavioral, and genomic data from the All of Us Research Program. Modality-specific variational autoencoders were used to derive latent representations of each data domain, and mediation analyses were subsequently performed in latent space to evaluate indirect associations between socioeconomic disadvantage, psychosocial factors, and multimorbidity. The final analytic cohort included 20,804 participants with complete multimodal data. Across 800 exposure–mediator–outcome combinations, mediation signals were concentrated within a small number of latent dimensions. The strongest indirect association linked a socioeconomic disadvantage dimension, a psychosocial vulnerability dimension, and a cardiometabolic multimorbidity dimension (NIE = 0.002517). The psychosocial dimension was characterized by poorer mental health, greater loneliness, lower social well-being, and lower health literacy, whereas the outcome dimension was associated with hypertension, diabetes, hyperlipidemia, obesity, chronic kidney disease, and heart disease. Bootstrap analyses supported the stability of the leading pathway. These findings suggest that psychosocial vulnerability may contribute to the association between socioeconomic disadvantage and cardiometabolic multimorbidity. More broadly, the proposed framework illustrates how AI-based representation learning can be used to investigate complex relationships across high-dimensional multimodal health data.

02.
medRxiv (Medicine) 2026-06-18

Guiding the development of climate counterfactuals for health impact attribution studies

Climate change detection and attribution (D&A) methods have become vital for quantifying the influence of anthropogenic forcing on the Earth's systems, including human health. Health impact attribution (HIA) studies seek to disentangle climate-driven health effects from natural variability yet are often constrained by the availability of accessible counterfactual climate scenarios. This tutorial paper presents a flexible, reproducible framework for developing counterfactual climates without reliance on computationally intensive global circulation models. We provide practical, R-based methodologies for constructing both trend-based (temperature and non-temperature) and event-based counterfactual, using a variety of techniques including model residual detrending, data-driven decomposition (e.g., Singular Spectrum Analysis and Empirical Mode Decomposition) and stochastic weather generators. The tutorial also explores the incorporation of greenhouse gas concentrations as forcing variables, rather than global mean temperature anomalies. By operationalising these methods through worked examples and an open code repository, this paper aims to build capacity within the HIA community, enhance methodological transparency, and foster interdisciplinary collaboration between climate and health researchers.

03.
PLOS Medicine 2026-05-20

Associations between hematologic dynamics during pregnancy and obstetric complications: A retrospective observational study

by Veronica Tozzo, Rachel Petherbridge, Kaitlyn James, Sarah Hsu, Deepti Pant, Chloe Michalopoulos, Brody H. Foy, Tanayott Thaweethai, Christopher Mow, Jacqueline Maya, Carolina Batlle Camero, Lydia Shook, Kathryn J. Gray, Logan Mauney, John M. Higgins, Camille E. Powe Background Pregnancy alters hematologic state as measured by complete blood count (CBC), but the longitudinal changes in CBC indices that define healthy pregnancies are not well established. In a large cohort based at an academic health system in the United States, we aimed to define reference intervals and typical longitudinal changes in CBC indices during pregnancy. We then tested for associations between extreme CBC values for gestational age or extreme longitudinal changes in CBC indices and obstetric complications. Methods and findings We studied nine CBC indices in individuals with singleton pregnancies who delivered after 30 weeks’ gestation and presented for prenatal care prior to 20 weeks. The electronic health record (EHR)-based Maternal Health Cohort (Massachusetts General Hospital; 1998–2016) formed our discovery cohort of 45,992 pregnancies, 18% of which had relevant complications. We developed a validation cohort of 48,868, 27% with complications from EHR data in the Mass General Brigham healthcare system from 2016 to 2024. In pregnancies without complications in the discovery cohort, we derived gestational-age-specific reference intervals (2.5th–97.5th percentile) and established typical intra-pregnancy longitudinal changes. In the validation cohort, we then tested CBC values outside of the 26–29 weeks’ gestation reference interval and CBC rare changes (uncommon changes in magnitude and direction) between 7–14 and 26–29 weeks’ gestation for association with a composite outcome (hypertensive disorders of pregnancy, small for gestational age birthweight, preterm birth) and its individual components using generalized estimating equations. Derived reference intervals differed from those in the literature for mean red cell volume, mean red cell hemoglobin, red cell count, and mean red cell hemoglobin concentration; reference intervals for other indices were similar to those previously published. In validation, hematocrit, hemoglobin, and red cell count values above their gestational-age specific reference intervals were associated with increased risk of the composite obstetric outcome: odds ratios (ORs) of 1.4 (95% CI [1.2, 1.5] p 

04.
arXiv (CS.CV) 2026-06-18

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real-world domain shifts when generalizing to new locations. In this work, we formulate zero-shot cross-city transfer as a controlled representation-level stress test for end-to-end autonomous driving and ask how visual pretraining affects transfer behavior under geographic domain shift. We conduct a comprehensive study by integrating self-supervised backbones I-JEPA, DINOv2, and MAE into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models across cities with different road topologies, traffic conventions, and visual environments. In open-loop evaluation, a supervised backbone exhibits severe degradation when transferring between cities, yet some domain-specific self-supervised methods can substantially reduce both displacement and collision degradation. In closed-loop evaluation, self-supervised pretraining improves average out-of-distribution PDMS in several single-city training settings. Our results provide empirical evidence that representation learning influences the robustness of cross-city planning and motivate zero-shot geographic transfer as an important stress test for evaluating end-to-end autonomous driving systems.

05.
arXiv (quant-ph) 2026-06-19

Maximum entropy principle for quantum processes

arXiv:2506.24079v3 Announce Type: replace Abstract: The maximum entropy principle, as applied to quantum systems, is a fundamental prescript positing that for a quantum system for which we only have partial knowledge, the maximum entropy state consistent with the partial knowledge is a valuable choice as the system's state. An intriguing result is that in case the only prior knowledge is of a fixed energy, the maximum entropy state turns out to be the thermal state, a ubiquitous state in several arenas, especially in statistical mechanics. We extend the consequences of this principle from static quantum states to dynamic quantum processes. We establish that a quantum channel attains maximal output entropy under a fixed energy constraint if and only if it is an absolutely thermalizing channel, where the fixed output is the thermal state corresponding to that energy. Our results have potential implications for understanding the informational and thermodynamic utility of quantum channels under physical constraints. As an application, we examine the consequences for private randomness distillation from fixed energy constrained quantum processes.

06.
arXiv (CS.CL) 2026-06-17

Perceptual compensation for tonal context in self-supervised speech models

This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier outputs between a purely self-supervised pre-trained model and a model fine-tuned for Mandarin ASR. No evidence of compensation was found in the embedding similarities of the purely pre-trained model. Probing classifiers showed some evidence of compensation in addition to the expected layer-wise improvements in categorization, but failed to replicate human performance on isolated test syllables. Our findings contrast with previous reports of sensitivity to phonological structure emerging through pre-training alone, and suggest that supervised objectives may be necessary to encourage the abstraction of at least some types of phonological regularities.

07.
arXiv (quant-ph) 2026-06-11

Scaling-optimal purification of noisy qubit unitary channels

arXiv:2606.12394v1 Announce Type: new Abstract: We consider the problem of purifying noisy qubit unitary channels. Given the ability to apply an unknown qubit unitary channel followed by depolarizing noise, we aim to construct a superchannel that purifies the noisy unitary back to the original unknown unitary. We first provide numerical evidence that sequential strategies can strictly outperform parallel strategies when the number of channel uses is finite, highlighting the fundamental distinction from state purification. We then provide a concrete $\mathrm{U}(2)$-covariant parallel protocol based on a novel entanglement-assisted quantum error-correcting code that suppresses the first-order noise strength as $O(1/n)$ with $n$ channel uses and show this scaling is asymptotically optimal in the low-noise regime, even when sequential strategies are allowed.

08.
medRxiv (Medicine) 2026-06-11

Incremental costs of transitioning from four to eight WHO-recommended antenatal care visits in Uganda: A costing analysis from a societal perspective

Background In 2016, the World Health Organization revised its antenatal care (ANC) recommendation from four to eight visits. For low- and middle-income countries like Uganda, where achieving even four visits remains a challenge, this transition has significant cost implications for both the health system and households. This study estimated the incremental costs of adopting the eight-visit model from a societal perspective. Methods The study was conducted in six government health facilities in southwestern Uganda. A micro-costing approach estimated health facility costs (personnel, equipment, consumables, and overhead). Costs incurred at patients end (transport, ultrasound, medical expenses, and time) were collected from 785 women using a questionnaire, with all costs in 2025 USD. Results For an average of 4.3 visits, total cost per woman was $100.1: facility costs $43.7 (43.7%), and patient costs $56.4 (56.3%). Transitioning to eight visits would increase total cost by $57.8 (57.8%), of which $36.4 (63.0%) would fall on households, equivalent to 68.8% of average monthly household income. Total costs would rise by 55.4% ($115.5 to $179.5) at Health Center IVs and 64.3% ($102.3 to $168.1) at Health Center IIIs, with facility costs up 43.4% and 62.9% and patient costs up 61.2% and 65.7%, respectively. Conclusion Transitioning to eight ANC visits would impose a large financial burden on households, with the incremental patient cost equivalent to more than two-thirds of average monthly household income. Equitable implementation requires improving availability of medicines and diagnostics, subsidizing transport, exploring telemedicine or community-based models, and improving efficiency at lower-tier health centers.

09.
arXiv (CS.AI) 2026-06-19

Thermodynamic Measure of Intelligence

arXiv:2606.20231v1 Announce Type: new Abstract: Can intelligence be measured? We propose that intelligence can be defined as the lawful amplification of rare but valid futures: a system increases the probability of outcomes that would be unlikely under passive dynamics but remain admissible under the constraints of the domain. We start with the premise that an intelligent system must model the world and its own place within it. Because the system is part of the world it models, this leads naturally to recursive self-simulation: the system represents futures in which its own actions are part of the trajectory. Our central results give a necessity statement and a conditional near-sufficiency statement connecting this architecture to a precise thermodynamic measure of lawful amplification of rare-valid futures: high rare-valid lift is impossible unless the internal simulation identifies rare-valid futures with high fidelity; conversely, when rare-valid fidelity is high and the simulation contains an effective policy, the achievable lift approaches the actuation-limited optimum. Thus recursive self-simulation is not merely a plausible feature of intelligence but, under the stated assumptions, is necessary and nearly sufficient for high thermodynamic intelligence. The resulting framework makes intelligence measurable on a universal scale, from passive matter and feedback controllers, large language models, and humans as text generators to Maxwell-demon-like information engines.

10.
arXiv (CS.CL) 2026-06-18

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of speech-driven features towards language discrimination among Chinese dialects. First, we systematically explore the appropriateness of speech-driven MFCC features towards CNN-based language discrimination. Then, we design an end-to-end speech recognition model based on HMM-DNN to predict Chinese dialect words. We adopt attention to extract the discriminative words related to different Chinese dialects. Finally, through a CNN, we combine the word-level embedding and the MFCC-based features. Evaluation of two benchmark Chinese dialect corpora shows the appropriateness and effectiveness of the proposed speech-driven approach to fine-grained Chinese dialect discrimination compared to the state-of-the-art methods.

11.
arXiv (CS.AI) 2026-06-17

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

arXiv:2606.17856v1 Announce Type: new Abstract: Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user queries are abstract and semantically sparse at the entity level, and (ii) suffers from brittle multi-hop reasoning, where noisy activations can derail entity-to-entity transitions and corrupt the inferred relation chain, yielding unreliable conclusions. To this end, we propose \texttt{FlowRAG}, a semantic-aware retrieval framework that improves both semantic recall and explicit reasoning. Specifically, \texttt{FlowRAG} constructs a quad-level heterogeneous graph over passages, summaries, sentences, and entities, where summary nodes serve as a coarse semantic hub. At retrieval time, a dual-granularity activation module combines summary–query alignment with sentence-level matching to activate relevant entities under paraphrase and abstraction robustly. We then introduce a frequency-aware weighted flow module that routes relevance through entity–passage links weighted by within-passage term frequency, pruning noisy connections and extracting high-confidence reasoning paths as an explicit logic skeleton for generation. Extensive experiments show that \texttt{FlowRAG} obtains state-of-the-art performance on complex reasoning benchmarks.

12.
arXiv (CS.AI) 2026-06-11

AutoMine Solution for AV2 2026 Scenario Mining Challenge

arXiv:2606.11874v1 Announce Type: new Abstract: With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale driving logs has become essential for data-driven evaluation. In this paper, we propose AutoMine, a robust self-refining scenario mining method based on LLMs and VLMs. AutoMine uses semantics-preserving prompt augmentation to reduce LLM prompt sensitivity, combines robust trajectory atomic functions with VLM-based functions to handle perception noise and open-world visual cues, and refines generated code through execution feedback from real logs. In the Argoverse 2 Scenario Mining Competition at CVPR 2026, AutoMine achieves a HOTA-Temporal score of 36.38 and a Timestamp BA score of 77.21.

13.
arXiv (CS.CL) 2026-06-16

Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences

In this report, we present LOGOS (Language Of Generative Objects in Science), a scientific generative language model that unifies heterogeneous tasks across the natural sciences within a single autoregressive framework based on a shared scientific grammar. It encodes diverse scientific objects and their spatial interactions as token sequences over a common vocabulary. By representing spatial contact and constraint patterns as discrete tokens, the model captures complex structural interactions in a purely sequential manner, without relying on explicit coordinates or geometric neural networks. This unified representation enables a wide range of downstream tasks to be formulated consistently as next-token prediction in the same grammar space, creating strong alignment between continued multi-domain pre-training and downstream objectives. Across diverse tasks, LOGOS consistently matches or outperforms domain-specific baselines, providing preliminary evidence for the feasibility of "one model fits all" in the natural sciences. We train LOGOS models at different scales (1B, 3B, and 8B parameters) and find a consistent positive correlation between model size and performance. This suggests that the future of AI for Science (AI4S) may not lie in building an independent technical stack that is separated from large language models (LLMs). Instead, it may depend on deeply aligning scientific foundation models with LLMs through shared architectures, shared training paradigms, and shared inference infrastructure, so that LLMs can truly become a new entry point for AI4S. We release the model weights and associated resources to facilitate further research.

14.
arXiv (CS.AI) 2026-06-16

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

arXiv:2606.15862v1 Announce Type: new Abstract: Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

15.
arXiv (CS.CL) 2026-06-11

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

16.
arXiv (CS.CL) 2026-06-16

From Argument Components to Graphs: A Multi-Agent Debate with Confidence Gating for Argument Relations

Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self-correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain-specific supervised fine-tuning. Recent work has shown that a multi-agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent-Opponent-Judge architecture, setting a promising direction for training-free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training-free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine-tuned RoBERTa models on Macro F1, suggesting that the under-representation of the Attack class was more damaging to supervised fine-tuning than to inference-only models. Additionally, our framework produces human-readable debate transcripts, offering interpretability absent from both single-agent and supervised classifiers.

17.
arXiv (CS.CV) 2026-06-16

Beer-Lambert Guided Representation Learning for Unsupervised Anomaly Detection in Sub-THz Food Inspection Images

Food manufacturing requires reliable inspection systems to detect foreign material contamination and maintain product safety. Sub-THz transmission imaging provides material-dependent attenuation characteristics that are useful for detecting low-density contaminants in food products. However, existing unsupervised anomaly detection methods mainly rely on RGB-pretrained visual representations, which may not adequately capture the transmission behavior of Sub-THz images. This paper proposes a Beer-Lambert guided representation learning framework for unsupervised anomaly detection in Sub-THz food inspection images. The proposed method introduces an attenuation decomposition module as an auxiliary regularization module that constrains student representations through attenuation reconstruction during training. In addition to the conventional one-class setting, we introduce a Leave-One-Food-Out protocol to evaluate generalization capability under unseen food categories. Experimental results on the Inline-Food-Inspection-THz dataset show that the proposed method improves overall anomaly detection performance over the baseline method.

18.
arXiv (CS.LG) 2026-06-17

Variational autoencoders with latent high-dimensional steady geometric flows for dynamics

arXiv:2410.10137v5 Announce Type: replace Abstract: We develop Riemannian approaches to variational autoencoders (VAEs) for PDE-type ambient data with regularizing geometric latent dynamics, which we refer to as VAE-DLM, or VAEs with dynamical latent manifolds. We redevelop the VAE framework such that manifold geometries, subject to our geometric flow, embedded in Euclidean space are learned in the intermediary latent space developed by encoders and decoders. By tailoring the geometric flow in which the latent space evolves, we induce latent geometric properties of our choosing, which are reflected in empirical performance. We reformulate the traditional evidence lower bound (ELBO) loss with a considerate choice of prior. We develop a linear geometric flow with a steady-state regularizing term. This flow requires only automatic differentiation of one time derivative, and can be solved in moderately high dimensions in a physics-informed approach, allowing more expressive latent representations. We discuss how this flow can be formulated as a gradient flow, and maintains entropy away from metric singularity. This, along with an eigenvalue penalization condition, helps ensure the manifold is sufficiently large in measure, nondegenerate, and a canonical geometry, which contribute to a robust representation. Our methods focus on the modified multi-layer perceptron architecture with tanh activations for the manifold encoder-decoder. We demonstrate, on our datasets of interest, our methods perform at least as well as the traditional VAE, and oftentimes better. Our methods can outperform this and a VAE endowed with our proposed architecture, frequently reducing out-of-distribution (OOD) error between 15% to 35% on select datasets. We highlight our method on ambient PDEs whose solutions maintain minimal variation in late times. We provide empirical justification towards how we can improve robust learning for external dynamics with VAEs.

19.
arXiv (CS.CV) 2026-06-16

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

20.
arXiv (CS.CV) 2026-06-19

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.

21.
arXiv (CS.CL) 2026-06-12

EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

22.
arXiv (CS.AI) 2026-06-17

Prototype-Based Semantic Consistency Alignment for Domain Adaptive Retrieval

arXiv:2512.04524v4 Announce Type: replace-cross Abstract: Domain adaptive retrieval aims to transfer knowledge from a labeled source domain to an unlabeled target domain, enabling effective retrieval while mitigating domain discrepancies. However, existing methods encounter several fundamental limitations: 1) neglecting class-level semantic alignment and excessively pursuing pair-wise sample alignment; 2) lacking either pseudo-label reliability consideration or geometric guidance for assessing label correctness; 3) directly quantizing original features affected by domain shift, undermining the quality of learned hash codes. In view of these limitations, we propose Prototype-Based Semantic Consistency Alignment (PSCA), a two-stage framework for effective domain adaptive retrieval. In the first stage, a set of orthogonal prototypes directly establishes class-level semantic connections, maximizing inter-class separability while gathering intra-class samples. During the prototype learning, geometric proximity provides a reliability indicator for semantic consistency alignment through adaptive weighting of pseudo-label confidences. The resulting membership matrix and prototypes facilitate feature reconstruction, ensuring quantization on reconstructed rather than original features, thereby improving subsequent hash coding quality and seamlessly connecting both stages. In the second stage, domain-specific quantization functions process the reconstructed features under mutual approximation constraints, generating unified binary hash codes across domains. Extensive experiments validate PSCA's superior performance across multiple datasets.

23.
arXiv (CS.CL) 2026-06-16

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.

24.
arXiv (CS.CV) 2026-06-11

CountZES: Counting via Zero-Shot Exemplar Selection

Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. Since counting is sensitive to exemplar quality, such selection strategies often yield poorly representative exemplars, leading to inaccurate count estimation. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.

25.
arXiv (CS.AI) 2026-06-16

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

arXiv:2606.15709v1 Announce Type: new Abstract: Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst – confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.