Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-17

Accounting for Human Movement to Improve Exposure-Health Models

Background. Current exposure-health models rely on averaged, residential-based environmental exposures, failing to account for human movement. This aggregation can lead to exposure misclassification and biased exposure-response estimates, potentially distorting our understanding of the true health effects of environmental conditions. We developed exposure disaggregation regression models that explicitly account for human movement when linking environmental exposures to health outcomes. Methods. By weighting pixel-level exposures according to distance from home as a simple proxy for human movement, our model linked disaggregated environmental exposures to individual-level health outcomes. Weights were either fixed a priori or derived from a latent distance-decay power parameter learned from the data. We additionally evaluated model performance under a nonlinear exposure-response relationship. Model performance was assessed across multiple sample sizes (N = 1,114; 50,000; and 100,000). A simulation study examined parameter recovery using bias, empirical standard error (EmpSE), and credible interval coverage. As a case study, Demographic and Health Surveys (DHS) data from Albania were used to link acute respiratory infection (ARI) outcomes among children under five to pixel-level NDVI within a 3 km buffer around DHS cluster centroids, and the proposed models were applied to these data. Results. Across all models (fixed-weight, learned-weight, and restricted cubic spline models), parameter recovery improved with increasing sample size. At N = 1,114, estimates were biased and imprecise, with incorrect effect direction for exposure-response parameters (e.g., learned-weight {beta}1 bias = - 0.79; EmpSE = 2.61; coverage = 0.88). In contrast, the models accurately recovered parameters at larger sample sizes, including the latent distance-decay parameter (bias = - 0.02; EmpSE = 0.15; coverage = 0.95 at N = 100,000), demonstrating their ability to reliably learn movement-based exposure weights when sufficient data were available. Conclusion. Instead of relying on arbitrarily-sized buffers, this statistical framework provides a novel method for studying environmental exposure-health relationships whilst accounting for human movement. With sufficiently large sample sizes, it can accurately estimate the influence of disaggregated environmental exposures on individual-level health and help address exposure misclassification arising from residential-only metrics. This methodological framework remains scalable, interpretable, and adaptable to other exposures and outcomes, offering a foundation for future work that integrates richer mobility-informed exposure-health research.

02.
bioRxiv (Bioinfo) 2026-06-20

Seed variation impacts clustering stability in Single-Cell RNA-Seq and can be mitigated by StAbility-BasEd-Reassignment (SABER)

Single-cell RNA-seq clustering is commonly treated as reproducible once a random seed is fixed, yet the choice of seed itself may alter cell assignments and downstream interpretation. We systematically quantified seed-induced clustering variability by running Louvain and Leiden clustering across 100 seeds in Seurat and Scanpy on 28 single-cell RNA-seq datasets from the Human Cell Atlas and IMMUcan. Using Element-Centric Consistency, we found that seed choice affected a substantial fraction of cells, with Scanpy showing more unstable assignments than Seurat on average, 40.46% versus 26.78% unstable cells, respectively. This increased stability came at a marked computational cost: Seurat required approximately 19-fold higher median memory than Scanpy. Seed-dependent clustering variability also propagated to cell-type annotation, particularly among transcriptionally related populations including macrophage/monocyte, endothelial/epithelial and T/NK cell states. To mitigate this instability, we developed StAbility-BasEd Reassignment (SABER), a Scanpy-based framework that identifies seed-sensitive cells across repeated clusterings and reassigns them to stable cluster cores using cosine similarity. SABER improved clustering quality while preserving annotation concordance and reduced median memory usage 3.5-fold compared with Seurat-Louvain. Our results identify seed choice as an underappreciated source of variability in single-cell analysis and provide a scalable strategy to improve clustering robustness.

03.
arXiv (CS.CV) 2026-06-12

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

04.
arXiv (CS.CL) 2026-06-11

Context-Aware Multimodal Claim Verification in Spoken Dialogues

Every day, millions absorb claims from podcasts and streams that no fact-checker ever sees. Spoken misinformation is built through conversation, where credibility comes not from facts alone but from how claims are framed, reinforced, or left unchallenged across turns. Yet fact-checking has focused on isolated text, leaving dialogue audio under-studied. We introduce MAD2, a new Multi-turn Audio Dialogues benchmark for spoken claim verification, containing 1,000 two-speaker dialogues with 3,368 check-worthy claims and approximately 10 hours of audio, and propose calibrated multimodal fusion of a context-aware audio encoder and a dialogue-aware text model. Across settings, adding dialogue context improves verification, but the gains depend on scenario type. Using only preceding context often matches offline performance, supporting live-moderation settings, and audio contributes most when transcript-based models are destabilized by additional context. Overall, conversational structure matters more for verification than misinformation framing.

05.
arXiv (CS.CL) 2026-06-11

One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection

Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability, creating exploitable gaps for jailbreak attacks. Current jailbreak defenses are largely developed and evaluated in dominant languages, and their effectiveness is limited by the scarcity of aligned multilingual supervision and representations dispersion caused by language variation. To address this issue, we propose MLJailDe, a multilingual jailbreak detection framework designed to improve both multilingual robustness and cross-lingual generalization. MLJailDe first introduces a multilingual back-translation data augmentation algorithm to construct a semantically consistent and functionally effective dataset spanning 11 languages, consisting of 2,232 benign and 1,239 jailbreak samples. On this basis, MLJailDe employs relative-distance constraints to reduce cross-lingual representation dispersion and encourage jailbreak prompts with similar intent to form consistent clusters across languages, while an imbalance-aware classification objective is further used to alleviate class imbalance and learn more reliable multilingual decision boundaries. Experimental results show that MLJailDe outperforms state-of-the-art baselines across multiple languages, achieving an F1 score of 98.5\%, and obtains an average F1 score of 97.1\% on unseen languages, demonstrating strong effectiveness and cross-lingual generalization.

06.
arXiv (CS.CV) 2026-06-19

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

07.
arXiv (CS.AI) 2026-06-12

AgentRivet: an automated system for producing Rivet routines from journal publications

arXiv:2606.13535v1 Announce Type: cross Abstract: Particle physics collider experiments provide Rivet routines as part of the analysis preservation strategy for model-independent measurements. Rivet is a C++ toolkit that allow new theoretical models to be compared to the measurements, thus aiding the development and tuning of Monte Carlo event generators as well as searches for physics beyond the Standard Model. However, analysis coverage is known to be incomplete, with only 39% of measurements having documented and publicly available Rivet routines. In this article, we design and implement an automated workflow based on Large Language Models with the goal of providing the missing routines. This multi-step workflow, referred to as AgentRivet, extracts the physics analysis information from published papers and writes the missing Rivet routines, with intermediate code- and physics- reviews as part of an autonomous quality control. We report the results obtained using commercial Large Language Models, provided by OpenAI, Anthropic, and Google, for two recent measurements from the ATLAS and CMS experiments. We find that AgentRivet produces competent Rivet routines with few syntax errors. The physics fidelity of the routines is reasonable and follows the explanations given in the relevant publications. Nevertheless, physics-implementation issues do arise and are investigated using the artefacts produced by AgentRivet. The majority of physics implementation issues arise from subtle-but-ambiguous definitions in the given publication, although some models struggle to implement complex observables even when clear definitions are given.

08.
arXiv (CS.AI) 2026-06-16

Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

arXiv:2606.16952v1 Announce Type: cross Abstract: The rapid adoption of generative AI and Large Language Models (LLMs) has spurred interest in synthetic data as a privacy-preserving alternative to sensitive real-world datasets. However, generating high-utility synthetic data often carries the risk of memorizing and regurgitating private information from the training corpus. In this work, we present a customizable empirical auditing framework designed to detect and explain such data disclosures. Our framework introduces a mechanism to distinguish between "true disclosures"-where the system directly reproduces a user's information-and "phantom disclosures''-where the system incidentally generates a user's data. By partitioning input data into training and holdout sets and applying rigorous statistical hypothesis testing, we determine if observed disclosures are consistent with strict privacy baselines, such as zero-learning or specific Differential Privacy (DP) bounds. Crucially, this approach requires no model access, no canary insertion, and no reference model training -only the synthetic output and a held-out control set. We demonstrate that this framework effectively functions as a membership inference attack, providing empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. Our approach is model-agnostic, applies to any synthetic data generation mechanism, and requires orders of magnitude fewer computational resources than shadow-model or canary-based alternatives.

09.
Nature (Science) 2026-06-17

Fast formation to reinforce lithium-rich cathodes

作者:

Formation in lithium-ion battery manufacturing typically involves low-rate charge–discharge cycles to establish stable electrode–electrolyte interfaces—a time-consuming process1–4. Here, our findings on lithium-rich layered oxide cathodes challenge the necessity of conventional formation, which can even shorten battery lifespan. Fast formation, on the other hand, reduces production cost and enhances capacity and stability. Multiscale synchrotron-based techniques show that residual lithium ions after the initial charge are critical for subsequent structural evolution and cycling performance. Deep lithium de-intercalation causes severe structural degradation and capacity loss due to the inherently fragile lithium-deficient matrix. By contrast, the residual lithium ions from fast formation enhance reversibility through a self-pinning effect, preventing pernicious lattice deformation and reinforcing the ion-storage framework. Adjusting the initial charge current density from 0.2 C to 2 C improves reversible capacity by 20% and extends cycle life by more than 36%. This approach can also be extended to other electrode systems, providing insights for more-efficient battery production. Fast formation in lithium-ion batteries outperforms conventional slow formation, lowering costs and improving battery capacity, stability and cycle life, offering broader application to electrode systems.

10.
arXiv (math.PR) 2026-06-16

Phase Transition in Convex Relaxations for Graph Alignment

arXiv:2606.15581v1 Announce Type: cross Abstract: We study the graph alignment problem for correlated Gaussian Orthogonal Ensemble (GOE) matrices, where the goal is to recover a hidden vertex permutation given two correlated symmetric Gaussian matrices $(A, B)$ with correlation $1/\sqrt{1+\sigma^2}$. While the maximum likelihood estimator is information-theoretically optimal, its computation, which reduces to a quadratic assignment problem, is intractable. Motivated by this, we analyze convex relaxations based on minimizing $\|AX - XB\|_F$ over the set of doubly stochastic matrices and the unit hypercube. We show that when the correlation parameter satisfies $\sigma = o(n^{-1/2}/\log^4 n)$, the solution of either relaxation $(X^\star)$ concentrates around the ground-truth permutation matrix $(\Pi^\star)$, i.e., $\|X^\star-\Pi^\star\|_F^2 = o(n)$, implying recovery of all but a vanishing fraction of vertices after simple post-processing. Combined with existing lower bounds, our results precisely characterize that $\|X^\star-\Pi^\star\|_F^2$ transitions from $o(n)$ for $\sigma = \tilde{o}(n^{-1/2})$ to $\Omega(n)$ for $\sigma = \tilde{\Omega}(n^{-1/2})$. In doing so, our analysis significantly tightens prior results and extends them beyond doubly stochastic relaxations.

11.
arXiv (CS.AI) 2026-06-15

Closing the Reflection Gap: A Free Calibration Bonus for Agentic RL

作者:

arXiv:2606.14211v1 Announce Type: new Abstract: LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we find a persistent reflection gap: LLM agents tend to mis-assess their own outputs after observing concrete environment feedback – even for questions they correctly answered – and standard RL barely helps due to a credit-assignment mismatch. To close this gap, we propose RefGRPO, a simple yet effective fix that augments standard RL algorithms with two key ingredients: a free calibration bonus computed by contrasting the agent's own reflection with the actual outcome (requiring no additional reward model, LLM judge, or external annotation), and a dynamic schedule on its coefficient. Compared to standard RL baselines, our method simultaneously improves reflection calibration (e.g., reduces underconfidence rate $44.4\% \to 7.7\%$) and task accuracy (e.g., $75.1\% \to 76.5\%$) on text-to-SQL across five benchmarks. The resulting calibrated reflection turns the agent into its own verifier grounded in environment feedback, which further enables (i) better self-improvement that uses reflections as pseudo-rewards without outcome supervision, and (ii) more effective test-time selective prediction by committing only to rollouts flagged as correct.

12.
arXiv (CS.AI) 2026-06-18

A Hybrid LSTM–Vision Transformer Architecture for Predicting HRRR Forecast Errors

arXiv:2606.19026v1 Announce Type: cross Abstract: Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric phenomena. Previous work demonstrated that Long Short-Term Memory (LSTM) networks can successfully predict forecast errors in the High-Resolution Rapid Refresh (HRRR) model using mesonet observations, but we believe performance degradation is linked to periods of complex vertical atmospheric evolution. To address this limitation, we develop a hybrid LSTM-Vision Transformer (LSTM-ViT) framework that combines temporal sequence learning from surface observations with atmospheric profiles from the New York State Mesonet profiler network. The LSTM-ViT framework is trained to predict HRRR hourly precipitation, 10 m wind speed, and 2 m temperature forecast errors at individual mesonet stations. Across all three predictors, incorporation of profiler-derived atmospheric structure improves forecast error prediction skill relative to the baseline LSTM architecture, with the largest gains occurring at shorter forecast lead times and during periods of enhanced PBL activity. Improvements are particularly pronounced for precipitation forecast error, where the LSTM-ViT framework achieves approximately a twofold increase in predictive skill relative to the baseline LSTM while better capturing convectively driven error evolution and reducing degradation associated with PBL processes. These results demonstrate that combining temporal sequence learning with vertically informed attention mechanisms provides a physically meaningful pathway for improving forecast error prediction in operational NWP systems. Our research offers forecasters enhanced guidance regarding model bias and forecast confidence.

13.
arXiv (math.PR) 2026-06-16

Hua-Chen New Theory of Economic Optimization

arXiv:2504.19134v4 Announce Type: replace-cross Abstract: Between 1957-1985, Chinese mathematician Loo-Keng Hua pioneered economic optimization theory through three key contributions: establishing economic stability's fundamental theorem, proving the uniqueness of equilibrium solutions in economic systems, and developing a consumption-integrated model 50 days before his death. Since 1988, Mu-Fa Chen has been working on Hua's theory. He introduced stochastics, namely Markov chains, to economic optimization theory. He updated and developed Hua's model and came up with a new model (Chen's model) which has become the starting point of a new economic optimization theory. Chen's theory can be applied to economic stability test, bankruptcy prediction, product ranking and classification, economic prediction and adjustment, economic structure optimization. Chen's theory can also provide efficient algorithms that are programmable and intelligent. {Stochastics} is the cornerstone of Chen's theory. There is no overlap between Chen's theory, and the existing mathematical economy theory and the economics developments that were awarded Nobel Prizes in Economics between 1969 and 2024. The distinguished features of Chen's theory from the existing theories are quantitative, calculable, predictable, optimizable, programmable and can be intelligent. This survey provides a theoretical overview of the newly published monograph [5rw24]. Specifically, the invariant of the economic structure matrix, also known as the Chen's invariant, was first published in this survey.

14.
arXiv (CS.AI) 2026-06-15

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

arXiv:2606.14157v1 Announce Type: cross Abstract: Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country's largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283{,}016 learner trips across 23{,}820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, $\lambda^{(k)}$, interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

15.
arXiv (CS.LG) 2026-06-15

Side-Channel Attacks Bypass Protection in 3D Printers

arXiv:2606.13952v1 Announce Type: cross Abstract: Active Motor Noise Cancellation (AMNC) ships in commercial fused deposition modeling (FDM) 3D printers as a hardware countermeasure against acoustic side-channel attacks that target intellectual property (IP). We present the first empirical evaluation of a deployed AMNC countermeasure, using a public dataset of synchronized acoustic and vibration recordings from two AMNC-equipped Bambu Lab printers across 12 object classes. AMNC fully neutralizes the acoustic channel: classification accuracy is indistinguishable from the 8.33% random baseline. The vibration channel, which AMNC does not target, still leaks. With summary statistics the leak is coarse and amplitude-driven (vibration accuracy approximately 31% pooled, 36-47% within-printer), while the waveform shape carries essentially nothing (frequency-only features at chance). A full-sequence temporal model that ingests the ordered evolution of the print raises accuracy to approximately 61%, and an order-shuffling control (approximately 33%) shows that a substantial component is genuinely sequential and tied to print progression. The leak is device-specific: a classifier trained on one printer transfers near chance to the other. We conclude that AMNC is an acoustic-only defense: vibration remains a partial, geometry-correlated side channel it does not address, but one that does not, on this dataset, support full geometric reconstruction; reconstruction-grade attacks would require the magnetic or power channels AMNC also leaves untouched. We release all code.

16.
arXiv (CS.CV) 2026-06-16

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

17.
arXiv (CS.CV) 2026-06-16

LUCID: Learned Undersampling-Adaptive Consistency-Guided Inference with Deterministic Flow Matching for Sparse-View CT Reconstruction

Sparse-view CT reduces radiation dose and scanning time by acquiring fewer projection views, but angular undersampling makes reconstruction severely ill-posed, causing streak artifacts, structural blurring, and loss of fine details. Existing supervised methods are often tied to specific sampling settings, whereas generative methods may introduce anatomically inconsistent hallucination-like structures under severe undersampling. We propose Lucid, a sparsity-adaptive, consistency-guided reconstruction framework based on a Flow Matching generative prior for sparse-view CT. Lucid is trained only on high-quality CT images to learn a continuous transport between a Gaussian distribution and the high-quality CT image distribution, independent of view sampling. During inference, the sampling sparsity level is explicitly incorporated to adapt the generative trajectory of a single pretrained model. Specifically, Lucid constructs a degradation-matched initial state by sparsity-weighted fusion of the sparse-view FBP image and Gaussian noise, performs sparsity-modulated Flow Matching updates, and applies projection-domain data-consistency correction after each prior update. Experiments under multiple sparse-view settings show that Lucid achieves stable reconstruction performance across different sampling densities, improves image quality and structural fidelity, and reduces the risk of hallucination-like structures in generative sparse-view CT reconstruction.

18.
arXiv (CS.LG) 2026-06-19

CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

arXiv:2510.18784v3 Announce Type: replace Abstract: Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with the quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches the accuracy achieved at 4-bits (W4A4) with the prior best method. The official implementation can be found over https://github.com/IST-DASLab/CAGE .

19.
arXiv (CS.AI) 2026-06-18

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

arXiv:2604.06367v2 Announce Type: replace-cross Abstract: Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g., SafeArena), no existing framework assesses an agent's ability to successfully execute user-facing website security and privacy tasks, such as managing cookie preferences, configuring privacy-sensitive account settings, or revoking inactive sessions. To address this gap, we introduce WebSP-Eval, an evaluation framework for measuring web agent performance on website security and privacy tasks. WebSP-Eval comprises 1) a manually crafted task dataset of 200 task instances across 28 websites; 2) a robust agentic system supporting account and initial state management across runs using a custom Google Chrome extension; and 3) an automated evaluator. We evaluate a total of 8 web agent instantiations using state-of-the-art multimodal large language models, conducting a fine-grained analysis across websites, task categories, and UI elements. Our evaluation reveals that current models suffer from limited autonomous exploration capabilities to reliably solve website security and privacy tasks, and struggle with specific task categories and websites. Crucially, we identify stateful UI elements are a primary reason for agent failure, with toggles causing more than 45% task failure across many models.

20.
arXiv (CS.LG) 2026-06-11

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

arXiv:2606.12299v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

21.
arXiv (CS.AI) 2026-06-19

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

arXiv:2606.20526v1 Announce Type: new Abstract: Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence. We introduce DeepSWIP, a single-world counterfactual semantics for DeepProbLog programs. Using neural materialization, we reduce fixed-context neural predicates to ordinary ProbLog choices, apply Single World Intervention Programs (SWIPs), and compute counterfactuals by weighted model counting (WMC) over a single transformed program. Under finite grounding and unique-supported-model assumptions, DeepSWIP is exact relative to the learned materialized FCM. The standard quotient-WMC form of ProbLog conditionals identifies active neural probabilities and explains intervention cleaning, calibration sensitivity, and rare-evidence instability. Experiments on MPI3D confirm the transformation against a DeepTwin construction against 12,000 queries, as predicted and a 2.14$\times$ inference speedup from avoiding the Twin's endogenous duplication. A SUMO HOV experiment shows that neural calibration degradation biases plug-in estimates, while a correctly scoped randomized-policy AIPW estimator removes most first-order bias for population mean and ATE estimands. Code is at https://github.com/saibib/deep_SWIP.

22.
arXiv (CS.AI) 2026-06-16

Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot

arXiv:2606.15810v1 Announce Type: cross Abstract: Large language models deployed as commercial APIs are vulnerable to model extraction attacks, while existing defenses either act too late or degrade utility for legitimate users. We propose Knowledge Trap, a defense that redirects extraction attacks toward low-transferability knowledge through a Honeypot Knowledge Graph (HKG) and breadcrumb-guided exploration. Instead of blocking queries or perturbing outputs, Knowledge Trap consumes the attacker's limited query budget on knowledge with negligible downstream utility while preserving benign-user performance. Experiments in medical and financial domains show that Knowledge Trap reduces surrogate Agreement by 6.2\% on average without degrading legitimate-user accuracy, outperforming existing defenses that impose measurable user impact. These results suggest that defending knowledge-space traversal is a practical direction for mitigating LLM extraction attacks.

23.
arXiv (CS.LG) 2026-06-17

Learning Credal Ensembles via Distributionally Robust Optimization

arXiv:2602.08470v3 Announce Type: replace Abstract: Credal predictors are models that are aware of epistemic uncertainty and produce a convex set of probabilistic predictions. They offer a principled way to quantify predictive epistemic uncertainty (EU) and have been shown to improve model robustness in various settings. However, most state-of-the-art methods mainly define EU as disagreement caused by random training initializations, which mostly reflects sensitivity to optimization randomness rather than uncertainty from deeper sources. To address this, we define EU as disagreement among models trained with varying relaxations of the i.i.d. assumption between training and test data. Based on this idea, we propose CreDRO, which learns an ensemble of plausible models through distributionally robust optimization. As a result, CreDRO captures EU not only from training randomness but also from meaningful disagreement due to potential distribution shifts between training and test data. Empirical results show that CreDRO consistently outperforms existing credal methods on tasks such as out-of-distribution detection across multiple benchmarks and selective classification in medical applications.

24.
arXiv (math.PR) 2026-06-12

Sticky CIR process with potential: invariant measure and exact sampling

arXiv:2605.13648v4 Announce Type: replace Abstract: We study the sticky Cox–Ingersoll–Ross (CIR) process in one dimension, a diffusion on $[0,\infty)$ with a sticky boundary condition at the origin, arising as the marginal process in a sparse Bayesian inference framework based on Hadamard–Langevin dynamics. For the parameter range $\delta\in(1,2)$, in which the origin is accessible but not absorbing, we prove well-posedness of the process and uniqueness of its invariant measure, which is a mixture of a point mass at zero and a weighted gamma-type density on the interior. We derive an explicit Green's function for the resolvent in terms of confluent hypergeometric functions, and use this to construct an exact sampler for the invariant measure in the zero-potential case. For a non-trivial potential $G$, we establish existence and uniqueness of the tilted invariant measure via a Girsanov change of measure, and develop two sampling algorithms: a Metropolis–Hastings corrected sampler that targets the invariant measure exactly, and a cheaper, biased unadjusted Langevin algorithm (ULA) for a boundary-clamped variant of which we prove a first-order expansion of the stationary bias with an explicit constant: the leading error is a rank-one transfer of mass $K_\star h|\log h| $ onto the atom, so the total-variation bias is of exact order $h|\log h | $ – independent of $\delta$ – whenever the potential has nonzero boundary drift. Numerical experiments confirm the predicted behaviour: the Metropolis–Hastings sampler achieves the target invariant measure at all step sizes, while the ULA bias follows the proven first-order law, including its constant.

25.
arXiv (CS.CV) 2026-06-11

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide – the most ubiquitous data in pathology – into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.