Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-19

Influence-Guided Concolic Testing of Transformer Robustness

arXiv:2509.23806v2 Announce Type: replace-cross Abstract: Concolic testing for neural networks alternates concrete execution with constraint solving to search for inputs that flip model decisions. We present a concolic tester for Transformer classifiers that uses SHAP estimates to rank pending path predicates by their impact on the current prediction. To support self-attention with multiple heads in execution backed by SMT solving, we implement attention semantics in pure Python that are compatible with the solver and make the softmax boundary explicit by concretizing exponentiation arguments. We evaluate our method on CIFAR-10 across three compact Transformer classifiers, ResNet18, and VGG16 under a one-pixel budget and a 900s horizon. Across the 500 model–input pairs in this matched comparison, our method achieves 60% success, compared with 15% for a differential evolution baseline that treats the model as a black box. In the primary two-layer Transformer branch-ordering study, SHAP-based predicate prioritization raises success from 56% to 60% and reduces median attack time by 51%. These results show that influence-guided path exploration can make concolic testing a practical way to find adversarial examples in Transformer models.

02.
arXiv (CS.CL) 2026-06-16

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Authors:

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

03.
arXiv (quant-ph) 2026-06-12

Efficient certification of intractable quantum states with few Pauli measurements

arXiv:2511.07300v2 Announce Type: replace Abstract: Efficient verification of quantum computational resources is crucial as experiments advance toward fault-tolerance. Universal quantum computation can be achieved by consuming resource states through simple Pauli measurements, yet a significant gap remains between states that are easy to certify and those required for universality. We focus on Clifford-enhanced Product States, a class of resource states obtained by applying Clifford circuits to a product of single-qubit, potentially magic, states. While essential for universal computation, the certification of such states has previously relied on query oracles that are \#P-hard to implement, leaving their efficient, oracle-free verification an open challenge. In this work, we demonstrate that such classically intractable resource states can be efficiently verified using only Pauli measurements. Our protocol achieves sample- and time-efficiency in both i.i.d.\ and adversarial settings. This work fills a gap in Pauli-based certification, providing a new practical pathway to verify resource states that drive universal Pauli-based quantum computation.

04.
arXiv (CS.AI) 2026-06-16

Benign in Isolation, Harmful in Composition: Security Risks in Agent Skill Ecosystems

arXiv:2606.15242v1 Announce Type: cross Abstract: Skills are becoming the capability layer through which LLM agents turn plans into actions, but their use introduces security risks such as data leakage, unauthorized operations, and tool misuse. Existing vetting usually evaluates each skill in isolation, while real agent tasks often invoke multiple skills in a shared execution context. This creates Skill Composition Risk (SCR): a skill that appears benign alone can become harmful when its outputs, trust signals, authorization cues, or side effects influence later invocations along an activated path. We introduce SCR-Bench to evaluate this risk in controlled, sandboxed skill environments. Rather than relying only on textual intent or surface behavior, SCR-Bench records downstream state changes and path-level outcomes across composed skill executions. It contains three sub-benchmarks: SCR-CapFlow for capability-flow composition, SCR-TrustLift for trust-transfer composition, and SCR-AuthBlur for authorization-confusion composition. Across SCR-Bench, composed paths expose risks that are largely absent under isolated evaluation. In SCR-CapFlow, attack success rate reaches 33.6 percent under composition, compared with near-zero isolated baselines. In SCR-TrustLift, attack success rate exceeds 96.5 percent on four of five backends. In SCR-AuthBlur, the risky-approval rate increases by 71.8 percent relative to the L0 isolated baseline under the L1 context setting. These results show that agent skill security should be assessed at the level of activated paths rather than isolated artifacts. SCR and SCR-Bench provide a foundation for path-aware risk evaluation and defense in LLM agent skill ecosystems. Benchmark: https://github.com/saint-viperx/SCR_Bench.

05.
arXiv (CS.AI) 2026-06-15

Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

arXiv:2606.13934v1 Announce Type: new Abstract: Humans cannot always intuit what scenarios are most challenging to LLMs. Hoping to capture challenging edge cases, developers either design problems to be difficult for humans or curate extensive benchmarks. What if we could instead anticipate which scenarios a model will fail on? In this paper, we use an LLM's representational geometry to predict which concept combinations it will fail on. We attribute this compositional failure to interference between salient features. In tasks that require systematic composition - toy programmatic settings, multihop reasoning, multilingual factual recall - we find that when a pair of concepts is encoded near-orthogonally, the model reliably composes them. When their linear encodings are close, producing interference, the model fails to compose them. Our method reliably anticipates failure modes across different compositional tasks, without evaluating specific inputs. These results lay the groundwork to use representational geometry to identify high-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real-world deployment.

07.
arXiv (CS.LG) 2026-06-16

Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection

arXiv:2510.24043v4 Announce Type: replace Abstract: This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.

08.
arXiv (quant-ph) 2026-06-15

Fulde-Ferrell superfluids in an asymmetric three-component Fermi Gas

arXiv:2602.24006v2 Announce Type: replace-cross Abstract: An asymmetric three-component Fermi gas, featuring Raman-induced spin-orbit coupling between the first and second components and contact interaction only between the first and third components, introduces both spin-orbit coupling and population imbalance-two mechanisms known to stabilize the Fulde-Ferrell superfluids.We systematically study Fulde-Ferrell superfluids in an asymmetric three-component Fermi gas { in two dimensions and at zero temperature} by finding the global minima of the thermodynamic potential. We reveal a new class of composite Fulde-Ferrell superfluids that emerges when strong spin-orbit coupling generates a double-well structure in momentum space within the lower spin-orbit-coupled band. The key features of these composite superfluids are identified.

09.
medRxiv (Medicine) 2026-06-24

Repetitive Transcranial Magnetic Stimulation over Primary Somatosensory Cortex for Upper Limb Function in Stroke: An Exploratory Randomized Controlled Trial

Background: Stroke often causes Upper Limb (UL) functional impairments. The Primary Somatosensory Cortex (S1) plays an important role in motor learning. Repetitive Transcranial Magnetic Stimulation (rTMS) over S1 could enhance UL recovery. We aimed to explore its preliminary effects on UL motor activity and function post-stroke. Methods: An exploratory parallel-group randomized controlled trial in people with chronic stroke (>3 months) and moderate hemiparesis was conducted. Participants received 20 sessions of active or sham 5Hz rTMS over affected S1, with Robot-Assisted Therapy and Task-Oriented Training, 5 days/week for 4 weeks. The primary endpoint was UL motor activity (Action Research Arm Test, ARAT). Secondary measures were the UL Fugl-Meyer Assessment (UL-FMA) and sensory outcomes. Results: The baseline-adjusted mean difference (MD) in ARAT was 4.05 points [0.78, 7.33], favoring active stimulation. Secondary measures did not favor active stimulation (UL-FMA: MD = 2.62 [-1.51, 6.76]; sensory outcomes showed no between-group differences). Conclusion: High-frequency rTMS over S1 may enhance UL motor activity (ARAT), but no evidence for motor impairment (UL-FMA) or sensory domains was found. Compensation rather than restoration may underlie this improvement. Stimulation targets should match the intended recovery domain, although larger trials are needed to confirm these preliminary findings.

10.
arXiv (CS.AI) 2026-06-11

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

arXiv:2606.12240v1 Announce Type: cross Abstract: Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) networks, operate in discrete time and may struggle to effectively capture continuous and irregular temporal behaviors. Liquid Neural Networks (LNNs) address some of these limitations through continuous-time dynamics, but standard LNN architectures typically rely on a single dynamical system, limiting their ability to model heterogeneous temporal patterns. To address these challenges, we propose a Multi-Rate Mixture-of-Experts (MR-MoE) framework built on top of Liquid Neural Networks. In the proposed architecture, multiple LNN-based experts operate at distinct time scales, enabling the model to explicitly separate fast-changing dynamics from slow-evolving temporal trends. A gating network further enables adaptive expert specialization based on input conditions. In addition, we incorporate both feature-level and temporal attention mechanisms to improve robustness, interpretability, and long-range dependency modeling. Feature-level attention suppresses noisy or irrelevant variables, while temporal attention selectively focuses on informative historical states. We evaluate the proposed framework on a complex multivariate time-series prediction task and compare it against strong baselines, including LSTM, monolithic LNN, and standard MoE models. Experimental results demonstrate that the proposed MR-MoE framework consistently achieves improved AUROC and AUPRC performance while maintaining favorable computational efficiency. These results highlight the effectiveness of combining continuous-time dynamics, multi-scale expert decomposition, and adaptive attention mechanisms for time-series modeling.

11.
arXiv (CS.CL) 2026-06-24

LangMAP: A Language-Adaptive Approach to Tokenization

Language-specific tokenizers improve tokenization quality and the downstream performance of models on those languages. However, using such a tokenizer comes at a cost: either a new model must be trained from scratch, or the vocabulary of an existing pretrained model must be adapted. We propose Language-adaptive Maximum a Posteriori (LangMAP) Tokenization, a tokenization scheme that extends the UnigramLM algorithm to the multilingual setting, producing language-specific tokenization from a single shared vocabulary. Notably, LangMAP can be used when training a multilingual language model from scratch or to adapt a pretrained model's tokenizer to individual languages without changing its vocabulary. While language labels are required at training time, a key feature of the algorithm is that it then performs language-specific tokenization at inference without knowledge of the input's language. Across 14 open-source tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and, for all coding languages tested, alignment with abstract syntax tree (AST) leaf boundaries. In fine-tuning experiments, results are mixed: LangMAP improves target-language grammatical acceptability (MultiBLiMP) on the languages tested; its benefits are less consistent on knowledge-related tasks (Global-PIQA, Belebele).

12.
arXiv (CS.AI) 2026-06-19

One Probe Won't Catch Them All: Towards Targeted Deception Detection

arXiv:2602.01425v2 Announce Type: replace Abstract: Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we demonstrate that deception detection is inherently heterogeneous: while a single universal probe achieves modest improvements (+0.032 AUC), post-hoc oracle analysis reveals substantially higher potential (+0.108 AUC) when probes are matched to specific deception types, and synthetic validation experiments suggest this ceiling is achievable a priori when the deception type is known in advance. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given this heterogeneity, we conclude that organizations should define their specific threat models and deploy appropriately matched probes rather than seeking a universal deception detector.

13.
arXiv (CS.LG) 2026-06-25

How Modular Is a Frontier Mixture-of-Experts? A Pre-registered Causal Test in Which Apparent Expert Modularity Mostly Dissolves

arXiv:2606.25092v1 Announce Type: new Abstract: Sparse Mixture-of-Experts (MoE) models route each token to a few of many experts, inviting the hypothesis that experts form functional modules tied to capabilities or languages. We test this causally on Command A+, a frontier open-weights MoE (218B total / 25B active; 128 experts, 8 active, +1 shared). We build a routing-mass atlas, pre-register six family-to-axis hypotheses before any intervention, and ablate each family at inference time against a size-matched random-expert null, measuring whether it selectively breaks its own axis (worst off-target effect at most one third of on-target). Crucially, we test the same families under four metrics and a held-out, independent-corpus run with bootstrap confidence intervals. Our finding is cautionary: robust functional modularity is rare and measurement-dependent. Of six pre-registered families, only one, the Arabic-language family, is a clean selective module that survives an independent corpus and a conservative statistical bar (1/6; a more permissive pre-registered point rule admits 3/6, but that count is threshold-sensitive). Every other family has a real causal effect yet fails selectivity, and its apparent modularity flips with the measurement: with the corpus, the metric, and the statistical bar. A positive control on Qwen3-30B-A3B recovers its published disjoint structure, confirming the method detects modularity when present. The verdict reproduces on the un-quantized BF16 model, ruling out a 4-bit quantization artifact. We conclude that ablation-based modularity verdicts are not safe unless the corpus, metric, and statistical bar are controlled. We release the atlas and ablation data.

14.
arXiv (quant-ph) 2026-06-12

A Robust Strontium Tweezer Apparatus for Quantum Computing

arXiv:2601.16564v2 Announce Type: replace-cross Abstract: Neutral atoms for quantum computing applications show promise in terms of scalability and connectivity. We demonstrate the realization of a versatile apparatus capable of stochastically loading a 5x5 array of optical tweezers with single $^{88}$Sr atoms featuring flexible magnetic field control and excellent optical access. A custom-designed oven, spin-flip Zeeman slower, and deflection stage produce a controlled flux of Sr directed to the science chamber. In the science chamber, featuring a vacuum pressure of $3 \times 10^{-11}$ mbar, the Sr is cooled using two laser cooling stages, resulting in $\sim 3 \times 10^5$ atoms at a temperature of 5(1) $\mu$K. The optical tweezers feature a $1/e^2$ waist of 0.81(2) $\mu$m, and loaded atoms can be imaged with a fidelity of $\sim 0.997$ and a survival probability of $0.99^{+0.01}_{-0.02}$. The atomic array presented here forms the core of a full-stack quantum computing processor targeted for quantum chemistry computational problems.

15.
arXiv (CS.LG) 2026-06-19

Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses

arXiv:2606.19891v1 Announce Type: new Abstract: We study adversarial bandit optimization in which the loss functions may be non-convex and non-smooth. In each round, the learner selects an action and observes only the loss incurred at that action. The loss consists of an underlying convex and $\beta$-smooth component and an adversarial perturbation that may be chosen after observing the learner's action. The perturbations are subject to a global budget controlling their cumulative magnitude over time. This framework extends the globally budgeted, post-action perturbation model from underlying linear losses to general convex and $\beta$-smooth losses. For this broader class, we establish expected regret guarantees that explicitly characterize the effect of the perturbation budget. To establish these guarantees, we modify a standard bandit optimization algorithm and develop an analysis that controls the additional regret caused by the perturbations. In the absence of perturbations, our results reduce to regret guarantees for the standard bandit convex optimization setting with $\beta$-smooth losses.

16.
medRxiv (Medicine) 2026-06-16

Care Delivery Gap framework: a proof-of-concept patient-reported measure of guideline-referenced care-process omissions in sickle cell disease

Abstract Background:Sickle cell disease (SCD) is concentrated in sub-Saharan Africa, where delivery of guideline-referenced care remains challenging. Current evaluation approaches rely largely on access indicators and clinical outcomes, which do not directly measure care delivery. We developed the Care Delivery Gap (CDG) framework, a patient-reported approach for identifying care-process omissions, and conducted a proof-of-concept study to assess feasibility and explore variation across income strata. Methods: We conducted a cross-sectional framework-development study involving a proof-of-concept sample of 52 individuals with SCD or caregivers recruited through clinics and moderated SCD communities across Africa, North America, and Europe between June 2025 and March 2026. The CDG framework assessed patient-reported omissions in specialist involvement, follow-up continuity, cardiovascular screening, and biochemical surveillance. Analyses were descriptive. Results: Substantial multi-domain care-process omissions were identified despite high reported healthcare engagement. Across geographic income strata, cardiovascular screening was reported by 4/35 (11%) LMIC versus 16/17 (94%) HIC participants, and regular follow-up within the preceding 12 months by 14/35 (40%) versus 16/17 (94%), respectively. High CDG scores, representing 1 omissions across three or four domains, occurred in 20/35 (57%) LMIC compared with 1/17 (6%) HIC participants. Similar disparities were observed across specialist review and vitamin B12 surveillance domains. Conclusion: A structured patient-reported framework identified multi-domain omissions in guideline-referenced SCD care, including among individuals reporting healthcare access. The divergence between access indicators and reported care delivery suggests that service contact alone may not reflect care quality. The framework provides a feasible foundation for future process-level quality measurement in high-burden settings.

17.
arXiv (CS.LG) 2026-06-16

Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data

arXiv:2502.19544v3 Announce Type: replace Abstract: Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two techniques: i) experience rehearsal and ii) execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves nearly twice the aggregate score of learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

18.
arXiv (CS.CL) 2026-06-11

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization

We introduce LibriConvo, a synthetic conversational speech corpus for speaker diarization and automatic speech recognition (ASR), built by instantiating the previously proposed Speaker-Aware Simulated Conversation (SASC) framework in a dataset and benchmarking setting. The main contribution of this paper is a corpus construction pipeline and benchmark derived from that framework. To make the data more suitable for downstream ASR and diarization, conversational timing statistics are estimated from English CallHome using external voice activity detection, long pauses are compressed, LibriTTS utterances are grouped by book to improve local semantic continuity, and room impulse responses are selected with a spatial-plausibility heuristic. The resulting corpus contains 240.1 hours of audio across 1,496 dialogues involving 830 speakers, partitioned into speaker-disjoint train, validation, and test splits. We report baseline results for both diarization and ASR. On the test split, Sortformer outperforms the pyannote pipeline in diarization (11.1\% vs.~24.4\% DER). For ASR, a Fast Conformer-CTC XLarge model fine-tuned with Serialized Output Training achieves 7.29\% WER and 6.97\% cpWER, outperforming zero-shot Whisper-large-v3. These results position LibriConvo as a practical benchmark for studying synthetic conversational speech and for evaluating multi-speaker speech processing systems.

19.
Nature Medicine 2026-06-10

Dual-target gene therapy in Parkinson’s disease: a multicenter phase 1 trial

Authors:

Restoring striatal dopamine synthesis is a promising gene therapy strategy for Parkinson’s disease. Previous adeno-associated virus-mediated aromatic L-amino acid decarboxylase (AADC) monotherapies remain dependent on exogenous levodopa, whereas multigene delivery is constrained by strict adeno-associated virus packaging limits. A ‘dual approach’ targeting the two rate-limiting enzymes, tyrosine hydroxylase (TH) and AADC, offers the potential for autonomous dopamine synthesis. We report the 12-month primary safety and tolerability outcomes of a multicenter, open-label, dose-escalation, phase 1 trial evaluating BBM-P002, a new adeno-associated virus vector—AAVT42—codelivering constitutively active TH and AADC. Ten participants with moderate-to-advanced Parkinson’s disease were enrolled and received bilateral intraputaminal infusions across doses of 4.0 × 1011 vg (Cohort 1; n = 1), 6.0 × 1011 vg (Cohort 2; n = 2), 1.0 × 1012 vg (Cohort 3; n = 2) and 1.2 × 1012 vg (Cohort 4; n = 5). The trial achieved its primary outcome, as BBM-P002 demonstrated a favorable safety and tolerability profile within 12 months post-treatment. No dose-limiting toxicities or drug-related serious adverse events occurred. A total of 23 adverse events were reported, all judged unrelated to BBM-P002 and primarily mild and transient. Systemic toxicity and clinically meaningful immunogenicity were absent. In conclusion, intraputaminal delivery of BBM-P002 was safe and well tolerated in this phase 1 trial, supporting continued clinical development. ClinicalTrials.gov registration: NCT05822739 . Phase 1 results reveal that BBM-P002, a dual-target gene therapy co-delivering TH and DDC, is safe and well tolerated in Parkinson’s disease, with 12-month motor improvements signaling therapeutic potential.

20.
arXiv (CS.AI) 2026-06-18

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

arXiv:2606.18988v1 Announce Type: new Abstract: Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black–box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step–by–step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC–GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy–to–hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

21.
arXiv (CS.AI) 2026-06-12

GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models

arXiv:2606.12821v1 Announce Type: new Abstract: Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workflows remain unvalidated: no benchmark evaluates agents operating through structured tool calling against real APIs. We introduce the GeoNatureAgent Benchmark, the first benchmark for environmental analysis agents that operate via structured tool calls to a production-style geospatial API. It comprises 93 tasks across 18 categories, covering municipality analysis, multi-turn conversation, spatial reasoning, cross-indicator synthesis, error handling and recovery, ranking, comparison, multilingual understanding, habitat analysis, and task rejection. Tasks are evaluated against an open, self-hostable API serving three environmental indicators across Spain and Portugal via sixteen tools. We evaluate seven LLMs (Claude Sonnet 4, DeepSeek V3.2, GLM-5, Gemini 2.5 Pro, Qwen3-235B, GPT-OSS-120B, Llama 4 Scout) under three temperature-1.0 seeds, reporting capability and per-case cost as orthogonal axes. We find: (1) Claude Sonnet 4 leads at 60.8% +/- 0.8%, followed by DeepSeek V3.2 at 56.3% +/- 3.1%, with no other model above 51%; (2) the cost-accuracy Pareto frontier is occupied mostly by open-weight models, with DeepSeek V3.2 offering 93% of Claude's capability at 11x lower cost ($0.011/case); (3) comparison tasks remain universally unsolved (0% on close-value comparisons), exposing systematic reasoning limits; and (4) structured tool calling against a real API is more discriminative than general-purpose GIS benchmarks, with accuracies 25-35 points lower. We further show extensibility by integrating BigEarthNet V2 land cover for Portugal alongside Spanish CO2 and erosion indicators. The benchmark, harness, and self-hostable API are publicly available.

22.
arXiv (CS.LG) 2026-06-19

EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

arXiv:2606.20108v1 Announce Type: cross Abstract: Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning ``what is degradation" from human-annotated labels, EFIQA learns ``what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.

23.
arXiv (CS.LG) 2026-06-11

Deep Learning of Solver-Aware Turbulence Closures from Nudged LES Dynamics

arXiv:2604.23874v3 Announce Type: replace-cross Abstract: The differentiable physics paradigm may be leveraged as an a-posteriori approach for discovering turbulence closure models by embedding a neural network parameterization directly inside the solver and optimizing it given potentially sparse target data. This addresses a key limitation of a-priori learning where direct numerical simulation (DNS) data is used to approximate the subgrid stress with the assumption of a low-pass filter. Closures trained in this a-priori manner frequently lead to unstable deployments due to the mismatch between the assumed filter and the effect of numerical discretizations and coarse-graining. In comparison, while typically stable during deployment, a-posteriori learning incurs high computational costs due to the need to backpropagate through a large eddy simulation (LES) solver. Furthermore, a-posteriori methods are challenging to apply broadly since they require significant modification of existing solvers. Finally, both approaches are limited when generalization is desired across different numerical schemes with their implicit filtering characteristics. In this work, we present a deep-learning approach for turbulence closure modeling built on the continuous data assimilation framework. Our approach enables the a-priori training of closures using sparsely observed DNS data without modifying or differentiating through the LES solver, while preserving stability during deployment for the recovery of invariant statistics. We focus on the model's ability to adapt to different discretizations by explicitly conditioning it on the numerical scheme. We use two- and three-dimensional canonical cases to test our framework and show that the learned correction systematically tracks the discretization error of the coarse solver.

24.
arXiv (CS.CV) 2026-06-24

Spherical-to-ERP Epipolar Rectification for Single-Axis Disparity in 360 Stereo

Omnidirectional stereo images provide full-surround perception but violate the geometric assumptions of classical disparity estimation: in spherical or fisheye views, epipolar correspondences follow curved great-circle paths, producing two-dimensional displacements that cannot be treated as single-axis disparity before geometric rectification. In this work, we adopt a standard spherical-to-equirectangular (ERP) projection as a preprocessing step, which straightens epipolar curves and restores a one-dimensional disparity structure - horizontal for left-right rigs and vertical for top-bottom rigs. Building on our previously introduced RAFT + Epipolar-Aligned Channel Selection (EACS) framework, originally developed for rectilinear and ERP stereo, we examine whether the same modular pipeline remains accurate when the input originates from spherical stereo imagery. After ERP projection, dense optical flow from RAFT is reduced to disparity by retaining only the baseline-aligned flow component. Experiments on synthetic fisheye stereo datasets show that this spherical-to-ERP-to-RAFT+EACS pipeline produces accurate, smooth, and structurally consistent disparity maps at real-time speed. These findings confirm that established ERP preprocessing can be effectively combined with our earlier RAFT+EACS method to enable practical, interpretable, and efficient disparity estimation from spherical stereo, providing a straightforward pathway for extending conventional stereo pipelines to 360 imaging.

25.
arXiv (CS.LG) 2026-06-25

What's in an Earth Embedding? An Explainability Analysis of Location Encoders

arXiv:2606.24997v1 Announce Type: new Abstract: Geographic implicit neural representations (INRs) learn to map any coordinate on Earth to a location embedding, implicitly encoding geospatial data into the weights of a neural network. Location embeddings are widely used off the shelf as general-purpose geospatial representations, yet users lack principled tools to audit what geographic or semantic information these embeddings capture. In this work, we analyze the information content of geographic INRs through their location embeddings. We decompose these embeddings into human-interpretable features$\unicode{x2014}$namely, (i) sparse latent concepts, (ii) natural language concepts, and (iii) visual features. The latent concept embeddings are learned using sparse autoencoders. To recover natural language concepts, we apply sparse linear concept embeddings (SpLiCE) over a predefined geospatial dictionary. Finally, visual features are extracted using saliency maps derived from CLIP Surgery. We show that location embeddings can be decomposed into human-interpretable representations while retaining high reconstruction capability, revealing interpretable geographic structures such as forests, deserts, and urban features. Across methods, sparse decompositions expose systematic differences in encoded information, ranging from urban structures to broader biome and climate signals, and pretraining-space saliency maps further highlight complementary features such as roads and landmarks. We hope this work provides a first step toward interpretable geospatial representations.