Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CV) 2026-06-24

DivRL: Disentangled Self-Similarity Rewards for Diverse Subject-Driven Generation

Subject-driven image generation faces an "Identity-Diversity Paradox", where strong identity preservation often leads to rigid and low-diversity outputs. We propose a post-training framework called DivRL that jointly optimizes identity consistency and structural diversity simultaneously by leveraging disentangled visual features from a robust similarity model. Specifically, we introduce a Negative Self-Similarity Measure (nSSM) to quantify structural diversity, and Visual Semantic Matching (VSM) to evaluate identity consistency. We propose an "Explore-and-Suppress" strategy that treats VSM as a gated constraint: the model freely explores structurally diverse configurations, and only samples that violate the identity threshold are penalized via a quadratic hinge loss. This converts identity preservation from a competing objective into a feasibility constraint, allowing nSSM and VSM to improve jointly. Experiments demonstrate that our method effectively pushes the model to generate both consistent and diverse images and improves structural diversity while maintaining comparable identity consistency through a gated optimization formulation.

02.
medRxiv (Medicine) 2026-06-16

Diurnal variation in brain-derived tau and five other blood-based biomarkers for dementia and their association with cognitive performance

Blood-based biomarkers of dementia are a promising scalable tool for early diagnosis, tracking disease progression, and evaluating therapeutic efficacy. Utility of these biomarkers will not only be dependent on the reliability of their association with pathology but also contingent on their ability to track cognitive status. Previously, we demonstrated diurnal variation in several biomarkers (amyloid beta (A{beta}) 42 and 40, 42/40 ratio, glial fibrillary acidic protein (GFAP), neurofilament light (NfL), and phosphorylated-Tau 217 (p-Tau217)) which has implications for their reliability. Here, we extend these observations to a larger cohort, include brain-derived tau (BD-Tau), which is assumed to be produced exclusively in the brain, and report endocrine measures of circadian rhythmicity. We not only assessed whether these biomarkers vary with time of day, but also whether they associate with daytime function and whether these associations vary with cognitive domain and number of repeated assessments. Data collected in 20 PLWA (72.4{+/-}5.9 years, mean{+/-}SD) and 19 controls (68.9{+/-}9.8 years) were analysed. Participants completed 14 days of home monitoring and one laboratory assessment of sleep and daytime function: mood, daytime sleepiness, reaction time, immediate and delayed memory recall, everyday memory errors. During the 27-hour residential laboratory session, 3-hourly blood samples were collected and analysed for the six blood-based biomarkers of dementia as well as melatonin and cortisol. Rhythmicity of melatonin and cortisol did not differ between groups. P-Tau217 and GFAP (p

03.
medRxiv (Medicine) 2026-06-15

Two Blood-based Endotypes Reveal Divergent Clinical Outcomes of Fibrotic Hypersensitivity Pneumonitis

Rationale: Fibrotic hypersensitivity pneumonitis (fHP) is an antigen-driven, life-threatening interstitial lung disease characterized by heterogeneous radiologic features, clinical outcomes, and treatment responses. Objectives: To identify blood-based fHP endotypes that inform mechanism, prognosis and therapeutic response. Methods: We performed integrative analyses of multi-compartment transcriptomic data derived from whole blood, peripheral blood mononuclear cells, bronchoalveolar lavage, and surgical lung biopsies, alongside circulating plasma proteomics. Multiple clustering algorithms were cross-compared to ensure robustness and reproducibility of endotypes identification. Immune cell composition was inferred using bulk RNA-seq deconvolution and annotated with BAL single-cell RNA-seq. Pathway activities were characterized using Gene Set Enrichment Analysis. Transplant-free survival (TFS) was evaluated for endotype and corticosteroid exposure by Kaplan-Meier methods, with hazard ratios analyzed using multivariable Cox proportional hazards models. Results: Two molecular endotypes, lymphocytic-associated (L-fHP) and non-lymphocytic-associated (N-fHP), were identified and validated. L-fHP showed enrichment of adaptive immune signaling and lymphocyte predominance, whereas N-fHP demonstrated myeloid-cell activation with neutrophil and macrophage predominance. Corticosteroid exposure was associated with worse TFS in L-fHP but not in N-fHP after adjusting for age, sex, and baseline pulmonary function. Compared to L-fHP, N-fHP had poorer baseline pulmonary function, faster 12-month FVC decline, and shorter TFS. N-fHP also exhibited elevated neutrophil-associated markers, including matrix metalloproteinase-9, across paired transcriptomic and proteomic datasets, supporting a neutrophil-driven, cross-compartment disease process. Conclusion: Multi-omic, multi-compartment analysis identifies two reproducible fHP endotypes with distinct clinical outcomes and corticosteroid responses, supporting a precision medicine approach beyond current clinical and radiologic classification.

04.
arXiv (CS.CL) 2026-06-24

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

Authors:

We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secret), and a reliability dimension where every turn must follow a strict JSON schema and an illegal action is silently discarded. The engine is private and each match uses a fresh random map seed and opponent, mitigating the data contamination that affects public benchmarks. Models receive a (near) rule-only prompt with no build-order advice (two tactical seed phrases were present during data collection; see Section 2.7). We benchmark 15 reasoning models across 54 matches and 5,258 actions. Findings: (1) the nuclear rush dominates (78% on the rules-coherent v0.11+ sub-corpus; 85% corpus-wide) with a sole-launcher signature that is largely mechanical under secret-simultaneous launch rules, not a cognitive deterrence failure; (2) military conquest is rare but faster (12.3 vs 18.9 turns); (3) diplomacy is prolific yet almost never consummated; (4) ~58% of illegal actions are fog/state errors, making the illegal-action rate a measure of belief-tracking; (5) – the least established, and the only one we label exploratory – a weak link associates reliability with winning. The corpus is small, unbalanced and not side-swapped, so the ranking is a preliminary descriptive view, not a contribution. Beyond ranking, the turn-by-turn traces of actions and messages make the corpus a lens on how LLMs reason under adversarial uncertainty – their belief-tracking, spontaneous deception, and per-model cognitive "personas" – which we frame as a future research direction. We release the replay format, an isometric viewer and all replays; engine source on request.

05.
arXiv (math.PR) 2026-06-16

An Algebraic Matrix Spencer Theorem

arXiv:2606.16005v1 Announce Type: new Abstract: We develop an algebraic approach to matrix discrepancy based on the representation theory of finite-dimensional C$^*$-algebras. As an application, we resolve a substantial structured special case of the Matrix Spencer conjecture. In particular, we show that for every family of contractions $A_1,\ldots,A_n$ that are contained in a finite-dimensional $C^*$-algebra $\mathcal A$ with $dim_{\mathbb C} (\mathcal A) \lesssim n$, there exists signs $x\in\{\pm1\}^n$ such that $\|\sum_{i=1}^n x_i A_i\| \le O(\sqrt n)$. As a noteworthy special case, our main result also resolves the Group Spencer conjecture of (Bandeira'24). We furthermore prove that Matrix Spencer continues to hold for low-rank perturbations of matrix families coming from an $C^*$-algebra of small dimension.

06.
medRxiv (Medicine) 2026-06-12

Immunologically Optimized Zmp1 Peptides Reveal a Translational Serological Biomarker Platform for Tuberculosis Diagnosis Across Disease Manifestations

Tuberculosis (TB) diagnosis remains challenging, particularly for extrapulmonary TB (EPTB), where invasive sampling, low bacillary burden, and suboptimal sensitivity of nucleic acid-based tests in peripheral specimens hinder timely detection. Here, we report an immunology-driven strategy for biomarker discovery and development of a peptide-based serological assay targeting Mycobacterium tuberculosis zinc metalloprotease-1 (Zmp1). Leveraging fundamental principles of adaptive immunity that antigenic regions containing overlapping B-cell and CD4 T-helper cell epitopes would preferentially generate high antibody titers through linked recognition and cognate T-cell help, we used an immunoinformatics pipeline to identify two nested immunodominant peptide regions within Zmp1 (Mtb-Zp-NT and Mtb-Zp-CT) enriched for overlapping B- and T-cell epitopes. The diagnostic potential of these peptides was evaluated through ELISA-based serological assays. A blinded pilot study (N=137) demonstrated a clear discrimination between active TB and TB-recovered individuals. The assay was subsequently validated in an expanded cohort (N=875) by screening 6,086 individuals, which identified 457 TB-positive cases. The cohort included pulmonary TB (PTB), EPTB, TB-recovered individuals, household contacts, non-specific infections, and healthy controls. Receiver operating characteristic analyses, supported by DeLong and bootstrap comparisons, revealed superior diagnostic performance of the peptide-based assays relative to full-length Zmp1. Mtb-Zp-CT exhibited the highest accuracy (AUC=0.93; specificity >90%), while Mtb-Zp-NT also demonstrated strong discriminatory power (AUC{approx}0.89). These findings establish that the immunologically optimized Zmp1 peptides are highly promising serological biomarkers for TB and EPTB. More broadly, they demonstrate how mechanistically informed epitope selection can accelerate translation of pathogen-specific immune signatures into sensitive, minimally invasive, and potentially point-of-care diagnostic platforms for resource-limited settings.

07.
medRxiv (Medicine) 2026-06-24

Clinical care site data integration reveals heterogeneity in EHR phenotyping and healthcare utilization patterns

Objective: Genomic research using electronic health record (EHR)-linked biobanks is influenced by heterogeneity in the clinical settings (care sites) where encounters occur. We developed two methods leveraging care site data: ClinicScan identifies where phenotype documentation occurs, and ClinicWAS identifies specialty utilization patterns associated with a risk factor. Materials and Methods: We extracted care sites for each clinical encounter at an academic medical center and mapped each to a clinical specialty. ClinicScan summarizes the specialty distribution of a user-specified diagnosis; ClinicWAS fits a logistic regression for each care site to identify specialty encounters associated with a user-specified risk factor. We applied ClinicScan to depression to test whether requiring a psychiatry encounter strengthened the association between a polygenic risk score (PRS) and a depression phenotype, and ClinicWAS to a coronary heart disease (CHD) PRS to identify sites enriched for high-risk patients. Results: Across 64,983,257 encounters, 2,544 care sites mapped to 57 specialties. Most depression diagnoses occurred in primary care (30.3%) and psychiatry (19.8%). Requiring a psychiatry encounter strengthened the PRS-phenotype association (OR=1.30, 95% CI 1.26-1.35) versus two or more diagnosis codes alone (OR=1.21, 95% CI 1.19-1.24). CHD ClinicWAS identified 19 associated care sites, including 5 catheterization labs. Men and women with high genetic risk (PRS[≥]95th percentile) underwent catheterization for CHD 3.1 (1.5-4.6) and 4.6 (2.5-6.7) years earlier than normal-risk participants, respectively. Discussion: Care site data capture phenotype heterogeneity that otherwise distorts EHR-based phenotypes and obscures high-risk subpopulations. Conclusion: Clinical care site data are an under-utilized resource in EHR-linked biobanks.

08.
arXiv (quant-ph) 2026-06-16

Linear algebra at exponential scale via tensor network dimension reduction

arXiv:2606.15350v1 Announce Type: cross Abstract: Many problems in modern scientific computing are challenging because of a curse of dimension, where their mathematical formulation involves objects whose dimension is exponential in the nominal "size" of the problem. Tensor networks can provide a compact representation for exponentially large vectors and matrices that arise in applications, but these representations do not always lead to reliable algorithms. This paper develops and analyzes techniques for randomized dimension reduction of tensor network data. These techniques support a suite of efficient algorithms for provably solving exponential-scale linear algebra problems, including trace estimation and eigenvalue approximation. The paper includes several stylized illustrations from quantum many-body physics with ambient dimension up to $2^{200}$.

09.
arXiv (CS.AI) 2026-06-15

Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

arXiv:2604.14892v3 Announce Type: replace-cross Abstract: Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM Jury, composed of three frontier AI models, for scoring 3334 diagnoses on 300 real-world low- and middle-income country (LMIC) hospital cases. Both LLM- and clinician-generated diagnoses are scored against expert panel diagnoses across four dimensions: diagnosis, differential diagnosis, clinical reasoning, and negative treatment risk. The LLM Jury scores are compared with expert and independent re-scoring panel scores to assess error metrics, inter-rater agreement, severe-risk errors, and the effect of post hoc calibration using isotonic regression. In our data, we find that: (i) the uncalibrated LLM Jury scores preserve ordinal agreement with the expert clinician panel scores, but are systematically lower; (ii) the probability of severe-risk errors is lower for the LLM Jury than the human expert re-score panels; (iii) the LLM Jury combined with LLM diagnoses can be used to identify diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (iv) the calibrated LLM Jury scores and rankings of diagnosing agents show excellent agreement with those of the primary expert panels; (v) LLM Jury models show no self-preference bias, they did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Together, these results provide evidence that a calibrated LLM Jury is a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking. Confirming these findings in other clinical settings is an important direction for future work.

10.
bioRxiv (Bioinfo) 2026-06-24

An atlas-scale generative model for unified representation learning of bulk RNA-seq data

Public bulk RNA-seq repositories contain hundreds of thousands of samples, creating opportunities for large-scale representation learning, but integration across studies remains challenging because of heterogeneous annotations, experimental protocols, and technical variation. While pre-trained foundation models are now widely available for single-cell RNA-seq, comparable resources for bulk RNA-seq remain scarce, motivating a model that learns a unified, tissue-aware representation directly from bulk data. We trained a supervised variational autoencoder (VAE) on a compendium of 118,263 bulk RNA-seq samples that we assembled from TCGA, GTEx, and ARCHS4 and mapped to 42 tissue categories. The model classifies tissue of origin at 94.9% balanced accuracy (weighted F1 96.2%) and compresses 16,115 genes into a 121-dimensional latent space. Tissue identity is the primary organizing axis of the latent space, while source effects remain secondary. To assess the impact of data volume, we constructed training sets at three different scales (38K, 75K, and 118K samples). Our results demonstrated that reconstruction fidelity improved incrementally with each expansion of the dataset, but with diminishing returns. We validated the model on an independent cohort of 734 paediatric tumour samples from TARGET, achieving 84.6% agreement with the expected tissue of origin. The trained model and code are available at GitHub (https://github.com/BIMSBbioinfo/flexynesis_tissue_vae_manuscript) with an interactive web application.

11.
arXiv (CS.LG) 2026-06-17

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

arXiv:2602.03420v2 Announce Type: replace-cross Abstract: Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.

12.
arXiv (CS.CL) 2026-06-11

Scenario-based Probing and Steering Cultural Values in Large Language Models–Extended Version

Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart–Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

13.
arXiv (quant-ph) 2026-06-19

On chip, multifunctional quantum sensing using single spins in a van der Waals crystal

arXiv:2606.19978v1 Announce Type: new Abstract: Nanoscale thermometry and magnetometry are in high demand across a wide range of scientific and technological applications. In this context, optically addressable spins in solids have emerged at the forefront of on-chip quantum sensing. However, simultaneous quantum sensing of multiple parameters (e.g., temperature and magnetic field) using the same spin sensor remains challenging due to cross-sensitivity to multiple physical quantities. Here, we demonstrate independent dual sensing of temperature and magnetic field using single quantum emitters in hexagonal boron nitride (hBN). We experimentally verify the independent response of the zero-phonon line (ZPL) position to temperature and of optically detected magnetic resonance (ODMR) to magnetic fields. Furthermore, we demonstrate local temperature sensing of a microcircuit while simultaneously measuring an external magnetic field. Our results establish quantum emitters in hBN as a robust platform for multifunctional quantum sensing under realistic operating conditions.

14.
arXiv (CS.CV) 2026-06-24

VisChronos: Revolutionizing Image Captioning Through Real-Life Events

This paper aims to bridge the semantic gap between visual content and natural language understanding by leveraging historical events in the real world as a source of knowledge for caption generation. We propose VisChronos, a novel framework that utilizes large language models and dense captioning models to identify and describe real-life events from a single input image. Our framework can automatically generate detailed and context-aware event descriptions, enhancing the descriptive quality and contextual relevance of generated captions to address the limitations of traditional methods in capturing contextual narratives. Furthermore, we introduce a new dataset, EventCap (https://zenodo.org/records/14004909), specifically constructed using the proposed framework, designed to enhance the model's ability to identify and understand complex events. The user study demonstrates the efficacy of our solution in generating accurate, coherent, and event-focused descriptions, paving the way for future research in event-centric image understanding.

15.
arXiv (CS.CL) 2026-06-16

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon: a keylogger, ransomware, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software with requests for harmful security knowledge and report refusal rates over non-comparable corpora. This paper's central result is that the CODE-versus-KNOWLEDGE classification axis established in a prior four-corpus release remains stable under a substantially expanded corpus pool and an independently refreshed judge panel, evidence that it measures a real construct rather than an artifact of the prompts or judges. Eight corpora spanning diverse elicitation paradigms (direct, jailbreak-decorated, indirect, and agent/interpreter: ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls), reaching Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"). Critically, the panel shares no judge with the prior release (five paid commercial APIs replaced by five open-weight models from five vendors), yet the two panels agree on 94.45% of the 3,133 shared prompts and reach Cohen's kappa = 0.952 [0.942, 0.963] on the 3,031-prompt binary overlap: the axis survives near-total panel replacement. The released bank comprises 4,748 consensus-CODE and 1,923 consensus-KNOWLEDGE prompts, a reliability-quantified benchmark whose central classification axis is shown stable across corpus expansion and judge-panel replacement.

16.
arXiv (CS.AI) 2026-06-18

Robust Regularized Policy Iteration under Transition Uncertainty

arXiv:2603.09344v3 Announce Type: replace Abstract: Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $\gamma$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

17.
arXiv (math.PR) 2026-06-18

Geometric obstructions to Lipschitz transport between weighted Hessian $\mathrm{CD}(\kappa,\infty)$ manifolds

arXiv:2606.11085v2 Announce Type: replace Abstract: We construct a weighted Riemannian manifold $(\mathbb R^2,g,\mu)$ satisfying $\mathrm{CD}(1/2,\infty)$, the curvature-dimension condition, with the following property: if $\gamma$ denotes a centered Gaussian measure on $\mathbb R^2$, then there is no Lipschitz map $T:(\mathbb R^2,\|\cdot\|) \to (\mathbb R^2,g)$ satisfying $T_\#\gamma=\mu$. Building on this, we prove a Weyl-type asymptotic law for the eigenvalues of the weighted Laplacian $-\Delta_{g,\mu}$ and show that they are asymptotically negligible when compared to the eigenvalues of $-\Delta_{\gamma}$. These results give strong counterexamples to two questions of E. Milman and complement the recent counterexample of Aryan.

18.
arXiv (math.PR) 2026-06-17

Optimal Impulse Control for Cyber Risk Management

arXiv:2410.17706v2 Announce Type: replace-cross Abstract: We explore an optimal impulse control problem wherein an electronic device owner strategically calibrates protection levels against cyber attacks. Utilizing epidemiological compartment models, we qualitatively characterize the dynamics of cyber attacks within the network. We determine the optimal protective measures against effective hacking by formulating and solving a stochastic control problem with optimal switching. We demonstrate that the value function for the cluster owner constitutes a viscosity solution to a system of coupled variational inequalities associated with a fully coupled reflected backward stochastic differential equation (BSDE). Furthermore, we devise a comprehensive algorithm alongside a verification procedure to ascertain the optimal timing for network protection across various cyber attack scenarios. Our findings are illustrated through numerical approximations employing deep Galerkin methods for partial differential equations (PDEs). We visualize the optimal protection strategies in the context of two distinct attack scenarios: (1) a constant cyber attack, (2) an exogenous cyber attack strategy modeled with a Poisson process.

19.
arXiv (CS.AI) 2026-06-11

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

arXiv:2606.11272v1 Announce Type: cross Abstract: Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial IoT (IIOT), cybersecurity, and smart cities-data streams are inherently non-stationary, leading classical FL methods to suffer from performance degradation, instability, and catastrophic forgetting. Continual Learning (CL) addresses learning under evolving data distributions but has been largely studied in centralized settings, overlooking key constraints of federated systems, including privacy, limited communication, and client heterogeneity. Federated Continual Learning (FCL) emerges at the intersection of FL and CL, aiming to support lifelong, adaptive, and privacy-aware learning over distributed and non-stationary data. This survey provides a comprehensive and systematic overview of FCL. We first present a formal definition of the FCL problem and clarify its distinctive characteristics. We then analyze the limitations of classical FL under non-stationary conditions, highlighting how CL principles support long-term adaptation. To organize the rapidly growing literature, we propose a multi-dimensional taxonomy of FCL approaches. Furthermore, we review representative application domains and data modalities, summarize commonly used evaluation metrics, and discuss experimental perspectives for assessing long-term performance and forgetting. Finally, we highlight key open challenges, including handling extreme heterogeneity under temporal drift, designing scalable and privacy-preserving memory mechanisms, and establishing standardized benchmarks. This survey aims to serve as a reference and a roadmap for advancing FCL toward robust and deployable real-world systems.

20.
arXiv (CS.CV) 2026-06-25

HiT-JEPA: A Hierarchical Self-supervised Trajectory Embedding Framework for Similarity Computation

The representation of urban trajectory data plays a critical role in effectively analyzing spatial movement patterns. Despite considerable progress, the challenge of designing trajectory representations that can capture diverse and complementary information remains an open research problem. Existing methods struggle in incorporating trajectory fine-grained details and high-level summary in a single model, limiting their ability to attend to both long-term dependencies while preserving local nuances. To address this, we propose HiT-JEPA (Hierarchical Interactions of Trajectory Semantics via a Joint Embedding Predictive Architecture), a unified framework for learning multi-scale urban trajectory representations across semantic abstraction levels. HiT-JEPA adopts a three-layer hierarchy that progressively captures point-level fine-grained details, intermediate patterns, and high-level trajectory abstractions, enabling the model to integrate both local dynamics and global semantics in one coherent structure. Extensive experiments on multiple real-world datasets for trajectory similarity computation show that HiT-JEPA's hierarchical design yields richer, multi-scale representations. Code is available at: https://anonymous.4open.science/r/HiT-JEPA.

21.
arXiv (CS.CV) 2026-06-12

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

22.
arXiv (CS.CL) 2026-06-12

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

23.
arXiv (CS.AI) 2026-06-19

Computational Identifiability

arXiv:2606.19361v1 Announce Type: cross Abstract: Identification conditions describe the computability of a target query or parameter of interest as a function of the type and amount of information available. In causal identification, this information is often expressed in the form of a causal graph, and data are observed or collected for some subset of variables in the graph. Target queries may be for a single effect alone or for a class of effects in a given model. The derivation of an identification algorithm then defines mathematically the process by which the desired causal effect(s) can be uniquely determined, theoretically, in expectation. Identifiability in expectation, or 'theoretical identifiability,' generally assumes asymptotic properties, infinite data, or other mathematically idealized conditions. In this paper, we explore a fundamental distinction between this theoretical, idealized notion of identifiability and a proposed alternative that is computation-bound. The framework we propose - 'computational identifiability' - is to instead define a finite computational search procedure for an empirical estimator. If this process finds an estimator empirically, within a desired error tolerance, then identifiability is satisfied, conditional on the specified assumptions of the search (i.e., a prior distribution over the parameters) and conditional on the search procedure itself. Through several experiments, we demonstrate how this framework allows us to answer fine-grained, practical identification questions, such as identification with small finite samples, with ambiguous graphical criteria, with mixed observational-interventional data, and across counterfactual data and estimands. Code is available at https://github.com/lbynum/metadentify.

24.
arXiv (CS.CV) 2026-06-25

OrthoTrack: Continuous 6-DoF UAV Trajectory Estimation Anchored in Public Orthophotos

Continuous 6-DoF pose estimation is essential for autonomous UAV operations. Yet, existing visual odometry and SLAM methods accumulate drift and yield only relative, up-to-scale trajectories. Single-frame geo-localization, in turn, discards temporal continuity and remains too slow for real-time use. We present OrthoTrack, a training-free system that estimates continuous 6-DoF UAV trajectories using only publicly available orthophotos and surface models as a map prior. OrthoTrack matches keyframes against the orthophoto and lifts correspondences to metric 3D via the surface model. It then propagates these map-anchored correspondences to intermediate frames with optical flow, producing absolute, metrically scaled poses at every frame without GPS or post-hoc alignment. We also introduce the MovingDrone Dataset, a large-scale benchmark pairing photorealistic UAV sequences with dense 6-DoF ground truth and co-registered multi-modal geodata including multi-temporal orthophotos. On MovingDrone and real-world benchmarks, OrthoTrack runs in real time on a single GPU. It outperforms all baselines by a large margin, even those receiving oracle scale and alignment. By relying on publicly available geodata, OrthoTrack enables deployment to new regions without site-specific adaptation.

25.
arXiv (CS.CV) 2026-06-16

Context-Aware RL for Agentic and Multimodal LLMs

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an indirect auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query–answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query–context–answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.