Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-11

Slots, Transitions, Loops: Learning Composable World Models for ARC

ARC tests in-context rule induction: given a few input-output demonstrations, a model must infer the hidden rule and apply it to a new query. While many approaches express ARC rules through language, code, or symbolic programs, ARC itself is visual-symbolic: rules appear as grid transitions over objects, colors, shapes, and spatial relations. We introduce Loop-OWM, an object-centric world-modeling architecture that learns these rules as composable transitions over structured states. It combines color-prototype slots, demonstration-conditioned task summaries, and a looped transition model with dense propagation and slot-conditioned correction. On both ARC-1 and ARC-2, Loop-OWM outperforms non-looped and looped baselines with comparable or fewer parameters. These results suggest that ARC rules can be learned not only as language descriptions or searched programs, but also as transitions over visual-symbolic world states.

02.
arXiv (CS.LG) 2026-06-17

Damage Adaptation in Seconds for Architected Materials

arXiv:2606.17394v1 Announce Type: cross Abstract: Adaptation to damages and in-situ physical repairs is essential for long-term robot autonomy, yet challenging outside of narrowly defined and well-anticipated bounds. In this work we proprioceptively adapt to catastrophic damage in soft-actuated systems in under one minute. Architected materials are well equipped for adaptation: actuator failure occurs gradually rather than acutely, and damage can be described in a low-dimensional, discrete coordinate space. Surprisingly, latent damage representations plus a simple yet robust ensemble method is sufficient for adapting to unseen damage in real-time. Moreover, we identify conditions under which exponential sample complexity collapses to linear sample complexity for learned representations of architected materials, a concrete advantage over rigid components or continuum soft mechanisms. We demonstrate LEAP, our method for adaptive proprioception, via a tracing task for a 6DoF soft wrist based on Handed Shearing Auxetic (HSA) actuators. Our algorithm is able to adapt to cuts, burns, and actuator repairs, enabling simulation-free real-time adaptation that is critical for realizing the promise of soft robots outside the lab. Videos and more information are available at https://murpheylab.github.io/leap.

03.
arXiv (CS.AI) 2026-06-12

A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)

arXiv:2605.02249v2 Announce Type: replace Abstract: We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system after an agent gains the belief in some state property. Based on the standard representation in epistemic planning of agents' beliefs via a single multi-agent Kripke model, we generalize the classical AGM belief revision postulates to the multi-agent setting, with the aim to provide a formal framework for evaluating dynamic epistemic reasoning frameworks in which the beliefs of all agents as the result of actions are computed. As an example of a simple operator that satisfies all of the generalized AGM postulates, we present generalized full-meet multi-agent belief revision. We moreover define a generalization of the standard postulates for iterated revision, present a more sophisticated, event model based revision operator, and discuss the potential issues in defining an epistemic operator on Kripke models that can satisfy all of the generalized postulates for iterated multi-agent belief revision.

04.
arXiv (CS.LG) 2026-06-16

The Data Manifold under the Microscope

arXiv:2606.15760v1 Announce Type: new Abstract: A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into data-manifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry. We repurpose and extend dSprites and COIL-20 with additional transformation dimensions and dense, axis-aligned sampling, and pair them with finite-difference estimators that recover curvature, reach, and volume at near-ground-truth accuracy in a regime where general-purpose estimators are unreliable or difficult to deploy. The framework is intended as a controlled testbed, useful as a calibration environment for geometric estimators and a sandbox for probing theoretical assumptions. To illustrate its use, we present two application studies, namely assessing the scaling behavior of the bounds of Genovese et al. and Fefferman et al., and tracking the layer-wise geometry of a $\beta$-VAE, highlighting the behavior of current bounds and the value of controlled benchmarks for guiding and validating future theory. A reference implementation is available at https://github.com/koulakis/manifold-microscope.

05.
arXiv (CS.CL) 2026-06-19

MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

Textual reuse pervades the Hebrew Bible, yet the computational methods used to detect it still rest largely on lexical overlap, and they falter once a parallel involves paraphrase, lexical substitution, or syntactic reworking. This paper introduces MiqraBERT, a Sentence-BERT model finetuned from AlephBERT (a Modern Hebrew encoder) for verse-level semantic similarity in Biblical Hebrew. The training set comprises 1,650 labeled verse and half-verse pairs: 825 true parallels drawn from the Chronicles synoptic material and from foundational studies of poetic parallelism, balanced against 825 randomly sampled negatives. Through cosine-similarity regression, the model learns an embedding space in which parallel verses cluster together and unrelated verses move apart. We evaluate separation with distribution-based metrics, Wasserstein distance and the overlap coefficient, across ten random seeds. MiqraBERT improves distributional separation 2.7-fold over the pre-trained baseline and reduces the ambiguous overlap region from roughly 24% to about 6%. Narrative synoptic parallels reach a recall@10 of 87.1%; poetic parallels remain difficult, below 9%. This genre-dependent asymmetry confines the model's reliable scope to narrative textual reuse. MiqraBERT is publicly available at https://huggingface.co/davidmsmiley/MiqraBERT

06.
arXiv (CS.CL) 2026-06-15

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

07.
arXiv (CS.CV) 2026-06-16

CRIS: Cross-Plane Self-Supervised Isotropic Restoration for Anisotropic Volumetric Imaging Across Modalities

Anisotropic volumetric acquisitions are common in clinical MRI and volume electron microscopy (vEM), where sparse through-plane sampling creates thick slices or sections that degrade orthogonal reformats and downstream analysis. We present CRIS, a cross-plane self-supervised framework for isotropic restoration without paired isotropic ground truth. CRIS casts 3D restoration as 2D stripe completion on orthogonal reformats of an isotropic grid: high-resolution in-plane slices are synthetically degraded and periodically masked for training, while at inference blank slices define the isotropic grid, two orthogonal reformats are restored, and predictions are fused by multi-view averaging. We evaluate CRIS on two MRI cohorts and two microscopy benchmarks up to 8x anisotropy. On brain MRI, CRIS achieves 32.921 +/- 0.436 dB PSNR and 0.9631 +/- 0.0027 SSIM, outperforming interpolation, SMORE4, SIMPLE, SA-INR, and ATME, and gives the best segmentation consistency (Dice 0.940 +/- 0.004, ASSD 0.245 +/- 0.014 mm, HD99 1.275 +/- 0.061 mm). On reference-free abdominal MRI, CRIS reduces FID/KID to 48.714/0.023. On vEM, CRIS outperforms interpolation, NIIV, and vEMINR, reaching 29.133 dB/0.834 3D PSNR/SSIM at 4x, 27.123 dB/0.734 on EPFL at 8x, and 21.915 dB/0.699 on noisy hemibrain data. In a robustness experiment, one variable-gap CRIS model evaluated across gap factors 3–7 and coronal, axial, and sagittal degradations maintained higher PSNR/SSIM than interpolation (36.36–31.14 dB and 0.977–0.932 vs. 33.07–27.85 dB and 0.951–0.853). These results support CRIS as a modality-flexible route to isotropic restoration without paired isotropic targets or configuration-specific retraining. Code is available at https://github.com/adi-hatav/CRIS.

08.
arXiv (CS.CL) 2026-06-17

LLMs Infer Cultural Context but Fail to Apply It When Responding

Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models' ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user's perceived cultural background. We introduce Cultural and Pragmatic Response Inference (CAPRI), a dataset of conversations with varying levels of cultural cues. Experiments with state-of-the-art LLMs show that models can infer cultural background and recall relevant conventions, but often fail to utilize the information to adapt their answers to the relevant cultural conventions, unless explicitly prompted to perform the tasks sequentially. We further evaluate adaptation to the interpretation of time and quantity expressions, two subjective language grounding dimensions that are affected by culture. We find that models increasingly adapt their answers as cultural cues accumulate, but their priors are not culture-neutral, sometimes aligning with the model's country of origin. Overall, CAPRI provides a resource for future research aimed at narrowing the gap between cultural knowledge and culturally adaptive language generation.

09.
medRxiv (Medicine) 2026-06-22

Exploring the association of Obesity on Cold and Warm Autoimmune Hemolytic Anemia in San Joaquin Valley: A Retrospective Cross-Sectional Study

The relationship between obesity and specific autoimmune diseases haas been well-established, specifically due to obesity's role in promoting pro-inflammatory states. Although not much literature has been documented regarding obesity association with AIHA. As such, this study aims to assess any correlations in patients with elevated body mass index (BMI) and autoimmune hemolytic anemia (AIHA). Here we present a retrospective cross-sectional study conducted over a four-year period, across four medical centers during which a new electronic medical record was implemented. The study included 25 patients who had a previously documented history of AIHA from another facility, DAT positive with indicators of hemolysis, or DAT positive with monomer specific antisera. The patients BMI was recorded at the time of presentation to the hospital. However, for patients with a prior history of AIHA or those transferred from another facility, the BMI that was closest to the time period of when the patient was diagnosed with AIHA was used as an adjunct. Our results show that there is an association of patients with elevated BMI (>25) and AIHA; however, various other confounding variables should be taken into consideration, and further research should be done to establish a causal relationship.

10.
bioRxiv (Bioinfo) 2026-06-15

RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views

Genes can be described through multiple heterogeneous biological views, including genomic sequence, transcript sequence, protein sequence, textual knowledge, and single-cell expression context, yet existing gene embeddings remain largely modality-specific and difficult to compare or reuse when many views are unavailable. We study a narrower but practically important question: whether pretrained embeddings from these distinct sources can be organized into a shared gene representation interface that remains usable under severe missing-modality conditions. To investigate this question, we introduce RepGene, a lightweight single-branch framework that combines modality adapters, a shared encoder, presence-aware fusion, and self-supervised cross-view objectives to map five biological views into one latent space. Our goal is not to claim a new multimodal learning principle or to establish superiority over all simpler fusion strategies, but to provide an initial technical instantiation for testing whether such a shared interface is feasible in a fixed-feature setting. Under a two-stage protocol in which RepGene is trained self-supervised on frozen upstream embeddings and evaluated by downstream linear probing, we find preliminary evidence that the learned representation is broadly competitive in the full-modality setting and remains informative when only partial modality subsets are observed at inference time. The strongest signal in our study is robustness under missing views: average performance changes are often limited when one modality is removed, and even single-view inference remains non-trivial in the evaluated benchmark regime.These results do not resolve unified biological representation learning, and they should be interpreted in light of incomplete simple-fusion baselines, limited architectural ablation, benchmark dependence, and possible upstream feature exposure. We therefore position RepGene as a feasibility study and a starting point for stronger comparisons, broader benchmarks, and leakage-aware validation.

11.
arXiv (CS.CV) 2026-06-18

RUB: Evaluating Residual Knowledge in Unlearned Models

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

12.
arXiv (CS.CV) 2026-06-16

Dehaze-GaussianImage: Zero-Shot Dehazing via Efficient 2D Gaussian Splatting Representation

Existing single image dehazing methods are often constrained by computational redundancy in pixel-level optimization and the lack of physical interpretability in implicit neural networks. These limitations hinder the balance between representation efficiency and reconstruction fidelity. To address these issues, we propose Dehaze-GaussianImage, the first zero-shot framework that introduces 2D Gaussian Splatting (2DGS) into the image dehazing domain to break the traditional pixel-grid processing paradigm. Distinct from static convolutional neural networks (CNNs) or Transformers, our approach models hazy images as continuous and dynamically evolvable anisotropic Gaussian fields. Specifically, we propose a novel reconstruction-decoupling zero-shot learning strategy that embeds the atmospheric scattering model into the Gaussian parameter space. This strategy drives Gaussian primitives to adaptively split, clone, and prune during optimization, achieving geometric-level decoupling of the transmission medium and clear textures. Furthermore, explicit structure-preserving constraints are introduced to suppress artifacts commonly caused by traditional physical priors. Experimental results demonstrate that the proposed method achieves state-of-the-art (SOTA) performance in a fully unsupervised manner with minimal parameters, highlighting the potential of explicit Gaussian representation for low-level vision tasks.

13.
bioRxiv (Bioinfo) 2026-06-13

Reinforcement learning-driven unified generative framework for multi-objective RNA codon design

Current RNA codon design methods are limited by inefficient long-sequence processing and poor generalizability, often relying on a decoupled "generate-or-optimize" paradigm. We introduce RNARL, a reinforcement learning-driven framework that unifies sequence generation with multi-objective optimization. RNARL directly learns to generate high-performance sequences, effectively optimizing sequences over 3,900 nucleotides and demonstrating superior performance and universality across six species and five RNA types. RNARL thus establishes an effective and generalizable framework for RNA codon design. Finally, a user-friendly web platform is freely available to facilitate its application for RNA therapeutic design.

14.
arXiv (CS.CV) 2026-06-19

DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests

Camera-trap monitoring in African tropical forests increasingly extends beyond closed-canopy interiors to riverbanks, clearings, and park edges. Among available open tools for African forest camera-trap classification, DeepForestVision is the only one providing a matched offline workflow for both photographs and videos, and previous work showed that it outperformed other available baselines on a comparable benchmark. However, it was designed for closed-canopy, ground-level forest interiors and uses a 35-class prediction space that becomes too coarse when deployments encounter arboreal primates, birds, semi-aquatic taxa, or human-associated confounders such as livestock. We present DeepForestVisionV2, an ecology-driven expansion from 35 to 64 prediction classes (61 animal classes plus human, vehicle, and blank) designed to address three recurrent deployment gradients: vertical stratification, scene openness, and anthropogenic interfaces. DeepForestVisionV2 retains the same offline workflow and is trained on 1,535,010 photographs and 243,354 videos from multi-country African tropical-forest projects. Evaluation combines a cross-country cropped-photo validation set, used to assess robustness across sites and camera-trap settings, with three held-out Uganda video benchmarks spanning the targeted gradients. On the validation set, DeepForestVisionV2 reaches 0.86 accuracy, 0.82 macro-F1, and 0.81 balanced accuracy. On the deployment benchmarks, it preserves or improves baseline accuracy despite its harder classification task, while increasing the number of identified taxa from 22 to 29 in forest-interior videos and from 4 to 9 at riverbanks. In the park-edge use case, it raises accuracy from 0.62 to 0.86 and reduces false alarms from 11 to 0. These results show that DeepForestVisionV2 materially improves field utility while preserving robustness across sites, habitats, and camera-trap settings.

15.
arXiv (CS.AI) 2026-06-15

History of the Muddy Children Puzzle

arXiv:2606.13703v1 Announce Type: new Abstract: The Muddy Children Puzzle is a puzzle about knowledge and ignorance that has been inspiring for the development of epistemic logic. Who came up with it first? This is unclear. We trace the origin of the Muddy Children Puzzle through logical and literary publications over the past two centuries. The puzzle inspired a numerous variations such as involving numbers or coloured hats. We also present a novel hats puzzle involving self-reference.

16.
arXiv (CS.CL) 2026-06-11

Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models

Large language models (LLMs) are widely used to tackle complex tasks with autonomous workflows. Recently, reusable natural language skills have emerged as a popular paradigm to inject procedural knowledge into LLM applications. Since popular skills are often invoked repeatedly, placing their full text in every context significantly increases prefill cost and latency. While text compression techniques have the potential to solve this problem, most existing methods are designed to compress factual knowledge in documents instead of procedural knowledge, making them insufficient for skill compression. In this paper, we argue that an effective skill compression method should: 1) preserve logical dependencies among workflows and tool protocols, 2) enable lightweight, offline compression for frequently updated community skills, and 3) be adaptable to varying complexities across skills. To address this, we present SKIM (SKIll coMpression), an adaptive multi-resolution soft token compression framework for procedural skills. Depending on the complexity of each skill, SKIM creates different numbers of soft tokens that not only improve the efficiency of LLM inference, but also preserve the effectiveness of skill usage. Experiments indicate that SKIM compresses skills to 30 to 60 percent of their original token length while preserving task performance better than existing compression methods.We have released our code at https://github.com/bebr2/SKIM .

17.
bioRxiv (Bioinfo) 2026-06-12

A Graph-based QSAR Modeling Pipeline for Predicting In vitro PubChem Assays and In vivo Human Hepatotoxicity: Mechanistic Analysis of Caspase-3/7 Activation

Background: Caspase-3 and -7 are key effector caspases in the apoptotic pathway, a form of programmed cell death, and their activities serve as a well-established biomarker for evaluating environmental chemical toxicity and informing chemical risk assessment. Loss of mitochondrial membrane potential is a key event in the activation of Caspase-3/7 signaling and the subsequent induction of apoptosis. Therefore, simultaneous assessment of mitochondrial membrane potential and Caspase-3/7 activity enables elucidation of the mechanisms and pathways through which apoptosis is initiated. Rapid and accurate assessment of the potential toxicity of environmental chemicals and drugs remains a major challenge. Quantitative Structure Activity Relationship (QSAR) modeling have been widely used for toxicity prediction. Graph-based approaches encode compounds directly as molecular graphs, allowing structure-activity relationships to be learnt from molecular topology without the information loss in binary fingerprints. While advanced graph models such as graph transformers (GTs) have shown outstanding performance in many domains, they have not been fully leveraged in QSAR modeling on Caspase and mitochondrial toxicity. Methods: We propose a QSAR modeling pipeline that encompasses assay data preprocessing, feature representations (fingerprints and molecular graphs), and benchmarking machine learning (ML) models, including classic ML models, graph neural networks (GNNs), GTs, and their consensus ensembles. Based on in vitro Caspase and mitochondrial assays in PubChem, we applied the pipeline to predict Caspase-3/7 activation and mitochondrial membrane potential (MMP). Beyond in vitro assays, we also built in vivo QSAR modeling for FDA Drug-Induced Liver Injury (DILI) gold standard on human hepatotoxicity. Moreover, mechanistic analysis on Caspase-3/7 activation was conducted by comparing with MMP disruption to identify chemical substructures that may be responsible for dual activations. We also investigated cell-line-specific responses by identifying structural motifs that selectively induce Caspase-3/7 activation in individual cell lines.Results:Experimental evaluations show that GTs and GNNs outperformed classic ML models when the number of active compounds is large, such as MMP disruption, while classic ML models and GTs performed good for highly imbalance data with limited active compounds, such as Caspase-3/7 activation. For DILI prediction, the full consensus model achieved the highest AUC 0.69 and Graphormer had the highest F1 score 0.79, both surpassing the previous best model with AUC 0.63 and F1 0.65 with a large margin.Our mechanistic analysis shows that phenolic compounds bearing a para-hydroxyphenyl motif, as well as members of the lipophilic chain family with long alkyl chains can trigger the collapse of MMP, leading to the activation of caspases-3 and -7. Human embryonic kidney (HEK293) was the only cell line with a distinct structural motif: 1,1-dichloroethane and chlorobenzene. Human neuroblastoma (SK-N-SH) is uniquely impacted by an epoxide fragment and rat hepatoma (H-4-II-E) is uniquely impacted by a tetramethylcyclohexene motif and an acetaldehyde fragment.Conclusions:The proposed pipeline for QSAR modeling, including data preprocessing, feature representations, and incorporation of advanced graph ML approaches, is highly effective in predicting not only on Caspase-3/7 activation and membrane potential collapse, but also on FDA DILI human hetatotoxicity. As future research directions, we will leverage extra information, e.g., biological activity and findings in existing toxicity literature, and recent advances in large language models and agentic AI to further improve the predictive performance and enable a sensitive and specific framework for assessing human hepatotoxicity of environmental compounds.

18.
arXiv (quant-ph) 2026-06-19

Quantum Dynamics from Lax Pair Theory: A Reconstruction from Spectrum Preservation

arXiv:2606.19664v1 Announce Type: new Abstract: We reconstruct unitary quantum dynamics from a minimal axiomatic foundation built on Hilbert-space observables and isospectral evolution. The only dynamical assumption is that physical time evolution is a continuous one-parameter flow of Hermitian observables that preserves their spectra, i.e. the possible outcomes of measurement. We show that this assumption is already sufficient to force the Lax form of quantum dynamics. The Heisenberg equation, the time-dependent and time-independent Schrödinger equations, conservation laws, and good quantum numbers then follow as theorems rather than postulates. In this formulation, Lax pair theory supplies the missing dynamical bridge between the measurement structure of a Hilbert space and standard quantum evolution: the Hamiltonian is not assumed, but emerges as the generator required for an isospectral observable flow.

19.
arXiv (CS.CV) 2026-06-16

OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model

X-ray contraband detection is critical for security in large-scale logistics and transportation, yet conventional detectors struggle to adapt to emerging contraband types and lack fundamental visual understanding. Vision-language models (VLMs) offer strong generalization but are hindered by the scarcity of high-quality X-ray image-caption data. To bridge this critical gap, we present MMXray, a meticulously curated benchmark of 52,124 image-caption pairs spanning 28 fine-grained classes of X-ray contraband. To enrich MMXray with realistic occlusion patterns, we further introduce CleanDET, a dedicated synthesis dataset containing clean foreground contraband images from 28 categories and background images with diverse density levels, together with AnyContraSyn, a controllable synthesis method designed to operate on CleanDET. We also develop OnePipe, an extensible pipeline for systematic data curation. Built on MMXray, we propose OneFocus, a unified VLM that supports four core tasks: visual question answering, contraband localization, classification, and image understanding. OneFocus achieves state-of-the-art performance in X-ray contraband understanding and demonstrates robust cross-domain generalization, establishing a strong vision-language baseline for security screening.

20.
arXiv (CS.CV) 2026-06-11

i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

Diffusion models have consistently driven progress in text-to-image generation. However, it is challenging to attribute recent progress to specific modeling and data choices: state-of-the-art open-weight models provide limited ablations, and do not disclose their training data and full training details. The research community needs fully open (weights, data, and code) models as a foundation for further research; yet existing fully open models still fall significantly short of leading models in performance. In this project, we conduct a systematic investigation of the modeling and data design choices in text-to-image diffusion training and inference with 300+ controlled experiments totaling 700K+ TPU v6e hours. Our experiments highlight several empirical findings (e.g., equal weighting is a strong default for mixing curated datasets) and simple design decisions (e.g., larger text encoder adapters improve performance with minimal added parameters) for training strong models. Guided by these insights, we train i1, a 3B-parameter text-to-image diffusion model using only publicly available datasets. i1 is competitive with leading models on five representative benchmarks (GenEval, DPG, PRISM, CVTG-2K, and LongText), and outperforms the best existing fully open model by 29.5 absolute percentage points on average. We provide the i1 checkpoints, training and inference code, and the data processing pipeline. Together, our findings and the i1 recipe establish a practical foundation for future open research in text-to-image diffusion models. Our code is available at https://github.com/zlab-princeton/i1.

21.
arXiv (CS.CV) 2026-06-17

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

22.
arXiv (CS.CL) 2026-06-18

Dual Dimensionality for Local and Global Attention

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

23.
medRxiv (Medicine) 2026-06-16

Diurnal variation in brain-derived tau and five other blood-based biomarkers for dementia and their association with cognitive performance

Blood-based biomarkers of dementia are a promising scalable tool for early diagnosis, tracking disease progression, and evaluating therapeutic efficacy. Utility of these biomarkers will not only be dependent on the reliability of their association with pathology but also contingent on their ability to track cognitive status. Previously, we demonstrated diurnal variation in several biomarkers (amyloid beta (A{beta}) 42 and 40, 42/40 ratio, glial fibrillary acidic protein (GFAP), neurofilament light (NfL), and phosphorylated-Tau 217 (p-Tau217)) which has implications for their reliability. Here, we extend these observations to a larger cohort, include brain-derived tau (BD-Tau), which is assumed to be produced exclusively in the brain, and report endocrine measures of circadian rhythmicity. We not only assessed whether these biomarkers vary with time of day, but also whether they associate with daytime function and whether these associations vary with cognitive domain and number of repeated assessments. Data collected in 20 PLWA (72.4{+/-}5.9 years, mean{+/-}SD) and 19 controls (68.9{+/-}9.8 years) were analysed. Participants completed 14 days of home monitoring and one laboratory assessment of sleep and daytime function: mood, daytime sleepiness, reaction time, immediate and delayed memory recall, everyday memory errors. During the 27-hour residential laboratory session, 3-hourly blood samples were collected and analysed for the six blood-based biomarkers of dementia as well as melatonin and cortisol. Rhythmicity of melatonin and cortisol did not differ between groups. P-Tau217 and GFAP (p

24.
arXiv (CS.CL) 2026-06-15

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.

25.
arXiv (CS.LG) 2026-06-16

Photon: Federated LLM Pre-Training

arXiv:2411.02908v2 Announce Type: replace Abstract: Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly-connected GPUs if they can effectively be used for pre-training. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch. We show that: (1) Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to centralized; and (3) Photon outperforms the wall-time of baseline distributed training methods by 35% via communicating 64x-512xless. Our proposal is robust to data heterogeneity and converges twice as fast as previous methods like DiLoCo. This surprising data efficiency stems from a unique approach combining small client batch sizes with extremely high learning rates, enabled by federated averaging's robustness to hyperparameters. Photon thus represents the first economical system for global internet-wide LLM pre-training.