Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-15

SinGeo: Unlock Single Model's Potential for Robust Cross-View Geo-Localization

Robust cross-view geo-localization (CVGL) remains challenging despite the surge in recent progress. Existing methods still rely on field-of-view (FoV)-specific training paradigms, where models are optimized under a fixed FoV but collapse when tested on unseen FoVs and unknown orientations. This limitation necessitates deploying multiple models to cover diverse variations. Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions – implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. SinGeo employs a dual discriminative learning architecture that enhances intra-view discriminability within both ground and satellite branches, and is the first to introduce a curriculum learning strategy to achieve robust CVGL. Extensive evaluations on four benchmark datasets reveal that SinGeo sets state-of-the-art (SOTA) results under diverse conditions, and notably outperforms methods specifically trained for extreme FoVs. Beyond superior performance, SinGeo also exhibits cross-architecture transferability. Furthermore, we propose a consistency evaluation method to quantitatively assess model stability under varying views, providing an explainable perspective for understanding and advancing robustness in future CVGL research. Codes will be available upon acceptance.

02.
arXiv (CS.CL) 2026-06-12

Language Model Circuits Are Sparse in the Neuron Basis

The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques which decompose the neuron basis into more interpretable units of model computation, such as sparse autoencoders (SAEs). However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that MLP neurons are as sparse a feature basis as SAEs. We use this finding to develop an end-to-end gradient-based attribution pipeline for circuit tracing on the MLP neuron basis, which surfaces causally effective neurons on a variety of tasks. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city-state-capital task from (Lindsey et al., 2025), we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g. mapping a city to its state), and can be steered to change the model's output. This work thus advances automated interpretability of language models without imposing additional training costs.

03.
arXiv (CS.CV) 2026-06-16

PPDM: Pixel Puzzling Diffusion Model for Speed and Memory Efficient Volumetric Medical Image Translation

Diffusion models have demonstrated superior fidelity for medical image-to-image translation, but their extension to high-resolution 3D volumes is severely constrained by prohibitive computational cost and GPU memory requirements. Existing memory-efficient strategies often compromise global volumetric consistency or fine anatomical detail. In this work, we propose the Pixel Puzzling Diffusion Model (PPDM), a simple and effective framework for memory- and speed-efficient 3D medical image translation. PPDM introduces a reversible pixel puzzle-unpuzzle operator that trades spatial resolution for channel dimensionality, substantially reducing activation memory while preserving global context. To further improve efficiency and stability, we adopt a direct bridge diffusion formulation that starts from the conditional input rather than pure noise, enabling the model to focus on task-relevant residuals. In addition, a puzzle-gradient loss is incorporated to enforce spatial coherence and suppress grid-like artifacts introduced by spatial rearrangement. We evaluate PPDM on multiple challenging 3D medical image translation tasks, including low-count PET denoising, joint PET denoising and attenuation correction, and cross-modal MRI translation. Across all tasks, PPDM consistently matches or outperforms full 3D diffusion models while reducing training GPU memory usage by up to an order of magnitude and significantly accelerating inference, and it outperforms existing memory-efficient diffusion approaches based on latent compression or frequency decomposition. These results demonstrate that PPDM provides a practical and scalable solution for high-fidelity 3D diffusion-based medical image translation under limited computational resources.

04.
arXiv (CS.CL) 2026-06-18

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

05.
arXiv (CS.LG) 2026-06-16

Deep Learning-Based Lunar Crater Terrain Relative Navigation

arXiv:2606.14776v1 Announce Type: cross Abstract: Accurate position estimation is crucial for the successful implementation of future lunar landings using autonomous vehicles, especially in dangerous environments with sparse terrain features. In this paper, we propose a terrain relative navigation (TRN) algorithm combining our deep-learning crater detector, which was designed specifically for the NASA Crater Detection Challenge problem, and an Extended Kalman Filter (EKF). Our detector analyzes crater features from the monocular images acquired from orbit, and their matches with craters from a global database are identified via a Hungarian assignment approach followed by the consensus-based outliers removal method. The estimated measurements are then used to refine an EKF, where spacecraft pose estimation in the Lunar-Centered Lunar-Fixed (LCLF) frame of reference, augmented with altitude aiding information, constrains radial drift. The simulation results indicate that even if the spacecraft is off from its actual location up to 5 km, TRN could recover from this situation, achieving navigation error reduction to a few hundred meters. It should be noted that in order to maintain crater feature correspondences, it is important to match the image resolution and the scales within the scene to the detector training set distribution.

06.
arXiv (CS.AI) 2026-06-19

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

arXiv:2606.20381v1 Announce Type: new Abstract: FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

07.
bioRxiv (Bioinfo) 2026-06-13

Testing the reliability of AI-generated protein structures

Although AlphaFold2 and its competitors have demonstrated remarkable abilities to predict protein structure, more work is needed to explore the limitations of these methods. Here we investigated the reliability of AlphaFold2 and ColabFold by creating a set of realistic but false protein sequences, using ColabFold to predict their structure, and then asking how often the program produces a high-scoring structure for a sequence that does not represent a protein. We determined that AlphaFold2 has a very small but non-zero false positive rate, estimated here at approximately 1 in 435 if one uses a threshold pLDDT score of 70 to define positive predictions. We also discovered, serendipitously, that some high-scoring sequences in the human genome were not false positives, but instead were previously unknown and un-annotated pseudogenes. These latter findings indicate that some well-established human annotations of protein-coding genes may have incorrectly extended the 5-prime untranslated regions too far. They also suggest that the false positive rate of AlphaFold2 is low enough that almost any high-scoring structure, even in a noncoding region, is worthy of further investigation.

08.
arXiv (CS.CL) 2026-06-19

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

Streaming Retrieval-Augmented Generation (Streaming RAG) reduces user-perceived latency by issuing tool queries in parallel with ongoing user input, before the utterance is complete. Reported gains are aggregate, yet the mechanism's benefit is fundamentally query-intrinsic: speculation can only help when the correct tool query becomes determinable before the user stops speaking or typing. We isolate and measure this property – tool-intent stabilization, the point in the input stream at which a speculative query's retrieval converges to the answer-bearing result. On the CRAG benchmark (1371 validation questions) we (i) measure the distribution of stabilization, (ii) derive a model-agnostic bound H on the portion of tool latency that can be hidden behind the user's remaining input, as a function of tool latency L and input cadence {\delta}, (iii) validate against a working streaming pipeline that realized savings meet or exceed this bound, and (iv) identify which query properties predict early versus late stabilization. The study requires no model training and runs on commodity CPU hardware. We find that at a realistic operating point (L=600ms, {\delta}=3w/s, {\theta}=0.8), 73.9% of queries across the full benchmark admit substantial latency hiding – a blended figure that mixes sufficiency stabilization on the 21.3% of questions where gold evidence is verbatim-present and BM25-retrievable (95.2% streamable on this favorable slice) with a grounding-free top-1-settling fallback on the remainder. On the favorable slice, {\phi}_suf is bracketed to [0.26, 0.281] by exact and relaxed grounding – both early. Question type produces a significant but coarse early/late split (Kruskal-Wallis p=0.017, epsilon^2=0.04), directly informing when a learned speculative trigger is worth its cost.

09.
arXiv (CS.CL) 2026-06-18

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

10.
arXiv (CS.AI) 2026-06-15

Universal Manipulation Exoskeleton: Learning Compliant Whole-body Policies with Real-time Torque Feedback

arXiv:2606.14218v1 Announce Type: cross Abstract: For robots to work safely in household environments, they need to be compliant and react to torque and force feedback during contact. However, the majority of existing data collection pipelines still lack the ability to capture force and torque data for learning active compliant policies. In this paper, we present Universal Manipulation Exoskeleton (UME), an upper-limb exoskeleton that provides real-time haptic torque feedback while recording whole-arm configurations and joint torque signals for teleoperation. With transparent torque feedback, human operators can even unsheathe kinematically constrained objects while blindfolded. UME is low-cost, lightweight, and portable. Equipped with an embedded IMU, it enables teleoperation for mobile manipulation. With our proposed universal retargeting algorithm, UME can teleoperate a range of robots, including the 7DoF OpenArm, 7DoF Franka, and 6DoF X-ARM. We demonstrate that this combination of capabilities enables learning bimanual, whole-body, and active compliant policies that operate effectively in highly constrained spaces. The learned robust autonomous policies achieve high success rates across a variety of tasks, including long-horizon mobile manipulation, force-mediated box flipping, visually occluded box pushing, and space-constrained tabletop manipulation. Videos, code, and additional information can be found at https://ume-exo.github.io.

11.
arXiv (CS.LG) 2026-06-16

Machine Learning-Driven Chemical Reactor Network Modeling of the Sandia-D Flame

arXiv:2606.14729v1 Announce Type: cross Abstract: Turbulent combustion simulations are crucial for many scientific and engineering systems. However, the high cost to fully resolve the complex multiscale and multiphysics behavior makes direct simulation typically infeasible. The equivalent reactor network (ERN) approach attempts to improve computational efficiency by replacing a multidimensional turbulent simulation with a series of much cheaper 0-D and 1-D chemical reactors, providing a surrogate model that retains detailed chemistry at the cost of simplified flow physics. However, their development remains a challenge, often requiring either expert analysis, or automated approaches that sacrifice accuracy. In this work, we develop an automated machine-learning-assisted framework for constructing ERNs of the Sandia-D turbulent methane/air flame. Principal component analysis is first used to reduce high-dimensional thermochemical computational fluid dynamics (CFD) data to a low-dimensional latent space, where k-means clustering identifies physically interpretable flame regions used to initialize a reactor-network graph. This initialization is then refined using finite-difference gradient descent wrapped around non-differentiable Cantera reactor simulations. Across 30 RANS simulations spanning a range of pilot temperatures and inlet methane compositions, the optimized 7-reactor ERN achieves a maximum-temperature $R^2$ score of 0.7945 while preserving a $\sim6000\times$ speedup over the CFD solver. Outlet CO prediction remains more challenging, with a final $R^2$ score of $-0.4183$, but improves substantially from the unoptimized clustering initialization. These results show that unsupervised thermochemical feature extraction can provide effective physics-informed initializations for ERN construction, while gradient-based refinement can significantly improve predictive accuracy without manual reactor-network design.

12.
arXiv (CS.CV) 2026-06-17

Bridging Modality Disconnect in Self-Reflection via Closed-Loop Visually Grounded Verification

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

13.
arXiv (CS.CL) 2026-06-19

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

作者:

Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.

14.
arXiv (CS.CV) 2026-06-16

Ellipse Meets Bit-Planes: A Novel Approach to RNFL based Glaucoma Detection Using Advanced Image Processing and Deep Learning

This work proposes an integrated pipeline for automatic glaucoma detection method from easily available colour fundas images based on an adaptive algorithm for ellipse-based polar transformation, to enhance the analysis of the Retinal Nerve Fiber Layer (RNFL) as the primary biomarker for observing glaucomatous changes, regardless of optic disc and macula position. Utilizing this transformation, we introduce two distinct frameworks tailored to different operational needs. The first framework, a deep learning-inspired feature fusion approach, achieves a 99.3% detection rate, ideal for settings where high precision is essential, despite higher computational demands. The second framework employs a novel image-processing algorithm based on bit-plane slicing, offering 92.31% accuracy and optimized for environments requiring rapid inference with minimal resource consumption. Both frameworks provide scalable and cost-effective solutions for early glaucoma detection. This study highlights the potential of RNFL-based diagnostic tools in addressing the global challenge of glaucoma, particularly in underserved regions.

15.
Nature (Science) 2026-06-17

Lethal plague outbreaks in Lake Baikal hunter-gatherers 5,500 years ago

Plague is among the most devastating diseases in human history1. However, early strains of the plague-causing bacterium Yersinia pestis lacked virulence factors that are required for the bubonic form until around 3,800 years ago2,3. Consequently, the morbidity and mortality of early plague strains remain unclear. Here we describe early plague strains that are associated with two phases of outbreaks among mid-Holocene hunter-gatherers near Lake Baikal in southeast Siberia, beginning from about 5,500 years ago. These outbreaks occur across four hunter-gatherer cemeteries, with a 39% detection rate for plague infection. By reconstructing kinship pedigrees, we show that small familial groups were affected, consistent with human-to-human spread of disease, and that the first outbreak occurred within a single generation. The infections appear to have resulted in acute mortality, especially among children (aged 8 to 11 years). We further note functional differences, including in the ypm superantigen locus, which is also present in present day Yersinia pseudotuberculosis. The new strains diverge ancestrally to known Y. pestis and constrain the timing of its emergence, indicating that this happened before approximately 5,700 years ago. These findings show that plague outbreaks happened earlier than previously thought and were indeed lethal. We contend that the occurrence of outbreaks among mid-Holocene hunter-gatherer communities well outside the sphere of Late Neolithic Europe challenges the notion that higher population densities and lifestyle changes during the Neolithic agricultural transition were prerequisites for plague epidemics. Analyses of ancient DNA from hunter-gatherers near Lake Baikal in southeast Siberia around 5,500 years ago indicate that highly virulent Yersinia pestis emerged earlier than previously estimated, far from the next known cases of infection in Late Neolithic Europe.

16.
arXiv (CS.CL) 2026-06-16

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $\alpha$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $\alpha$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

17.
bioRxiv (Bioinfo) 2026-06-18

Structure-Based Immunoinformatics Design of a CTB-Adjuvanted Multi-Epitope Mucosal Vaccine Against Helicobacter pylori

Background: Helicobacter pylori coloniz the gastric mucosa of nearly half of the global population and is classified as a Group I carcinogen by the World Health Organization due to its strong association with gastric cancer. The growing prevalence of antibiotic-resistant H. pylori strains significantly compromises current therapeutic strategies, emphasizing the urgent need for effective prophylactic approaches. Research design and methods; In this study, a novel multi-epitope vaccine was designed targeting H. pylori, incorporating epitopes from four key virulence proteins: BabB, SabB, SabA, and VacA. Using an immunoinformatics-guided structural vaccinology approach, B- and T-cell epitopes were predicted, prioritized based on immunogenicity, conservation, population coverage, and non-homology to human proteins, and assembled into the final vaccine construct. To enhance immunogenicity and specifically stimulate mucosal immune responses, the cholera toxin B subunit (CTB) was fused at the N-terminal via an EAAAK linker, a novel application in H. pylori multi-epitope vaccines. The PADRE universal epitope and additional linkers were incorporated to optimize epitope presentation and helper T-cell activation. Results: Comprehensive evaluations of physicochemical, antigenic, allergenic, and toxic properties were conducted, followed by secondary and tertiary structure modeling, refinement, and validation. Conformational B-cell epitopes were mapped, and molecular docking, binding affinity analysis, energy minimization, and molecular dynamics simulations confirmed structural stability and receptor interactions. Codon optimization and in silico cloning predicted efficient expression in Escherichia coli, while immune simulations suggested robust humoral and cellular responses. Conclusions: This study presents a promising multi-epitope vaccine candidate against H. pylori, offering a rational framework for future experimental validation and potential clinical application.

18.
bioRxiv (Bioinfo) 2026-06-19

FeatureMSEA: Metabolic Feature-based Metabolite Set Enrichment Analysis

Liquid chromatography-mass spectrometry (LC-MS) untargeted metabolomics detects thousands of metabolic features, but converting these chemical signals into metabolite set-level biological knowledge remains challenging. This is because most features lack unambiguous metabolite identities. Conventional metabolite set enrichment analysis (MSEA) generally requires identified metabolites and metabolite-level ranked inputs, leaving much of the untargeted feature space unused. Here, we present FeatureMSEA, a feature rank-based framework for metabolite set enrichment directly from metabolic features with ambiguous annotations. FeatureMSEA integrates multi-evidence feature-to-metabolite annotation, feature rank-based enrichment scoring, permutation-based inference, and iterative leading-edge-guided annotation refinement, with an optional LLM-assisted module for post-enrichment interpretation. In null comparisons of randomly split healthy samples, FeatureMSEA detected no significant metabolite sets, whereas metabolite-set spike-in simulations showed recovery of implanted signals. In a cerebrospinal fluid metabolomics study of Huntington's disease, FeatureMSEA identified dysregulated metabolite sets related to amino acid metabolism, mitochondrial energy metabolism, and neuroactive signaling. MS/MS-based annotation analysis further showed that FeatureMSEA refinement reduced annotation ambiguity and prioritized chemically consistent candidate metabolites. In summary, FeatureMSEA provides a general framework for extracting metabolite set-level biological insights from LC-MS untargeted metabolomics in which confident metabolite identification remains incomplete.

19.
arXiv (CS.AI) 2026-06-16

Cognitive Trajectory Modeling: Quantifying Human-AI Co-Creation through Cognitively Grounded Interaction Trajectories

arXiv:2606.15358v1 Announce Type: cross Abstract: Co-creative AI research increasingly seeks methods capable of representing how interaction dynamics evolve through time. While many existing approaches focus on observable interaction characteristics, interaction metrics, behavioral coding schemes, or activity traces, these methods often struggle to capture higher-order interaction dynamics, including how collaborative processes reorganize, stabilize, regulate, and evolve through time. This paper introduces Cognitive Trajectory Modeling (CTM) as a cognitive theory of interaction dynamics that conceptualizes cognition, interaction, and creative processes as temporally organized trajectories unfolding across cognitively meaningful attractor landscapes. CTM builds upon the theoretical foundations of the Enactive Model of Creativity and Creative Sense-Making (CSM), revisiting the role of sense-making curves and cognitive trajectories in representing co-creative interaction dynamics. We formalize this perspective through the Cognitive Trajectory Principle, which states that temporal representations are only theoretically interpretable as cognitive trajectories when their underlying states possess directional cognitive meaning. Building on this principle, CTM generalizes the notion of cognitive trajectories beyond any particular coding scheme and provides a broader framework for modeling interaction dynamics through trajectories unfolding across meaningful attractor landscapes. We further distinguish cognitive trajectories from interaction traces and situate CTM within a broader hierarchy of cognitive, interaction, and domain dynamics. More broadly, we argue that understanding co-creative systems requires methods capable of modeling how cognition and interaction dynamics unfold through time. CTM provides a foundation for studying interaction dynamics across co-creative AI and human-AI interaction.

20.
arXiv (CS.CV) 2026-06-16

Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

21.
bioRxiv (Bioinfo) 2026-06-15

oxo-flow: compiled, memory-safe bioinformatics workflow orchestration

作者:

Bioinformatics analyses depend on workflow engines to coordinate dozens of computational tools across complex dependency chains. The most widely adopted engines-Snakemake, Nextflow, the Common Workflow Language (CWL), and the Workflow Description Language (WDL)-run on interpreted or just-in-time (JIT) compiled language runtimes, incurring hundreds of milliseconds of startup latency and providing no compile-time safety guarantees from the host language. We developed oxo-flow, a workflow engine written in Rust that compiles to a single native binary. On an Apple M5 processor, oxo-flow parses, validates, and dry-runs a production-scale workflow in roughly 22 milliseconds-before Snakemake or Nextflow have finished loading their runtime environments. Peak memory usage is 16 megabytes, representing six- to seven-fold reductions relative to Snakemake and Nextflow. Dry-run latency is essentially independent of workflow size: a hundred-fold increase in rule count adds approximately 0.4 milliseconds. oxo-flow integrates 31 command-line tools, a REST interface with 60 endpoints, an embedded web application, and native cluster submission into a single 10-megabyte binary. It provides per-rule environment isolation across seven backends, checkpoint-based fault tolerance with cryptographic output verification, and a formal installation and operational qualification protocol for regulated laboratory environments. Ten curated workflows and three demonstration pipeline repositories are available. oxo-flow is freely available under Apache License 2.0 at https://github.com/Traitome/oxo-flow.

22.
arXiv (CS.CL) 2026-06-17

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

23.
arXiv (CS.AI) 2026-06-11

When Context Returns: Toward Robust Internalization in On-Policy Distillation

arXiv:2606.11627v1 Announce Type: cross Abstract: Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student's no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher's context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student's no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

24.
arXiv (CS.CL) 2026-06-12

PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation

Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced either by rule-based traffic managers or by learned models trained toward a single behavioral mode. Recent work introduces style variation through post-hoc labels on observational data or LLM-inferred reward weights, but these signals act as proxies for what a style should reward rather than demonstrations of humans explicitly asked to drive in that style. We introduce PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, in which participants drive CARLA leaderboard routes under aggressive, neutral, and conservative instructions on a driver-in-the-loop rig. The pipeline has three stages: (i) offline triplet mining over per-style human driving data using a combined image-text similarity score; (ii) training a lightweight retrieval head that fuses frozen visual features with a small control encoder over per-style databases; and (iii) fine-tuning a single VLA backbone to treat retrieved context points as in-context behavioral demonstrations during waypoint prediction. At inference, the same backbone is conditioned on any style by swapping which per-style database the retrieval head queries, so selecting a style requires no per-style retraining while enabling human-style, style-diverse non-ego agents for closed-loop simulation. On Bench2Drive, PersonaDrive (no style) improves the driving score by 4.6% over SimLingo and 2.5% over HiP-AD, and under style conditioning attains the highest driving score in every style within a roughly 2% band (its weakest style surpassing the strongest baseline, DMW, by 5.4%), while average speed and acceleration rise by 18% and 25% from the conservative to the aggressive instruction.

25.
arXiv (CS.CL) 2026-06-18

Approximate Structured Diffusion for Sequence Labelling

Sequence labelling, a core task of Natural Language Processing (NLP), consists in assigning each token of an input sentence a label. From a Machine Learning point of view, sequence labelling is often cast as a Linear-Chain Conditional Random Field (CRF) parametrised by a neural network. While this approach gives good empirical results, CRFs assume a finite decision span (eg label bigrams) which can limit their expressivity and hurt performance when long-range dependencies are required. We show we can leverage diffusion to train a CRF conditioned on an entire label sequence, with the caveat that the condition is on a noisy version of labels. We show experimentally that this method, in conjunction with approximate CRF inference, improves label accuracy with a 16.5% error reduction for POS-tagging.