Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-12

QoS Improvement in Multi User Cellular-Symbiotic Radio Network Assisted by Active-STAR-RIS

arXiv:2401.08301v2 Announce Type: replace-cross Abstract: In this article, we employ active simultaneously transmitting and reflecting reconfigurable intelligent surfaces (ASRIS) to enhance the quality of 6G cellular network services. The network integrates commensal symbiotic radio (CSR) subsystems to facilitate communication between passive Internet of Things (IoT) users and active users, referred to as symbiotic backscatter devices (SBDs) and symbiotic user equipments (SUEs), respectively. Since the SBDs are passive, transmitting information to the SUEs poses significant challenges. To overcome this challenge, we harness the capabilities of massive multiple input multiple output (MIMO) antennas within the base station (BS) to relay the information transmitted by SBDs with greater power. This scheme uses the non-orthogonal multiple access (NOMA) technique for multiple access among all users, and potential interferences are eliminated using successive interference cancellation (SIC). The primary objective is to maximize the throughput between SBDs and SUEs. To achieve this, we formulate an optimization problem involving variables such as active beamforming coefficients at the BS and ASRIS, phase adjustments of ASRIS, and scheduling parameters between CSR and cellular networks. To solve this optimization problem, we used three deep reinforcement learning (DRL) methods: proximal policy optimization (PPO), twin delayed deep deterministic policy gradient (TD3), and asynchronous advantage actor critic (A3C). These methods were simulated, and the results demonstrate that A3C, TD3, and PPO have the best convergence speeds and achieve the highest increases in network throughput, respectively. Finally, the proposed scheme was evaluated using passive simultaneously transmitting and reflecting RIS (STAR-RIS), which demonstrated poorer performance compared to ASRIS.

02.
bioRxiv (Bioinfo) 2026-06-11

TMO: ASYMMETRIC CROSS-MODAL ATTENTION FOR LEARNINGCELL-STATE-DEPENDENT REGULATORY LAGS FROM SINGLE-CELL MULTIOMIC DATA

Abstract Background: Single-cell multi-omics technologies simultaneously measure chromatin accessibility (ATAC) and gene expression (RNA), providing a unique window into the temporal ordering of regulatory events during differentiation. However, most computational models treat the two modalities symmetrically, ignoring the directional relationship between chromatin and transcription, and existing lag-aware methods estimate a single global lag per gene, failing to capture cell-state-dependent dynamics. Methods and Results: We introduce Temporal Multi-Omics (TMO), a deep learning framework that learns signed, cell-state-conditional regulatory lags ({Delta}{tau}) using asymmetric cross-modal attention. TMO projects RNA and ATAC into 50 latent components each, tokenises each cell as a sequence of 100 tokens, and uses a two-pass transformer in which a data-driven lag prior - derived from a sliding-window cross-correlation function - directly biases attention asymmetrically. On four independent 10x Multiome datasets (mouse brain, human brain, mouse kidney, human PBMC), the asymmetric model achieves Lag Concordance Scores (LCS) of 0.988-0.999, compared to 0.048-0.108 for an architecturally identical symmetric baseline. A stratified 80/20 held-out experiment confirms that the learned component-lag ordering generalises to unseen cells (held-out LCS 0.85-0.99). Clustered {Delta}{tau} heatmaps show positive {Delta}{tau} (ATAC-led priming) in early pseudotime and negative {Delta}{tau} (RNA-led, activity-dependent regulation) in late pseudotime; the ATAC-RNA correlation heatmap exhibits a U-shaped pattern indicative of developmental decoupling. Components with the most positive {Delta}{tau} are enriched for chromatin organization and stem cell differentiation (FDR < 0.05), while those with the most negative {Delta}{tau} are enriched for synaptic signalling and immune activation. Ablating the cell-state information from the lag predictor reduces the LCS and collapses per-component temporal dynamics (KS p [&le;] 0.039 in all four tissues), proving that TMOs dynamic lag patterns depend on cell-state conditioning. Independent ChIP-seq validation for four transcription factors (PAX5, Pax6, ASCL1, Hnf4) confirms highly significant separation between target genes and expression-matched background (p < 10-4 in all cases). Two Multiome Perturb-seq screens provide causal validation: SMARCB1 knockout shows a directional trend (1.5-fold target shift, p = 0.056, n = 147 perturbed cells), and SMARCE1 knockout reaches statistical significance (p = 0.0089, n = 3,394 perturbed cells). Gene-level cross-correlation independently validates that the regulatory lag signal is present in the raw data, and TMO further identifies rare, statistically significant biphasic gene programs where the regulatory direction reverses across pseudotime. Conclusions: TMO is the first method to make regulatory lag a learnable, cell-state-conditional, and architecturally encoded parameter. It is scalable, interpretable, and open-source, providing a powerful tool for studying regulatory timing in development, disease, and perturbation screens.

03.
arXiv (quant-ph) 2026-06-15

Quantum-Classical Hierarchical Equations of Motion

Authors:

arXiv:2606.14363v1 Announce Type: new Abstract: We develop a quantum-classical hierarchical equations of motion (QC-HEOM) approach for simulating non-Markovian open quantum systems. The method combines the ensemble-averaged classical path reference of the quantum-classical path integral formalism with a hierarchy of auxiliary quantum influence functionals. By incorporating thermal fluctuations through an ensemble average over reference trajectories, the hierarchy is required to represent only the residual quantum memory associated with the imaginary part of the bath response function. Consequently, unlike conventional hierarchical equations of motion, QC-HEOM does not require Matsubara or Padé expansions of the thermal kernel and exhibits only weak temperature dependence of the hierarchy size. Furthermore, because thermal fluctuations are supplied through reference classical trajectories, the framework naturally extends beyond harmonic baths and enables the incorporation of anharmonic and molecular environments through externally generated trajectories. We derive the formalism and demonstrate its exactness for a harmonic bath. Applications to an asymmetric spin-boson model and the seven-site Fenna–Matthews–Olson complex illustrate the accuracy of QC-HEOM. It reproduces benchmark quasi-adiabatic path integral and hierarchical equations of motion results while requiring substantially fewer auxiliary objects, particularly at low temperatures. These results establish QC-HEOM as an efficient framework for treating residual quantum memory in quantum-classical descriptions of open-system dynamics. The separation of thermal fluctuations from residual quantum memory through the use of Wigner trajectories provides an approximate route toward hierarchical treatments of complex anharmonic environments that are inaccessible to conventional HEOM approaches.

04.
bioRxiv (Bioinfo) 2026-06-18

Deciphering shared and divergent tissue architectures from cross-species spatial transcriptomics

Authors:

The integration of spatial transcriptomics (ST) data across species is essential for cross-species and translational studies, but remains challenging due to molecular divergence and anatomical differences between organisms. We present STACAME, a graph attention autoencoder-based framework to decipher shared and divergent tissue architectures from cross-species ST data by explicitly modeling both orthologous and species-specific genes. STACAME aligns ST slices in a spatially aware manner, identifies homologous and species-specific domains, and enables a suite of downstream comparative analyses. We demonstrate its utility by integrating ST datasets from diverse tissues, including hippocampus, isocortex, embryo, breast, liver, and cerebellum, across multiple species such as human, macaque, marmoset, mouse, and zebrafish. STACAME supports cross-species spatial domain alignment, the detection of shared and divergent spatially variable genes, development alignment and comparison, and the 3D integration of tissue architecture. This flexible approach facilitates the translation of findings from model organisms to humans, providing a unified computational platform for cross-species spatial transcriptomics.

05.
arXiv (CS.LG) 2026-06-15

A Low-Rank Subspace Analysis of LLM Interventions

arXiv:2606.14388v1 Announce Type: new Abstract: Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable safety controls. To understand these side-effects, we introduce a diagnostic framework for analyzing interacting behaviors in LLMs. We model behaviors as low-rank subspaces in activation space, and study how interventions influence across behaviors. Across multiple instruction-tuned models (7B-70B) and across refusal, jailbreak, and sycophancy settings, we find that different behaviors share internal representations, and intervening on one behavior alters others in asymmetric ways. Some behaviors act as upstream control points whose interventions propagate broadly across other behaviors, while others remain more isolated. We relate these effects to two geometric quantities: (i) the overlap between behavior subspaces, measured as the average squared cosine of principal angles, and (ii) the angle between each behavior subspace and the decision subspace (capturing the model's final decision e.g., refuse vs. comply). Empirically, intervention effects on other behaviors tend to be larger for behavior pairs with higher subspace overlap, and for source behaviors whose subspaces lie closer (smaller angle) to the decision subspace. These findings highlight a challenge for targeted behavior control: behaviors are difficult to modify independently, as interventions can propagate through shared representations and asymmetric interactions.

06.
arXiv (quant-ph) 2026-06-12

The Pound-Drever-Hall Method for Superconducting-Qubit Readout

arXiv:2512.03138v3 Announce Type: replace Abstract: Scaling quantum computers to large sizes requires the implementation of many parallel qubit readouts. Here we present an ultrastable superconducting-qubit readout method using the multi-tone self-phase-referenced Pound-Drever-Hall (PDH) technique, originally developed for use with optical cavities. In this work, we benchmark PDH readout of a single transmon qubit, using room-temperature heterodyne detection of all tones to reconstruct the PDH signal. We demonstrate that PDH qubit readout is insensitive to microwave phase drift, displaying $0.73^\circ$ phase stability over 2 hours, and capable of single-shot readout in the presence of phase errors exceeding the phase shift induced by the qubit state. We show that the PDH sideband tones do not cause unwanted measurement-induced state transitions for a transmon qubit, leading to a potential signal enhancement of at least $14$~dB.

07.
arXiv (CS.LG) 2026-06-12

Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

arXiv:2601.09693v3 Announce Type: replace Abstract: Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for predefined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves competitive zero-shot virtual screening performance, substantially outperforms existing methods on a challenging target fishing task, and demonstrates state-of-the-art ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

08.
arXiv (CS.LG) 2026-06-11

DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics

arXiv:2606.11651v1 Announce Type: new Abstract: Synthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein-like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi-supervised framework. By equipping a classical VAE with an additional feature-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non-native environments and cross-validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.

09.
arXiv (CS.AI) 2026-06-15

Patcher: Post-Hoc Patching of Backdoored Large Language Models

arXiv:2606.02995v2 Announce Type: replace-cross Abstract: Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden triggers that bypass safety mechanisms. Existing defenses often require comprehensive attack information or multiple triggered examples, making them impractical when defenders only observe a single reported failure case without knowing whether it stems from a backdoor attack or a natural alignment bug. This paper presents Patcher, a post-hoc defense framework that repairs backdoored language models using only a single reported failure case and the model parameters. Patcher operates in two stages. First, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering to separate triggers from benign context. Second, it patches the model through a constrained fine-tuning objective that breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks through KL-divergence constraints. We conduct extensive evaluations across multiple backdoor attack strategies and demonstrate that Patcher successfully localizes triggers and neutralizes backdoors while maintaining model utility. We further show robustness against adaptive attacks designed to evade our defense. This work represents a significant step toward practical defenses against training-time attacks in deployed language models.

10.
arXiv (CS.CV) 2026-06-11

Cross-Modal Benchmarking for Robotic Perception in Natural Environments

Natural environments present a complex challenge to robotics perception systems. Current models, particularly vision foundation models, are largely trained on structured, urban environments leading to weaknesses in their perception for field robotics tasks. We showcase the limitations of current models using our recently released WildCross benchmark, a new cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF pose and synchronized dense lidar submaps. In this work, we provide an expanded analysis of the benchmark results from the recent WildCross benchmark, with particular emphasis on expanded metric depth estimation experiments. Access to the code repository and dataset for this work can be found at https://csiro-robotics.github.io/WildCross.

11.
arXiv (CS.AI) 2026-06-17

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

arXiv:2606.09004v2 Announce Type: replace Abstract: Feature engineering remains a cornerstone of tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for its automation, giving rise to LLM-powered Automated Tabular Feature Engineering (LATTE). However, the field lacks standardized, cost-aware evaluation platforms, and the combinatorial explosion of design choices obscures true algorithmic progress. To bridge these gaps, we systematically deconstruct 15 representative LATTE methods into a unified 6-dimensional taxonomy. Based on this abstraction, we introduce LATTEArena, a standardized, modular, and extensible benchmarking framework that decouples monolithic pipelines into reusable execution blocks. By distilling the massive combinatorial space, we evaluate 24 core LATTE configurations across 7 research questions. Our head-to-head benchmarking goes beyond predictive accuracy to quantify token efficiency and execution robustness, yielding 17 empirical findings on cost-effectiveness trade-offs. Furthermore, we provide 3 concrete recommendations for optimal real-world deployment. By enabling controlled component-level comparisons, LATTEArena shifts the paradigm from ad-hoc prompt engineering to systematic context management. All code, datasets, and over 4,000 execution logs are publicly available to foster a dynamic, community-driven benchmark. Our framework, leaderboard, and all artifacts are hosted on the LATTEArena project website at https://goodenhak.github.io/LATTEArena.

12.
arXiv (CS.CV) 2026-06-16

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

13.
arXiv (CS.CV) 2026-06-19

LooseControlVideo: Directorial Video Control using Spatial Blocking

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

14.
arXiv (CS.CV) 2026-06-17

Attention Alignment Between Humans and Vision-Language Models

Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2$\times$2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40–50 percentage points (80–87\% vs.\ 40–59\% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5–20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85–87\%). Despite their alignment advantage, LSTM-decoder attention maps were spatially diffuse and minimally task-differentiated; ViT-Transformer, the weakest in alignment, showed the sharpest spatial concentration and strongest task differentiation. A hemispatial-neglect simulation confirmed that ablating attention impacted LSTM decoders more than Transformer decoders. In an exploratory extension using TRIBE-simulated synthetic neural responses, fixation alignment and neural relevance dissociate: CNN-Transformer attention maps better predicted synthetic brain activity despite lower fixation alignment, with attention maps best predicting early visual cortex. Together, top-down and bottom-up components trade off what they predict in behavioral and synthetic neural data.

15.
arXiv (CS.CV) 2026-06-15

PhysVLA: Towards Physically-Grounded VLA for Embodied Robotic Manipulation

Vision-Language-Action (VLA) models excel at mapping visual inputs and natural language instructions directly to robotic control policies. However, because they are trained primarily to fit behavioural demonstration data, they do not explicitly enforce fundamental physical principles such as rigid-body dynamics or contact constraints. This exposes a critical physics gap: standard temporal smoothing applied on top of single-step or chunked VLAs trades trajectory quality for added failures that short-term memory cannot resolve. To bridge this gap, we introduce PhysVLA (Physics-VLA), a plug-and-play, inference-time framework designed to wrap any frozen VLA backbone without retraining, fine-tuning, or weight access, with less than 1 ms of overhead per control step. PhysVLA intercepts the predicted control action, captures only the simulator or system state, and applies a dual-layered correction: (i) a phase-aware finite-state machine that structures discrete task segments (approach, grasp, transport, and place), and (ii) a selective Euler-Lagrange gate that activates only when a dynamics oracle detects kinodynamic inconsistency. Evaluated across OpenVLA, OpenVLA-OFT, Force-VLA, and Generalist-VLA on LIBERO-Spatial with a 7-DoF Franka Panda, the framework delivers absolute success rate increases of up to 17% and stability increases of up to 19% with no per-task regressions, improves trajectory efficiency by up to 15% across all four backbones, and shows up to a 10x improvement in trajectory jerk robustness on a Robosuite Lift cross-simulator sweep. We further validate the framework on a real Agilex Piper arm with a pick-and-place task, confirming that PhysVLA transfers to physical hardware without retraining, with success-rate improvements of up to 50%, establishing physical awareness as a composable, backbone-agnostic runtime module.

16.
arXiv (CS.LG) 2026-06-17

Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach

arXiv:2510.19528v2 Announce Type: replace-cross Abstract: We investigate the fundamental problem of leveraging offline data to accelerate online reinforcement learning - a direction with strong potential but limited theoretical grounding. Our study centers on how to learn and apply value envelopes within this context. To this end, we introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms. Our method extends prior work by decoupling the upper and lower bounds, enabling more flexible and tighter approximations. In contrast to approaches that rely on fixed shaping functions, our envelopes are data-driven and explicitly modeled as random variables, with a filtration argument ensuring independence across phases. The analysis establishes high-probability regret bounds determined by two interpretable quantities, thereby providing a formal bridge between offline pre-training and online fine-tuning. Empirical results on tabular MDPs demonstrate substantial regret reductions compared with both UCBVI and prior methods while remaining competitive with related approaches.

17.
bioRxiv (Bioinfo) 2026-06-12

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Lossless compression and probabilistic sequence modeling are two faces of the same coin: a model that assigns high probability to a sequence can encode it in few bits via arithmetic coding. We exploit this duality to evaluate genomic language models as compressors of DNA, using compression primarily as an objective probe of generative sequence modeling rather than as a deployable storage system. We release DNAGPT2, a family of ten GPT-2-small models pretrained for one epoch on a single A40 using the DNABERT2 multi-species corpus that differ only in byte-pair encoding vocabulary size. Coupled with arithmetic coding, the best model reaches 1.47 bits per base (bpb) on the T2T human genome, fourth in the Cobilab compression benchmark and ahead of every general-purpose compressor. Our results suggest that NLP-style tokenization choices may be suboptimal for DNA: a 32-token BPE vocabulary compresses better than larger vocabularies. We also find that, in this benchmark, published long-context genomic LMs underperform a much shorter-context BPE GPT-2; we discuss in Section 5 that this is not a controlled context-length ablation, since the compared models also differ in architecture, training data, parameter count, and tokenization. Finally, we compute a per-nucleotide information-content map of the human genome and show that exons, introns, intergenic regions, and Alu repeats have statistically distinct information profiles.

18.
bioRxiv (Bioinfo) 2026-06-11

inquiSTR: a toolkit for accurate and efficient population-scale tandem repeat genotyping and analysis

Tandem repeats are highly mutable genomic elements linked to human traits and diseases. Profiling large catalogs of tandem repeats from population-scale long-read sequencing data requires accurate and efficient tools. We introduce inquiSTR, a command-line toolkit for fast genome-wide tandem repeat length genotyping. inquiSTR, with efficient parallel processing and low-memory streaming algorithms, genotypes a genome-wide repeat catalog of 1.78 million loci in less than two minutes. Benchmarking shows high accuracy and significantly faster performance compared to existing tools and truth sets. inquiSTR also provides methods for downstream analyses such as population structure inference, association testing, and outlier detection.

19.
arXiv (CS.CV) 2026-06-16

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .

20.
arXiv (CS.LG) 2026-06-19

How to sketch a learning algorithm

Authors:

arXiv:2604.07328v3 Announce Type: replace Abstract: How does the choice of training data influence an AI model? This broad question is of central importance to interpretability, privacy, and basic science. At its technical core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error $\varepsilon$ and failure probability $\delta$ in the deep learning setting. Our precomputation and prediction algorithms are only $\tilde{O}(\log(1/\delta)/\varepsilon^2)$ factors slower than regular training and inference, respectively. The storage requirements are those of $\tilde{O}(\log(1/\delta)/\varepsilon^2)$ models. Our proof is based on an assumption that we call stability. In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at https://github.com/SamSpo1/microgpt-sketch. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.

21.
arXiv (CS.LG) 2026-06-16

A Multimodal Approach to Alzheimer's Diagnosis: Geometric Insights from Cube Copying and Cognitive Assessments

arXiv:2512.16184v2 Announce Type: replace Abstract: Early and accessible detection of Alzheimer's disease (AD) remains a critical clinical challenge, and cube-copying tasks offer a simple yet informative assessment of visuospatial function. This work proposes a multimodal framework that converts hand-drawn cube sketches into graph-structured representations capturing geometric and topological properties, and integrates these features with demographic information and neuropsychological test (NPT) scores for AD classification. Cube drawings are modeled as graphs with node features encoding spatial coordinates, local graphlet-based topology, and angular geometry, which are processed using graph neural networks and fused with age, education, and NPT features in a late-fusion model. Experimental results show that graph-based representations provide a strong unimodal baseline and substantially outperform pixel-based convolutional models, while multimodal integration further improves balanced classification performance and discriminative ability. SHAP-based interpretability analysis identifies specific graphlet motifs associated with corner integrity and edge continuity as key predictors, closely aligning with clinical observations of distorted cube drawings in AD. Together, these findings establish graph-based analysis of cube-copying behavior as an interpretable, non-invasive, and scalable framework for Alzheimer's disease screening.

22.
arXiv (CS.CL) 2026-06-12

PRISM: Prosody-Integrated Multi-Agent Reasoning Framework for Empathetic Spoken Dialogue

Empathetic spoken dialogue systems require not only semantically appropriate responses but also emotionally aligned prosodic expression. However, cascade pipelines often discard acoustic cues during speech-to-text conversion, while end-to-end speech models lack interpretable control over emotion and knowledge integration. To address these challenges, we propose PRISM, a multi-agent framework for empathetic spoken dialogue that decouples speech perception, response generation, and speech synthesis into coordinated components. PRISM introduces a prosody-to-language translation mechanism to stabilize large language model reasoning and enables on-demand invocation of external knowledge tools for empathetic dialogue generation. Experimental results demonstrate that PRISM achieves consistent improvements in empathy, prosodic appropriateness, and text response generation quality across objective and subjective metrics. Our code is available at: https://github.com/Bxzfrm/PRISM.

23.
arXiv (quant-ph) 2026-06-16

Temporal modulation as a resource: enhanced frequency estimation in continuous variable systems

arXiv:2606.15108v1 Announce Type: new Abstract: Frequency estimation, a cornerstone of quantum metrology, has been significantly enhanced by advanced quantum sensing strategies. However, most protocols rely either on static or time-independent encoding mechanisms, inherently limiting their achievable precision scaling, or on control strategies requiring changing the Hamiltonian and/or implementing feedback mechanisms. To overcome this, we investigate a simpler dynamical encoding protocol where the quantum oscillator is driven by a general continuous temporal frequency modulation $\Omega(t) = \omega_0 f(t)$. We analytically demonstrate that for a given modulation profile $f(t)$ and its corresponding time-integral $F(t)$, the quantum Fisher information (QFI) scales as $\mathcal{O}(F(t)^2)$. This enhancement stems from the fact that temporal encoding fundamentally alters the mechanism of dynamical phase accumulation. Crucially, when evaluated under the energy and evolution-time constraints, this framework reveals a genuine precision enhancement over the conventional time-independent baseline. By analyzing explicit polynomial and exponential modulations, we establish that arbitrary precision scaling can be deterministically engineered, with ultimate bounds that are asymptotically saturable via optimal homodyne detection. Our framework provides a universal paradigm for exploiting time-dependent quantum control in next-generation sensors.

24.
arXiv (CS.CV) 2026-06-19

GH-ESD: Grounded Hypothesis-Driven Error Slice Discovery for Instance-Level Vision Tasks

Systematic failures of vision models on semantically coherent subsets, known as error slices, reveal limitations in robustness and evaluation. Existing slice discovery approaches largely model slices as clusters in representation space or combinations of predefined attributes. While effective for image-level classification, such formulations are insufficient for instance-level tasks such as object detection and segmentation, where failures often arise from contextual relational and spatially grounded visual patterns. We propose GH-ESD (Grounded Hypothesis-Driven Error Slice Discovery), a generate and verify framework that reformulates slice discovery as grounded hypothesis generation and statistical verification. GH-ESD constructs relational failure hypotheses using LLM priors and grounded visual evidence, discovers hypothesis slices at the instance level via Vision Language Models, and verifies them through statistical trend analysis over instance-level errors. We also introduce GESD (Grounded Error Slice Dataset), a new benchmark for instance-level error slice discovery, providing expert-defined and spatially grounded slices derived from detection and segmentation failures. Extensive experiments demonstrate that GH-ESD consistently outperforms baselines, improving Precision@10 by 0.10 (0.73 vs. 0.63) on the GESD benchmark for detection tasks, while also supporting segmentation scenarios. GH-ESD identifies interpretable slices that facilitate actionable model improvements. The GESD dataset will be made publicly available upon acceptance.

25.
arXiv (CS.CV) 2026-06-11

Seeing What Matters: Perceptual Wrapper with Common Randomness for 3D Gaussian Splatting

While 3D Gaussian Splatting (3DGS) achieves impressive real-time rendering, it frequently struggles to synthesize high-frequency textures, a limitation heavily exacerbated in memory-constrained and rate-distortion-optimized (RDO) pipelines. To address this, we propose a versatile 2D perceptual wrapper that enhances the rendered outputs of existing 3DGS representations in a content- and view-dependent manner. Our method leverages a lightweight synthesis network conditioned on pseudo-random Gaussian noise to synthesize perceptually plausible textures. Supervised by Wasserstein Distortion, the network learns to match local feature statistics rather than strictly enforcing pixel-wise reconstruction fidelity, effectively mitigating the blurriness inherent in standard frameworks. We demonstrate the broad applicability of our plug-and-play approach across vanilla, memory-constrained, and RDO 3DGS methods. Comprehensive subjective and objective experiments confirm that our method significantly improves over existing baselines, yielding superior perceptual quality at sharply reduced file or model sizes.