Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-17

Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity

arXiv:2606.17830v1 Announce Type: cross Abstract: Neural network parameter spaces are inherently non-injective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants–sinusoidal and rotary positional encodings (RoPE)–we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.

02.
arXiv (CS.LG) 2026-06-18

QUIVER: Cost-Aware Adaptive Preference Querying in Surrogate-Assisted Evolutionary Multi-Objective Optimization

arXiv:2605.04267v2 Announce Type: replace Abstract: Interactive multi-objective optimization systems face a budget allocation dilemma: one can spend resources on expensive objective evaluations or on eliciting decision-maker preferences that identify the relevant region of the Pareto set. Moreover, preference elicitation itself spans modalities with different information content and cognitive burden, ranging from cheap, noisy pairwise preference statements (PS) to richer but costlier indifference adjustments (IA). We study cost-aware optimization under an unknown scalarization and introduce QUIVER (Query-Informed Value Estimation for Regret), a surrogate-assisted evolutionary multi-objective optimizer that adaptively chooses between objective evaluations and heterogeneous preference queries. At each step, QUIVER selects the next action by maximizing the expected decision-quality improvement per unit total cost. Across DTLZ and WFG benchmarks under synthetic decision-maker models, QUIVER achieves the lowest final utility regret on challenging WFG problems (utility regret of 2.14 on WFG4, 2.82 on WFG9: a 25% improvement over baselines), outperforming all single-modality baselines. We analyze how the optimal mix of PS and IA adapts to problem difficulty: on easy problems (DTLZ2), QUIVER selects 80\% PS queries; on hard problems (WFG9), it shifts to 35% IA queries. This adaptive modality selection demonstrates cost-aware preference learning in action.

03.
arXiv (CS.CL) 2026-06-16

ROMPAR: Morphological Completion and Demographic Unlearning for Romanian-Accented Speech Recognition

Automated transcription of parliamentary proceedings faces significant hurdles due to demographic bias, dialectal variation, and technical artifacts such as utterance truncation during segmentation. This paper introduces the ROManian PARliamentary Speech Corpus (ROMPAR) dataset, a 17.80-hour corpus of Romanian and Moldavian parliamentary speech, featuring double-annotated ground truth and explicit labels for reconstructed word fragments. To build a robust ASR system, we propose a multi-task adversarial training framework that enforces demographic invariance across age, gender, and dialect. We address the inherent instability of adversarial objectives in generative architectures by introducing an exponential decay mechanism for the adversarial coefficients. Furthermore, we implement an LLM-guided decoding strategy with position-dependent weighting to facilitate morphological completion of truncated terminal words. Our results demonstrate that the proposed framework significantly reduces WER and achieves an F1-score of 96.6% in morphological reconstruction.

04.
arXiv (CS.AI) 2026-06-16

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

arXiv:2606.14801v1 Announce Type: cross Abstract: Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

05.
arXiv (CS.LG) 2026-06-18

Sequential Hiring of Contingent Workers Through Learning-Based Optimization

arXiv:2606.18438v1 Announce Type: cross Abstract: In this paper, we study a sequential workforce management problem in a contingent labor setting with uncertainty in both worker production and labor supply. A firm seeks to maximize cumulative profit by maintaining an active team of fixed size while learning worker productivity over time. We emphasize two critical operational frictions in this problem: replacing workers is costly, and workers may not be available immediately for hiring because of, for example, prior job commitments, scheduling constraints, or onboarding procedures. Thus, hiring decisions take effect only after a random delay. We formulate this problem as a stochastic multi-play bandit with costly switching and delayed actions, and develop a learning-based hiring policy, DR-UCB (DelayedReplacement-UCB), that makes replacement and hiring decisions sequentially through learning cycles. In each cycle, the policy uses real-time production data to determine when to initiate workforce changes and which workers to replace and hire. We show that the leading-order regret of the proposed policy matches its lower bound in its dependence on the time horizon. Our numerical experiments show that DR-UCB outperforms benchmark policies.

06.
arXiv (CS.LG) 2026-06-19

Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics

Authors:

arXiv:2606.19367v1 Announce Type: new Abstract: Building on a two-parameter Weibull framework for diagnosing transformer weight distributions, we study why the Weibull weight-scale parameter $\lambda$ grows, overshoots, and then relaxes during AdamW training. We derive a leading-order three-force decomposition of the squared weight norm from the AdamW update: an alignment force measuring the correlation between weights and the adaptive update direction, an injection force from adaptive step magnitude, and a decay force from decoupled weight decay. On self-trained Pythia-70M models with ground-truth optimizer moments, alignment dominates the rise phase, contributing 88-94% of the absolute force budget across four random seeds and remaining robust to super-weight removal. Near saturation, alignment and decay approach balance, explaining the transition from weight-scale growth to relaxation. These force dynamics directly govern the squared-norm component underlying $\lambda(t)$; the remaining RMS-to-Weibull reconstruction offset is measurable and decomposes into bridge and integration components, totaling approximately 5-6% in densely sampled regions. To extend the analysis to real models where optimizer moments are unavailable, we introduce a spline displacement method that recovers the alignment force from sparse checkpoints with approximately 92-94% accuracy, about twice the naive two-point baseline. We further observe that the peak value of $\lambda(t)$ varies with training-data coherence in our experiments, suggesting a data-dependent component of weight-scale growth that we leave to a controlled follow-up study. Code and data are available at https://github.com/tiexinding/NPM-Weibull-public.

07.
arXiv (CS.CV) 2026-06-11

Precision-Aware Illumination-Disentangled Vision Transformer for Spacecraft 6D Pose Estimation

Vision sensors provide a lightweight solution for spacecraft proximity operations, but monocular spacecraft 6D pose estimation remains difficult under illumination variation, specular reflection, shadowing, weak texture, and background interference. These factors make local visual evidence spatially unreliable and can destabilize pose regression. This article proposes a Precision-Aware Illumination-Disentangled Vision Transformer (PAID-ViT) for robust spacecraft pose estimation.The proposed model separates pose-relevant structure tokens from illumination-sensitive appearance tokens, estimates patch reliability before pose aggregation, and uses foreground mask supervision to preserve silhouette cues. A parameter-free geometric recovery module converts normalized crop coordinates, log-depth, and a continuous 6D rotation representation into camera-frame rotation and translation. Experiments on SPEED+ V2, the SPEED+ validation/lightbox/sunlamp evaluation configuration used in this study, suggest that PAID-ViT reduces translation error and improves robustness in the challenging sunlamp domain, while ablation studies support the complementary roles of illumination disentanglement, reliability-aware token aggregation, mask supervision, and training-side regularization.

08.
arXiv (CS.LG) 2026-06-12

Hierarchical Successor Representation for Robust Transfer

arXiv:2602.12753v2 Announce Type: replace Abstract: The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

09.
arXiv (CS.CL) 2026-06-11

Agreement in Representation Space for Open-Ended Self-Consistency

Self-consistency improves LLM reasoning by sampling multiple outputs and selecting the most consistent answer, but existing formulations largely rely on exact matching and therefore remain limited to tasks with categorical outputs. In this work, we study self-consistency in open-ended generation tasks such as code synthesis and text summarization. We hypothesize that consistency can be understood as a geometric property of the generation space, where semantically compatible generations concentrate in similar regions of representation space. To study this hypothesis, we introduce Embedding-Based Agreement (EBA), a simple training-free operationalization that estimates agreement by clustering sampled generations in embedding space. Through experiments on mathematical reasoning, code generation, and summarization, we show that agreement in representation space provides a robust and scalable signal of self-consistency for open-ended tasks. In particular, EBA consistently outperforms random selection and exhibits more stable scaling behavior than recent selection approaches based on LLM evaluation or uncertainty estimation. We further show that these agreement signals remain stable across model families and embedding spaces, even with native hidden representations. Finally, our analysis shows that the geometric location occupied by sampled generations is strongly correlated with generation quality: generations concentrated near central regions of representation space tend to correspond to more reliable outputs, whereas peripheral generations are substantially less accurate. Overall, our findings support viewing self-consistency as a property of the geometric organization of sampled generations rather than exact symbolic overlap.

10.
arXiv (CS.LG) 2026-06-16

{\alpha}-Fair Insurance Pricing: A Fairness Continuum

arXiv:2606.14898v1 Announce Type: new Abstract: Fairness in insurance pricing remains a long-standing and deeply debated puzzle. On one hand, insurers, driven by profitability considerations, set premiums that differentiate across individual risks to achieve actuarial fairness. On the other hand, insurance serves a critical societal function by pooling risks across a population, motivating cross-subsidization among groups to promote solidarity fairness. The tension between these two competing notions of fairness makes insurance pricing inherently complex, particularly in modern settings where granular data allow for increasingly fine risk differentiation and regulators face growing pressure to protect vulnerable groups. To address this challenge, we propose an $\alpha$-Fair Individual Solvent Premium ($\alpha$-FISP) framework for insurance pricing that explicitly captures the trade-off between actuarial and solidarity fairness while guaranteeing solvency, a fundamental requirement in insurance operations. We formulate the pricing problem as a constrained optimization task, where actuarially fair premiums are adjusted subject to budget constraints on cross-subsidization within each risk class. This formulation naturally yields a family of solutions parameterized by $\alpha$, tracing a continuum between purely actuarial and purely solidarity-based pricing and enabling decision-makers to select an operating point along this fairness spectrum. We derive theoretical guarantees for the proposed framework. Numerical experiments show that $\alpha$-FISP is computationally tractable and aligns well with the U.S. regulatory regimes featuring heterogeneous state-level fairness requirements.

11.
medRxiv (Medicine) 2026-06-18

Expert in Ultrasound Skills: Feasibility of an IMU-video platform to describe technical profiles during focused cardiac ultrasound. Pilot study

Background: Focused cardiac ultrasound (FoCUS) is operator dependent and requires coordinated probe manipulation, image interpretation and iterative visual feedback. Existing assessment approaches often emphasize final image quality or expert rating. We developed Expert in Ultrasound Skills (EXUS) , a platform that synchronizes transducer-mounted inertial measurement unit (IMU) data with ultrasound video, and evaluated its technical feasibility during FoCUS acquisition. Methods: This observational pilot study included 6 operators performing two repetitions of a four-view FoCUS protocol, yielding 12 analytical sessions and 48 planned acquisitions. Feasibility was defined by acquisition completion, video availability, start/stop events, fused IMU-video windows, temporal coverage, complete human label entries and IMU integrity. A 100-image Likert rating task was used to summarize pairwise inter-rater agreement for still-frame image quality assessment. Results: All 48 planned acquisitions were completed with video, start/stop events, fused windows and complete human label entries. Temporal coverage was at least 90% in 47/48 acquisitions. IMU integrity endpoints exceeded the 80% threshold: 43/48 acquisitions had no extreme IMU-derived artifact, 43/48 had no active-segment IMU restart and 44/48 had no complete motion flatline. Mean pairwise exact agreement for the Likert task was 38.9%, with mean quadratic-weighted Cohen's kappa of 0.564. Post hoc profiles varied across duration, visual quality, mechanical load and motor efficiency. Conclusions: EXUS was technically feasible for synchronized IMU-video capture during FoCUS. The pilot supports multimodal acquisition data as a way to describe technical profiles and generate formative feedback hypotheses, but the post hoc indices are not validated competency measures. Keywords: focused cardiac ultrasound; point-of-care ultrasound; inertial measurement unit; medical education; deliberate practice

12.
arXiv (CS.LG) 2026-06-18

Shrinkage priors for Bayesian Substitute Confounders

arXiv:2606.18535v1 Announce Type: cross Abstract: Multi-cause observational studies contain information about unmeasured confounding through the dependence structure among causes. However, literal imputation of the unobserved confounder is often more complex than learning a lower-dimensional substitute score that preserves the shared assignment variation needed for stable causal adjustment. The deconfounder (Wang and Blei, 2019) and related substitute confounder methods exploit this idea, but flexible assignment models can fit the joint distribution of the causes while producing scores that over-encode the treatment vector, collapse overlap, or capture single-cause variation. We develop a Bayesian factor assignment framework for learning sparse substitute confounders that retain coarse multi-cause dependence with shrinkage priors. The theory is stated at the level of posterior concentration, factor score contraction, and overlap-preserving assignment geometry and therefore does not rely on a particular shrinkage prior. Under these conditions, the proposed regression-adjusted estimators are consistent for mean potential outcomes when the corresponding latent variable identification assumptions hold. Shrinkage priors provide a natural tool for latent structural learning: they favour low-dimensional factors supported by multiple causes, discourage effectively single-cause factors, and induce an ordering of the latent factors through progressive shrinkage. Synthetic experiments illustrate the roles of signal strength, outcome validity, and geometry-aware regularization. In an Alzheimer's Disease Neuroimaging Initiative (ADNI) baseline analysis, sparse substitute scores recover much of the adjustment obtained by directly conditioning on invasive cerebrospinal-fluid biomarkers, while collapse diagnostics identify when fitted factors reduce to individual observed measurements.

13.
arXiv (CS.CL) 2026-06-15

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

Streaming speaker diarization is crucial for time-critical medical dispatch, but deploying it on resource-constrained hardware requires smaller, faster models. Using SIMSAMU, a dataset of simulated medical-dispatch conversations, we evaluate streaming behavior before compressing the segmentation model with pruning and low-bit quantization. We characterize performance across a range of streaming latency budgets and find that additional buffering is not consistently beneficial, while very low-latency operating points can substantially degrade performance. Our study shows that model compression trades performance for memory footprint, and we highlight an operating point where FP16 reduces model size by half with essentially unchanged real-time factor, at a cost of a 40\% relative DER increase against the baseline. This work characterizes the trade-offs for real-time deployment and contributes to speech technology that can enable reliable human communication in time-critical contexts.

14.
arXiv (CS.CV) 2026-06-16

MVEB: Massive Video Embedding Benchmark

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

15.
arXiv (CS.CV) 2026-06-17

Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image Segmentation

Semi-supervised medical image segmentation has emerged as a dominant research problem in medical image analysis, mitigating annotation scarcity by leveraging consistency regularization on unlabeled data. However, existing approaches operate predominantly via visual pattern matching, relying heavily on pixel-level similarities. This visual-centric dependency often falters in clinical scenarios characterized by the visual-semantic mismatch, where visually similar lesions warrant distinct diagnostic conclusions, thus failing to capture the underlying diagnostic logic used by experts. To address this, we move beyond visual cues and propose CERS (CoT-Enhanced Reasoning Segmentation), a framework that integrates Chain-of-Thought (CoT) reasoning to distinguish pathologically distinct cases. Specifically, we construct a knowledge pool enriched with linguistic reasoning descriptions generated by large language models (LLMs). A semantic-aware reference selection strategy is introduced to identify historical evidence, filtering candidates first by morphology, and then refining them via CoT consistency to eliminate hard negatives. Furthermore, a multi-scale coordinate attention module (MCAM) is designed to effectively fuse this reasoning-derived context into the decoding process. Extensive experiments demonstrate the superiority of CERS against state-of-the-art approaches, particularly in resolving boundary ambiguities and semantic inconsistencies. The code is available at https://github.com/cymasuna/CERS.

16.
medRxiv (Medicine) 2026-06-10

Towards the Virtual Amyotrophic Lateral Sclerosis Patient: Inferring Cortical Excitability through Whole-Brain Dynamical Modeling

Amyotrophic lateral sclerosis (ALS) is increasingly recognized as a multisystem neurodegenerative disorder in which motor-neuron degeneration is accompanied by widespread alterations in cortical dynamics. Among its most reproducible neurophysiological signatures is cortical hyperexcitability, yet how this local excitability imbalance shapes distributed whole-brain activity remains poorly understood. Here, we combined source-reconstructed resting-state MEG data, tractography-informed whole-brain modeling, and simulation-based inference to investigate whether ALS-related alterations in large-scale brain dynamics can be mechanistically explained by changes in cortical excitability. First, we characterized empirical brain dynamics using complementary features spanning regional activity amplitude and variability, functional connectivity, and avalanche-based metrics. These analyses revealed significant alterations in ALS patients relative to healthy controls, as well as associations with clinical impairment and disease staging. To mechanistically interpret these changes, we employed a reduced Wong-Wang whole-brain model in which local recurrent excitation modulates emergent large-scale neural dynamics. Simulations showed that increasing excitability systematically reproduced the empirical dynamical signatures observed in ALS. We then applied a simulation-based inference framework to estimate latent excitability parameters directly from empirical observations. Whole-brain model inversion revealed increased excitability in ALS patients compared with controls. The recovered excitability parameter was associated with disease staging, supporting its clinical relevance as a model-derived descriptor of ALS progression. Finally, by extending the model to estimate frontal and non-frontal excitability separately, we found that ALS-related alterations were predominantly associated with increased frontal excitability, whereas non-frontal regions appeared comparatively less affected. The recovered parameters related to disease staging. Together, these findings provide a mechanistic framework linking altered large-scale brain dynamics in ALS to selective cortical hyperexcitability, explaining how local excitability changes can give rise to global network reorganization. More broadly, they show how computational model inversion can recover latent multiscale pathophysiological processes from empirical neural recordings, offering a non-perturbative alternative to complex experimental paradigms typically required to causally probe local-to-global mechanisms.

17.
medRxiv (Medicine) 2026-06-22

Regional Service-System Conditions Associated with Facility-Linked Home-Based Specialist Care in Japan: A Claims-Based Ecological Study of Home Dialysis

Authors:

Background Complex chronic care is increasingly delivered in patients' homes while remaining linked to specialist facilities for training, monitoring, and backup care. Home dialysis provides a useful case because peritoneal dialysis (PD) and home hemodialysis (HHD) share a home-facility delivery structure but differ in technical and operational requirements. This study examined regional service-system conditions associated with the presence and scale of PD and HHD in Japan. Methods This ecological study used publicly available claims, administrative, census, and geospatial data harmonized to 334 Secondary Medical Areas. Regional indicators were organized into four domains: dialysis service delivery, implementation support for home-based care, hospital backup capacity, and living and sociodemographic context. Diffusion was examined using claims-based indicators of regional presence and post-presence scale, analyzed separately for PD and HHD with Firth penalized logistic regression and zero-truncated negative binomial regression, respectively. Results PD was observed in 271 regions and HHD in 109. Patterns of associated regional conditions differed by modality and stage. PD was associated mainly with existing dialysis-service organization, whereas HHD was associated with broader regional supports, including home-care delivery, living infrastructure, transition support, and hospital-system indicators. Conditions associated with presence differed from those associated with scale. Cross-modality associations suggested that shared regional factors may shape the distribution of both modalities. Conclusions Regional conditions for home dialysis diffusion in Japan differed by modality and stage. PD was linked mainly to existing dialysis-service organization, whereas HHD was linked to multi-domain regional support for technically demanding home treatment. Under standardized reimbursement, local service-system capacity may remain important for modality- and stage-specific diffusion of home dialysis.

18.
arXiv (CS.AI) 2026-06-15

Metabolic cost of information processing in Poisson variational autoencoders

arXiv:2602.13421v2 Announce Type: replace-cross Abstract: Computation in biological systems is fundamentally energy-constrained, yet standard theories of computation treat energy as freely available. Here, we argue that variational free energy minimization under a Poisson assumption offers a principled path toward an energy-aware theory of computation. Our key observation is that the Kullback-Leibler (KL) divergence term in the Poisson free energy objective becomes proportional to the prior firing rates of model neurons, yielding an emergent metabolic cost term that penalizes high baseline activity. This structure couples an abstract information-theoretic quantity – the *coding rate* – to a concrete biophysical variable – the *firing rate* – which enables a trade-off between coding fidelity and energy expenditure. Such a coupling arises naturally in the Poisson variational autoencoder (P-VAE) – a brain-inspired generative model that encodes inputs as discrete spike counts and recovers a spiking form of *sparse coding* as a special case – but is absent from standard Gaussian VAEs. To demonstrate that this metabolic cost structure is unique to the Poisson formulation, we compare the P-VAE against Grelu-VAE, a Gaussian VAE with ReLU rectification applied to latent samples, which controls for the non-negativity constraint. Across a systematic sweep of the KL term weighting coefficient $\beta$ and latent dimensionality, we find that increasing $\beta$ monotonically increases sparsity and reduces average spiking activity in the P-VAE. In contrast, Grelu-VAE representations remain unchanged, confirming that the effect is specific to Poisson statistics rather than a byproduct of non-negative representations. These results establish Poisson variational inference as a promising foundation for a resource-constrained theory of computation.

19.
arXiv (CS.LG) 2026-06-16

A Spatio-Temporal Expert Prefetching Framework for Efficient MoE-based LLM Inference

arXiv:2606.15453v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) based large language models (LLMs), such as Qwen and DeepSeek, have recently emerged as an effective approach to improving model capacity without proportionally increasing computational cost. By replacing the conventional feed-forward network in dense LLMs with a set of experts and activating only a subset of them for each input token, MoE models significantly increase the total number of parameters while keeping the per-token computation relatively manageable. However, this dynamic and irregular expert activation pattern also introduces substantial expert loading overhead during inference, since the required experts must be fetched on demand according to token-dependent routing results. As a result, expert loading latency becomes a major source of performance and energy inefficiency. To this end, we first perform a comprehensive analysis of expert selection behavior in various MoE-based LLMs and applications, including language understanding and code generation. Our analysis reveals that, within each application domain, expert requests exhibit strong correlation across both adjacent MoE layers and consecutive decoding tokens, making future expert activations predictable. Based on this insight, we propose ST-MoE, a spatio-temporal expert prefetching framework that proactively stages experts ahead of use to overlap expert loading with ongoing computation. ST-MoE combines a lightweight runtime prediction mechanism that preserves the original routing behavior with a reconfigurable hardware design that efficiently supports dynamic expert prefetching. The combined effect of the prediction mechanism with the supporting hardware significantly improves MoE inference performance and energy efficiency while preserving model inference accuracy.

20.
arXiv (CS.CV) 2026-06-15

Pix2Pix-Hybrid: Structure-Guided Conditional Synthesis of Hajj Crowd Images with Multi-Channel Conditioning and Weak Attribute Supervision

Developing accurate crowd-counting models for Hajj pilgrimage scenes remains challenging because domain-specific annotated images are scarce and data collection during large gatherings raises privacy concerns. To address these limitations, this paper proposes Pix2Pix-Hybrid (P2P-H), a hybrid conditional GAN for structure-guided Hajj crowd-image synthesis and data augmentation. P2P-H builds on Pix2Pix and employs a U-Net generator conditioned on eight input channels that jointly encode structural cues (edges and grayscale) and contextual attributes (crowd density and time of day). To capture detailed textures in dense scenes, the framework integrates two multi-scale PatchGAN discriminators operating at different resolutions. The training procedure combines adversarial, perceptual, and feature-matching objectives with adaptive data augmentation and stabilization strategies. The model was trained on 993 real Hajj frames collected from 60 publicly available video sources, with conditioning attributes derived automatically to reduce manual labeling effort. Using this framework, we constructed CrowdH, a synthetic dataset of 10,000 high-resolution Hajj crowd images. Experimental results show that P2P-H improves structure-preserving conditional synthesis quality compared with Pix2Pix and StyleGAN2-ADA baselines and shows favorable transfer to other crowd datasets. To assess downstream utility, we further constructed CrowdH-Mix-469, an annotated mixed real-synthetic dataset comprising 384 real Hajj images and 85 selected synthetic images,and evaluated five crowd-counting models under real-only and real-plus-synthetic training. The selected synthetic data reduced MAE across all five models, with the strongest gain observed for CSRNet.

21.
arXiv (CS.LG) 2026-06-16

HRIR-Former: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer

arXiv:2603.27998v2 Announce Type: replace-cross Abstract: Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose HRIR-Former, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase preprocessing is unnecessary.

22.
arXiv (quant-ph) 2026-06-16

Electronic Band Structure of Silicon Determined via a Variational Adiabatic Eigensolver: Theory and Experiment

arXiv:2606.16604v1 Announce Type: new Abstract: This work addresses the critical challenge of excited-state preparation for semiconductor band structure calculations. We introduce a variational adiabatic eigensolver (VAE) protocol that combines adiabatic evolution with variational optimization to prepare high-fidelity eigenstates on noisy intermediate-scale quantum (NISQ) devices. Applying a momentum-space truncation, we accurately compute the electronic band structure of silicon – an idealized infinite periodic system – using only a modest number of qubits. Our approach employs multi-qubit parameterized circuits and a phase-based loss function, overcoming limitations of conventional methods. These limitations include the circuit-construction difficulty in traditional adiabatic approaches and the reduced accuracy of variational quantum eigensolvers for excited states. Through rigorous numerical simulation and experimental implementation on a superconducting quantum processor, we successfully prepare silicon's valence-band and conduction-band eigenstates. Single-shot readout yields state fidelities exceeding 96%, and the measured energy expectations agree with theoretical band energies within 0.5 eV. Further refinement via single-frequency oscillation fitting reduces the energy deviation to below 0.01 eV. This framework provides a robust and practical pathway for precisely determining electronic structures in quantum materials.

23.
arXiv (CS.CV) 2026-06-15

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

24.
arXiv (CS.LG) 2026-06-18

Unsupervised Diffusion Solver for Combinatorial Optimization via Combinatorial Adjoint Matching

arXiv:2605.30920v2 Announce Type: replace Abstract: Diffusion-based neural solvers have shown strong promise for combinatorial optimization (CO), but existing methods typically rely on supervised training with large collections of near-optimal solutions. In this work, we extend adjoint-based trajectory optimization methods to discrete combinatorial domains. We formulate diffusion-based CO as a stochastic control problem over Continuous-Time Markov Chains and introduce discrete adjoint dynamics for propagating optimization signals through discrete generative trajectories. Building on this formulation, we propose Combinatorial Adjoint Matching (CAM), an unsupervised training framework for discrete diffusion solvers with structured and low-variance trajectory-level optimization signals. Empirically, CAM consistently outperforms existing unsupervised diffusion baselines and achieves performance competitive with strong supervised diffusion solvers and even traditional solvers across diverse combinatorial optimization problems. Our code is available at https://github.com/Shengyu-Feng/CAM.

25.
arXiv (CS.LG) 2026-06-19

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

arXiv:2606.19451v1 Announce Type: new Abstract: We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.