Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-12

What Type of Inference is Active Inference?

arXiv:2606.04935v2 Announce Type: replace Abstract: Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that full EFE-based planning outperforms ablations that omit either the planning correction or the epistemic corrections.

02.
bioRxiv (Bioinfo) 2026-06-11

DeePEn - A Depth sensitive benchmark for Protein Engineering

Recent progress in modeling techniques and high-throughput screening has significantly enhanced the accessibility of protein engineering. Nevertheless, further progress gets hindered by the lack of robust benchmarks that capture the practical challenges for real-world protein engineering. Here, we introduced DeePEn, a Depth-sensitive benchmark for Protein Engineering that quantifies a models generalization capabilities when predicting protein fitness at increasing mutational distance from the wildtype or training data. We defined distance as the number of simultaneous point mutations, i.e., single amino acid variants (SAVs), moving from wild-type to mutant (edit distance in computer science jargon). Specifically selecting four deep mutational scanning (DMS) datasets with sufficient multi-mutation data points from ProteinGym, we assessed recent predictive models, including general and biophysics-informed protein Language Models (pLMs), and a non-transformer neural network. Our results highlight how the performance of all models deteriorates with increasing mutational distance and that no single metric sufficiently captures the diverse requirements of protein engineering. To overcome these shortcomings, DeePEn provides a readily available resource for multi-metric benchmarking that focuses on the prediction of distant variants.

03.
arXiv (CS.LG) 2026-06-16

How to Score Experts for One-Shot MoE Expert Pruning: A Unified Formulation and Selection Principle

arXiv:2606.15716v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) language models reduce per-token computation through sparse expert activation, yet deployment still requires storing the full expert pool, making one-shot expert pruning a practical approach for reducing memory usage. Although effective, existing criteria are largely heuristic, and no single criterion is universally optimal. Thus, establishing a principle for selecting pruning criteria suited to different deployment objectives remains an important yet largely underexplored problem in one-shot expert pruning. To this end, we introduce a unified formulation for one-shot MoE expert pruning organized around three factors: routing frequency, gate weighting, and activation strength. The formulation yields a criteria selection principle: task-agnostic pruning should favor routed-token-averaged, gate-free activation-based criteria, whereas task-specific pruning can benefit from retaining routing-frequency and gate-weight information. Beyond this principle, the formulation also provides a systematic view of existing heuristic criteria and gives rise to two new task-agnostic criteria, Mean Activation Norm (MAN) and Mean Squared Activation Norm (MSAN). Across four representative MoE models and 16 diverse benchmarks, MAN and MSAN are consistently strong in the task-agnostic setting, obtain the top-two average ranks, and improve average performance by up to 8.8 points over the strongest baseline.

04.
arXiv (quant-ph) 2026-06-16

Optical Creation of Synthetic Microgravity for Quantum Degenerate Gases

arXiv:2606.14985v1 Announce Type: cross Abstract: Microgravity environments provide unique opportunities for ultracold-atom experiments by enabling long interrogation times and reduced acceleration-induced dynamics. However, their realization has largely been restricted to specialized facilities such as drop towers, sounding rockets, and space-based laboratories. Here we realize synthetic microgravity for quantum degenerate gases using optically engineered force landscapes that compensate Earth's gravity to the milli-g level while maintaining continuous confinement of the atomic ensemble. These force landscapes are generated by dynamically painted optical dipole potentials and calibrated in situ through Bloch oscillations in a vertical optical lattice, enabling precise control of the residual acceleration. We use this capability to demonstrate matter-wave beam splitting with arm separations of several hundred microns. We further implement a Bloch-band atom interferometer in which interaction-induced dephasing is strongly suppressed through controlled three-dimensional expansion in the synthetic microgravity potential. This reduction of mean-field effects restores near-$\sqrt{N}$ scaling of interferometric sensitivity for large quantum degenerate ensembles. Our results establish a versatile platform for realizing synthetic microgravity with trapped quantum gases in terrestrial laboratories, bringing the advantages of microgravity experiments to continuously operating systems and opening new opportunities for quantum sensing, matter-wave interferometry, and precision measurements.

05.
arXiv (CS.LG) 2026-06-12

An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

arXiv:2606.12486v1 Announce Type: new Abstract: Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution's implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

06.
arXiv (CS.LG) 2026-06-12

Adaptive generative moment matching networks for improved learning of dependence structures

arXiv:2508.21531v2 Announce Type: replace-cross Abstract: An adaptive bandwidth selection procedure for the mixture kernel in the maximum mean discrepancy (MMD) for fitting generative moment matching networks (GMMNs) is introduced, and improved learning of copula random number generators is demonstrated. Based on the relative error of the training loss, the number of kernels is increased during training; additionally, the relative error of the validation loss is used as an early stopping criterion. While training time remains similar, adaptively training GMMNs (AGMMNs) significantly increases training performance, which is shown based on validation MMD trajectories, samples and validation MMD values. Superiority of AGMMNs over GMMNs and parametric copula models is also demonstrated in terms of three applications. First, convergence rates of estimators based on quasi-random versus pseudo-random samples from copulas are investigated in dimensions as large as 100 for the first time. Second, replicated validation MMDs, as well as Monte Carlo and quasi-Monte Carlo applications demonstrate the improved training of AGMMNs for a copula model implied by the 50 constituents of the S&P 500 index after deGARCHing. Last, both the latter dataset and 50 constituents of the FTSE 100 are used to demonstrate that the improved training of AGMMNs indeed translates to an improved model prediction.

07.
arXiv (CS.CV) 2026-06-16

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

08.
arXiv (quant-ph) 2026-06-12

From 2D Yang-Mills to Calogero-Sutherland via a colored particle

arXiv:2606.13388v1 Announce Type: cross Abstract: We study Yang-Mills theory coupled to a particle on a cylinder, where gauge invariance and compactness reduce the dynamics to a finite dimensional quantum system. In the Abelian case, this yields a model equivalent to the Landau problem on a torus, with a degenerate ground state structure. We generalize this construction to non-Abelian gauge groups and show that, for SU(N), the system reduces to a one dimensional quantum many body problem with a singular Calogero-Sutherland-type interaction.

09.
arXiv (CS.CL) 2026-06-12

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits to enter latent mode and to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

10.
arXiv (CS.CV) 2026-06-18

Optimizing Incomplete, Large-Scale and Sparse Multi-Graph Matching in Bioimaging

Multi-graph matching is a fundamental problem in computer vision. Our work is motivated by a challenging application in bioimaging, where dozens or even hundreds of 3D microscopy images of worms must be brought into correspondence. Existing datasets do not cover this large-scale regime, and virtually all existing methods are inapplicable because they assume a complete or dense problem setting. To support further research, our first contribution is a new large-scale dataset based on problem instances from bioimaging. Our second contribution is a comprehensive analysis of the two main multi-graph matching paradigms: direct and permutation synchronization-based formulations. We argue, in part by proof, that practical large-scale methods must explicitly address problem sparsity and incompleteness. Since standard permutation synchronization approaches fail in this setting, we further introduce a sparse permutation synchronization paradigm. Our final contribution is GREEDA, a general method for sparse and incomplete problems that can be instantiated across cost orders and paradigms. While our paper focuses on objective functions up to quadratic order, GREEDA is inherently generalizable to arbitrary orders. On larger, sparse instances, GREEDA outperforms competing methods in both objective value and runtime. For example, for moderately-sized problems based on 30 worm images GREEDA produces a high-quality solution within 2 minutes, whereas competitors require at least half an hour and yield far worse results. On smaller dense problems, GREEDA remains on par with leading methods while being an order of magnitude faster.

11.
arXiv (CS.LG) 2026-06-18

Beyond Algorithms: Conceptual Innovation in Medical Imaging AI

arXiv:2606.19270v1 Announce Type: cross Abstract: Artificial intelligence has driven rapid progress in medical imaging research, producing increasingly sophisticated algorithms and steady improvements on benchmark tasks. However, this algorithm-centric trajectory has also revealed a growing imbalance: while computational methods advance rapidly, the conceptual foundations that define imaging tasks, evaluation metrics, and clinical meaning sometimes remain underexamined. In this Perspective, we distinguish algorithmic innovation, which focuses on improving computational implementations and performance within a fixed problem definition, from conceptual innovation, which reframes what problems are posed, how success is measured, and why an approach is clinically relevant. We argue that prevailing incentive structures, training pathways, and publication norms disproportionately reward algorithmic novelty, particularly for early-career researchers, while at times undervaluing conceptual contributions that are essential for scientific maturation and clinical translation. Through representative examples from medical imaging AI, we show how insufficient conceptual grounding can lead to misaligned objectives, fragile generalization, and limited real-world impact. We conclude with actionable recommendations for researchers, mentors, reviewers, and journals to better recognize, support, and integrate conceptual innovation alongside algorithmic advances.

12.
arXiv (CS.LG) 2026-06-15

Federated Learning for Feature Generalization with Convex Constraints

arXiv:2606.14416v1 Announce Type: new Abstract: Federated learning (FL) often struggles with generalization due to heterogeneous client data. Local models are prone to overfitting their local data distributions, and even transferable features can be distorted during aggregation. To address these challenges, we propose FedCONST, an approach that adaptively modulates update magnitudes based on the parameter strength of the global model. This prevents over-emphasizing well-learned parameters while reinforcing underdeveloped ones. Specifically, FedCONST employs linear convex constraints to ensure training stability and preserve locally learned generalization capabilities during aggregation. A Gradient Signal to Noise Ratio (GSNR) analysis further validates the effectiveness of FedCONST in enhancing feature transferability and robustness. As a result, FedCONST effectively aligns local and global objectives, mitigating overfitting and promoting stronger generalization across diverse FL environments, achieving state-of-the-art performance.

13.
arXiv (CS.CL) 2026-06-15

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Language Models. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism. This is achieved by leveraging real-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning. In collaborative and non-collaborative dialogue benchmarks, UP-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks. Particularly in negotiation tasks, the sale-to-list ratio (SL) increased by 56.41%. This demonstrates that UP-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics.

14.
arXiv (CS.AI) 2026-06-16

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

arXiv:2606.08090v2 Announce Type: replace-cross Abstract: Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.

15.
arXiv (CS.CV) 2026-06-12

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

16.
arXiv (CS.CV) 2026-06-15

BoRAD: Bootstrap your Own Representations for Multi-class Anomaly Detection

Reconstruction-based anomaly detection is attractive for industrial inspection, but scaling it from category-specific training to a one-for-all setting is challenging. A single model must reconstruct diverse normal appearances without copying abnormal details, which exposes two coupled failure modes: identical shortcut, where anomalies pass through the reconstruction path, and mis-reconstruction, where normal categories are confused with one another. We propose BoRAD, a label-free training framework that treats this as a representation-capacity allocation problem. BoRAD uses a shared learnable prototype bank to impose two complementary regularizers: spatial prototype alignment contracts local within-prototype variation to suppress anomaly copying, while prototype-relative global alignment preserves between-prototype structure and improves sensitivity to abnormal angular deviations. The prototype bank and prediction heads are used only during training; inference remains a standard teacher-student feature discrepancy pass, with no class labels, negative pairs, memory retrieval, or prototype lookup. BoRAD achieves competitive one-for-all anomaly detection performance, including 86.2\% mAD on MVTec AD, 80.7\% mAD on VisA and 73.1\% mAD on Real-IAD. Diagnostic analyses further show reduced anomaly leakage, improved normal-category separability, and stronger anomaly-normal score separation.

17.
arXiv (quant-ph) 2026-06-19

Resolving problems with the continuum limit in coherent-state path integrals

arXiv:2602.02466v2 Announce Type: replace Abstract: The paper solves the problem of continuum limit in bosonic thermal coherent-state path integrals. For this purpose, exact discrete versions of the path integral are constructed for three different orderings of the Hamiltonian: normal, anti-normal and symmetric (Weyl order). Subsequently, their different continuum versions are checked on the harmonic oscillator, to choose the symmetric ordering as a possibly correct choice for all polynomial Hamiltonians. Spotted mathematical subtleties in the simple case serve as a clue to the general solution. Finally, a general justification for the symmetric order is provided by deriving the continuum path integral starting from the exact discrete case using a renormalization procedure in the imaginary time frequency domain. While the role of Weyl order has already been found, the paper provides the missing proof of its suitability for every polynomial Hamiltonian and simplifies the previously established construction by referring only to creation and annihilation operators (without position and momentum operators).

18.
medRxiv (Medicine) 2026-06-17

Determinants of non-utilization of insecticide-treated nets among children under five in Rwanda: analyses of the 2024 Rwanda malaria indicator survey

Background Insecticide-treated nets (ITNs) are effective for preventing malaria among children under five years, who bear a disproportionate burden of malaria. This study assessed the prevalence and determinants of ITN non-utilization among children under five in Rwanda using data from the 2024 Rwanda Malaria Indicator Survey (RMIS).Methodology This cross-sectional study utilized nationally representative data from the 2024 RMIS. Analyses were restricted to children under five residing in households that owned at least one ITN. The outcome was non-utilization of ITN, defined as not sleeping under an ITN the night preceding the survey. Survey-weighted descriptive statistics were used to estimate the prevalence of ITN non-utilization. Factors associated with non-utilization were identified using a survey-weighted Poisson regression model. Adjusted prevalence ratios (aPRs), 95% confidence intervals and p-values were reported.Results A total of 1,979 children were included in the study. The weighted prevalence of ITN non-utilization among children under five years was 20.11% (95% CI: 17.81 - 22.63). After adjusting for other factors, children aged 2 - 3 years were associated with an 83% higher prevalence of ITN non-utilization compared with those aged [&le;]1 year (aPR = 1.83, 95% CI: 1.423 - 2.352, p < 0.001). Compared with households that owned only one ITN, children in households with three or more ITNs were associated with a 76% lower prevalence of ITN non-utilization (aPR = 0.24, 95% CI: 0.171 - 0.332, p < 0.001). Children living in households with 5 - 7 members were associated with an 87% higher prevalence of ITN non-utilization compared with those in households with 1 - 4 members (aPR = 1.87, 95% CI: 1.476 - 2.358, p < 0.001).Conclusion The findings suggest that ITN utilization among children is influenced not only by household access to nets but also by household composition and dynamics that shape the allocation and use of available preventive resources.

19.
arXiv (CS.LG) 2026-06-19

Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis

arXiv:2606.19966v1 Announce Type: cross Abstract: Whole-slide images (WSIs) are widely used for computational cancer prognosis. However, most existing methods primarily focus on in-domain performance and fail to generalize across clinical centers. This limitation stems from their reliance on pixel-derived representations that are highly susceptible to domain-specific artifacts caused by staining protocols and scanner hardware. We hypothesize that high-level pathology semantics, such as tumor grade and micro-environmental architecture, provide a domain-invariant semantic representation that mirrors the robust diagnostic logic of human pathologists. Therefore, we propose a Semantic-Anchored Evidential Fusion Survival (SAEFS) framework, where SAEFS derives semantic anchors from WSIs via Visual Question Answering (VQA), employs a dual-stream WSI evidence extraction architecture, uses Dirichlet-based Subjective Logic to model uncertainty, and fuses semantic and visual evidence through a cautious conjunction rule to avoid overconfident fusion from correlated sources. Trained exclusively on one source domain and evaluated zero-shot across four unseen domains, SAEFS consistently outperforms state-of-the-art models both in prediction accuracy and reliability, improving the average C-index by 10.2%. Quantitative analyses further show that VQA-derived semantic features exhibit significantly lower cross-center divergence than pixel-derived features, highlighting their robustness for cross-center clinical applications.

20.
arXiv (CS.LG) 2026-06-18

Smoothness-Based Derandomization of PAC-Bayes Bounds

arXiv:2606.19105v1 Announce Type: new Abstract: We study PAC-Bayes derandomization for smooth loss functions. Our goal is to obtain generalization bounds that hold with high probability for deterministic predictors by exploiting smoothness properties of both the loss and the predictor class. We show that passing from the Gibbs predictor to the deterministic predictor at the posterior mean has a precise cost, given by the generalization gap of the Jensen gap class. We control this class through its Rademacher complexity, leading to bounds for deterministic predictors that involve flatness quantities expressed in terms of parameter Jacobians and Hessians of the score map. The framework applies to both bounded and unbounded smooth loss functions, and we specialize the results to linear predictors and smooth neural networks. Finally, the Jacobian and Hessian quantities appearing in the theory motivate a practical regularizer. For BatchNorm networks, we compute this regularizer with respect to effective BatchNorm weights obtained by folding the BatchNorm transformation into the adjacent affine weights. Experiments on CIFAR-10 illustrate the behavior of this regularizer under different batch sizes.

21.
arXiv (CS.AI) 2026-06-12

Foresight: Iterative Reasoning About Clues that Matter for Navigation

arXiv:2606.12550v1 Announce Type: cross Abstract: Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: https://amrl.cs.utexas.edu/foresight

22.
arXiv (CS.AI) 2026-06-16

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

arXiv:2510.04212v4 Announce Type: replace-cross Abstract: The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.

23.
arXiv (CS.CV) 2026-06-19

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

24.
arXiv (CS.CV) 2026-06-12

Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors $R=4$ and $R=8$ using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

25.
arXiv (CS.CL) 2026-06-11

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p < .0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.