Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

02.
arXiv (CS.CV) 2026-06-16

Semantic Editing with Coupled Stochastic Differential Equations

Editing the content of an image with a pretrained text-to-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using coupled stochastic differential equations (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box, without retraining or auxiliary networks, and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI. Project page: https://z-jianxin.github.io/syncSDE-release/. Code: https://github.com/Z-Jianxin/syncSDE-release.

03.
arXiv (quant-ph) 2026-06-17

Unveiling Hierarchical Invariants in Multiphoton Linear Optics

arXiv:2506.12857v2 Announce Type: replace Abstract: Linear optical networks driven by quantum states of light are important building blocks of photonic quantum technologies. They access large bosonic Hilbert spaces through multiphoton interference. At the same time, their dynamics are generated by single-particle mode transformations, thereby defining a highly structured subset of multiphoton unitaries and setting boundary on linear optics capability. To elucidate this boundary, we reveal an underlying fine-grained symmetry structure that partitions the multiphoton operator space into invariant subspaces and generates a hierarchy of invariants. We experimentally confirm the conservation of high-order invariants and demonstrate their operational utility in characterizing state reachability and the metrological capability of multiphoton probes. Our framework provides a symmetry-based perspective for understanding and harnessing structured multiphoton dynamics across photonic quantum technologies.

04.
arXiv (math.PR) 2026-06-16

Risk or Replace: Efficient Asymptotics for Data-Driven Maintenance

arXiv:2606.14706v1 Announce Type: cross Abstract: Condition-based maintenance (CBM) is an approach that plans interventions for deteriorating systems according to their observed operational state. CBM reduces unplanned downtime and extends usable lifetime. We study a heterogeneous population of components that degrade over time according to a stochastic processes with non-negative and i.i.d. increments that are characterized by component-specific parameters that remain unobservable to the decision maker. We rely on degradation data to estimate these parameters and determine replacement actions at equidistant epochs. The goal is to minimize the long-run average cost, which incorporates fixed replacement costs, failure costs, and operating costs. This problem can be formulated as a high-dimensional partially observable Markov decision process (POMDP), which is generally intractable. We develop a tractable, data-driven CBM policy that estimates the optimal policy of a hypothetical Oracle that has full information of the underlying degradation parameters and call this policy the Estimated Oracle's Optimal Policy (EOP). We introduce a scaling regime where both the failure thresholds and cost parameters increase proportionally, reflecting practical settings in which component lifetimes and maintenance costs are large relative to the time between two consecutive CBM decision moments. We show that the regret of the EOP, defined as the difference between its long-run average cost and that of the Oracle, converges to zero in the scaling regime when the parameter estimator is consistent. Across extensive experiments using both real and simulated data, the EOP achieves very low regret and, whenever the optimal POMDP policy can be computed exactly, a negligible optimality gap.

05.
medRxiv (Medicine) 2026-06-15

Therapeutic efficacy study on shoulder impingement syndrome in swimmers: a network meta-analysis

Shoulder impingement syndrome (SIS), including subacromial impingement and rotator cuff tendinitis, is commonly caused by repetitive swimming movements and associated shoulder joint dysfunction. Despite numerous available treatment options, no consensus exists on the most effective treatment option. Therefore, this systematic review and network meta-analysis aimed to investigate treatment methods for SIS in swimmers. Using a frequentist framework and Cochrane PICOS principles, we compared SIS treatments, constructed network evidence diagrams, and assessed heterogeneity. A total of 45 studies were included in the qualitative synthesis, and 42 contributed to the network meta-analysis, comprising 1752 participants, 9 treatment categories, and outcome measures. For pain outcomes, some adjunctive interventions combined with exercise showed favorable ranking probabilities, although several estimates were accompanied by wide confidence intervals. For shoulder range-of-motion outcomes, taping, acupuncture, manual therapy, and sport-specific training showed favorable effects in selected comparisons, particularly for external and internal rotation. According to surface under the cumulative ranking curve (SUCRA) rankings, exercise combined with medium-frequency therapy ranked highly for pain reduction, whereas exercise combined with acupuncture or extracorporeal shock wave therapy ranked highly for shoulder flexion. Exercise combined with taping ranked highly for external rotation, and exercise combined with manual therapy ranked highly for internal rotation. However, the interpretation of ranking results should remain cautious because uncertainty and inconsistency were present in some comparisons. Exercise-based rehabilitation appears to remain central to the management of SIS in swimmers. Several adjunctive interventions showed favorable findings for selected outcomes, especially pain relief and shoulder rotational function. However, the available evidence was affected by heterogeneity, inconsistency, and imprecision across some treatment comparisons. More rigorously designed swimmer-specific randomized controlled trials are needed before firm treatment hierarchies can be established. Trial registration: The protocol for this systematic review is registered with PROSPERO (www.crd.york.ac.uk/PROSPERO; registration number: CRD42024498851). The first submission of PROSPERO was on January 15, 2024, and it was revised and updated on March 25, 2026.

06.
arXiv (CS.AI) 2026-06-17

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

arXiv:2606.17767v1 Announce Type: cross Abstract: Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through "spoken statistics," intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.

07.
arXiv (CS.CL) 2026-06-18

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

08.
arXiv (CS.CL) 2026-06-16

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.

09.
arXiv (CS.AI) 2026-06-11

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

arXiv:2606.00140v2 Announce Type: replace-cross Abstract: While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure has emerged as a prospective safeguard. However, as the field gradually transitions from U-Net-based diffusion models to Rectified Flow Transformers, erasure research has struggled to keep pace. In this work, we introduce GEM, a simple but highly effective erasure framework for Rectified Flow models. As part of our contribution, we establish a principled bridge between trajectory-based unlearning grounded in Generative Flow Networks and classic teacher-guided erasure: we translate trajectory-based signals into a teacher-guided flow-matching setup that unifies the strengths of both paradigms. Concretely, a teacher provides complementary attraction and repulsion signals that we combine into a single geometric guidance objective, yielding targeted suppression of unwanted concepts while preserving benign generation.

10.
arXiv (CS.LG) 2026-06-17

The Morse Transform for Discrete Shape Analysis

arXiv:2503.04507v2 Announce Type: replace-cross Abstract: The geometry of an object plays a vital role in modulating its interactions with the physical world. It nevertheless remains difficult to describe geometric information numerically for the purposes of statistical inference or classification tasks. Here, we introduce a new topological transform which leverages directional piecewise-linear Morse theory to quantify the geometry of an embedded object by cataloguing critical points across multiple height-functions. The output of this Morse transform records both the heights and the local topological type (peak, trough or saddle) of the critical points that characterise the underlying shape, retaining finer information than the Euler characteristic transform whilst naturally prioritising a shape's outermost regions. Crucially, this output can be further compressed into a rich but compact feature vector. We benchmark the Morse feature vector as a descriptor for ligand-based virtual screening (LBVS), which intrinsically depends on the shape of molecules. Under a common gradient-boosted tree classification pipeline, Morse descriptors achieve the highest mean AUROC when compared to other topological transform descriptors and to standard shape-based LBVS descriptors.

11.
arXiv (CS.CV) 2026-06-18

Cosmos 3: Omnimodal World Models for Physical AI

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI – effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

12.
arXiv (CS.CL) 2026-06-19

DeFrame: Debiasing Large Language Models Against Framing Effects

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing – differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") – as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

13.
arXiv (quant-ph) 2026-06-12

More efficient Clifford+T synthesis for small-angle rotations and application to Trotterization

arXiv:2605.31544v2 Announce Type: replace Abstract: Clifford+T synthesis of rotation gates is an important routine in fault-tolerant quantum compilation. While Clifford+T synthesis is scalable, it has a high overhead of tens of T gates per rotation in practice, translating to high resource estimates for many fault-tolerant algorithms. However, these well-known results, including those using probabilistic mixtures [Quantum 7, 1208 (2023)], are independent of the rotation angle $\theta$, requiring $O(\log 1/\delta)$ T gates. We show that it is possible to do much better for small angles, reducing the T cost to $\tilde{O}(\theta^2/\delta)$, and returning to existing $O(\log1/\delta)$ results in the worst case. This is particularly important since many algorithms, such as Trotterization, are dominated by small-angle rotations. Further, we perform a detailed theoretical and numerical study of quasi-probabilities, which can further reduce the total T cost of large circuits by orders of magnitude with only a small overhead in sample complexity. We also develop a scheme based on quasi-probability mixtures of Clifford+T fallback channels. We derive new $\theta$-dependent formulas that can be used for resource estimation of fault-tolerant quantum algorithms. As an application of our results, we show that the gate cost of Trotterization circuits compiled to a Clifford+T gate set is constant in the small Trotter step size limit, and can be reduced by orders of magnitude even for large step sizes. The cost of fault-tolerant Trotterization for a variety of applications should be re-examined in light of these results. Our work dispels the widely-stated claim that Clifford+T rotation synthesis has a high cost independent of $\theta$, and further develops a scalable quasi-probability method for rotation synthesis. We also expect our results to bring forward useful early fault-tolerant quantum computing by reducing required magic state resources.

14.
arXiv (CS.LG) 2026-06-19

A graph neural network surrogate model for mesh-based crashworthiness prediction of vehicle panel components

arXiv:2503.17386v2 Announce Type: replace-cross Abstract: Crashworthiness is a key performance measure in the design of safety-critical vehicle panel components such as B-pillars. Finite element (FE) simulations are widely used to evaluate crash responses but remain computationally expensive for large-scale, nonlinear impact scenarios, particularly when integrated into iterative design and optimisation processes. Although machine learning-based surrogate models have been developed for rapid crashworthiness analysis, they exhibit limitations in detailed representation of complex 3-dimensional components. Graph Neural Networks (GNNs) have emerged as a promising solution for processing data with complex structures. However, existing GNN models often lack sufficient accuracy and computational efficiency to meet industrial demands. This paper proposes Recurrent Graph U-Net (ReGUNet), a graph-based surrogate model for crashworthiness analysis of vehicle panel components. By representing FE meshes in graph form, the model naturally accommodates complex irregular structural geometries. Its hierarchical architecture improves computational efficiency and accuracy, while the introduction of recurrence enhances stability of temporal predictions over multiple time steps. A side-impact case study of hot-stamped steel B-pillars with varying geometries is used to generate training dataset. The trained model demonstrates high accuracy in predicting the dynamic deformation behaviour and crashworthiness indicators of previously unseen component designs. ReGUNet achieves over a 52% reduction in the average deformation prediction error relative to baseline methods, together with markedly improved computational efficiency. ReGUNet provides rapid and reliable crashworthiness assessments, which in turn accelerates the design cycle of vehicle panel components.

15.
arXiv (CS.AI) 2026-06-11

Towards a Bridge Layer Between Bibliographic and Formalized Mathematical Knowledge

作者:

arXiv:2606.11430v1 Announce Type: cross Abstract: Mathematical knowledge is split between bibliographic databases (e.g., MathSciNet, zbMATH Open) and formal proof libraries (e.g., Lean mathlib), preventing unified access between published results and their formalizations. We propose a relational bridge-database that aligns publication metadata with formal artifacts, providing an interoperability layer between mathematical literature and machine-verifiable proofs. We introduce a paper-level formalization score that measures how much of a publication is covered in formal systems. As a feasibility study, we show how such scores can be estimated via cross-document alignment between informal texts and Lean formalizations, enabling large-scale analysis of formalization coverage. This framework is a first step toward integrating bibliographic and formal mathematical ecosystems into scalable, machine-actionable knowledge graphs linking publications to formal proof objects.

16.
arXiv (CS.LG) 2026-06-11

Modelling magnetic material properties with uncertainty-aware neural networks

arXiv:2606.11870v1 Announce Type: cross Abstract: Machine learning is increasingly applied to accelerate the discovery of novel materials by exploring large compositional and structural design spaces. Yet, the scarcity of high-quality data and the frequent need for out-of-distribution prediction introduce substantial uncertainty, making the assessment of model reliability essential. In this work, we investigate uncertainty quantification as a means to evaluate model confidence in the context of permanent magnet research. In a first study, we benchmark classical and modern machine learning models for predicting intrinsic magnetic properties, focusing on the quality of their uncertainty estimates. We apply Gaussian negative log-likelihood loss and dropout-based Bayesian approximation as practical strategies for estimating predictive uncertainty. In a second study, we transfer these architectural features for uncertainty estimation to a more complex task: predicting coercivity from microstructural information using a graph neural network. Together, these studies demonstrate that uncertainty quantification not only enhances the trustworthiness of predictions but is also transferable across different modeling tasks.

17.
medRxiv (Medicine) 2026-06-10

Cortical activity during narrative discourse production in individuals with post-stroke aphasia and controls measured via functional near-infrared spectroscopy

Introduction: Aphasia is an acquired language disorder with a significant negative functional impact. Much of the research on aphasia has focused on word-level language comprehension and production. Further evaluation of discourse-level tasks, both at behavioral and neural levels, will allow for an ecologically valid understanding of the functional implications of language impairment in this population. Method: This study evaluated bilateral frontal, temporal, and parietal cortical activity during computer-based narrative production in 14 young neurotypical individuals, 17 individuals with post-stroke aphasia, and 15 age-matched neurotypical participants using functional near-infrared spectroscopy (fNIRS). Oxygenated hemoglobin (HbO) was measured during narrative production following short video clips and compared to HbO during counting aloud. In addition, behavioral measures quantifying in-task performance were correlated with averaged HbO values. Results: Young neurotypical individuals showed greater cortical activity in bilateral language regions for narrative production compared to counting aloud. In contrast, people with aphasia showed positive condition-related effects in the right frontal ROI and the age-matched group showed positive condition-related effects in the left frontal and right precentral ROIs. Each group showed different patterns in relationships between cortical activity and discourse performance measures. Conclusion: Overall, young participants showing more consistent condition-related effects for narrative discourse production than individuals with aphasia and age-matched controls. This study shows the potential for fNIRS to evaluate cortical activity for ecologically valid language tasks in individuals with post-stroke aphasia.

18.
medRxiv (Medicine) 2026-06-23

Post Hoc Localization of Beam F3 Stimulation Targets: An MRI-Derived Geodesic Approach for Refined TMS E-Field Simulations

Background: Transcranial magnetic stimulation (TMS) targeting the left dorsolateral prefrontal cortex (dlPFC) is an established treatment option in major depressive disorder. One of the most common approaches for targeting the dlPFC is the Beam F3 method, which determines the stimulation site (F3Beam) as a function of external cranial measurements. Precise knowledge of the individual stimulation site is essential for imaging-based analyses of TMS effects. However, due to the method's reliance on individual anatomy, retrospective identification of F3Beam targets across cohorts is challenging, limiting the analysis of existing datasets. We developed a scalable method to reconstruct subject-specific F3Beam target locations for e-field simulations based on structural imaging. Methods: High-resolution three-dimensional (3D) T1-weighted MRI was used to generate individual scalp meshes via the ''Simulation of Non-Invasive Brain Stimulation'' (SimNIBS) software. Subject-specific anatomical distances and coordinates of interest were measured geodesically using a Python-based script to reconstruct the individual F3Beam targets. Validation included a retrospective comparison between digital geodesic measurements and manual cranial measurements in 20 patients and a prospective comparison with MR-visible scalp markers in 2 healthy controls. To assess the impact of our targeting algorithm on e-field simulations, volumetric e-field maps based on three potential targets (F3Beam, F3MNI, F3Geo) were generated in SimNIBS and compared using voxel-wise statistics in SPM12. Results: Retrospective analysis revealed a systematic bias towards higher in vivo measurements compared to digital geodesic measurements, though deviations in the final distances determining F3Beam (xBeam and yBeam) were minimal ({Delta}xBeam: 0.11 {+/-} 0.08 cm; {Delta}yBeam: 0.14 {+/-} 0.21 cm). Prospective validation demonstrated that F3Beam coordinates better matched in vivo coil positions than group-template-derived targets (F3MNI). Group-level analysis showed method-dependent clustering of coil positions with corresponding voxel-wise e-field differences. Conclusions: Individualized geodesic measurements may enable accurate, scalable and retrospective identification of Beam F3 targets and coil orientations. This approach may yield more accurate e-field simulations than group-template based targeting and provides a practical method for retrospective analysis of existing TMS treatment cohorts. This could be leveraged to identify response predictors or imaging-based biomarkers of treatment response.

19.
arXiv (CS.CL) 2026-06-24

What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning

Despite the impressive performance of vision-language models (VLMs) on downstream tasks, their ability to understand and reason about causal relationships in visual inputs remains unclear. Robust causal reasoning is fundamental to solving complex high-level reasoning tasks, yet existing benchmarks often include a mixture of reasoning questions, and VLMs can frequently exploit object recognition and activity identification as shortcuts to arrive at the correct answers, making it challenging to truly assess their causal reasoning abilities. To bridge this gap, we introduce VQA-Causal and VCR-Causal, two new benchmarks specifically designed to isolate and rigorously evaluate VLMs' causal reasoning abilities. Our findings reveal that while VLMs excel in object and activity recognition, they perform poorly on causal reasoning tasks, often only marginally surpassing random guessing. Further analysis suggests that this limitation stems from a severe lack of causal expressions in widely used training datasets, where causal relationships are rarely explicitly conveyed. We additionally explore fine-tuning strategies with hard negative cases, showing that targeted fine-tuning can improve model's causal reasoning while maintaining generalization and downstream performance. Our study highlights a key gap in current VLMs and lays the groundwork for future work on causal understanding.

20.
arXiv (CS.AI) 2026-06-15

Learning What to Predict: Downstream-Guided Task Design for Continued Pretraining

arXiv:2601.22108v2 Announce Type: replace-cross Abstract: Continued pretraining is optimized with fixed self-supervised tasks but selected by downstream performance, creating a coarse feedback loop in which practitioners evaluate checkpoints, change data mixtures or objectives, and restart runs, while individual updates remain blind to target capabilities. We ask whether a small set of verifiable downstream examples can provide step-level feedback without directly supervising the learner. We introduce V-pretraining, which decouples a learner trained only with a self-supervised loss from a lightweight task designer that constructs targets or views for unlabeled batches. Given the current learner and batch, V-pretraining scores a candidate construction by predicting the first-order reduction in downstream loss after the induced self-supervised update. The designer maximizes this value; the learner then applies the update with targets or views detached, so downstream labels never update learner parameters. We instantiate V-pretraining as adaptive top-K soft targets for language modeling and learned views or masks for self-supervised vision. Across both modalities, V-pretraining improves target capabilities without degrading generalization. Under wall-clock-matched continued pretraining, it improves GSM8K Pass@1 for Qwen models using 1,024 GSM8K examples only as feedback, including a +7.4 point single-run gain for Qwen2.5-0.5B. In vision, it improves DINOv3 transfer to ADE20K semantic segmentation and NYUv2 depth estimation while preserving ImageNet linear accuracy, suggesting that feedback-guided task construction can improve target capabilities without collapsing general-purpose representations.

21.
arXiv (CS.LG) 2026-06-18

Investigating Faithfulness in Large Audio Language Models

arXiv:2509.22363v4 Announce Type: replace Abstract: Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

22.
arXiv (CS.AI) 2026-06-12

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

arXiv:2606.13276v1 Announce Type: cross Abstract: Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

23.
arXiv (CS.LG) 2026-06-17

Towards Fast GNN Surrogates for CO2 Migration in Complex Geological Formations

arXiv:2606.17180v1 Announce Type: new Abstract: This chapter discusses how a data-driven machine learning approach can reproduce key aspects of the physical behavior of multiphase flows in complex geological formations. We propose an end-to-end graph neural surrogate tailored to CO$_2$ plume migration forecasting in geological storage. The method is evaluated on the SPE11A benchmark, a well-known industry test case designed to assess CO$_2$ storage scenarios and characterized by sharp gas-water interfaces, strong advective transport, and rapid convective mixing with fingering development. The benchmark is reformulated as a graph in which nodes represent computational cells and edges encode transmissibility-based interactions enriched with geometric attributes. Directional transport arising from grid geometry, permeability contrasts, and geological heterogeneity is captured through an anisotropic message-passing mechanism, where interaction weights are computed via geometry-conditioned edge embeddings, biasing message aggregation toward physically relevant transport directions. Temporal evolution is modeled in latent space using an autoregressive residual formulation trained with multi-step supervision. The proposed model produces competitive forecasts of gas saturation and liquid-phase density, which are key indicators for CO$_2$ storage monitoring, with cumulative errors that remain moderate over extended forecasting horizons.

24.
arXiv (CS.AI) 2026-06-18

Learning-Based Decision Making for Combustion Phasing Control in Multi-Fuel CI Engines with Latent Fuel Reactivity Estimation

arXiv:2606.18393v1 Announce Type: cross Abstract: Multi-fuel compression-ignition engines offer fuel flexibility but introduce uncertain, time-varying fuel reactivity, represented by cetane number (CN), which complicates cycle-to-cycle combustion-phasing control. This work formulates CA50 regulation under latent CN variation as a partially observable sequential decision problem and systematically evaluates controllers with increasing temporal and representational capacity, including LinUCB, history-augmented contextual bandits, observation-only DDPG, recurrent DDPG, and a proposed GRU-guided RL framework. A Gaussian-process surrogate trained on experimental multi-fuel engine data provides a controlled and reproducible evaluation environment. Results show that myopic and fixed-history bandit methods degrade under CN variation, observation-only RL suffers from latent-state aliasing, and generic recurrence is insufficient when CN evolves rapidly. The proposed framework learns a compact GRU-based representation of fuel reactivity from combustion history and conditions both actor and critic on this estimated signal rather than oracle CN. By training the policy on the same imperfect fuel-reactivity information available at deployment, the controller avoids train-deploy inconsistency in conventional online estimate-then-control pipelines. Across unseen CN trajectories, the policy achieves stable CA50 regulation with mean absolute tracking error below 0.25{\deg} CA at the training setpoint, while producing smooth, physically consistent SOI and glow-plug-power actuation. These results show that combustion control under latent, continuously evolving fuel dynamics requires more than standalone estimation or generic recurrence. By aligning fuel-reactivity inference with control policy learning, the proposed framework enables reactivity-aware decision-making using the same estimated state available during deployment.

25.
arXiv (CS.CV) 2026-06-11

ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.