Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-16

ShipNet: A Geometric Deep Learning Surrogate for Real-Time Ship Hydrodynamics

arXiv:2606.15356v1 Announce Type: cross Abstract: Accurate prediction of hydrodynamic performance is central to ship design, yet high-fidelity computational fluid dynamics remains prohibitively expensive for large-scale parametric exploration. This motivates the development of data-driven surrogate models that provide rapid approximations to hydrodynamic predictions at substantially reduced cost. We present ShipNet, a geometric deep-learning surrogate that predicts both hull-surface pressure distributions and far-field free-surface wave patterns directly from hull geometry and speed. The network employs a regularized dynamic graph convolutional backbone on hull point clouds, with a multi-head decoder for simultaneous near-body pressure and free-surface elevation outputs. Training data consist of 420 inviscid free-surface simulations generated using a potential-flow panel method for two parent yacht hulls, each parameterized into 70 variants and evaluated at three speeds. ShipNet predicts per-point pressure coefficient and two-dimensional wave elevation map using a composite loss that combines point-wise regression and image-structure terms. On a geometry-held-out test set, ShipNet achieves R^2=0.98 for hull pressure and R^2=0.91 for wave fields. Inference requires approximately 0.15s per case, yielding over a 550x speedup relative to the potential-flow solver on conventional hardware. Limitations include the restricted geometry and speed ranges and the inviscid training data, while future work will extend the model to high-fidelity viscous simulations with physics-informed regularization.

02.
arXiv (CS.CV) 2026-06-16

FUSE: Quantifying Uncertainty in Vision-Language Models by Bayesian Fusing Epistemic and Aleatoric Uncertainty

Vision-language models (VLMs) are playing an increasingly important role across multiple domains. In many applications, such as robotics, it is crucial to quantify the uncertainty in the output of these models. } We develop FUSE, a probabilistic framework for capturing two complementary sources of uncertainty in vision-language modeling: (i) aleatoric embedding-level uncertainty derived from input data vision-language ambiguity, and (ii) epistemic model-level uncertainty estimated from the semantic response diversity of VLMs. Our approach formulates a Bayesian fusion mechanism that analytically combines these uncertainty sources to produce a scalar measure of uncertainty. This measure can be used to reliably predict the model's output correctness for downstream applications. We demonstrate that our method outperforms baselines and achieves SOTA uncertainty calibration.

03.
arXiv (CS.CL) 2026-06-16

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models (LMs). In particular, previous methods use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, these tokenizers often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs. Although recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, this can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment. To address this, we propose LM-SPT, an LM-aligned speech tokenization method based on semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only and minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder. This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. Experimental results show that the proposed LM-SPT consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level.

04.
arXiv (CS.AI) 2026-06-16

When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

arXiv:2604.05859v2 Announce Type: replace Abstract: We study Contextual Multi-Armed Bandits (CMABs) for non-episodic decision-making problems where the context includes both textual and numerical information (e.g., recommendation systems, dynamic portfolio adjustments, offer selection; all frequent problems in finance). While Large Language Models (LLMs) are increasingly applied to these settings, utilizing LLMs for reasoning at every decision step is computationally expensive, and uncertainty estimates are difficult to obtain. To address this, we introduce LLMP-UCB, a bandit algorithm that derives uncertainty estimates from LLMs via repeated inference. However, our experiments demonstrate that lightweight numerical bandits operating on text embeddings (dense or Matryoshka) match or exceed the accuracy of LLM-based solutions at a fraction of their cost. We further show that embedding dimensionality is a practical lever on the exploration-exploitation balance, enabling cost-performance tradeoffs without prompt complexity. Finally, to guide practitioners, we propose a geometric diagnostic based on the arms' embeddings to decide when to use LLM-driven reasoning versus a lightweight numerical bandit. Our results provide a principled deployment framework for cost-effective, uncertainty-aware decision systems with broad applicability across AI use cases.

05.
arXiv (math.PR) 2026-06-19

The central heat trace on large compact classical groups

arXiv:2511.08288v2 Announce Type: replace-cross Abstract: We study the large-$N$ asymptotics of the central trace of the heat kernel on compact classical groups. For every classical family $G_N\subset \mathrm{GL}_N(\C)$, we prove a full large-$N$ asymptotic expansion, using a highest weights/partitions correspondence adapted to the large-rank regime, under which the eigenvalues of the Laplace–Beltrami operator stabilize as observables in the algebra of shifted symmetric functions. Then, we prove a random surface representation of the trace in terms of ramified coverings of the torus. We provide two independent applications: an explicit large-rank counting law for the Casimir spectrum, with exponential Hardy–Ramanujan-type growth in contrast with the polynomial behavior of Weyl's law at fixed rank, and a rigorous probabilistic formulation of the Yang–Mills/Hurwitz duality on a two-dimensional torus initiated by Gross and Taylor, completing a previous work of the authors. We also extend this duality to a Yang–Mills/Gromov–Witten duality by expressing the coefficients of the central heat trace as explicit functionals of the generating function of Gromov–Witten invariants.

06.
arXiv (math.PR) 2026-06-17

Decay of correlations and zeros for the hard-core model

arXiv:2603.17858v2 Announce Type: replace Abstract: In a recent paper the last author proved that absence of complex zeros of the partition function of the hard-core model near a parameter $\lambda>0$ implies a form of correlation decay called strong spacial mixing. In this paper we investigate the reverse implication. We introduce a strengthening of strong spatial mixing that we call very strong spatial mixing (VSSM). Our main result is that if VSSM holds at a parameter $\lambda>0$ for a family of graphs, this implies that the partition function has no zeros near that parameter for each graph in the family. We also demonstrate that a closely related variant of very strong spatial mixing does not imply zero-freeness. As a consequence of our main result, we moreover obtain that VSSM implies spectral independence. Our proof relies on transforming the problem to the analysis of an induced non-autonomous dynamical system given by Möbius transformations.

07.
arXiv (math.PR) 2026-06-16

Uniform integrability of the distance to the nearest leaf in random trees

arXiv:2606.15339v1 Announce Type: new Abstract: We study the distance from the root to the nearest leaf, the analogous quantity for a uniformly chosen vertex, and its protection number, in size-conditioned simply generated trees. We prove a uniform exponential tail bound for each of these quantities, valid for arbitrary offspring distributions. As a consequence, these random variables are uniformly integrable of every order. This yields convergence of all moments to those of the corresponding local limit. The argument is probabilistic and unified across the three quantities.

08.
medRxiv (Medicine) 2026-06-15

GLLaucoMed: A Secure LLM-Powered Agentic Workflow for Automated Medication Extraction from Free-Text Glaucoma Clinical Notes

Purpose: To evaluate the efficacy of large language models (LLMs) in extracting medication-related information from glaucoma clinical notes in the electronic health record (EHR). Design: Cross-sectional. Subjects: 1,250 subjects in the Bascom Palmer Ophthalmic Repository. Methods: Extracted clinical notes from glaucoma-related encounters between 2014 and 2024 were labeled by two glaucoma specialists with a third serving as an adjudicator. Graders were asked to label current topical medications (CTM), proposed changes to topical medications ({Delta}TM), current oral medications (COM), and proposed changes to oral medications ({Delta}OM) in a structured fashion. The dataset was split into development (10%), validation (10%), and test (80%) sets stratified by clinician. Development and validation sets were used to engineer and refine prompts, and the held-out test set was used for model assessment. Five LLMs (Claude Opus 4.6, DeepSeek-V3.2, GPT 5.2, Grok 4.1, and Qwen3.6-35B-A3B) were accessed via Microsoft Azure AI Foundry within a HIPAA-compliant environment. Inter-grader agreement was assessed with Gwet AC1. LLM performance was initially assessed in a binary fashion with F1 scores, and the degree of text match among positive cases was evaluated using exact match accuracy and Jaccard Index (JI). Main Outcome Measures: F1 score, exact match accuracy, JI. Results: Gwet AC1 for intergrader agreement was 0.799, 0.888, 0.985, and 0.988 for CTM, {Delta}TM, COM, and {Delta}OM, respectively. F1 scores for CTM were 0.985, 0.971, 0.978, 0.968, and 0.970 for Claude, Deepseek, GPT, Grok, and Qwen, respectively; for {Delta}TM: 0.905, 0.826, 0.897, 0.842, 0.855, respectively; for COM: 0.923, 0.887, 0.899, 0.906, 0.894, respectively; for {Delta}OM: 0.958, 0.815, 0.937, 0.835, 0.940, respectively. Among positive cases, range of exact match accuracies for CTM (N=1354) was 0.730- 0.882 and range of JIs was 0.809-0.918. For {Delta}TM (N=404), exact match accuracy range was 0.619-0.780 and JI range was 0.668-0.827. For COM (N=47), exact match accuracy range was 0.766-0.872 and JI range was 0.765-0.870. For {Delta}OM (N=25), exact match accuracy range was 0.583-0.920 and JI range was 0.583-0.922. Conclusions: The GLLaucoMed pipeline demonstrated high performance in extracting and standardizing medication data from unstructured clinical notes, including both current medications and proposed changes. Claude and GPT exhibited the strongest performance.

09.
medRxiv (Medicine) 2026-06-15

Automated AI-Based Ventricular Subcompartment Segmentation and Volumetry in Idiopathic Normal Pressure Hydrocephalus

Purpose In idiopathic normal pressure hydrocephalus (iNPH), longitudinal monitoring of ventricular size is important for diagnosis and treatment follow-up. This study aimed to validate a fully automated AI model for CT ventricular volumetry with subcompartments and to compare AI-derived volume changes with routine radiology assessments. Methods This retrospective, single-center study included 88 patients with iNPH and 456 non-contrast-enhanced head CT examinations. The model was trained on 38 manually labeled CT scans with 12 ventricular subcompartments. Outcomes included segmentation accuracy, correspondence between AI-derived longitudinal ventricular volume changes and radiology report categories (decreased, unchanged, increased), radiologist detection thresholds for ventricular change, and paired pre- and postoperative volume changes in 22 patients with ventriculoperitoneal shunt. Results Mean segmentation accuracy was high (Dice, 0.83). 91% of 100 segmentations were rated as excellent by an expert neuroradiologist. AI-derived ventricular volume changes corresponded well to radiology report categories (median total ventricular volume changes of -17% in cases reported as decreased, 0% in unchanged cases, and +22% in increased cases; all p < 0.001). Radiologists reported ventricular volume change in 50% of cases at an AI-measured relative volume change of +/-6%, and in 90% of cases at +21% for enlargement and -18% for decrease. After shunt placement, ventricular volume decreased by -8% (median), with the largest relative reductions observed in the right temporal and occipital horns. Conclusions Automated AI-based ventricular segmentation on CT enables accurate and reproducible assessment of ventricular volume changes in iNPH and complements routine radiological evaluation for longitudinal and postoperative monitoring.

10.
arXiv (CS.LG) 2026-06-17

Amortizing Maximum Inner Product Search with Learned Support Functions

arXiv:2603.08001v2 Announce Type: replace Abstract: Maximum inner product search (MIPS) is a crucial subroutine in machine learning, requiring the identification of a vector taken within a database (the keys) that best aligns with a given query. We propose amortized MIPS: a regression-based approach that trains neural networks to directly predict MIPS solutions, amortizing the cost of repeatedly solving MIPS for queries drawn from a known distribution over a fixed key database. Our key insight is that the MIPS value function is the support function of the set of keys, a well-studied convex function whose gradient yields the optimal key. This motivates two complementary amortized models: SupportNet, an input-convex neural network trained to regress the support function, and KeyNet, a vector-valued network that directly regresses the optimal key. SupportNet can serve as a cluster router, steering queries toward relevant database partitions, while KeyNet can be used as a drop-in replacement for the original query, fed directly to off-the-shelf indexing pipelines. Our experiments on the BEIR benchmark show that, for document embeddings, learned \SupportNet{}s and \KeyNet{}s significantly improve IVF match rates when accounting for compute effort, whether measured in FLOPs, number of probes, or wall-clock time. Our code is available at: https://github.com/apple/ml-amips.

11.
medRxiv (Medicine) 2026-06-22

Brain-gut axis imaging, motion correction with 11C-carfentanil total-body PET

Background: Mu-opioid receptors (MORs) are expressed throughout the body including in the brain and gastrointestinal (GI) tract. Total-body PET imaging of the brain and GI tract offers a promising approach for cross-sectional in vivo evaluation of the MOR brain-GI axis. However, intestinal motility and bladder filling introduce motion throughout the GI tract over the scan window. Here we establish analysis methodology to account for motion for dynamic imaging of the brain-GI axis, to further characterize peripheral MORs throughout the body and provide a framework for semi-automatic total-body PET modeling. Methods: 4 subjects underwent 90-min dynamic [11C]-carfentanil (cfn) total-body PET acquisitions at baseline, after intravenous naloxone (central antagonist) administration, and after orally administered loperamide (peripheral agonist and P-glycoprotein substrate). Thalamic MOR availability was measured using the Logan reference tissue model. Using CT-based segmentation, the GI tract was subdivided into anatomical segments, in addition to other peripheral organs (e.g., liver, psoas muscle). Frame-by-frame semi-automatic motion correction was performed with three distinct reference frames (11-14 min post-injection, p.i., 35-40 min p.i., and 85-90 min p.i.). The performance of these three were compared to manual correction. Compartment modeling and Logan graphical analysis were performed to estimate relevant kinetic parameters (K1, VT, VTLogan). Results: Across the 4 subjects and regions, kinetic parameter estimates were highly correlated (r>0.7) for K1, VT and VT Logan when comparing semi-automatic (reference frame at 35-40 min p.i.) and manual correction. With semi-automatic motion correction, graphical-based estimation of VTLogan in the gastrointestinal tract was significantly decreased with loperamide relative to baseline (p

12.
arXiv (CS.LG) 2026-06-12

A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning

arXiv:2606.13060v1 Announce Type: new Abstract: Accurate prediction of solubility remains a central challenge across materials science and sustainable chemistry. In particular due to emerging technologies like organic and hybrid photovoltaics, batteries, and catalysis, solvent usage is expected to increase significantly within the coming years. Therefore, substituting solvents with greener alternatives is vital. This is where machine learning can have substantial impact. However, the limited data on critical parameters of solubility significantly constraints machine learning efficacy. In this work, we transfer a pre-trained foundational model on QM9 targets to our application with minimal data requirements. Additionally, the pipeline integrates uncertainty quantification, allowing the user to gauge the confidence of the predictions. As baseline, we succeed in predicting the Hansen solubility parameters and Dielectric Constant for which extensive databases exist. Importantly, we achieve high model performance on additional targets, such as Gutmann Donor and Acceptor numbers, where the available data is extremely limited. Overall, we augment data on solubility descriptors by orders of magnitude with high quality predictions. For effective dissemination, we deploy easy-to-use, easily integrateable with high throughput labs, customizable tool for ranking and screening possible solvent substitutes. Finally, we rediscovered known green solvent alternatives and proposed new candidates proving its relevance for finding eco-friendly solvents.

13.
medRxiv (Medicine) 2026-06-15

Efficacy of Painhunting Therapy for Event-Related Depression: A Randomized Controlled Trial with Crossover Replication

Background. Depression affects an estimated 332 million people worldwide and is a leading cause of disability, with up to 80% of major depressive episodes preceded by an identifiable adverse life event [17,18]. First-line treatments target symptoms rather than the precipitating event and are resource-intensive: standard CBT averages roughly 12 sessions, and antidepressant discontinuation carries relapse rates near 35% at six months [8]. These limitations create a clear rationale for brief, structured interventions that address the cognitive and somatic sequelae of adverse life events directly. Painhunting therapy is one such intervention, in which each session targets a discrete adverse event through a structured incident-processing procedure. Methods. We conducted a two-arm, parallel-group, single-site randomised controlled trial comparing Painhunting therapy (Arm A, immediate; n=42) with a waitlist control (Arm B, delayed; n=42) in adults with PHQ-9 >= 9 and active psychological distress related to an adverse life event. After the primary endpoint at T2 (approximately two weeks post-randomisation), Arm B crossed over to active treatment, with T3 as the post-crossover endpoint at approximately four weeks. The primary outcome was PHQ-9 at T2 (between-arm contrast); secondary outcomes were ICG, GAD-7, WHO-DAS 2.0 (12-item), and the Global Impression of Change (GIC). Pre-specified analyses included intention-to-treat, per-protocol, and single-exclusion sensitivity populations. Results. Eighty-four participants were randomised (198 applications, 134 completed screening questionnaire, 119 passed psychometric screening). At T2, mean PHQ-9 was 2.32 (SD 2.59) in Arm A and 16.56 (SD 6.76) in Arm B, yielding an ITT between-arm Cohen d = 2.78 (95% CI 2.19-3.76, p < 0.001). Within-arm paired reductions during each arm's active-treatment window reproduced this magnitude (Arm A T0 to T2 change 14.71, Morris d = 2.80; Arm B T2 to T3 change 14.19, Morris d = 2.77, eligible n=26). Treatment gains were durable at the T4 follow-up (week 8). Aligning each arm to its own end-of-treatment timepoint, the off-treatment drift to week 8 was almost identical between arms: Arm A rose 0.78 points from T2 to T4 (2.19 to 2.97, n=37) and Arm B rose 1.59 points from T3 to T4 (4.74 to 6.33, n=27), the latter falling to 0.77 points once a single documented relapse case (R59) is excluded (4.81 to 5.58, n=26). This small off-treatment rebound then stabilised rather than continuing: Arm A was essentially unchanged from T3 to T4 (change +0.05), with concordant maintenance on ICG, GAD-7, and WHO-DAS. At T4, 68% of Arm A and 41% of Arm B remained in remission (PHQ-9 < 5). Secondary measures (ICG, GAD-7, WHO-DAS) moved in the same direction and to comparable magnitude at every timepoint. The waitlist window in Arm B showed essentially no change on any measure (PHQ-9 change 0.22, p = 0.81). Sensitivity analyses excluding six sub-threshold T2 cases, the single treated-in-error case (R82), the R59 relapse case, and one late T2 submitter left all conclusions unchanged. Conclusions. Painhunting therapy produced large and statistically robust reductions in depression, complicated grief, anxiety, and functional disability over a brief course of three to four sessions, with effect sizes substantially exceeding benchmarks reported for established first-line psychotherapies including CBT and EMDR. Critically, these gains persisted at the week-8 follow-up: depression scores in the immediate-treatment arm were essentially unchanged from four weeks to eight weeks post-randomisation, indicating that the benefit reflects durable change rather than a transient post-session dip. Treatment-window concordance between arms, durability of gains at one month off-treatment, and the flat waitlist trajectory together strengthen the evidence for genuine efficacy rather than spontaneous remission. Baseline covariates including therapeutic alliance, treatment expectancy, self-efficacy, age, and sex showed near-zero associations with outcome, reducing the plausibility of allegiance bias or expectancy effects as primary drivers. The differential retention between arms (88% vs 64% at T3) is attributable to the waitlist design and is discussed as a limitation. These findings support proceeding to a confirmatory active-comparator trial against manualized CBT. Trial registration: ClinicalTrials.gov NCT07490691, prospectively registered.

14.
arXiv (quant-ph) 2026-06-11

Holographic Complexity, Extremality, and Cosmic Censorship

arXiv:2604.20170v2 Announce Type: replace-cross Abstract: We propose a holographic complexity origin for the third law of black-hole mechanics and weak cosmic censorship. In both complexity equals action and complexity equals volume prescriptions, the relative complexity between subextremal and extremal AdS black holes diverges logarithmically. For overcharged RN-AdS, explicit calculations in both prescriptions show that the near-singularity action terms are power-law divergent or finite, while the maximal-volume contribution is finite. Thus, the extremal-to-naked relative complexity also diverges, obstructing finite-time transitions.

15.
arXiv (CS.LG) 2026-06-16

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

arXiv:2605.01702v2 Announce Type: replace Abstract: Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm $D^\mathtt{AD}$. We first show that given a floating-point function $\phi$ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network $f$ and $D^\mathtt{AD}(\phi\circ f)$, respectively. We further extend this result: given $\phi_1,\dots,\phi_n$, $D^\mathtt{AD}(\phi_i\circ f)$ can simultaneously represent arbitrary gradients while $f$ represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Sigmoid}$, and $\mathrm{tanh}$.

16.
arXiv (CS.AI) 2026-06-11

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

arXiv:2606.11400v1 Announce Type: cross Abstract: Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

17.
arXiv (CS.CV) 2026-06-15

$\mu_0$: A Scalable 3D Interaction-Trace World Model

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $\mu_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $\mu_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $\mu_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $\mu_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $\mu_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $\pi_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

18.
arXiv (CS.AI) 2026-06-19

Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge Dissemination

arXiv:2606.20258v1 Announce Type: cross Abstract: The emergence of LLM-driven information services is reshaping the conditions under which public knowledge institutions operate, threatening to absorb the editorial function these institutions exist to exercise. While LLMs offer powerful new affordances for knowledge dissemination, editorial authority is challenged by pretrained LLMs that arrive already aligned with the values and dissemination strategies of their commercial developers. This paper investigates editor participation in re-aligning LLM interfaces to editorial standards through design workshops, in a case study where we design and implement an LLM-enabled encyclopedia interface with a Nordic public knowledge institution. We introduce editorial alignment as a design practice within Participatory AI, framing AI alignment as a design process and positioning the editorial standard as a design artefact that translates editorial practice and values into alignment objectives for technical implementation. Last, we discuss how editorial alignment can create space for ongoing participation and give editors agency in LLM-mediated knowledge dissemination.

19.
arXiv (CS.LG) 2026-06-12

Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

arXiv:2606.12478v1 Announce Type: new Abstract: Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query–key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

20.
arXiv (CS.CV) 2026-06-19

OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification

Multi-view imaging, such as mammography and chest radiography, is a standard component of clinical practice. However, medical images are often unregistered and contain view-specific artifacts or irrelevant background cues that can obscure diagnostically relevant findings. Many existing methods directly fuse per-view representations, allowing such irrelevant content to contaminate the fused embedding and reducing robustness under varying view configurations. We propose OTCHA, a confidence-aware latent hub token alignment module based on optimal transport (OT) that refines patch tokens before fusion for multi-view classification. OTCHA introduces a set of learnable latent hub tokens shared across views. For each view, we compute an OT plan between patch tokens and hub tokens that jointly considers feature similarity and geometry, and augment the OT formulation with token-conditional dustbins to enable partial matching and discard irrelevant tokens. The resulting transport plan provides token-wise matching confidence, which gates hub-mediated message passing and weights a novel optimal-transport-based representation alignment loss to stabilize refinement. Experiments on three multi-view medical image datasets demonstrate consistent improvements over competing baselines across diverse anatomies and view configurations. Our code is available at https://github.com/labhai/OTCHA.

21.
medRxiv (Medicine) 2026-06-16

A Poisson Process Life Expectancy framework for optimising patient lifetime during chemotherapy

Cancer therapy balances between two competing objectives - treatment efficacy against the tumour and the risk of treatment related severe adverse events, including patient death. Most existing optimal control theory (OCT) formulations rely on optimising heuristic cost functionals that lack direct clinical interpretability. In clinical practice treatment efficacy and patient tolerability are primarily assessed through survival metrics and adverse event rates. Here we introduce the Continuous Lifetime Payoff (CLP), a novel OCT objective functional that directly links treatment decisions to patient survival. It explicitly incorporates tumour dynamics, tumour eradication, and patient mortality from tumour progression, drug-related toxicity and age. We fit age-related mortality from life tables and infer parameters from simulated survival data. The CLP provides a clinically grounded framework for optimising chemotherapy regimens.

22.
arXiv (CS.LG) 2026-06-19

Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning

arXiv:2606.19372v1 Announce Type: cross Abstract: We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available.

23.
medRxiv (Medicine) 2026-06-16

Adherence to Red Reflex and Vision Screening Recommendations: A Deep Dive into Primary Care Implementation Gaps

Introduction: Early childhood vision screening is critical for detecting amblyopia and other vision-threatening conditions. Despite screening recommendations during well-child visits, rates remain low. Red reflex assessment is recommended to identify serious ocular pathology, yet its use in primary care is not well described. We examined rates and drivers of vision screening in pediatric primary care. Methods: We conducted a retrospective review of electronic health records for children 3 to 5 years attending well-child visits in 2022 in one of three representative primary care clinics within a university health system. Outcomes were documented red reflex and functional vision tests. We evaluated associations with patient demographics and clinic site using multivariable logistic regression Results: Among 1,003 visits, 21.1% (n=212) had a documented red reflex assessment, and 60.8% (n=610) a functional vision test. Younger children (ages 3 and 4 vs. 5 years) had higher odds of red reflex assessment [adjusted odds ratio (aOR) 9.00 and 8.64], and lower odds of a functional vision (aOR 0.47 and 0.59) test. Females had higher odds of red reflex assessment (aOR 1.53). Other/Multiracial children had lower odds of red reflex assessment than Non-Hispanic White children (aOR 0.48). Screening rates varied significantly by clinic site Conclusions: Visual function and red reflex assessment are inconsistently performed in pediatric primary care, with particularly low rates of red reflex documentation. Screening rates varied between clinics and were affected by age. These findings highlight missed opportunities for early detection of vision-threatening conditions and identify targets for improving adherence to pediatric vision screening recommendations

24.
arXiv (CS.CV) 2026-06-16

Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

25.
arXiv (CS.CV) 2026-06-12

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.