Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CV) 2026-06-11

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Frames2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Frames2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Frames2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Frames2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

03.
arXiv (CS.AI) 2026-06-24

DTT-BSR+: A Generative-Regression Cascade for Music Source Restoration

arXiv:2606.24127v1 Announce Type: cross Abstract: Music source restoration (MSR) requires jointly addressing source unmixing and the inversion of non-linear production effects. Current methods struggle to achieve accurate target signal reconstruction while maintaining semantic consistency. To address this limitation, we propose DTT-BSR+, a two-stage cascade MSR system that decouples distribution fitting from signal reconstruction into separate stages. A generative DTT-BSR separator in the first stage produces stems matching the prior of clean sources, and a modified Demucs network in the second stage enhances the first stage output using time-domain and multi-resolution spectral losses. DTT-BSR+ improves multi-mel signal-to-noise ratio (MMSNR) over the single-stage DTT-BSR across all stems, and surpasses the state-of-the-art X-LANCE MSR system on five stems. We also reveal through Fréchet Audio Distance (FAD) decomposition an implicit trade-off between signal reconstruction accuracy and semantic distribution fitting across stems.

04.
bioRxiv (Bioinfo) 2026-06-12

PeptiDIA: A Machine Learning Framework for Enhanced Peptide Identification in Fast-Gradient Data-Independent Acquisition Proteomics

Data-independent acquisition (DIA) mass spectrometry has become increasingly prevalent in proteomics as advances in instrumentation, chromatography, and computational analysis have enabled robust proteome identification across complex biological samples. However, analytical depth achieved with fast chromatographic gradients remains lower than that obtained using long-gradients, reflecting a throughput-depth trade-off. Here, we present PeptiDIA, a machine learning framework that enhances peptide identification in fast-gradient DIA data by leveraging paired fast and long-gradient acquisitions from identical samples. PeptiDIA processes DIA-NN outputs generated at relaxed false discovery rate thresholds to obtain expanded candidate peptide pools and trains gradient-boosted decision tree models using long-gradient identifications as reference labels. The model integrates DIA-NN features with engineered peptide descriptors and applies isotonic regression to calibrate probabilities, enabling controlled peptide recovery relative to the long-gradient reference. Applied to human and murine datasets spanning six tissues acquired on an Orbitrap Exploris 480, PeptiDIA increased peptide identifications by 25-34% at 1% target reference-discordance rate (RDR) and increased the number of protein groups containing at least one rescued peptide by 15-17%. Overall, PeptiDIA improves the identification depth of fast-gradient DIA-NN workflows without altering acquisition strategies. The framework is available as a web application and command-line tool at https://github.com/Jordano700/PeptiDIA.

05.
arXiv (CS.CV) 2026-06-11

Spatially Selective Self-Training for Unsupervised Building Change Detection

Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision. SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08%, 91.69%, and 86.60%, respectively, outperforming existing unsupervised and label-free baselines.

06.
medRxiv (Medicine) 2026-06-16

Usability testing with a prototype user interface of an Artificial Intelligence driven air-Safety Tool (AISaT)

Involving end-users in the development of an AI tool is an important facilitator to its implementation. Usability testing was therefore conducted with a prototype user interface of an Artificial Intelligence driven air-Safety Tool (AISaT) to capture the perspectives and user experiences of AISaT from 10 staff members across two hospitals working within estates, infection prevention and control, and clinical areas, to inform the development of next iterations of AISaT. The perspectives shared could be grouped under improvements to the understand-ability; content; navigation; visibility; usability; workflow; ownership; and frequency of use of the tool. There were key areas that can and will be easily improved within AISaT, however there were areas that required a deeper level of critical reflection, such as incorporating data on more existing variables in a room (i.e., existing ventilation) and whether all patients should be assumed as infectious and breathing heavily. The research team must consider if the target audience of end users and recommended frequency of AISaT use will be pre-defined by the tool developers, or whether this level of detail should be left to each individual hospital to decide.

07.
arXiv (CS.CV) 2026-06-19

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1–100x, pseudo-labels across thresholds 0.90–0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

08.
PLOS Medicine 2026-06-23

Comparisons of core component delivery in cardiac rehabilitation programs by country income classification and decade based on the 2025 Global Audit Update: A survey study

by Gabriela Lima de Melo Ghisi, Rachael P. Carson, Karam Turk Adawi, Rongjing Ding, Warner M. Mampuya, Mariya P. Jiandani, Jimena Martinez, Monserrat Cruz Rivero, Claudia V. Anchique, Dinah L. van Schalkwijk, Jonathan Gallagher, Buket Akinci, Dion Candelaria, Jirapa Champaiboon, Daniel F. Quesada-Chaves, Tone M. Norekvål, Iwona Szadkowska, Borut Jug, Evangelia Kouidi, Marta Supervia, Won-Seok Kim, Chamila Mettananda, Lilian Mbau, Gulsim T. Aimakova, Sherry L. Grace, on behalf of the ICCPR Global Cardiac Rehabilitation Audit Update Investigators Background Cardiovascular disease (CVD) remains a leading global health burden. Cardiac rehabilitation (CR) is essential to reducing morbidity and improving patient outcomes. Since the COVID-19 pandemic, CR delivery worldwide has evolved, yet these changes have not been systematically charactemkjrized. The objective of this study was to characterize globally: (1) the delivery of core CR components, including risk factors assessed, patient education practices, and program resources; (2) differences in these elements by country income classification and relative to the initial 2016 Global CR Audit. Methods and findings A cross-sectional Audit update was conducted. Program-level data were collected from May 1st to September 1st 2025 using a REDCap survey adapted from previous Audits. Eligible respondents were leads of phase II/post-discharge CR programs providing at least an initial assessment, structured aerobic exercise, and ≥1 additional core component. ICCPR associations and local leaders supported program identification. Main outcomes were core components delivered (10 assessed), risk factors assessed (14 assessed), patient education dose (hours/patient/program), and program resources (17 assessed). Generalized linear mixed models (GLMM) tested differences by income classification and (when applicable) changes since 2016. Of 7,025 programs identified globally, 1,505 (62% median country response rate) initiated a survey from 90/113 (80%) countries with CR. The median number of core components offered was 8/program (p25, p75 = 6, 10), with upper-middle income countries offering significantly more components overall (median = 9), and also high-income countries offering more than low-income countries (8 versus 6, p 

09.
arXiv (CS.AI) 2026-06-16

A Formal Framework for Declarative Agentic AI in Business Process Analysis

arXiv:2606.15291v1 Announce Type: new Abstract: Agentic AI opens new opportunities for automating Business Process (BP), enabling autonomous decision-making and dynamic adaptation. However, realising this potential requires BP entities and their interactions to be defined with formal precision. This paper presents a formal framework for Agentic BP analysis through the AGO methodology. AGO captures the modelling perspective in terms of who is acting (Agents), why it is carried out (Goals), and what the relevant entities are (Objects). Grounded in set theory and mathematical logic, we formally define the AGO entity types and their interactions, organising all definitions into a BP Knowledge Base (BPKB). The resulting BPKB supports structured querying, incremental updates, and automatic generation of BP workflows, while ensuring soundness and completeness of the derived paths.

10.
arXiv (CS.CL) 2026-06-11

FinTradeBench: A Financial Reasoning Benchmark for LLMs

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with advances in Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question-answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning about how company stocks trade in the market or their interactions with fundamentals. To leverage the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

11.
arXiv (CS.CV) 2026-06-16

OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

Reinforcement learning for multimodal large language models (MLLMs) is often hindered by severe reward sparsity in complex reasoning tasks. This challenge is particularly pronounced in human-centered scenarios involving states, emotions, intentions, and behaviors, where heterogeneous multimodal signals and subjective human factors make high-quality chain-of-thought (CoT) annotations expensive and difficult to obtain. Although many multimodal datasets provide expert-annotated ground-truth labels, directly using these labels for supervised fine-tuning may encourage shortcut learning in multimodal perception and provides limited transparency for safety-critical human–AI interaction. To address these limitations, we propose OmniOPSD, a Rationale-Privileged On-Policy Self-Distillation framework that uses frontier-generated rationales as teacher-side privileged evidence rather than student imitation targets. OmniOPSD uses frontier-generated evidence-aware rationales only as training-time privileged evidence context for a local teacher. The student samples its own rollout from the original multimodal input, while the rationale-privileged teacher scores the same tokens and provides dense token-level supervision. Thus, the student learns on its own trajectory distribution without directly imitating frontier-model completions, and inference requires no labels, rationales, CoT annotations, or closed-source model access. Experiments on MER-UniBench show that OmniOPSD achieves state-of-the-art performance with an average score of $84.19$, and ablations further support the value of rationale-privileged teacher guidance.

12.
arXiv (CS.AI) 2026-06-16

Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

arXiv:2606.16515v1 Announce Type: cross Abstract: Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal – a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $\psi$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $\psi_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $\psi(s_t)$ to $\psi(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $\psi$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.

13.
arXiv (CS.CV) 2026-06-19

SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9\% and FVD by 9.5\%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.

14.
arXiv (CS.CV) 2026-06-16

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.

15.
arXiv (CS.LG) 2026-06-16

Semi-Supervised Speech Confidence Detection using Pseudo-Labelling and Whisper Embeddings

arXiv:2606.16505v1 Announce Type: cross Abstract: Understanding speaker confidence is crucial in educational settings, as it can enhance personalised feedback and improve learning outcomes. This study introduces a novel framework for detecting speaker confidence by integrating human-engineered features with embeddings from the Whisper encoder. To address data limitations, a pseudo-labelling technique is employed to expand the labelled dataset, allowing the model to learn from both human-annotated and model-generated labels. The framework combines traditional speech features including pitch, volume, rate of speech, and the presence of disfluencies and stress, with Whisper embeddings, and uses a co-attention mechanism to fuse these representations and achieve an overall accuracy of 75%. This study contributes to advancing speech analysis, enabling applications that support personalised learning and speaking skill development.

16.
arXiv (CS.AI) 2026-06-16

HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

arXiv:2606.16480v1 Announce Type: cross Abstract: Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.

17.
bioRxiv (Bioinfo) 2026-06-10

A Unified Spatial AI Framework for Cross-Domain Tissue-State Analysis in Trauma, Oral, and Cardiovascular Pathology

Authors:

Objective: To develop a cross-domain spatial AI framework for identifying conserved tissue-state organisation across trauma, oral disease, and cardiovascular tissue using spatial transcriptomic data. Methods: Four public spatial transcriptomic datasets spanning wound healing, periodontitis, oral squamous cell carcinoma, and cardiac tissue were integrated using recurrence modelling, graph-based spatial learning, fuzzy tissue-state analysis, and tensor decomposition. Cross-domain coupling, spatial fragmentation, recurrence structure, and permutation-based topological validation were evaluated. Results: Six conserved fuzzy tissue states were identified, dominated by extracellular matrix remodelling, fibroblast/stromal activation, endothelial signalling, and inflammatory pathways. Latent embedding analysis demonstrated strong overlap between trauma and oral domains, while cardiovascular tissue exhibited more compact spatial organisation. Oral inflammatory tissue showed the highest fragmentation, whereas cardiovascular tissue demonstrated greater recurrence coherence. Tensor decomposition identified conserved stromal-remodelling programmes across domains. Permutation testing confirmed significantly elevated graph modularity and reduced spatial entropy relative to null distributions. Conclusion: The proposed framework identified conserved spatial tissue-state architecture linking wound healing, oral pathology, and cardiovascular tissue despite differences in tissue origin, pathology, and acquisition technology. Significance: These findings demonstrate the potential of spatial AI for investigating conserved stromal and inflammatory microenvironmental organisation across clinically related disease systems and may support spatial biology research in trauma–oral–systemic health.

18.
arXiv (quant-ph) 2026-06-12

Exotic critical states as fractional Fermi seas in the one-dimensional Bose gas

arXiv:2602.17656v2 Announce Type: replace-cross Abstract: Critical quantum field theories occupy a central position in modern theoretical physics for their inherent universality stemming from long-range correlations. As an example, the Tomonaga-Luttinger liquid (TLL) describes a wealth of one-dimensional quantum systems at low temperatures. Its behavior is deeply rooted in the emergence of an effective Fermi sea, leading to power-law correlations and Friedel oscillations. A promising direction to realize systems exhibiting novel universal behavior beyond TLL is through the generalization of the underlying Fermi sea. In this Letter, we show that fractional Fermi seas with reduced occupancy arise in an integrable Bose gas driven out of equilibrium by cyclic changes in interactions from repulsive to attractive. The correlation functions feature signatures of criticality incompatible with a conventional TLL, suggesting a novel critical phase. Our predictions, based on Generalized Hydrodynamics, are directly relevant to cold atoms.

19.
medRxiv (Medicine) 2026-06-12

Does the method matter? Evaluating the effectiveness, efficiency and ease of hearing-aid gain self-adjustment

In conventional hearing-aid personalisation, clinicians cannot hear what their patients hear, and patients cannot often reliably detect or describe what they hear. Self-adjustment avoids this issue but requires user controls that adjust hearing-aid signal processing parameters to be effective, efficient and easy. In this study, we explored (a) the roles of interface complexity and stimulus type in the self-adjustment of hearing-aid gain, and (b) how well individuals can adjust one sound to match another to assess the same interfaces and stimuli. Adult hearing-aid users with mild to moderate symmetrical sensorineural hearing loss repeatedly adjusted the gain (a) to their preference from individual prescription (n = 41) and (b) to match their previous preferences from a random starting point (n = 32) using three interfaces representing different bass/mid/treble configurations and three stimuli (music, speech and speech-in-noise). The large interindividual variability in self-adjusted gains clustered into three patterns of deviation from initial prescription: increased relative bass, overall gain reduction, and close to initial prescription. There were no substantial effects of interface nor stimulus on self-adjustment reliability (median {sigma} = 2.8 dB), whereas absolute sound-matching error increased with increasing interface complexity and centre frequency. Neither individual matching accuracy nor questionnaire responses predicted either self-adjusted gains or reliability. Overall, these results show that many - but not all - hearing-aid users can adjust gains with reasonable reliability, and while it can be difficult to predict the behaviour from the individual, the individual applies a similar self-adjustment behaviour across different interfaces and stimuli.

20.
arXiv (CS.CL) 2026-06-16

RASST: Retrieval-Augmented Simultaneous Speech Translation

Simultaneous speech translation produces target text incrementally from partial speech input. Recent speech large language models have markedly improved SST quality but still struggle with rare and domain-specific terminology. Retrieval augmentation has helped in automatic speech recognition and neural machine translation, but extending it to SST is non-trivial: retrieval must be fast and accurate under partial speech, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which addresses both challenges. For accurate cross-modal retrieval under partial input, RASST trains a lightweight speech-text retriever that produces chunkwise terminology hints for the Speech LLM via multi-scale retrieval. To use these hints correctly, we synthesize training data that teaches the Speech LLM to decide whether and when to apply each retrieved term. Experiments on ACL 60/60 dev set and the ESO test set show that RASST improves terminology accuracy by nearly 40% and overall translation quality by up to 3 BLEU points, with negligible computational overhead.

21.
PLOS Computational Biology 2026-05-29

Structural and dynamic basis of NOD2 tandem CARD association and NOD1/2–RIP2 signaling complexes

by Jitendra Maharana, Aritra Bej, Debasish Biswal, Debashis Panda, Arjun Sharma NOD1 and NOD2, founding members of the NOD-like receptor (NLR) family, play a crucial role in host defense against bacterial infections. Recognition of peptidoglycan-derived ligands triggers ATP-dependent oligomerization of the NACHT domain, exposing the CARD domains that recruit the adaptor protein RIP2 via CARD–CARD interactions to activate the NF-κB signaling cascade. Although NOD1/2-RIP2 interactions and RIP2CARD filament assembly are established, the precise interfaces that stabilize hetero–CARD filaments remain poorly defined. Here, we integrate in silico structural modeling with molecular dynamics (MD) simulations to elucidate structurally compatible arrangements of NOD1–RIP2 and NOD2–RIP2 hetero–CARD filaments. Our results reveal that NOD1CARD subunits form a structurally compatible homomeric scaffold via canonical (type-I–III) interfaces, accommodating multiple tiers of RIP2CARD rings at both filament termini. Meanwhile, the NOD2 tandem CARDs adopt multiple discrete conformations, reflecting a more intricate structural mechanism. In stable filament conformations, tandem CARDs converge at the type-II interface, with RIP2CARD rings stacking onto CARDa (top-down) and CARDb (bottom-up) interfaces, highlighting the structural role of NOD2CARDb in RIP2-mediated CARD–CARD interaction. In silico mutagenesis, involving charge-reversal and alanine scanning of key interfacial residues, disrupts NOD1–RIP2 and NOD2–RIP2 interactions at both top-down and bottom-up interfaces, leading to rapid interface destabilization within 0.1–0.4 μs of simulation. Together, these results reveal conserved and receptor-specific mechanisms governing NOD1/2–RIP2 CARD–CARD interactions and provide deeper structural and dynamic insights into the complex structural mechanisms for NLR-mediated inflammatory signaling.

22.
arXiv (CS.CV) 2026-06-16

Pathway-Structured Privileged Distillation for Deployable Computational Pathology

Integrating transcriptomics and histopathology can improve cancer risk modelling, yet practical use is constrained by the limited availability of RNA profiling in routine settings. Here we introduce Mixture of Pathway Experts (MoPE), a knowledge-distillation framework that reframes multimodal learning as privileged distillation for histology-only inference. MoPE is motivated by the partial observability between RNA profiles and whole-slide images: histology can capture morphology-linked consequences of certain molecular programmes, but cannot be expected to reconstruct the full transcriptomic state. MoPE encodes RNA-derived pathways and transfers the molecular supervision to pathway-indexed pathology experts through memory-usage alignment. Across diverse public benchmarks and two independent breast cancer cohorts, MoPE consistently improved WSI-only inference performance relative to baseline methods. Pathway-usage analyses and human-audited visual inspection provide bounded inspection of model behaviour and candidate morphology-linked readouts. These results support pathway-structured privileged distillation as a promising route to using molecular information during training while preserving RNA-free inference.

23.
arXiv (CS.CV) 2026-06-24

VisChronos: Revolutionizing Image Captioning Through Real-Life Events

This paper aims to bridge the semantic gap between visual content and natural language understanding by leveraging historical events in the real world as a source of knowledge for caption generation. We propose VisChronos, a novel framework that utilizes large language models and dense captioning models to identify and describe real-life events from a single input image. Our framework can automatically generate detailed and context-aware event descriptions, enhancing the descriptive quality and contextual relevance of generated captions to address the limitations of traditional methods in capturing contextual narratives. Furthermore, we introduce a new dataset, EventCap (https://zenodo.org/records/14004909), specifically constructed using the proposed framework, designed to enhance the model's ability to identify and understand complex events. The user study demonstrates the efficacy of our solution in generating accurate, coherent, and event-focused descriptions, paving the way for future research in event-centric image understanding.

24.
arXiv (CS.AI) 2026-06-16

Dual-Granularity Orthogonal Disentanglement for Generalizable Audio Deepfake Detection

arXiv:2606.16532v1 Announce Type: cross Abstract: Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper proposes a dual-granularity orthogonal disentanglement framework enforcing feature independence at two levels: sample-level cosine orthogonality captures directional decorrelation, while batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions. A curriculum disentanglement schedule progressively strengthens the orthogonality constraint without auxiliary networks or adversarial dynamics. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets demonstrate that the proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER), respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.

25.
arXiv (CS.AI) 2026-06-16

Multi-Grade Deep Learning for Partial Differential Equations with Applications to the Burgers Equation

arXiv:2309.07401v2 Announce Type: replace-cross Abstract: Deep neural networks (DNNs) show great promise for solving partial differential equations (PDEs), but their deep architectures introduce complex, large-scale, non-convex optimization challenges. Nonlinear PDEs, like the viscous Burgers' equation, compound these difficulties due to steep gradients and shock-like solutions. To address this, we propose a two-stage multi-grade deep learning (TS-MGDL) method. In the first stage, shallow networks are trained progressively grade by grade to fit the target function from low- to high-frequency components; previously learned grades are frozen, and each new residual block is trained solely to minimize the remaining approximation error. The second stage unfreezes and retrains selected layers using the first-stage network as initialization, achieving an interpretable, stable hierarchical refinement while mitigating optimization complexity. Furthermore, we theoretically prove that each grade and stage in TS-MGDL monotonically reduces the loss function under an appropriate optimization strategy. Numerical experiments on 1D, 2D, and 3D viscous Burgers' equations demonstrate that TS-MGDL significantly outperforms single-grade learning (SGL), reducing predictive errors by up to a factor of 60.