Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CV) 2026-06-11

Mitigating Content Shift and Hallucination in GenAI Image Editing via Structural Refinement

Generative AI (GenAI) image editors, such as Nano Banana, produce visually compelling results for retouching tasks, enabling non-experts to edit images through text prompts alone. However, the generative nature of these models often introduces spatial misalignment, texture distortion, and content hallucination, all of which are detrimental to downstream workflows that require pixel-level fidelity. We identify a problem setting we call "structure-preserving GenAI fusion" for black-box GenAI image retouching: retain the perceptual enhancements of a GenAI output while enforcing structural faithfulness to the original input image. To address this problem, we propose a post-processing framework that fuses an input image with its GenAI-enhanced counterpart by first establishing coarse spatial and photometric correspondences, then performing a fusion stage that transfers desired enhancements while suppressing hallucinated content. In the absence of direct prior work in this setting, we evaluate our framework against representative methods from photorealistic style transfer and image fusion. Our experiments demonstrate that our method better preserves aesthetic quality while maintaining pixel-level structural consistency and the input resolution.

02.
arXiv (CS.AI) 2026-06-17

Structural Preservation and the Logical Expressiveness of Graph Neural Networks

arXiv:2606.17882v1 Announce Type: new Abstract: Bridges between graph neural networks (GNNs) and logical formalisms have been established by fixing architectural choices, such as the types of aggregation, combination, and activation functions. These choices define restricted classes of GNNs for which tight correspondences with logical formalisms can be obtained, by showing that logical formulae can be translated into equivalent GNNs and, conversely, that GNNs can be translated into equivalent formulae. In this paper we take a semantic perspective by establishing the logical expressiveness of classes of GNN classifiers that are preserved under structural properties: embeddings (extensions), injective homomorphisms, and homomorphisms. We show that, for each such property, there exists a fragment of graded modal logic characterising the class of GNNs. In particular, preservation under embeddings, injective homomorphisms, and homomorphisms corresponds to existential graded modal logic, its existential-positive fragment, and existential-positive modal logic, respectively. These results characterise the expressiveness of broad classes of GNNs independently of specific architectural choices, but we also show that each of these classes admits a GNN architecture of the same expressiveness. Technically, our approach uses a new well-quasi-order result for trees of bounded height, yielding finite representations of unravelling-invariant classes.

03.
arXiv (CS.CV) 2026-06-16

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.

04.
arXiv (CS.AI) 2026-06-11

Grounding Computer Use Agents on Human Demonstrations

arXiv:2511.07332v2 Announce Type: replace-cross Abstract: Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

06.
arXiv (math.PR) 2026-06-16

A Tail-Respecting Splitting Numerical Scheme for Lévy-Driven SDEs With Superlinear Drifts

arXiv:2504.07255v3 Announce Type: replace Abstract: We present an explicit numerical approximation scheme, denoted by $\{X^n\}$, for the effective simulation of solutions $X$ to a multivariate stochastic differential equation (SDE) with a superlinearly growing $\kappa$-dissipative drift, where $\kappa>1$, driven by a multiplicative heavy-tailed Lévy process that has a finite $p$-th moment, with $p>0$. We show that the strong $L^{p_X}$-convergence $\sup_{t\in[0,T]}\mathbf E \|X^n_t-X_t\|^{p_X}=\mathcal O (h_n^{\gamma})$ holds for any $p_X\in (0,p+\kappa-1)$, which is exactly the range where the $p_X$-moment of the solution is known to be finite. Additionally, for any $p_X\in (0,p)$ we establish strong uniform convergence: $\mathbf E\sup_{t\in[0,T]} \|X^n_t-X_t\|^{p_X}=\mathcal{O} ( h_n^{\delta} )$. In both cases we determine the convergence rates $\gamma$ and $\delta$. In the special case of SDEs driven solely by a Brownian motion, our numerical scheme preserves super-exponential moments of the solution. The scheme $\{X^n\}$ is realized as a combination of a well-known Euler method with a Lie-Trotter type splitting technique.

07.
arXiv (CS.CL) 2026-06-16

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training. It also yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

08.
arXiv (CS.AI) 2026-06-17

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

arXiv:2606.17924v1 Announce Type: cross Abstract: Current Vision-Language-Action (VLA) models face a trade-off between efficient action generation and explicit deliberation. Directly decoding actions from vision-language backbone representations enables low-latency control, whereas explicit reasoning through textual chains, pixel-level subgoals, or action search can improve planning but incurs substantial latency and computational cost. We propose PearlVLA, a VLA framework that moves deliberation into the latent space of a vision-language model (VLM). PearlVLA separates VLM meta-query representations into a fixed visual grounding branch and an iterative latent plan branch. At each refinement round, a plan-conditioned world query probes a lightweight frozen latent world model for an action-free future observation latent, which is fed back to guide plan refinement. A future-guided RefineNet then applies scheduled residual updates to progressively refine a coarse semantic draft into a fine-grained latent action plan. The refined plan after K rounds is then decoded in parallel into an action chunk for low-latency execution. We further introduce Causal Refinement-Grouped Process-Reward RL to optimize the latent refinement process with rewards from longer-horizon imagined futures induced by latent plan edits. Empirical evaluations on the LIBERO benchmark demonstrate that PearlVLA achieves state-of-the-art performance among existing methods.

09.
arXiv (CS.LG) 2026-06-18

Kernel of Partition Paths: A Unified Representation for Tree Ensembles

arXiv:2606.18853v1 Announce Type: cross Abstract: A recent line of work has reframed individual decision trees as linear models on engineered features associated with their splits, opening routes for oracle inequalities and feature-importance reinterpretation, but leaving open the question of what unified geometric object a forest induces when one indexes its feature map by nodes rather than by splits. The present paper studies that object. KPP indexes the feature map by the nodes of the forest, weighted by a path metric that turns each coordinate into a component of a squared-Euclidean path-isometric embedding. KPP unifies four pillars under a single non-diagonal Gram that carries a metric: prediction, exact additive attribution, deterministic Lipschitz robust radius in the KPP metric, and uniform Rademacher risk bounds for regression and classification under fixed, honest, or cross-fit conditioning. All probabilistic guarantees are conditional on the representation and are stated under three explicit conditioning regimes; the robust-radius guarantee is deterministic in the KPP metric rather than in a norm on the raw input. Conjectured fast-rate refinements for both regression and classification are stated as open problems and are not claimed as theorems.

10.
medRxiv (Medicine) 2026-06-18

Urinary Creatine Riboside Complements PSA to Improve Disease Detection in the Diagnostic Gray Zone of Prostate Cancer

Circulating prostate-specific antigen (PSA) discriminates poorly in the diagnostic gray zone (3.0-9.99 ng/mL), where ~75% of biopsies yield no clinically significant prostate cancer (PCa). We evaluated whether urinary creatine riboside (CR), a tumor-derived metabolite excreted through the prostatic urethra, complements PSA for gray-zone detection and independently predicts prostate-cancer-specific mortality (PCSM). In the NCI-Maryland PCa Case-Control Study (951 cases, 962 controls; 47.6% African American men; median follow-up 11.5 years), urinary CR was quantified by UPLC-MS/MS. Within the PSA gray zone (n = 668), urinary CR was complementary to PSA, with markedly higher single-marker discrimination than PSA (AUC 0.93, 95% CI 0.88-0.98 vs 0.77, 0.66-0.89) and additive when combined ({Delta}AUC +0.17, p < 0.001; 91.4% sensitivity at 80% specificity). After adjustment for 11 clinical and sociodemographic covariates, urinary CR independently predicted PCSM complementary to PSA (Fine-Gray SHR 1.72, 1.35-2.19 for CR; 1.35, 1.08-1.68 for PSA; Harrell's C 0.85 for CR + PSA vs 0.77 for PSA alone), with strongest signal in African American men (SHR 2.43, 1.57-3.75 for CR). We conclude that urinary CR is a candidate non-invasive biomarker complementary to PSA - improving gray-zone triage and predicting PCSM; prospective validation in biopsy-referred cohorts is warranted.

11.
arXiv (CS.AI) 2026-06-15

Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack

arXiv:2606.14409v1 Announce Type: cross Abstract: In this report, we present Hy-Embodied-0.5-VLA, abbreviated as HyVLA-0.5, an end-to-end system that spans the full robot learning stack: data collection, model design, continued pre-training and supervised fine-tuning, RL post-training, and real-world deployment. Each component serves a distinct role in this stack.

12.
Nature Biotechnology 2026-06-19

Efficient site-specific gene addition using R2 retrotransposons in tobacco and rice

Authors:

Precise integration of multikilobase DNA fragments remains a major technical barrier in plants. Here we introduce non-long terminal repeat (non-LTR) R2 retrotransposons as a versatile system for targeted gene integration in plants. We reconstituted R2 activity in Nicotiana benthamiana and benchmarked insertion efficiency and fidelity using a TMV-based episomal reporter system. We demonstrate site-specific integration of GFP (2.2 kb) and recombinase-compatible landing pads (0.6 kb) into 28S rDNA arrays, with intact cassette insertion frequencies up to 75% and 53%, respectively. To temporally constrain donor availability and avoid DNA intermediates, we combined in planta effector expression with recombinant RNA virus-mediated donor delivery. We apply R2 retrotransposons for targeted insertion of resistance cassettes within the rDNA of rice callus, achieving integration efficiencies up to 17%. These results position R2 retrotransposons as a double-strand break-free system for RNA-templated insertion of multikilobase gene cassettes at rDNA loci, for safe-harbor trait stacking in plants with potential applications in crop improvement and synthetic biology. Retrotransposons are applied in plants for safe-harbor transgene integration.

13.
arXiv (CS.CV) 2026-06-16

GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

Remote-sensing foundation models (RSFMs) benefit from pretraining on imagery from multiple sensors and ground sampling distances (GSDs), but such exposure alone does not resolve scale mismatch during downstream adaptation. A fixed token-grid offset can correspond to different ground distances across sensors, making grid-based positional priors physically inconsistent. Meanwhile, heterogeneous spatial granularity means that compact urban regions and homogeneous landscapes may require different positional sensitivities even under the same GSD. Therefore, we propose {GeoRoPE}, a ground-aware, RoPE-compatible, and parameter-efficient spatial adaptation method for RSFMs. GeoRoPE recalibrates token-level positional interactions from two complementary aspects. First, Geo-Coordinate Calibration (GCC) rescales raw token-grid offsets according to the ground distance represented by one token-grid step, producing geo-calibrated relative coordinates across GSDs. Second, Geo-Frequency Calibration (GFC) adjusts the native RoPE frequency with a relation-specific factor, enabling position sensitive adaptation to scene-dependent spatial granularity. GeoRoPE is injected into pretrained RSFMs through a lightweight adapter, preserving the frozen spatial prior while adding geo-aware positional corrections. Experiments across multiple RSFMs, sensors, resolutions, and downstream tasks demonstrate that GeoRoPE improves cross-resolution robustness and scale-sensitive representation learning.

14.
arXiv (CS.AI) 2026-06-11

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

arXiv:2606.12018v1 Announce Type: new Abstract: We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at https://github.com/eeee-sys/MODF-SIR, demo is available at https://huggingface.co/spaces/Harry-1234/MODF-SIR, LoRA is available at https://huggingface.co/Harry-1234/MODF-SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry-1234/IntentRouterTrain.

15.
arXiv (CS.AI) 2026-06-11

Internet of Everything in the 6G Era: Paradigms, Enablers, Potentials and Future Directions

arXiv:2604.25018v2 Announce Type: replace-cross Abstract: The Internet of Everything (IoE) represents an evolution of the Internet of Things (IoT) by integrating people, data, processes, and things into a unified intelligent ecosystem. IoE aims to enhance automation, decision-making, and service efficiency across multiple application domains such as smart cities, healthcare, industry, and next-generation wireless networks. This paper provides a structured overview of the IoE concept, its core components, architectural foundations, enabling technologies, and major research challenges. Finally, open research directions toward 6G-enabled intelligent IoE systems are discussed, with emphasis on scalability, security, privacy, and energy efficiency.

16.
medRxiv (Medicine) 2026-06-17

Nickel and Dimed: How a Common Earth Element is Short-Changing Our Health

Nickel has been studied for a long time as an environmental contaminant but less so in its connection to population health. It does not announce itself as loudly as its transition metal brethren like mercury and cadmium, but its chemical properties permit it to be deleterious as a low-dose, chronic exposure, particularly among those with immune systems sensitized to it. There is a growing evidence base and vocabulary to discuss nickel's affect on health. However, in the U.S., there are not recent, reliable estimates of the share of the population with a nickel allergy, let alone how much nickel Americans are exposed to through their diet. This paper seeks to close this evidence gap by creating a new dataset of dietary nickel and other heavy metal exposure and assessing how high levels of dietary nickel exposure shape local demand for health care services. We use soil data from the U.S. Geological Survey and data on agricultural product transport from FoodFlows.org to create a county-level dietary nickel exposure index. We then use a large electronic health record database and double machine learning to estimate how demand for primary care services varies across levels of dietary nickel exposure. We find that counties with high nickel exposure experience an increase in the share of primary care office visits for symptoms highly suggestive of nickel poisoning. This result survives multiple hypothesis test corrections and placebo tests. Our research suggests that nickel has harmful effects on individual health whose exposure can be measured at a population level, and is shaping primary care across the U.S.

17.
medRxiv (Medicine) 2026-06-15

Supporting people to access social security payments through the Special Rules for End of Life: a qualitative study of the perspectives of patients, carers and health care professionals

Background: People living with terminal illness face a double financial burden from additional costs and loss of earning for themselves and their carers. Social security benefits are intended to help alleviate some of this financial pressure, and in the UK and other countries people are eligible for fast-tracked access to financial support via the Special Rules for End of Life. One in 3 people who are eligible miss out on this support, yet there is limited evidence on the reasons for this take-up deficit. Objectives: The aim of this study is to understand the barriers and facilitators to claiming benefits for terminally ill people from the perspectives of patients, carers, and health care professionals. Methods: This is a qualitative study combining i) focus groups with healthcare professionals recruited via professional networks and social media, and ii) interviews with patients and carers recruited in hospital and hospice settings. We analysed the data using Practical Thematic Analysis Results: Fifty-five multidisciplinary healthcare professionals participated in 11 focus groups, and we interviewed 10 patients and carers. We constructed five descriptive themes to summarise the data: Navigating priorities and uncertainty; positive impacts alongside a sense of shame and stigma; talking about money, difficulties and dividends; everybodys, yet nobodys, responsibility; and sticking points in the system. Conclusion: The themes reveal several challenges that may contribute to people not taking up this financial support. However, discussions about access to benefits were also seen as a core part of holistic care, a positive way to offer support and a gateway to other discussions about end-of-life care preferences and decisions. Recommendations for policy and practice include evaluating the adoption of a diagnostic rather than a prognostic eligibility criteria, integrating discussions about benefits into existing processes such as advance care planning, and improving education and support for clinicians.

18.
arXiv (CS.AI) 2026-06-17

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

arXiv:2606.17127v1 Announce Type: cross Abstract: Antimicrobial resistance causes to over a million deaths annually. Antimicrobial peptides (AMPs) are a promising solution, but generative AMP models are not yet ready to design peptides with non-natural amino acids and/or chemical modifications, which are essential for real-world peptide drugs. We present AMPGAN v3, a multi-objective conditional GAN that expands the generative vocabulary to D-amino acids and N/C-terminus modifications such as amidation. By separating adversarial and activity-aware supervision across two specialized discriminators, AMPGAN v3 substantially improves training stability and outperforms prior generative AMP models on external classifiers. We validated five candidates spanning three structural classes in vitro; two showed activity against Gram-positive strains, with the best candidate reaching MIC 8 {\mu}g/mL against B. subtilis. To support downstream curation, we further present PepCraft, a multi-agent framework for end-to-end AMP discovery in which a Planning Agent orchestrates specialized executors for generation, filtering, and verification. Its prioritization recommendations align with our in vitro outcomes. Together, these contributions let us examine, on a small but real scale, how generative and agentic AI compose in therapeutic peptide discovery. Code: https://github.com/marszzibros/AMPGANv3

19.
arXiv (CS.LG) 2026-06-12

Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement Learning

arXiv:2606.13605v1 Announce Type: cross Abstract: This paper presents a distribution-agnostic robust trajectory-optimization framework based on chance-constrained reinforcement learning. The uncertainty is represented here through initial conditions and process noise, with the only requirement being that it can be sampled. A deterministic nominal trajectory is first computed offline, and reinforcement learning is then used only to robustify that baseline through a structured affine closed-loop correction law comprising a feedforward control adjustment and time-varying feedback gains. Probabilistic feasibility is enforced empirically through rollout-based upper-tail quantiles, while terminal dispersion is regulated through covariance-feasibility penalties. The framework is assessed on two materially different trajectory design problems. The flagship case study is a three-dimensional multi-impulse Earth-Mars transfer, where the learned policy is benchmarked against a recent robust trajectory-optimization reference under Gaussian uncertainty and then evaluated under bounded uniform uncertainty and under process disturbances not seen during training. The second case study is a stochastic atmospheric pinpoint rocket landing problem, used to assess portability to a short-horizon continuous-thrust setting with drag, mass depletion, and glide-slope constraints. The results show that the proposed framework can remain competitive in upper-tail fuel cost while preserving probabilistic feasibility, and that the same robustification scaffold can be carried across heterogeneous spacecraft trajectory planning problems without redesign of its core stochastic-control structure.

20.
medRxiv (Medicine) 2026-06-11

PCRAgent: A Multi-Agent Framework for Transforming Noisy clinical conversations into Structured Pre-Consultation Medical Records and Reusable Clinical Data Resources

In primary care and outpatient settings, clinically important patient information is often embedded in fragmented, ambiguous, repetitive, and noisy communication between physicians and patients. This limits physicians ability to obtain a clear preconsultation overview of symptoms, history of present illness, and visit intent, while also preventing real world clinical dialogues from being reused in hospital information systems and medical artificial intelligence applications. To address this challenge, we developed PCRAgent, a centrally coordinated multi agent framework for preconsultation clinical information organization. Guided by physician inquiry logic, PCRAgent identifies, extracts, corrects, and standardizes patient-reported information from noisy consultations. Its coordinated modules including error detection, semantic editing, output control, contextual memory, and intent recognition enable robust parallel handling of spelling errors, repetitions, grammatical inconsistencies, medical ambiguities, and non-medical interference. A traceable edit list records intermediate corrections and context, allowing iterative refinement without redundant modifications. PCRAgent generates two complementary outputs. One is a PreConsultation Clinical Report for rapid physician review. The other is a Structured Clinical Conversation Dataset for hospital data construction and downstream AI applications. In evaluations using 220000 strongly perturbed consultations, PCRAgent maintained high robustness, achieving a clinical information accuracy of 4.99 out of 5 and key element completeness of 5 out of 5, outperforming GPT4o. Expert review of Chinese and English dialogues confirmed high clinical accuracy of 4.85 out of 5 and high safety of 4.79 out of 5. Multicenter validation in real-world outpatient workflows further demonstrated practical utility. These findings indicate that PCRAgent can efficiently transform noisy and unstructured consultations into physician ready reports and AI ready structured data, improving outpatient efficiency, reducing cognitive burden, ensuring information completeness, supporting precise decision-making, and enabling high-quality reuse of clinical data.

21.
arXiv (CS.CV) 2026-06-19

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.

22.
arXiv (CS.LG) 2026-06-16

AREAL-DTA: Dynamic Tree Attention for Efficient Reinforcement Learning of Large Language Models

arXiv:2602.00482v2 Announce Type: replace Abstract: Reinforcement learning (RL)-based post-training for large language models (LLMs) is computationally expensive, as it generates many rollout sequences that frequently share long token prefixes. Existing RL frameworks usually process these sequences independently during policy training, i.e., repeatedly recomputing identical prefixes in both the forward and backward passes of policy gradient computation, leading to substantial inefficiencies in computation resources and memory usage. Although prefix sharing naturally induces a tree structure over rollouts, packed tree-mask approaches scale poorly in RL settings. In this paper, we introduce AReaL-DTA, which efficiently exploits prefix sharing in RL training. AReaL-DTA employs a depth-first search (DFS)-based execution strategy that dynamically traverses the rollout prefix tree during both forward and backward computation, materializing only a single root-to-leaf path at a time. To further improve scalability, AReaL-DTA incorporates a load-balanced distributed batching mechanism that dynamically constructs and processes prefix trees across multiple GPUs. On $\tau^2$-bench, AReaL-DTA improves training throughput by up to $8.31\times$ over dense training and up to $1.70\times$ over sparse training. Our code is available at https://github.com/areal-project/AReaL/tree/feat/dta.

23.
arXiv (CS.AI) 2026-06-12

Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

arXiv:2605.03847v2 Announce Type: replace Abstract: Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm systems, creates environments in which emergent risk is structurally unavoidable: locally correct decisions by individual agents compose into globally unacceptable behavioral trajectories under uncertainty. Existing approaches such as constrained optimization, safe reinforcement learning, and runtime assurance evaluate acceptability at the level of individual actions rather than across behavioral trajectories, and none addresses the multi-participant, uncertainty-laden nature of DCI deployments. This paper introduces mechanical conscience (MC), a novel concept and simplified mathematical framework that operationalizes trajectory-level normative regulation for both single-agent and distributed intelligent systems. Mechanical conscience is defined as a supervisory filter that minimally corrects a baseline policy's actions to reduce cumulative deviation from a normatively admissible region, while accounting for epistemic uncertainty. We introduce associated constructs, conscience score, mechanical guilt, and resonant dependability, that provide an interpretable vocabulary and computable governance signals for this emerging field. Core theoretical properties are established: admissibility equivalence, existence of optimal regulation, and monotonic deviation reduction. Illustrative results demonstrate that MC-regulated agents maintain trajectory-level normative acceptability where conventional controllers drift outside admissible bounds, and that the framework naturally extends to suppress interaction-induced emergent risk in multi-agent DCI settings.

24.
arXiv (CS.CV) 2026-06-16

Text-Vision Co-Instructed Image Editing

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.

25.
arXiv (CS.LG) 2026-06-11

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

arXiv:2606.11382v1 Announce Type: new Abstract: Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at https://github.com/eemokey/glacier.