Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.AI) 2026-06-17

From Paper to Program: Knowledge Externalization for AI-Assisted Quantum Many-Body Code Generation

Authors:

arXiv:2604.04089v3 Announce Type: replace-cross Abstract: Large language models can write scientific code, but direct paper-to-program translation remains fragile when correctness depends on tacit conventions in the literature. We identify this bottleneck as knowledge externalization: converting implicit computational assumptions – index conventions, gauge choices, fermionic signs, contraction order, and memory constraints – into an explicit technical specification before implementation. We evaluate a multi-stage, human-in-the-loop workflow that inserts such a specification, with validation and stop gates, between theory extraction and code generation. The workflow is tested on two algorithmically distinct quantum many-body tasks: variational sweep-based Density-Matrix Renormalization Group (DMRG) from a pedagogical review and constructive Pfaffian conversion of Hartree–Fock–Bogoliubov states to matrix product states from the five-page Letter by Jin et al., Phys. Rev. B 105, L081101 (2022), for which no public code is available. For DMRG, all 16 specification-guided model pairings in a $4\times4$ grid satisfy physics-validation criteria, compared with 6/13 direct attempts. A prose-specification ablation indicates that externalized content, not \LaTeX{} formatting, is the essential ingredient. For Pfaffian-MPS, the workflow succeeds in 11/26 archived attempts, whereas direct prompting yields zero audited passes. Cross-specification transfer is asymmetric: non-GPT specifications implemented by GPT~5.5 pass 4/4, while GPT~5.5 specifications implemented by weaker models fail 4/4, indicating a residual implementation-model bottleneck. The resulting Paper-to-Program Many-Body skill provides an auditable protocol for AI-assisted implementation of many-body algorithms and for diagnosing where externalization succeeds or fails.

02.
arXiv (CS.CV) 2026-06-16

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.

03.
arXiv (math.PR) 2026-06-16

Large Deviations for the Nonlinear Schrödinger Equation with Randomized Quasi-Periodic Initial Data in Higher Dimensions: Subcritical Case

arXiv:2604.17253v2 Announce Type: replace Abstract: We study the cubic weakly nonlinear Schrödinger equation with randomized spatially quasi-periodic initial data in higher dimensions. Under a polynomial decay assumption in Fourier space, we establish a Large Deviations Principle for rogue waves in the so-called subcritical time regime. The proof proceeds in two main steps. We first characterize the distribution of the linear solution and establish the corresponding linear large deviations principle. The lower bound is obtained via pointwise estimates, while the upper bound follows from a combination of truncation and probabilistic arguments. {The method used in this step appears to be new; compare with [GGKS23].} We then perform a detailed combinatorial analysis of the Picard iteration, deriving an effective bound for the Duhamel term and thereby establishing the nonlinear large deviations principle.

04.
arXiv (CS.AI) 2026-06-11

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

arXiv:2606.12406v1 Announce Type: cross Abstract: Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at https://jasonjzliu.com/factr2

05.
arXiv (CS.CV) 2026-06-12

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: https://github.com/DachunKai/EvTexture.

06.
arXiv (CS.AI) 2026-06-15

When Sample Selection Bias Precipitates Model Collapse

arXiv:2606.13732v1 Announce Type: new Abstract: The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier. We show that in low-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of the target manifold, selection itself becomes biased. This situation naturally arises in low-resource data silos such as healthcare consortia or proprietary financial institutions, where raw data cannot be pooled and local references are inherently incomplete. As a result, selection preferentially retains samples aligned with the local manifold while pruning globally relevant tail modes, turning from a safeguard against collapse into a mechanism that precipitates it. We theoretically prove that such siloed selection accelerates collapse and induces power-law diversity decay. As an initial mitigation, we construct Wasserstein proxy references from multiple silos without sharing raw data. Empirical results confirm that local-reference selection fails on skewed distributions, whereas collaborative proxy references mitigate diversity degradation, suggesting that recursive synthetic-data pipelines require particular caution when real-data coverage is fragmented or scarce.

07.
arXiv (CS.LG) 2026-06-12

The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

arXiv:2606.13637v1 Announce Type: new Abstract: Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.

08.
arXiv (CS.CL) 2026-06-15

MedLatentDx: Latent Multi-Agent Communication for Cross-Hospital Rare-Disease Diagnosis

Rare diseases affect over $300$ million patients across more than $7{,}000$ conditions, yet no single hospital encounters enough cases of any one condition for reliable diagnosis. Cross-hospital collaboration could help by allowing a diagnosing institution to use distributed, case-specific diagnostic evidence, but privacy regulations restrict the transmission of identifiable clinical text across institutional boundaries. This setting raises two challenges: existing medical agent systems often rely on textual evidence exchange, while raw latent states such as hidden states and KV caches may still reveal prompt-derived clinical content. We introduce MedLatentDx, a latent multi-agent communication framework in which hospital agents keep private clinical records and retrieved cases local, and send compact latent KV blocks to a host agent for rare-disease diagnosis. MedLatentDx supports two deployment settings: same-backbone hospital agents use latent KV distillation, while hospitals with different LLM backbones use cross-family latent alignment. On CrossRare-Bench, a self-built large-scale rare-disease benchmark with hospital-level partitions, MedLatentDx improves cross-hospital diagnostic performance while reducing reconstructable clinical content relative to raw-latent communication baselines.

09.
arXiv (quant-ph) 2026-06-16

Experimental quantum state learning with pairs of photons

arXiv:2606.16932v1 Announce Type: new Abstract: Tomography allows one to estimate the density matrix describing the state an ensemble of quantum systems are prepared in (for example, polarization tomography determines the polarization state of a beam of identically prepared photons). In general, it is not possible to uniquely decompose the density matrix into its pure state components. Agarwal et al. proposed a protocol which, for a mixture composed of any two pure states of a qubit (with arbitrary probabilities), allows an observer to infer not only the density matrix but the identity of those specific pure states and their weights - the additional requirement being that the qubits arrive in pairs, where both qubits in each pair are in the same state. We experimentally demonstrate this learning-from-pairs concept using photons in the polarization degree of freedom. We use tomography to measure a sequence of single photons and make use of their time-of-arrival information to 'pair up' the photons after the measurement. From here we are able to infer the photons' polarization states and their respective probabilities, and we demonstrate this for various different choices of polarization states and ratios. Finally, we investigate our ability to discriminate between two equal mixtures of distinct pairs of orthogonal polarization states. We find that on the order of approx. 10e4 photons is typically enough to achieve tomography fidelities of approximately 0.9999. This is sufficient to discriminate between two different preparations of the same mixed state, differing by angles of less than 5 degrees between the pure states used in the two preparations.

10.
arXiv (CS.AI) 2026-06-12

Agents-K1: Towards Agent-native Knowledge Orchestration

arXiv:2606.13669v1 Announce Type: new Abstract: Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce Agents-K1, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce Scholar-KG, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

11.
arXiv (CS.LG) 2026-06-12

A Unified Latent Space Disentanglement VAE Framework with Robust Disentanglement Effectiveness Evaluation

arXiv:2603.11242v2 Announce Type: replace-cross Abstract: Evaluating and interpreting latent representations, such as variational autoencoders (VAEs), remains a significant challenge for diverse data types, especially when ground-truth generative factors are unknown. To address this, we unify several state-of-the-art disentangled VAE approaches for latent space disentanglement into one framework – bfVAE. To assess the effectiveness of a disentangled VAE model and enhance latent space interpretability, we propose Feature Variance Heterogeneity via Latent Traversal (FVH-LT) and Dirty Block Sparse Regression in Latent Space (DBSR-LS). To ensure robust interpretability of learned latent space, we develop a greedy alignment strategy (GAS) that mitigates label switching and aligns latent dimensions across runs to set the foundation of result aggregation. We also introduce a convenient scalar latent space separation index (LSSI) based on the GAS-aligned outputs of FVH-LT and DBSR-LS to summarize the overall latent structural separation without knowledge of the ground-truth generative factors. We compare bfVAE to five VAE models and validate the effectiveness FVH-LT, DBSR-LS, and LSSI in on seven tabular and image datasets. Under our examined experimental settings, bfVAE provides a more flexible disentanglement framework achieves more favorable overall trade-off between disentanglement and reconstruction than the benchmark VAE models; FVH-LT and DBSR-LS reliably uncover semantically meaningful and domain-relevant latent structures and generally yield consistent results; and LSSI makes an effective quantitative summary of latent structural separation.

12.
medRxiv (Medicine) 2026-06-17

Hormonal Contraceptives Drive Genital Lipid Metabolism Reprogramming and Susceptibility to HIV Infection

Heterosexual genital HIV transmission is a major driver of new infections, particularly in women, making them disproportionately vulnerable to HIV acquisition. Previous studies have associated injectable hormonal contraceptives (HC) with increasing susceptibility to HIV. Yet, the underlying molecular mechanism remains incompletely understood. Given the structural and signaling role of lipids in the female genital tract, cervicovaginal lipidomic profiling has the potential to reveal the mechanistic interplay among HC, lipidome, and HIV susceptibility in the female genital tract. We conducted untargeted cervicovaginal lipidomics study in a cohort of high-risk, HIV-negative, Kenyan sex workers who were using injectable depot medroxyprogesterone acetate (DMPA), oral contraceptive pill (OCP), or no hormonal contraception (NH). Genital lipids were quantitatively analyzed using liquid chromatography-mass spectrometry (LC-MS) and bioinformatics platforms. A total of 1045 lipid species were identified in the cervicovaginal lavage samples. Injectable DMPA significantly downregulated major structural and signaling membrane lipids, including phospholipids, ceramides, sphingomyelins, and glycosphingolipids (p

13.
arXiv (CS.AI) 2026-06-16

ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies

arXiv:2606.16826v1 Announce Type: cross Abstract: Generalist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce ATOM-Bench, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. We collect 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation. Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. We further introduce Atomic Score (AS) and Compositional Failure Share (CFS) to distinguish failures caused by weak atomic skills from failures caused by limited compositional reuse. Through 2,700 physical rollouts on five representative manipulation policies, we find that current policies can acquire simple instruction-grounding skills, but still struggle with fine-grained motor atoms, counting, and logical filtering. More importantly, strong atomic performance does not reliably transfer to held-out compositional tasks. ATOM-Bench provides a diagnostic testbed for studying whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.

14.
arXiv (math.PR) 2026-06-18

Metastability for the Curie-Weiss-Potts model with unbounded random interactions

arXiv:2505.11260v2 Announce Type: replace Abstract: We analyse the metastable behaviour of the disordered Curie–Weiss–Potts (DCWP) model subject to a Glauber dynamics. The model is a randomly disordered version of the mean-field $q$-spin Potts model (CWP), where the interaction coefficients between spins are general independent random variables. These random variables are chosen to have fixed mean (for simplicity taken to be $1$) and well defined cumulant generating function, with a fixed distribution not depending on the number of particles. The system evolves as a discrete-time Markov chain with single spin flip Metropolis dynamics at finite inverse temperature $\beta$. We provide a comparison of the metastable behaviour of the CWP and DCWP models, when $N \to \infty$. First, we establish the metastability of the CWP model and, using this result, prove metastability for the DCWP model (with high probability). We then determine the ratio between the metastable transition time for the DCWP model and the corresponding time for the CWP model. Specifically, we derive the asymptotic tail behavior and moments of this ratio. Our proof combines the potential-theoretic approach to metastability with concentration of measure techniques, the latter adapted to our specific context.

15.
medRxiv (Medicine) 2026-06-17

Multi-strain Probiotics Alter Gut Microbiota and Estrobolome Pathways in Primary Dysmenorrhea

Background: Exact cause of primary dysmenorrhoea is unknown but recent evidence uncovers a potential link between gut dysbiosis and benign gynaecological disorder via disruption of estrobolome. Methods: A randomized controlled trial to investigate the effects of multi-strain oral probiotics on primary dysmenorrhoea has been conducted. This is a secondary analysis comparing the stool microbiome in women with primary dysmenorrhoea and those without (control), and the effects of treatment with probiotics versus placebo. Results: Although microbial richness and evenness were comparable between groups (alpha diversity, p > 0.05), gut microbial community composition differed significantly (Bray Curtis PERMANOVA, p = 0.015), characterised by reduced Bifidobacterium adolescentis and Blautia and enrichment of Faecalibacterium in dysmenorrhoea, alongside condition-specific core taxa. Post-intervention analysis revealed significant shifts in microbial community structure between pre- and post-treatment groups (PERMANOVA, F = 2.11, p = 0.005), with probiotic supplementation inducing more consistent and directed microbiome changes than placebo, without altering alpha diversity (p > 0.05). Functional prediction showed no significant difference in overall beta glucuronidase pathway abundance (p > 0.05); however, dysmenorrhoea was associated with higher abundance of beta glucuronidase producing taxa (MaAsLin2, q < 0.05) that were differentially modulated by probiotic treatment. Conclusion: This discovery provides evidence on the microbial disruption in primary dysmenorrhoea as well as the benefit of probiotics to modulate the intestinal microbiota to improve the condition.

16.
arXiv (CS.CV) 2026-06-15

Relational Retrieval: Leveraging Known-Novel Interactions for Generalized Category Discovery

In this study, we tackle Generalized Category Discovery (GCD) via a Relational Retrieval perspective, explicitly coupling labeled and unlabeled data through bidirectional knowledge transfer. While existing methods treat these sources separately, missing valuable interaction opportunities, we propose Relational Pattern Consistency (RPC) that enables mutual enhancement. RPC employs One-vs-All classifiers for soft ID/OOD decomposition, then introduces two mechanisms: (i) for known-class preservation, we transfer semantic behavioral alignment; (ii) for category discovery, we leverage the insight that samples from the same category maintain invariant relationships with known-class prototypes, transforming unreliable pseudo-labeling into well-defined relational pattern matching. This bidirectional design allows labeled data to guide unlabeled learning while discovering novel categories through their collective relational signatures. Extensive experiments demonstrate RPC achieves state-of-the-art performance on both generic and fine-grained benchmarks.

17.
arXiv (quant-ph) 2026-06-12

Stable, bidirectional electro-optic transduction in thin film lithium tantalate

arXiv:2606.12726v1 Announce Type: new Abstract: Efficient and stable microwave-optical transduction is a key enabling technology for distributed superconducting quantum computing and heterogeneous quantum networks. Electro-optic transducers based on thin-film lithium niobate (TFLN) have shown strong promise, but demonstrations to date have been limited by various factors such as low frequency bias drift, low efficiency, fabrication complexity, and scalability. Here we demonstrate the first integrated electro-optic microwave-optical transducers realized in thin-film lithium tantalate (TFLT), a material platform offering Pockels nonlinearity comparable to TFLN together with improved bias stability and high-power handling. We fabricate superconducting microwave resonators coupled to tunable photonic-molecule optical resonators using wafer-scale deep ultraviolet lithography, offering high-throughput production of hundreds of devices per wafer. Across six devices we observe coherent bidirectional conversion between C-band optical photons and 4.9-5.5 GHz microwave photons, with measured on-chip efficiencies and inferred single-photon coupling rates g_0/2{\pi} ~ 1 kHz consistent with theory. Continuous operation over multiple days is achieved using a static bias field with minimal feedback, demonstrating a major operational advantage. We further characterize optical loss statistics, microwave resonator performance, and optically induced added noise under pulsed pumping, finding less than one added photon for 100 microsecond pulses at the highest measured efficiencies. These results establish TFLT as a scalable and robust electro-optic platform for future quantum interconnects and modular quantum processors.

18.
arXiv (CS.AI) 2026-06-17

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

arXiv:2606.17546v1 Announce Type: new Abstract: Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

19.
arXiv (CS.CV) 2026-06-16

CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +26.2 percentage points on ScienceQA and +9.1 percentage points on EgoSchema.

20.
arXiv (CS.AI) 2026-06-11

Grounding Computer Use Agents on Human Demonstrations

arXiv:2511.07332v2 Announce Type: replace-cross Abstract: Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

21.
arXiv (CS.CL) 2026-06-19

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.

22.
arXiv (CS.LG) 2026-06-16

Biarchetype analysis for univariate functional data. An application to macroeconomic financial time series

arXiv:2606.15881v1 Announce Type: cross Abstract: We introduce biarchetype analysis for the first time in the context of univariate functional data. This unsupervised methodology extends archetype analysis by simultaneously identifying archetypal structures across both the cases (countries, in our application) and the temporal argument. Both cases and time points are expressed as mixtures of biarchetypes, yielding a concise and highly interpretable representation of complex functional observations. Although biarchetype analysis is not intended as a clustering technique, it offers superior interpretability compared with biclustering approaches, as it is based on extreme, representative patterns rather than average centroids, thereby enhancing human comprehension. We apply the proposed method to 10-year government bond yields of European countries over the period 2001-2025. The results identify three distinct time regimes (the pre-crisis period, the euro-area sovereign debt crisis, and the post-crisis period), and reveal Germany, Greece, and Hungary as country archetypes.

23.
arXiv (CS.CL) 2026-06-16

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

24.
arXiv (CS.AI) 2026-06-19

RIVET: Robust Idempotent Voice Attribute Editing

arXiv:2606.19629v1 Announce Type: cross Abstract: Voice attribute editing models modify characteristics such as age and gender while preserving speaker identity. In large-scale speech datasets, however, attribute annotations are often noisy or inconsistent, which can cause conditional generative models to produce unstable edits. In this work, we show that idempotency provides an effective mechanism for improving robustness to noisy labels. An idempotent operator is one for which repeated application does not change the result, i.e., f(f(x)) = f(x). Enforcing this property acts as an implicit regularizer that reduces sensitivity to mislabeled examples. We introduce RIVET, a training framework that incorporates an idempotency objective to improve robustness to label noise. We evaluate RIVET under controlled label noise and on the GLOBE dataset with naturally noisy annotations. RIVET improves editing success and better preserves speaker identity than standard training, showing that idempotency improves robustness in voice editing models.

25.
arXiv (CS.CL) 2026-06-19

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

Policy-adherent tool-calling agents in customer-service domains must maintain task states across turns while calling tools and obeying domain policies. Task states consist of relevant facts, identifiers, constraints, and conditions observed through user interaction and tool calls. In standard agents, task states are not represented separately. Observations, tool returns, and policy instructions are placed in the prompt, leaving agents to reconstruct the relevant states from the prompt each time they decide what to do next. This design makes state management implicit, creating two common failure modes. An agent may retrieve the right facts but later ground its decision in stale, missing, or incorrect information; and a syntactically valid tool call may still violate a domain policy that depends on the current task state. We introduce \textsc{LedgerAgent}, an inference-time method for tool-calling agents that maintains observed task states in a separate ledger and renders the states into the prompt. The ledger is also used to check state-dependent policy constraints before environment-changing tool calls are executed, blocking policy violations. Across four customer-service domains and a mixed panel of open- and closed-weight models, \textsc{LedgerAgent} improves average pass\textasciicircum{}k over a standard prompt-based tool-calling approach, with the largest gains under stricter multi-trial consistency metrics.