Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.CL) 2026-06-15

Verbatim Chunks Beat Extracted Artifacts: A Controlled Ablation of Memory Representations for Long LLM Conversations

Authors:

A growing class of conversational-memory systems compresses dialogue history into structured artifacts – extracted facts, decisions, or events – on the premise that distilled structure retrieves better than raw text. We test this premise with a controlled ablation: within one fixed retrieval-rerank-reasoning pipeline, we swap only the stored representation – LLM-extracted typed artifacts versus verbatim conversation chunks – holding the model, retriever, reranker, and judge constant. Verbatim chunks win by 15.9 points on LoCoMo (43.9% vs. 28.0%) and 22.0 points on LongMemEval-S (67.4% vs. 45.4%); a 1-hop semantic graph does not recover the gap, and five confound controls reproduce the effect. The mechanism is lossy distillation: extraction discards verbatim detail that chunks retain for free, and the extracted-artifact pipeline never beats naive RAG in overall accuracy. Concurrent positive results with near-verbatim, provenance-preserving units fit the same account: retrieval accuracy tracks how far the representation departs from the source. For the extraction designs we test, structured memory should augment verbatim text rather than replace it: a chunks $\cup$ artifacts union store matches chunks on both benchmarks while artifacts alone forfeit the gap. Code and data: https://github.com/tao-hpu/cog-canvas

02.
arXiv (math.PR) 2026-06-24

Strong duality for the GROW criterion

arXiv:2606.24768v1 Announce Type: cross Abstract: This paper presents general strong duality results when testing hypotheses by betting against them. A bet is an e-variable for a composite null hypothesis $\mathcal{P}$: a nonnegative random variable $X$ whose expected value is at most one under every $\P \in \Pcal$. Following Kelly, Breiman, Cover, Shafer, Grünwald and others, we study a natural minimax log-optimality criterion: given a composite alternative $\Qcal$, we characterize the ``GROW value'' $\sup_{X} \inf_{\Q} \E_{\Q}[\log X]$. This paper generalizes the results of [larsson2025numeraire] from (arbitrary $\Pcal$ and) simple $\Qcal$ to arbitrary $\Qcal$. We identify a weak-$*$ joint information projection pair between arbitrary $\Pcal$ and $\Qcal$ that always exists and show that the GROW value for bounded e-variables always equals the relative entropy of this pair, without any restrictions on $\Pcal$ or $\Qcal$. We also prove a similarly general strong duality for the REGROW criterion with bounded e-variables and arbitrary bounded offsets. Under various assumptions our results extend to unbounded e-variables, and examples show that without any assumptions such extensions fail. Our results are analogous to those in[larsson2026complete], swapping tests for bounded e-variables, minimax risk for the GROW criterion, and total variation for relative entropy.

03.
arXiv (CS.AI) 2026-06-11

Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity

arXiv:2509.09794v5 Announce Type: replace Abstract: Computational models have emerged as powerful tools for multi-scale energy modeling research at the building and urban scale, supporting data-driven analysis across building and urban energy systems. However, these models require large amounts of building parameter data that is often inaccessible, expensive to collect, or subject to privacy constraints. We introduce a modular, multimodal generative Artificial Intelligence (AI) framework that integrates image, tabular, and simulation-based components and produces synthetic residential building datasets from publicly available county records and images, and present an end-to-end pipeline instantiating this framework. To reduce typical Large Language Model (LLM) challenges, we evaluate our model's components using occlusion-based visual focus analysis. Our analysis demonstrates that our selected vision-language model achieves greater visual focus than a GPT-based alternative for building image processing. We also assess realism of our results against a national reference dataset, finding that our synthetic data overlaps more than 95% for three of the four selected variables. This work reduces dependence on costly or restricted data sources, lowering barriers to building-scale energy research and Machine Learning (ML)-driven urban energy modeling, and therefore enabling scalable downstream tasks such as energy modeling, retrofit analysis, and urban-scale simulation under data scarcity.

04.
arXiv (CS.AI) 2026-06-24

OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

arXiv:2606.24799v1 Announce Type: cross Abstract: Generic text-to-video models can be used as rich open-world scene priors. Despite the high quality of today's generated videos, they do not directly yield reliable 3D assets: camera motion is difficult to control, view coverage is partial, and frames often contain inconsistencies across time. We introduce OrbitForge, an adapter built from frozen video priors and per-prompt Gaussian Splatting reconstruction optimization that converts a single text-generated video into a canonical closed-orbit 3D Gaussian Splatting scene. We use 3D reconstruction as an anchor to improve the 3D consistency of the generated video. We obtain a preliminary 3D reconstruction from a first generated video via Deformable Gaussian Splatting with a robust MedianGS proxy. We render views from a prescribed orbit to detect missing viewpoints. OrbitForge uses the text-to-video model to complete only the missing views, and reconstructs the completed orbit into a final Gaussian Splatting scene. This design requires no task-specific video or multiview fine-tuning, avoids per-prompt score-distillation optimization, and does not progressively generate views one step at a time. We further argue that this setting demands coverage-aware evaluation: local smoothness alone rewards methods that never attempt a full orbit. On a frozen 300-prompt T3Bench-derived audit, OrbitForge reconstruction attains a 359.0-degree measured median span, raises originally unsupported-bin Q10 ImageReward from 8.07 to 16.36 relative to MedianGS-only reconstruction, while remaining competitive with VideoMV on the coverage-quality.

05.
arXiv (CS.CV) 2026-06-12

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

06.
arXiv (CS.CL) 2026-06-15

Detecting undisclosed LLM-generated content in parliamentary texts

In this paper, we evaluate the extent of undisclosed LLM-generated content in texts from the parliaments of the United Kingdom and Sweden. In many areas, such as in journalism or in academic writing, there are often requirements to clearly disclose whether AI tools, such as LLMs, have been used. In the case of parliamentary texts, the guidelines on disclosure of AI use are more vague. However, in order to maintain transparency and retain public trust, it is generally recommended that parliamentarians should state whether or not they have used AI when writing texts, such as parliamentary motions. Here, we train an interpretable (glass-box) text classifier using pre-LLM parliamentary texts and LLM-generated versions of such texts. We then apply the classifier to a test set containing recent parliamentary texts, finding a steady increase in undisclosed LLM use, in both parliaments, from 2022 onwards.

07.
arXiv (CS.CL) 2026-06-18

Learning User Simulators with Turing Rewards

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains–conversational chat and Reddit forum discussion–we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

08.
arXiv (CS.CV) 2026-06-18

Stimulus Motion Perception Studies Imply Specific Neural Computations in Human Visual Stabilization

Even during fixation the human eye is constantly in low amplitude motion, jittering over small angles in random directions at up to 100Hz. This motion results in all features of the image on the retina constantly traversing a number of cones, yet objects which are stable in the world are perceived to be stable, and any object which is moving in the world is perceived to be moving. A series of experiments carried out over a dozen years revealed the psychophysics of visual stabilization to be more nuanced than might be assumed, say, from the mechanics of stabilization of camera images, or what might be assumed to be the simplest solution from an evolutionary perspective. The psychophysics revealed by the experiments strongly implies a specific set of operations on retinal signals resulting in the observed stabilization behavior. The presentation is in two levels. First is a functional description of the action of the mechanism that is very likely responsible for the experimentally observed behavior. Second is a more speculative proposal of circuit-level neural elements that might implement the functional behavior.

09.
arXiv (CS.LG) 2026-06-17

A Diffusion Approximation for Temporal-Difference Learning with Linear Features under Markovian Noise

arXiv:2606.18183v1 Announce Type: cross Abstract: Temporal difference (TD) learning with linear function approximation is a core method for policy evaluation. Its classical continuous-time description is an ordinary differential equation (ODE), which captures the asymptotic mean dynamics but neglects stochastic fluctuations determining the error floor. We introduce a stochastic differential equation (SDE) approximation for linear TD(0) under Markovian noise. The resulting model distinguishes the contraction dynamics governed by the projected Bellman operator from the influence of Markovian sampling. As a consequence, the model explains the constant-stepsize error floor through the interaction between Markovian long-run covariance and the contraction geometry of the projected Bellman operator.

10.
arXiv (CS.LG) 2026-06-18

ToolChain-CRC: Conformal Risk Control for Agentic AI Under Retrieval and Tool-Use Drift

arXiv:2606.18467v1 Announce Type: cross Abstract: Modern AI agents retrieve documents, call tools, check intermediate information, and then produce a final answer or action. This creates a risk-control problem that is not visible from the final answer alone. A final response may look acceptable even when the retrieval was weak, a tool output was wrong, or an earlier step was unsupported. We propose ToolChain-CRC, a conformal risk-control method for retrieval-augmented and tool-using agents under drift. The method treats each agent run as a full trajectory of actions, observations, and final output. It builds step-level risk scores, combines them into a trajectory risk score, calibrates an accept-or-intervene rule, and adds an anytime alarm that can stop risky runs before the final answer. We prove trajectory-level risk control under exchangeable calibration runs, give a drift-aware extension with auditable constants, and prove an anytime escalation rule through a supermartingale construction. Experiments cover synthetic tool-chain drift, RAG/tool-use stress tests, public SQuAD-derived retrieval tasks, an API-free agentic QA case study, ablations, target-risk sensitivity checks, 20-seed robustness checks, a drift-margin audit, and a live RAG/tool-use agent benchmark. Across these settings, final-answer-only calibration can miss retrieval and tool failures, while trajectory-level calibration keeps accepted-trajectory risk below the target.

11.
arXiv (CS.AI) 2026-06-18

Self-Evolving Multi-Agent Systems via Textual Backpropagation

arXiv:2506.09046v3 Announce Type: replace-cross Abstract: Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network (ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative team focused on a specific subtask. Our framework follows a two-phase optimization strategy: (1) Forward Phase - Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase - Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables our framework to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across seven benchmark datasets, our work surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements.

12.
arXiv (CS.AI) 2026-06-16

Probing Low Frame Rate Degradation in Neural Audio Codecs

arXiv:2606.16969v1 Announce Type: cross Abstract: Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.

13.
arXiv (CS.CL) 2026-06-19

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

14.
arXiv (math.PR) 2026-06-11

Consensus on Dynamic Stochastic Block Models: Fast Convergence and Phase Transitions

arXiv:2209.03999v2 Announce Type: replace Abstract: We introduce two models of consensus following a majority rule on time-evolving stochastic block models (SBM), in which the network evolution is Markovian or non-Markovian. Under the majority rule, in each round, each agent simultaneously updates their opinion according to the majority of their neighbors. Our network has a community structure and randomly evolves with time. In contrast to the classic setting, the dynamics is not purely deterministic, and reflects the structure of SBM by resampling the connections at each step, making agents with the same opinion more likely to connect than those with different opinions. In the Markovian model, connections between agents are resampled at each step according to the SBM law and each agent updates their opinion via the majority rule. We prove a power-of-one type result, i.e., any initial bias leads to a non-trivial advantage of winning in the end, uniformly in the size of the network. In the non-Markovian model, a connection between two agents is resampled according to the SBM law only when at least one of them changes opinion and is otherwise kept the same. We identify the phase-transition threshold, up to the second-order leading term, between halting and fast convergence to consensus. We also give sufficient initial-lead conditions for consensus to occur within one, two, or three rounds.

15.
arXiv (CS.AI) 2026-06-11

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

arXiv:2602.19502v2 Announce Type: replace Abstract: Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

16.
Science (Express) 2026-04-16

Protein-templated synthesis of dinucleotide repeat DNA by an antiphage reverse transcriptase | Science

Authors: Unknown Author

Defense-associated reverse transcriptases (DRTs) are widespread bacterial anti-phage systems that use unconventional mechanisms of polynucleotide synthesis. We show that DRT3, which comprises two distinct RTs (Drt3a and Drt3b) and a noncoding RNA (ncRNA), synthesizes alternating poly(GT/AC) double-stranded DNA. Cryo–electron microscopy structures at 2.6 Å resolution reveal a D3-symmetric 6:6:6 complex of Drt3a, Drt3b, and ncRNA. Drt3a produces the poly(GT) strand using a conserved ACACAC template within the ncRNA. Notably, Drt3b synthesizes a complementary, protein-primed poly(AC) strand in the complete absence of a nucleic acid template, using conserved active site residues specific to Drt3b to enforce precise base alternation. These findings expand the functional landscape of nucleic acid polymerases, revealing a protein-templated mechanism for sequence-specific DNA synthesis.

17.
arXiv (CS.AI) 2026-06-18

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

arXiv:2606.18293v1 Announce Type: cross Abstract: Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

18.
arXiv (CS.AI) 2026-06-16

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

arXiv:2606.15862v1 Announce Type: new Abstract: Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

19.
arXiv (CS.LG) 2026-06-19

Physics-Informed Discovery of Yield Functions in Plasticity via Convex Neural Representations

arXiv:2606.19375v1 Announce Type: new Abstract: Identifying anisotropic yield functions remains challenging since yielding is not directly observed in full-field mechanical measurements, directional calibration can require many loading directions, and selecting an appropriate analytical form is nontrivial. This study proposes a physics-informed framework for discovering yield functions from full-field displacement data and reaction force data, without stress observations, plastic strain measurements, direct yield surface data, or a prescribed parametric yield function. The framework identifies the yield function as a mechanically constrained constitutive component inside elastoplastic stress integration, rather than through direct stress-space supervision. The yield function is represented by a convex neural network that enforces convexity and positive homogeneity of degree one while imposing the assumed tension-compression symmetry, and this neural yield function is trained with a differentiable stress update and a physics-informed force equilibrium loss across multiple loading cases. The proposed framework is validated using finite element (FE) benchmark studies with von Mises, Hill 1948, and Yld2000-2d yield functions, assessing yield contour agreement, displacement-noise sensitivity, identifiability through plastically active stress states, epistemic uncertainty, and polynomial-surrogate deployment. This study provides a mechanics-constrained pathway for discovering anisotropic yield functions from displacement and force data while keeping the identified component within the structure of elastoplastic stress integration.

20.
arXiv (CS.AI) 2026-06-24

NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

arXiv:2606.24450v1 Announce Type: cross Abstract: Perceiving physical contact is fundamental to dexterous manipulation. While robots often rely on dedicated hardware tactile sensors, humans exhibit a remarkable ability to infer contact by integrating visual information with an innate sense of their body's pose and movement. Inspired by this embodied perceptual skill, we investigate whether a robot can learn to infer contact from vision, an approach that also offers a scalable alternative to tactile hardware specifically for binary contact estimation, which faces practical challenges in cost, fragility, and integration. We present NoContactNoWorries, a transformer-based multimodal framework that fuses RGB-D vision with the robot's proprioception to infer binary contact states as a pseudo-tactile signal for hand-object interactions. We validate by training a single contact prediction model on multiple objects and show that the inferred contact signal supports downstream reinforcement learning agents for in-hand object reorientation, generalizing to novel objects. Experiments in both simulation and on a real-world robot validate our approach, highlighting the feasibility of inferring contact from vision and proprioception. Project Page: https://soham2560.github.io/no-contact-no-worries/

21.
arXiv (CS.CL) 2026-06-24

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Self-evolving agents improve over time by distilling experience from past executions and reusing it in future tasks. Existing systems represent such experience either as natural-language text injected into the agent context or as code exposed as callable tools. However, the choice between these representations is typically made at design time rather than derived from the characteristics of the experience itself, leaving the trade-offs between them poorly understood. We present the first controlled study that isolates text memory and code memory over an identical set of experiences. Our results show that the two forms exhibit complementary trade-offs in construction cost, execution efficiency, and transferability, such that neither representation alone is sufficient. Guided by these findings, we propose Metis, a self-evolving agent system built on a hierarchical dual-representation memory. Metis organizes textual experience into execution plans, environment facts, and common pitfalls, and selectively crystallizes recurring plans into validated callable tools. This design combines the broad applicability of text memory with the execution efficiency of code memory while incurring tool-generation cost only when justified by repeated reuse. We evaluate Metis on AppWorld, a challenging benchmark for interactive agents. The results show that Metis improves task accuracy by up to 20.6% over ReAct while reducing execution cost by up to 22.8%. Compared with representative self-evolving agent systems, Metis consistently achieves a better balance between accuracy, execution efficiency, and memory-construction cost.

22.
arXiv (CS.LG) 2026-06-17

Expanding SPHERE-JEPA: A Family of Statistical Regularizers for the Hypersphere

arXiv:2606.17603v1 Announce Type: new Abstract: In Self-Supervised Learning (SSL), preventing representation collapse by explicitly enforcing a uniform distribution on the unit hypersphere has proven to be effective. However, current frameworks typically rely on sliced statistical regularizers such as SIGReg (used in LeJEPA) and SUSReg (used in SPHERE-JEPA), which approximate this continuous objective via Monte Carlo sampling along random 1D directions. This stochasticity injects projection variance into the training gradients, destabilizing optimization, and hindering convergence. In this work, we first show that analytically integrating out these random projections natively yields a deterministic Maximum Mean Discrepancy (MMD), bypassing the variance of sliced methods. Motivated by this equivalence, we formulate full-dimensional objectives for MMD, Kernel Stein Discrepancy (KSD), and Kullback-Leibler (KL) divergence directly on the sphere to enforce a uniform distribution. To prevent spatial bias, we equip these tests with rotationally invariant kernels constructed via spectral theory, systematically evaluating two canonical families: smooth exponential decay (Heat) and strict frequency cutoff (Bandlimited) filters. Empirically, removing projection-induced noise results in more stable optimization, faster convergence, and consistent improvements over stochastic sliced regularizers on ImageNet and Galaxy10. Furthermore, we reveal that the choice of the statistical test shapes the geometry of the learned latent space: MMD and KSD favor locally clustered organization suitable for object-centric domains, whereas the continuous KDE-based KL divergence promotes fine-grained instance separation, yielding the strongest results on unclustered procedural texture retrieval.

23.
arXiv (CS.CV) 2026-06-16

XPASS-Vis: A Dataset for Cross-Domain Personalized Image Aesthetic Assessment

Personalized image aesthetic assessment (PIAA) seeks to model, at the individual level, the subjective nature of aesthetic judgments toward artworks and photographs. Aesthetic preference is known to be both deeply personal and partially consistent across visual domains. Yet existing PIAA datasets and methods are largely confined to a single domain, or provide too few samples per annotator within each domain to enable personalization across domains. Consequently, the cross-domain generalization of personalized aesthetic preferences remains largely unexplored. To address this gap, we introduce XPASS-Vis, the first dataset explicitly designed for cross-domain PIAA. XPASS-Vis comprises 6,526 stimuli from three visual domains – art, fashion, and landscape – rated by 129 annotators, yielding 87,836 user-stimulus interactions, each annotated with an overall aesthetic score and nine aesthetic-emotion ratings. Notably, each annotator rated more than 200 stimuli per domain, providing sufficient per-domain coverage to support personalization both within and across domains. Moreover, we establish baseline models for cross-domain PIAA under unsupervised domain adaptation (UDA), where a model trained on a labeled source domain is transferred to an unlabeled target domain. A systematic evaluation of representative UDA approaches shows that the best-performing method recovers approximately 60\% (Spearman's $\rho$ = .28) of the supervised upper bound under a fully unsupervised setting. This provides encouraging evidence that personalized aesthetic preferences are, to a meaningful extent, transferable across visual domains. At the same time, a substantial gap remains, highlighting the need for PIAA-specific adaptation strategies. XPASS-Vis and the accompanying baselines provide a foundation for future research on cross-domain PIAA. All datasets and code will be made publicly available upon acceptance.

24.
arXiv (CS.CV) 2026-06-11

Slots, Transitions, Loops: Learning Composable World Models for ARC

ARC tests in-context rule induction: given a few input-output demonstrations, a model must infer the hidden rule and apply it to a new query. While many approaches express ARC rules through language, code, or symbolic programs, ARC itself is visual-symbolic: rules appear as grid transitions over objects, colors, shapes, and spatial relations. We introduce Loop-OWM, an object-centric world-modeling architecture that learns these rules as composable transitions over structured states. It combines color-prototype slots, demonstration-conditioned task summaries, and a looped transition model with dense propagation and slot-conditioned correction. On both ARC-1 and ARC-2, Loop-OWM outperforms non-looped and looped baselines with comparable or fewer parameters. These results suggest that ARC rules can be learned not only as language descriptions or searched programs, but also as transitions over visual-symbolic world states.

25.
arXiv (CS.AI) 2026-06-11

From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

arXiv:2606.11831v1 Announce Type: cross Abstract: Neural relational inference (NRI) methods discover interaction graphs from trajectories through variational reasoning on discrete potential edges. However, these methods typically rely on oversimplified, factorized graph priors. Such priors, typically nearing uniform distributions, treat edges as independent entities. This systemic misalignment does not match the real-world systems and yields diffuse and indecisive edge posteriors limiting the reliability of structural discovery. To address this, we propose Diff-prior, a diffusion-parameterized adaptive prior used to calibrate latent graph distribution rather than generate graphs. Our core insight is to reframe prior integration as a learnable denoising-style calibration that organizes scattered, uncertain edge posteriors into a more reliable overall structure which can be trained by the diffusion model. Diff-prior learns an adaptive structure prior that performs structured calibration on the edge posteriors during inference, guiding it towards a distribution closer to the underlying structure. The diff-prior operates before structural sampling and acts as a denoising calibrator directly on the encoder edge distribution, which provides a generic training paradigm over structured variables. Experiments on standard benchmarks validated our framework, and the results indicate that Diff-prior improves the performance of structure inference and generates more decisive edge posteriors across multiple NRI-family architectures. The code is available on https://github.com/Hardy158118/Diffprior.