Academic Intelligence · Curated Daily

Explore the Frontier of Global Academia

AcademicHub aggregates real-time literature from top journals and preprint platforms. Build your personal research radar and let large language models compile cross-disciplinary analysis briefings automatically.

01.
arXiv (CS.LG) 2026-06-17

A Dynamical Systems Perspective on the Analysis of Neural Networks

arXiv:2507.05164v2 Announce Type: replace-cross Abstract: In this chapter, we utilize dynamical systems to analyze several aspects of machine learning algorithms. As an expository contribution we demonstrate how to re-formulate a wide variety of challenges from deep neural networks, (stochastic) gradient descent, and related topics into dynamical statements. We also tackle three concrete challenges. First, we consider the process of information propagation through a neural network, i.e., we study the input-output map for different architectures. We explain the universal embedding property for augmented neural ODEs representing arbitrary functions of given regularity, the classification of multilayer perceptrons and neural ODEs in terms of suitable function classes, and the memory-dependence in neural delay equations. Second, we consider the training aspect of neural networks dynamically. We describe a dynamical systems perspective on gradient descent and study stability for overdetermined problems. We then extend this analysis to the overparameterized setting and describe the edge of stability phenomenon, also in the context of possible explanations for implicit bias. For stochastic gradient descent, we present stability results for the overparameterized setting via Lyapunov exponents of interpolation solutions. Third, we explain several results regarding mean-field limits of neural networks. We describe a result that extends existing techniques to heterogeneous neural networks involving graph limits via digraph measures. This shows how large classes of neural networks naturally fall within the framework of Kuramoto-type models on graphs and their large-graph limits. Finally, we point out that similar strategies to use dynamics to study explainable and reliable AI can also be applied to settings such as generative models or fundamental issues in gradient training methods, such as backpropagation or vanishing/exploding gradients.

02.
arXiv (quant-ph) 2026-06-16

Bath memory as a precision resource in quantum transport

arXiv:2606.17026v1 Announce Type: new Abstract: Structured baths can reshape transport fluctuations in mesoscopic quantum devices, yet a predictive criterion for when this enhances precision has been lacking. We propose a route towards such precision advantages by utilizing bath memory in coherent fermionic transport through a noninteracting quantum-dot chain. Using the Landauer-Büttiker formalism, we derive a dual impedance-matching condition that synchronizes the conductor mode splitting, boundary dissipation, and bath bandwidth, and sustains constructive multimode interference across the transmission window. The analytical predictions for the optimal bath bandwidths show excellent agreement with exact nonequilibrium Green's function calculations of the transport for Lorentzian, Gaussian, and Newns spectral densities. The prescription yields an optimal bath bandwidth at which the current Fano factor is minimized and the thermodynamic and kinetic precision coefficients are simultaneously enhanced beyond their Markovian limits. The alignment of the optimal precision regime with the experimentally accessible current Fano factor minimum thus provides a practical strategy for designing precision-enhanced transport in mesoscopic platforms such as semiconductor quantum-dot arrays and ultracold fermionic channels.

03.
arXiv (CS.CV) 2026-06-18

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

04.
arXiv (CS.CV) 2026-06-16

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

Generalist Vision-Language-Action models remain constrained by the scarcity of robotic data relative to the abundance of human video demonstrations. Existing Latent Action Models attempt to use video data but often suffer from visual entanglement, encoding noise rather than manipulation skills. To address this limitation, we propose Contrastive Latent Action Pretraining (CLAP), a framework that first uses Act-VAE to learn an executable action-token vocabulary from robot trajectories and then aligns human visual transitions with this vocabulary through contrastive learning. This alignment maps unlabeled human videos into a physically grounded latent action space rather than reconstructing appearance. Building on the aligned tokens, we train CLAP-NTP as an autoregressive VLA using robot demonstrations and pseudo-labeled human videos, preserving instruction following and object generalization. For deployment and target-domain adaptation, we further introduce a post-training strategy that combines CLAP-RF, a Rectified Flow action head for low-latency continuous action chunk prediction, with Knowledge Matching regularization to preserve pretrained semantic knowledge during fine-tuning. Extensive experiments show that CLAP achieves strong performance against competitive baselines while enabling effective skill transfer from human videos to robotic execution.

05.
arXiv (CS.CV) 2026-06-15

Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance Learner

Authors:

Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.

06.
arXiv (CS.AI) 2026-06-18

AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China

Authors:

arXiv:2606.00729v2 Announce Type: replace Abstract: Artificial intelligence in France is often discussed through separate dimensions such as investment, compute, regulation, employment, sovereignty, and education. This viewpoint paper proposes a unified interpretation: France can be analyzed as a national AI learning system. Building on Human-Centered Learning Mechanics (HCLM), we use HCLM not as a validated econometric model, but as a conceptual and diagnostic lens for interpreting national AI development as a balance between information injection, absorptive capacity, and institutional dissipation. Information injection includes compute, data, talent, research, capital, industrial deployment, and policy experimentation. Institutional dissipation refers to avoidable frictions such as administrative overload, coordination failures, energy constraints, regulatory uncertainty, talent mobility pressures, and weak industrial absorption. Regulation is not treated as mere friction: adaptive governance, trusted data spaces, and safety-oriented standards may increase long-term learning capacity by improving legitimacy, interoperability, and social trust. The central claim is not that a country follows neural-network equations, but that AI sovereignty depends on how effectively it converts distributed information into absorbed, coordinated, and socially legitimate capability. The paper connects HCLM with neural scaling laws, endogenous growth theory, creative destruction, absorptive capacity, and coordination mechanisms. It offers a formal heuristic, policy indicators, illustrative scenarios, and implications for France. The numerical results are diagnostic scenarios, not econometric estimates or official rankings. The proposed viewpoint reframes AI policy as the governance of an open, strategic, non-equilibrium learning system that should be tested with historical and cross-country data.

07.
arXiv (CS.AI) 2026-06-15

Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

arXiv:2604.14892v3 Announce Type: replace-cross Abstract: Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM Jury, composed of three frontier AI models, for scoring 3334 diagnoses on 300 real-world low- and middle-income country (LMIC) hospital cases. Both LLM- and clinician-generated diagnoses are scored against expert panel diagnoses across four dimensions: diagnosis, differential diagnosis, clinical reasoning, and negative treatment risk. The LLM Jury scores are compared with expert and independent re-scoring panel scores to assess error metrics, inter-rater agreement, severe-risk errors, and the effect of post hoc calibration using isotonic regression. In our data, we find that: (i) the uncalibrated LLM Jury scores preserve ordinal agreement with the expert clinician panel scores, but are systematically lower; (ii) the probability of severe-risk errors is lower for the LLM Jury than the human expert re-score panels; (iii) the LLM Jury combined with LLM diagnoses can be used to identify diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (iv) the calibrated LLM Jury scores and rankings of diagnosing agents show excellent agreement with those of the primary expert panels; (v) LLM Jury models show no self-preference bias, they did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Together, these results provide evidence that a calibrated LLM Jury is a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking. Confirming these findings in other clinical settings is an important direction for future work.

08.
arXiv (CS.AI) 2026-06-18

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

arXiv:2508.21720v3 Announce Type: replace Abstract: Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.

09.
arXiv (CS.AI) 2026-06-16

AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration

arXiv:2606.16328v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate remarkable potential in dynamic graph reasoning, but suffer from a scaling bottleneck: current models can only handle graphs with tens of nodes, constrained by exponential reasoning overhead and finite context windows. While multi-agent systems (MAS) offer collective reasoning and topology-aware orchestration, capabilities naturally suited for graph-structured tasks, their application to dynamic graphs remains unexplored. This paper presents Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration (AdaSTORM), a framework that reformulates large-scale dynamic graph reasoning into two stages: (i) Adaptive Partitioning, partitioning large-scale dynamic graphs into subregions that match the model's reasoning capacity while minimizing inference cost; and (ii) Collaborative Reasoning, aligning graph partition topologies with a spatio-temporal decoupled multi-agent architecture. AdaSTORM is the first multi-agent framework tailored for dynamic graph reasoning. Extensive experiments show that AdaSTORM successfully breaks through the scaling bottleneck, scaling reasoning to thousand-node graphs with over 90% accuracy across several large-scale dynamic graph settings without external tools, significantly outperforms seven competitive baselines. Furthermore, it achieves state-of-the-art accuracy on existing benchmarks and generalizes robustly to real-world datasets. The source code is available at: https://github.com/irisorchid107/AdaSTORM/.

10.
arXiv (CS.AI) 2026-06-16

Inference-time Policy Steering via Vision and Touch

arXiv:2606.14981v1 Announce Type: cross Abstract: Inference-time steering adapts pre-trained generative robot policies during deployment by verifying candidate actions before execution. While prior methods typically perform this verification only with visual observations, vision alone is often insufficient for contact-rich manipulation, where success depends on both global task progress and subtle local interactions such as contact force. We introduce ViTaL, a visuo-tactile inference-time steering framework that formulates multimodal guidance as a bi-level optimization problem. At the high level, visual sampling-and-verification performs long-horizon mode selection, deciding what behavior the robot should execute. At the low level, tactile-guided diffusion editing refines the selected action sequence over a shorter horizon to satisfy local contact requirements. To support outcome-based steering, ViTaL learns a visuo-tactile latent world model and employs semantically aligned visual and tactile verifiers, including a novel text-conditioned tactile reward that scores predicted tactile futures directly in latent space. Across three real-world contact-rich manipulation tasks, ViTaL improves overall success by 51% over the base policy, outperforms unimodal steering by at least 33%, and exceeds naive multimodal fusion by at least 20%. Website: https://yilin-wu98.github.io/vital_website.

11.
arXiv (CS.AI) 2026-06-16

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

arXiv:2605.29874v2 Announce Type: replace-cross Abstract: Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): ten of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Gemini 3.1 Pro Refine achieves the highest ICD in the dataset (0.925), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

12.
arXiv (CS.LG) 2026-06-15

Geometric Domain Adaptation via Optimal Transport for Linear Regression in R^2

arXiv:2606.14023v1 Announce Type: cross Abstract: Optimal Transport has become recently a powerful method for domain adaptation by aligning source and target distributions. We study a supervised domain adaptation problem where source and target domains are related by a rotation or a translation or a homothety in $\mathbb{R}^2$. We prove that the optimal transport map recovers the underlying map when using a $p-$norm cost with $p \geq 2$. Based on this insight, we develop a method combining $K-$means and optimal transport to estimate the underlying map, enabling adaptation of linear regression models when target data is scarce. Simulations demonstrate improved performance over baseline methods. Rather than relying on highly expressive deep learning architectures, we focus on classical machine learning models to emphasize interpretability and theoretical insight. This perspective allows us to explicitly characterize the role of optimal transport in recovering geometric transformations such as rotations, translations, and homotheties. Our contributions include a theoretical result linking optimal transport and rotations, translations and homothecies in $\mathbb{R}^2$, and a practical method for adaptation in linear regression offering both conceptual clarity and applied value in domain adaptation tasks in this space.

13.
arXiv (CS.CL) 2026-06-18

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

14.
arXiv (CS.CV) 2026-06-16

MamBOA: State-Space Architecture for Video Recognition

Fine-grained action recognition demands temporal reasoning that general-purpose architectures address through different cost-accuracy tradeoffs: 3D dense operators couple computation to the input volume, while difference-based methods approximate motion through rigid, hand-crafted subtraction of uncontextualized features - each reflecting a deliberate design choice with corresponding limitations in expressiveness or flexibility. We present MamBOA, a backbone-agnostic temporal framework built upon a novel interleaved scan structure that recasts the selective state-space recurrence (S6) as a native motion synthesizer. By interleaving consecutive feature representations extracted from a pretrained backbone into a single alternating sequence, the proposed scan structurally drives the recurrence to encode both temporal observations of each position within a shared hidden state, separated by only a single decay step - rendering the inter-frame transition an intrinsic component of the state dynamics rather than an externally computed quantity. A cascade of dedicated alignment and decoding operations then distills this joint encoding into an explicit motion representation, which a dual-path pooling mechanism adaptively aggregates by balancing attention-driven selection with uniform temporal coverage. The framework interfaces seamlessly with CNN, Transformer, and Mamba backbone families, adding only ~2.1 GFLOPs per feature pair. On Diving48, MamBOA achieves 85.02% Top-1 accuracy with an image-pretrained backbone and 86.24% with a video-pretrained backbone processing the entire video in a single forward pass - demonstrating that structurally induced state-space dynamics constitute a principled and general foundation for motion modeling.

15.
arXiv (CS.CV) 2026-06-19

Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval

This paper presents Vortex, the multimodal video retrieval system developed by our team, FocusOnFun, for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, our FocusOnFun team's system achieved a score of 79.6/88 (90.5\%) in the Preliminary Round and was further evaluated in the Final Round, achieving an `Excellent' overall performance with `Outstanding' results in the question-answering (QA) task. This demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.

16.
arXiv (CS.CL) 2026-06-15

Persona-Pruner: Sculpting Lightweight Models for Role-Playing

Language Models (LMs) have shown remarkable potential as role-playing chatbots, delivering consistent, stylized interactions when given a specification of a character or user persona. However, applying these capabilities to real-world applications (e.g., ecosystems with numerous NPCs interacting simultaneously) exposes a critical inefficiency due to the excessive computational cost. In this paper, we question the necessity of dedicating a full, generalist model to a single persona, hypothesizing that a specific character identity relies on only a fraction of the model's total capacity. We observe that naively pruning LMs often severely degrades the role-playing performance for a specific persona; it does not distinguish between redundant knowledge and essential character traits. We propose Persona-Pruner, a framework that sculpts a lightweight role-playing model by isolating persona-specific sub-networks from a single description. Our experiments consistently show that Persona-Pruner preserves role-playing performance substantially more effectively than existing state-of-the-art LLM pruning techniques, reducing the performance drop from the dense model by up to 93.8% over the strongest baseline on RoleBench in LLM-as-a-judge score, while still maintaining general LLM capabilities. Code is available at https://github.com/jsu-kim/Persona-Pruner.

17.
arXiv (CS.CL) 2026-06-11

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.

18.
arXiv (CS.LG) 2026-06-16

Contextual Bandits for Maximizing Stimulated Word-of-Mouth Rewards

arXiv:2606.15146v1 Announce Type: new Abstract: Stimulated word-of-mouth is a strategy that promotes information sharing through prompts or incentives. Optimizing stimulated word-of-mouth through social networks requires identifying and targeting connected users who are most susceptible to spillover, a phenomenon where the influence of recommendations extends beyond the immediate audience to impact their connected users. The probability of spillover varies across individuals, and their connections, leading to heterogeneity. Understanding and accurately estimating the spillover probabilities among users in social networks is crucial for improving the effectiveness of stimulated word-of-mouth. To address this, we present a novel contextual multi-armed bandit framework that learns individual spillover probabilities and ranks connected users to maximize rewards from stimulated word-of-mouth. Experiments on real-world network datasets demonstrate that accounting for spillover heterogeneity enhances the targeting precision of top-$k$ connected users, boosting rewards and outperforming baseline methods that do not learn individual spillover effects.

19.
arXiv (CS.LG) 2026-06-16

Probing Dec-POMDP Reasoning in Cooperative MARL

arXiv:2602.20804v2 Announce Type: replace Abstract: Cooperative multi-agent reinforcement learning (MARL) is typically framed as a decentralised partially observable Markov decision process (Dec-POMDP), a setting whose hardness stems from two key challenges: partial observability and decentralised coordination. Genuinely solving such tasks requires Dec-POMDP reasoning, where agents use history to infer hidden states and coordinate based on local information. Yet it remains unclear whether popular benchmarks actually demand this reasoning or permit success via simpler strategies. We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes to audit the behavioural complexity of baseline policies (IPPO and MAPPO) across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax. Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning. Reactive policies match the performance of memory-based agents in over half the scenarios, and emergent coordination frequently relies on brittle, synchronous action coupling rather than robust temporal influence. These findings suggest that some widely used benchmarks may not adequately test core Dec-POMDP assumptions under current training paradigms, potentially leading to over-optimistic assessments of progress. We release our diagnostic tooling to support more rigorous environment design and evaluation in cooperative MARL.

20.
arXiv (CS.CV) 2026-06-19

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

Authors:

We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

21.
arXiv (CS.CL) 2026-06-11

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.

22.
medRxiv (Medicine) 2026-06-17

Low-Density Lipoprotein Cholesterol and Dementia Risk: Integrating Mendelian Randomization and Target Trial Emulation Within the Heart-Brain Axis

Background: The heart-brain axis links cardiovascular and neurodegenerative disease through shared vascular and inflammatory mechanisms. Although low-density lipoprotein cholesterol (LDL-C) is an established causal factor in atherosclerotic cardiovascular disease (ASCVD), its relationship with dementia remains uncertain, with midlife elevations associated with increased risk but late-life associations often appearing null or inverse. To address this cholesterol paradox, we integrated mendelian randomization (MR) with an active-comparator new-user target trial emulation. Methods: We applied a triangulated causal inference framework integrating two-sample MR with observational target trial emulation. Genetic variants associated with LDL-C were used as instrumental variables to evaluate Alzheimer disease (AD), dementia with Lewy bodies (DLB), frontotemporal dementia (FTD), and any dementia (AnyDem), with causal estimates derived using inverse-variance weighted models and sensitivity analyses for heterogeneity and pleiotropy. In parallel, an active-comparator new-user design compared statin versus ezetimibe initiation among adults aged 60 years or older using propensity score (PS) overlap weighting and Cox proportional hazards models to evaluate cardiovascular and dementia outcomes. Results: Genetically predicted LDL-C was associated with increased risk of DLB (OR 1.65, 95% CI 1.30-2.10; p

23.
arXiv (CS.AI) 2026-06-15

FlexMS: A Unified Public Benchmark for Molecule Tandem Mass Spectrum Prediction

arXiv:2602.22822v3 Announce Type: replace Abstract: Tandem mass spectrometry (MS/MS) is central to small molecule identification, but current deep learning systems for spectrum prediction still remain difficult to evaluate and deploy in practice. While novel architectures constantly claim state-of-the-art performance, inconsistent metadata conditioning and entangled preprocessing pipelines hinder fair architectural comparisons. Besides, existing evaluations are often restricted to curated datasets, failing to capture the heterogeneity and cross-domain shifts of real-world metabolomics. Furthermore, current benchmarks lack difficulty-aware diagnostics and leave blind to how models behave under specific compute or data constraints. To address this, we present FlexMS, a modular public-data benchmark framework that standardizes MS/MS prediction across public resources while keeping molecular encoders, metadata conditioning, predictor heads, and downstream retrieval under one protocol. FlexMS establishes a fair evaluation playground which significantly lowers the barrier for integrating new predictive tools. Rather than solely optimizing for average scores, FlexMS augments aggregate accuracy with difficulty-aware diagnostics, providing actionable guidance on model selection across different compute constraints, data scales, and downstream retrieval objectives. Ultimately, FlexMS provides the community with a reproducible standard to identify which algorithmic conclusions are stable and which operating points are most viable in practice.

24.
arXiv (math.PR) 2026-06-16

Exponential Convengence of DLRA for SDEs

arXiv:2606.15843v1 Announce Type: new Abstract: We study dynamical orthogonal (DO) approximations of stochastic differential equations and investigate their long-time behaviour. The DO formulation represents the solution by a low-rank decomposition and leads to a coupled system consisting of an evolution equation on the Stiefel manifold and a reduced stochastic process. We establish the well-posedness of the strong DO system and derive quantitative error estimates between the original stochastic differential equation and its low-rank approximation in the Wasserstein distance. Our main contribution is the analysis of invariant probability measures for the DO dynamics. Under suitable dissipativity, Lipschitz continuity, and non-degeneracy assumptions on the coefficients, we prove the existence of an invariant probability measure for the strong DO system. The proof combines uniform moment estimates, a Krylov–Bogoliubov argument for an associated frozen system, and a Kakutani-Fan-Glicksberg fixed-point theorem to recover the self-consistent dynamics. We further show that the induced low-rank process admits an invariant probability measure and discuss the structure of invariant measures through several illustrative examples. These results provide a rigorous foundation for the use of dynamical low-rank approximations in the approximation of long-time statistical properties of stochastic dynamical systems.

25.
arXiv (CS.CL) 2026-06-19

CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility – Semantic Metrics and Convergence Analysis

Decomposing compound sentences into atomic, verifiable claims is a prerequisite for reliable automated fact-checking. Prior work has relied on token-overlap (Jaccard) metrics that systematically underestimate decomposition quality for paraphrastic claims, and has lacked formal termination analysis for the repair loop. We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings. Our contributions are: (1) Semantic-F1: we use BGE-large cosine similarity fidelity metric that resolves Jaccard's penalisation and improves downstream fact-checking accuracy; (2) Convergence theorems: we formally characterise four properties of the repair pipeline, establishing that rule-based repair is monotone and finitely terminating under an oracle parser assumption; LLM-based self-repair is provably non-monotone and requires an early-exit guard; (3) Three evaluation benchmarks spanning social-media, encyclopaedic, and news domains for cross-domain generalisation measurement; (4) Multi-model benchmarking across four decomposer models (3.8B-12B) and a closed API model. Experiments on SocialClaimSplit, WikiSplitBench, and ClaimDecompBench show that Semantic-F1 outperforms Jaccard-F1 by +15-32pp. EPR ranges from 0.94 to 1.00 on SocialClaimSplit and WikiSplitBench, while ClaimDecompBench includes lower base EPR cases (down to 0.824) due to harder news-domain constructions, and rule-repair reduces the Atomicity Violation Rate (AVR) by 47-100% relative to the base model without degrading fidelity.