Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-18

Looked but didn't see: inattentional blindness and yes-bias confabulation in vision-language models

Previous work showed that many participants fail to notice a gorilla in a video of people playing basketball. Another study found that 83% of trained radiologists failed to report a gorilla figure inserted into a chest CT nodule-search task, even though eye-tracking revealed that most observers had foveated the figure. We ask whether a similar phenomenon exists in contemporary vision-language models (VLMs). We find that (i) VLMs are capable of spotting the gorilla in both still-frame images and videos of lung CT scans; (ii) models display inattentional blindness, which varies according to model generation and type of stimulus presented; (iii) Gemini-3.1-Pro outperforms most other flagship and open-weight VLMs at identifying the presence or absence of the gorilla. We additionally ran a segmentation experiment utilizing two different model classes: a generalist (SAM 3), which found the gorilla but produced little to no results for anatomy-based prompts; a medical specialist (BiomedParse), which produced more promising anatomy-based results but flagged "gorilla" on gorilla-free control videos on 82% of frames. The behavioral signature of inattentional blindness reproduces in VLMs, but a unique confabulation failure mode means that any "did the model see X" claim requires signal-detection analysis with a matched-control false-alarm baseline.

02.
arXiv (CS.LG) 2026-06-19

Statistical Properties of Training & Generalization

arXiv:2606.20299v1 Announce Type: cross Abstract: Deep learning has managed to evade numerous intuitions from classical statistics to achieve unprecedented performance on a number of real-world tasks. In this article, we investigate the key features and surprises of deep learning from a physics-informed perspective, taking care to point out and justify where possible the many choices inherent in constructing a deep learning model. In particular, we review the phenomenon of neural scaling laws and discuss their interplay with the constraints and inductive biases which may be present when applying machine learning to problems in physics.

03.
arXiv (CS.LG) 2026-06-11

Robustness of Mixtures of Experts to Feature Noise

arXiv:2601.14792v2 Announce Type: replace Abstract: Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.

04.
arXiv (CS.CL) 2026-06-16

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

05.
medRxiv (Medicine) 2026-06-18

Cardiac rhythm development: A wearable device index of risk for physical and mental illness in adolescence

Objective. The autonomic nervous system, which regulates cardiac rhythm, undergoes pronounced maturation across adolescence. How cardiac rhythm develops over this period, however, and whether individual differences in its development forecast mental and physical illness, remain open questions. We used three waves of Fitbit data from the Adolescent Brain Cognitive Development (ABCD) Study to characterize the developmental trajectory of the cardiac rhythm and to test whether variation in that trajectory predicts onset of psychopathology and cardiometabolic disease. Methods. 8,301 adolescents contributed 242,811 valid Fitbit wear days across Waves 2 (Mage=12), 4 (Mage=14), and 6 (Mage=16). Cosinor mixed-effects models yielded three rhythm parameters per session: mesor (24-hour mean), amplitude (diurnal swing), and acrophase (peak timing). We first characterized age- and sex-specific trajectories, cross-wave stability, and factors shaping the rhythm. We then used parallel-process latent growth models to test whether within-person changes in rhythm tracked symptom trajectories, and hierarchical logistic models to test whether rhythm parameters predicted the first clinical onset of psychopathology and of obesity and hypertension. Results. The cardiac rhythm changed substantially across adolescence: mesor decreased, amplitude flattened, and acrophase shifted later. Within-person change in the rhythm tracked change in blood pressure, BMI, and trajectories of depression and ADHD symptoms. Higher mesor predicted incident onset of all five outcomes controlling for demographics, baseline symptoms, and behavior (ORs 1.36-1.54); amplitude, acrophase, and rhythm instability conferred additional risk. Conclusions. The 24-hour cardiac rhythm is a passively measurable substrate of adolescent autonomic development that indexes transdiagnostic risk for psychiatric and cardiometabolic illness.

06.
arXiv (CS.CL) 2026-06-19

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.

07.
arXiv (CS.CL) 2026-06-11

The Language You Ask In: Language-Conditioned Ideological Divergence in LLM Analysis of Contested Political Documents

作者:

Large language models (LLMs) are increasingly deployed as analytical tools across multilingual contexts, yet their outputs may carry systematic biases conditioned by the language of the prompt. This study presents an experimental comparison of LLM-generated political analyses of a Ukrainian civil society document, using semantically equivalent prompts in Russian and Ukrainian administered to two frontier models from different developers, ChatGPT 5.2 and Claude Opus 4.5. Despite identical source material and parallel query structures, both models diverged along the same axis: Russian-language outputs leaned toward delegitimizing framings, characterizing civil society actors as externally funded elites constraining a democratic mandate, while Ukrainian-language outputs treated the same actors as legitimate stakeholders in democratic contestation. The magnitude of this divergence, however, was model-dependent. ChatGPT's Russian output reproduced vocabulary characteristic of Russian state discourse; Claude Opus's stayed in a mainstream critical idiom and hedged its judgments in both languages. These findings demonstrate that prompt language alone can systematically shift the ideological orientation of an unchanged model analyzing identical content. The shift is a general property of multilingual LLMs whose severity, and whose alignment with propaganda narratives, varies across systems. The implications reach AI deployment in polarized information environments, cross-lingual research, and AI governance in multilingual societies.

08.
arXiv (quant-ph) 2026-06-12

Quantum Error Correction Codes for Truncated SU(2) Lattice Gauge Theories

作者:

arXiv:2511.13721v2 Announce Type: replace Abstract: We construct two quantum error correction codes for pure SU(2) lattice gauge theory in the electric basis truncated at the electric flux $j_max=1/2$, which are applicable on quasi-1D plaquette chains, 2D honeycomb and 3D triamond and hyperhoneycomb lattices. The first code converts Gauss's law at each vertex into a stabilizer while the second only uses half of the vertices and is locally the carbon code. Both codes are able to correct single-qubit errors. The electric and magnetic terms in the SU(2) Hamiltonian are expressed in terms of logical gates in both codes. The logical-gate Hamiltonian in the first code exactly matches the spin Hamiltonian for gauge singlet states found in previous work.

09.
arXiv (CS.CL) 2026-06-19

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer model generation. However, existing counterspeech datasets against the overlap of hate and misinformation are scarce and limited to single-turn English dialogues, while real-life interactions span across multiple turns and languages. To bridge this gap, we introduce the first large-scale, expert-curated, multilingual dataset of dialogues tackling the intersection of hate and misinformation. To ensure factual grounding, the dialogues are also anchored in verified external knowledge (i.e., fact-checking articles and NGO reports) and include document- and chunk-level span annotations, making it directly applicable for RAG systems. Covering five languages and targeting hate directed at seven marginalized groups, this novel resource enables the training and evaluation of more persuasive, factually grounded counterspeech models.

10.
arXiv (CS.CL) 2026-06-16

QK-Normed MLA: QK normalization without full key caching

Query-key (QK) normalization stabilizes attention by controlling the scale of queries and keys before the dot product, but is not immediately compatible with Multi-head Latent Attention (MLA). MLA achieves efficient decoding by caching low-dimensional latent states instead of full keys, whereas post-projection QK RMSNorm appears to require the fully projected key for every cached token. We show this apparent incompatibility is an implementation artifact, not an architectural constraint. RMSNorm decomposes into a static affine weight and a dynamic scalar RMS statistic. The static key-side weight can be absorbed into the MLA query-side projection; the dynamic key statistic reduces to one inverse-RMS scalar per token and KV group. The resulting formulation is exactly equivalent to explicit post-projection QK RMSNorm in exact arithmetic and preserves MLA's latent decode path. In our 400M runs trained for up to 100B tokens, QK-Normed MLA achieves lower training loss and better downstream accuracy than QK clipping, while H800 decode benchmarks show less than 2% latency overhead up to 256k context. These results make QK normalization a practical stabilization option for MLA models without requiring full-key caching.

11.
arXiv (quant-ph) 2026-06-11

Compressed minimum-purity time evolution for late-time quantum dynamics

arXiv:2606.11392v1 Announce Type: cross Abstract: Unitary time evolution of initially simple quantum many-body states rapidly generates entanglement and complex correlations, which limits direct numerical simulations. The late-time dynamics of physical observables, however, typically exhibits an effective simplicity in the form of hydrodynamics or kinetic theory. This leads to the question whether microscopic equations of motion can remain accurate and tractable up to long time scales by discarding irrelevant information in a controlled manner. Here, we introduce compressed minimum-purity time evolution (CoMPuTE) as an approach to keep track of a consistent set of reduced local density matrices, closing the hierarchical equations of motion using a minimum-purity principle. In benchmark applications we demonstrate (i) accurate description of energy diffusion in the one-dimensional mixed-field Ising model, (ii) the applicability to genuinely out-of-equilibrium Floquet dynamics starting from a pure state, and (iii) the limitations of the local reduced density matrix approximation when describing transport in the XXZ chain at $\Delta=1$ that is governed by increasingly non-local integrals of motion. The CoMPuTE method enhances computational efficiency in comparison to the closely related local-information time evolution algorithm, opening a possible route towards an extension to systems in higher spatial dimensions.

12.
medRxiv (Medicine) 2026-06-16

Ranking-optimized survival models can underperform fixed-horizon clinical prediction: a SUPPORT2 reanalysis of machine learning, attending-physician judgment, and the original SUPPORT model at 60- and 180-day mortality

Machine-learning survival models are increasingly proposed for intensive-care mortality prediction and are almost always selected and reported using the concordance index, a ranking metric averaged over follow-up. Yet most bedside decisions hinge on a probability at a specific time, such as 60- or 180-day mortality. We asked whether ranking-optimized models remain competitive at fixed clinical horizons against two reference points clinicians actually rely on: unaided attending-physician judgment and the original 1995 SUPPORT logistic model. Reanalyzing the SUPPORT2 cohort (9,105 critically ill adults from five United States centers, 1989-1994) under a stratified 70/15/15 split, we compared a gradient-boosted survival model, the physician's recorded prognosis, and the 1995 model at 60 and 180 days, alongside several alternative learners. The survival model achieved competitive ranking concordance (0.705) yet underperformed both comparators at fixed horizons: at 60 days its area under the ROC curve was 0.750, against 0.808 for physicians on the matched sample and 0.827 for the 1995 model, a gap that held across eight independent data splits and remained statistically reliable after multiplicity correction. The shortfall was not miscalibration, since post-hoc recalibration left discrimination unchanged, nor limited capacity, since neural networks, a deep ranking model, and two timepoint-aware discrete-time models also failed to close it; replacing the ranking objective with timepoint-matched binary training recovered roughly half the gap, pointing to an objective-horizon mismatch. Discrimination was equitable across sex, race, and age, but leave-one-disease-out validation exposed severe failure for disease groups absent from training, and the physician advantage was conditional on a physician electing to provide an estimate. We recommend reporting timepoint-specific discrimination alongside concordance, timepoint-matched training when fixed-horizon predictions drive care, leave-one-subgroup validation, and distribution-free prediction intervals to support selective deployment.

13.
arXiv (CS.AI) 2026-06-12

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

arXiv:2606.13602v1 Announce Type: new Abstract: We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3–53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6–48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2–47.8 and 31.0–47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

14.
arXiv (CS.CL) 2026-06-16

From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding

Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.

15.
arXiv (CS.AI) 2026-06-15

Position: Align AI to Our Aspirations, Not Our Flaws

arXiv:2606.13755v1 Announce Type: cross Abstract: We argue that aligning AI to aggregated human preferences is the wrong target. With current technology, one can train AIs to share the values of a Silicon Valley techno-optimist, a degrowth environmentalist, a national-conservative culture warrior, a single-party state cadre, or a devout religious traditionalist. We should not. Human values produce societies that thrive or fail on the merits of those values - from failed states and extreme inequality to declining happiness, political polarization, and government dysfunction in the world's wealthiest democracies. The pluralistic-alignment program correctly diagnoses that there is no single "humanity" to align with, but is dangerous if taken as the main directive. We argue that AI should be trained to a non-negotiable floor of objective alignment goals - competence, bounded by the constraints of factual accuracy, honesty, and lawfulness and that pluralism belongs at the surface (language, register, conventions, missing-context defaults) and across the wide band of legitimate value tradeoffs that respect the floor, but not at the level of values that violate it. We highlight the empirical reality of unfiltered pluralistic values, propose four commitments as a constructive alternative, and engage six credible objections: commercial pressure and practical feasibility, democratic legitimacy, regulatory compliance, over-reliance on institutionalist explanations, the charge that the floor itself is culturally laden, and the limits of Coherent Extrapolated Volition.

16.
arXiv (CS.CL) 2026-06-11

Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection

Stance detection requires identifying an author's position toward a target, often from short-form texts where stance is implicit, indirect, or rhetorically framed. Although large language models (LLMs) achieve strong performance on this task, single-pass prompting can be brittle when multiple interpretations are plausible. Existing aggregation strategies, such as majority voting or self-consistency, improve robustness by combining labels, but they discard the intermediate reasoning needed to resolve conflicting interpretations. We introduce a multi-agent reasoning framework with adaptive worker allocation for stance detection that shifts aggregation from label-level voting to reasoning-level synthesis. The framework employs a Manager-Worker architecture in which a Manager adaptively allocates a variable number of Worker agents based on input complexity. Each Worker analyzes the input from a distinct perspective and produces a reasoning-only explanation without emitting a stance label; the Manager then synthesizes these explanations to produce the final prediction. We evaluate the proposed framework on SemEval-2016, P-Stance, and COVID-19 Stance using Llama, Mistral, and Gemini. Results show that the framework yields the largest gains on implicit and context-dependent stance cases, achieving 86.07 Macro-F1 on COVID-19 and 82.90 on SemEval-2016, while remaining competitive on more explicit stance datasets such as P-Stance. These findings suggest that adaptive reasoning-level aggregation is most beneficial when stance cannot be reliably inferred from surface cues alone.

17.
arXiv (CS.CV) 2026-06-19

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1–100x, pseudo-labels across thresholds 0.90–0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

18.
arXiv (CS.AI) 2026-06-17

Temporal Preference Optimization for Unsupervised Retrieval

arXiv:2606.17664v1 Announce Type: cross Abstract: Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documents-an important aspect when a document collection spans multiple time periods (e.g., retrieving documents from 2018-2025 for "Who is the president in 2019?" introduces temporal ambiguity). Existing methods rely on supervised training with explicit timestamps, which are not always feasible. We propose TPOUR (Temporal Preference Optimization for Unsupervised Retriever), which uses our novel training method Temporal Retrieval Preference Optimization (TRPO). TRPO reinterprets preference learning in the temporal dimension, guiding the retriever to favor temporally aligned documents. TPOUR further generalizes to unseen time periods via interpolation in a learned time embedding, enabling continuous temporal alignment. Experiments on temporal information retrieval (T-IR), TPOUR outperforms both unsupervised and supervised baselines. Compared to Qwen-Embedding-8B, despite being about 72.7x smaller, TPOUR Contriever improves average nDCG@5 by +4.04 (+12.15%) on explicit and +4.98 (+15.21%) on implicit queries. We provide our code at https://github.com/agwaBom/TPOUR.

19.
arXiv (CS.CV) 2026-06-11

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

作者:

Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

20.
arXiv (CS.CV) 2026-06-18

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

21.
arXiv (CS.CV) 2026-06-18

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

22.
arXiv (CS.AI) 2026-06-11

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

arXiv:2604.20348v2 Announce Type: replace-cross Abstract: Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves 70.5% average success rate, outperforming the best training-free baseline by 6.1 percentage points and surpassing most supervised methods. We also demonstrate superior real-world performance on 3 tasks without hardware-specific retraining.

23.
arXiv (CS.AI) 2026-06-15

When Sample Selection Bias Precipitates Model Collapse

arXiv:2606.13732v1 Announce Type: new Abstract: The proliferation of recursive training on synthetic data can alleviate data scarcity but risks model collapse, where repeated training erodes distributional tails and homogenizes outputs. Data selection is widely viewed as a remedy, yet its reliability depends critically on the reference distribution used by the verifier. We show that in low-resource verification regimes, where each verifier observes only a small, fragmented, and biased slice of the target manifold, selection itself becomes biased. This situation naturally arises in low-resource data silos such as healthcare consortia or proprietary financial institutions, where raw data cannot be pooled and local references are inherently incomplete. As a result, selection preferentially retains samples aligned with the local manifold while pruning globally relevant tail modes, turning from a safeguard against collapse into a mechanism that precipitates it. We theoretically prove that such siloed selection accelerates collapse and induces power-law diversity decay. As an initial mitigation, we construct Wasserstein proxy references from multiple silos without sharing raw data. Empirical results confirm that local-reference selection fails on skewed distributions, whereas collaborative proxy references mitigate diversity degradation, suggesting that recursive synthetic-data pipelines require particular caution when real-data coverage is fragmented or scarce.

24.
bioRxiv (Bioinfo) 2026-06-18

Calculation of sequence space coverage in a mutagenesis library

Directed evolution requires screening of large mutagenesis libraries, but accurate calculation of library sizes needed to discover functional variants remains challenging. Existing models provide baseline estimates, yet current computational approaches for finding the best variants scale poorly with library complexity. Here, we introduce a scalable algorithmic framework to compute exact discovery probabilities in saturation mutagenesis libraries with no requirement for explicit sequence enumeration. By aggregating variants into a composition log–sum distribution and applying log-space convolution across randomisation blocks, it is possible to extend this to massive sequence spaces and mixed codon schemes. By inverting these calculations, absolute mathematical ceilings for experimental design are established. Ultimately, this framework provides a rapid, quantitative tool to balance the statistical coverage-diversity trade-off within the limitations of laboratory screening. Finally, this is implemented as an open-source web application (SSCC) that allows researchers to construct heterogeneous library designs and compute required sampling depths, coverage probabilities, and absolute randomisation limits.

25.
bioRxiv (Bioinfo) 2026-06-11

GermRL: Alleviating The Germline Bias In Autoregressive Antibody Language Models Through Reinforcement Learning

Antibodies are powerful therapeutics whose antigen specificity arises from sequence diversity shaped during development. Recently, language models trained on large antibody repertoire datasets have enabled the generation and screening of novel candidates, but these models retain a strong germline bias. As AI adoption increases in therapeutic workflows, it is crucial to develop models that harness the diversity of antibodies necessary for the discovery of mutations that encode desirable properties. Previous work explored the germline bias in masked antibody language models, yet the bias in generative autoregressive language models has not yet been addressed. Here, we present GermRL, a lightweight and modular reinforcement learning (RL) framework capable of alleviating the germline bias in pre-trained antibody autoregressive language models through group relative policy optimization (GRPO). GermRL achieves consistent one-shot generation of antibodies that satisfy specified mutation thresholds from germline while maintaining structural plausibility. Under the lowest and highest mutation thresholds tested (5 and 35 mutations from germline), GermRL scores 0.992 and 0.950 pass@1, respectively, compared to 0.398 and 0.034 for the pre-trained language model. Within GermRL, we introduce a key pair of modifications to GRPO that increase training efficiency by discouraging reward hacking under our antibody application. Furthermore, comparison of RL generated and natural antibody sequences reveals how RL based optimization can explore alternative evolutionary mutational patterns and residue compositional strategies while preserving key global properties of natural antibodies, including identifiable germline assignments, embedding-level similarity and comparable developability profiles. Thus, RL-trained generative models optimized to promote antibody mutations through diversity from germline provide a promising framework for navigating the antibody sequence landscape, enabling exploration of novel yet biologically plausible candidates for therapeutic design.