Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
bioRxiv (Bioinfo) 2026-06-11

A high-quality chromosome-scale reference genome assembly for Asparagus racemosus var. CIM-Shakti (Shatavari), a medicinal plant of Ayurvedic importance

Asparagus racemosus Wild., commonly known as Shatavari, is an important medicinal plant in Ayurveda and is valued for its steroidal saponins, particularly shatavarin compounds, which contribute to its adaptogenic, galactagogue, immunomodulatory, and therapeutic properties. Despite its medicinal and economic importance, genomic resources for this species have remained limited, restricting molecular breeding, pathway discovery, and comparative evolutionary studies within Asparagaceae. Here, we report a high quality chromosome scale reference genome assembly of A. racemosus var. CIM Shakti generated using PacBio HiFi long read sequencing and Omni C chromatin conformation scaffolding. The pseudo haploid assembly spans 817 Mb across 53 scaffolds, with a scaffold N50 of 98.50 Mb, L50 of 5, and a largest scaffold of 113.80 Mb. Ten major chromosome scale pseudomolecules were resolved, corresponding to the haploid chromosome complement of A. racemosus. The assembly showed high gene space completeness, with BUSCO completeness of 99.8% against the Eukaryota dataset and 98.0% against the Embryophyta dataset. BlobToolKit profiling further supported assembly quality, with GC content of approximately 39 to 40% and no major evidence of contamination. EDTA based repeat annotation identified 580.93 Mb of interspersed repetitive elements, accounting for 71.06% of the 817.57 Mb genome assembly. The repeat landscape was dominated by LTR retrotransposons, particularly Gypsy elements, which accounted for 25.01% of the assembly, followed by unclassified LTR elements at 26.58% and Copia elements at 4.84%. Structural and functional annotation identified 29,199 protein coding genes represented by 29,199 transcript models, 138,433 exons, and 125,201 CDS features. The annotation was structurally robust, with an average gene length of 4,605.1 bp, 4.74 exons per transcript, and 97.80% of transcripts containing multiple exons. The CIM Shakti reference genome provides a foundational genomic resource for investigating steroidal saponin biosynthesis, sex chromosome evolution, repeat driven genome expansion, and comparative genomics in Asparagaceae. This assembly will support future studies on medicinal trait improvement, conservation genomics, and genomics assisted breeding of climate resilient Shatavari cultivars.

02.
medRxiv (Medicine) 2026-06-12

Home-based binocular serious games in virtual reality to treat visual acuity and stereovision in residual amblyopia: AMBER study

Objectives: Amblyopia is a pediatric visual disorder traditionally treated by patching the fellow eye, though many patients retain residual amblyopia post-treatment. Increasing evidence suggests that visual plasticity allows treat-ment beyond the classical therapeutic window. AMBER evaluated the efficacy of binocular serious games in virtual reality (VR) in residual amblyopia. Methods and Analysis: The monocentric, prospective, randomized, crossover trial (reported as case series) includ-ed 14 anisometropic, strabismic, or mixed residual amblyopia patients (6-35 years; 5 children, 9 adults). Participants underwent two 2-month intervention phases: optical correction (standard care) and standard care plus VR games (2.5 h/week), each with a 2-month follow-up. Best-corrected visual acuity (BCVA), stereoacuity, and reading speed were assessed (5 timepoints) using the Sloan and Landolt charts, the Titmus, TNO, Lang II, Asteroid, and Mnread tests. Compliance and adverse events (AE) were recorded. Results: VR training improved BCVA in 10 amblyopic eyes (Landolt and Sloan), with more pronounced effects in anisometropic patients. Six patients showed improved stereoacuity (Titmus; 4x mixed, 1x anisometropic, 1x stra-bismic amblyopia), persistent only in children (1x strabismic, 1x mixed amblyopia). Four improvements were ob-served with TNO (1x), Lang II (1x), Asteroid (0x), and MNread (1x). Despite positive trends, when comparing re-sults of individual patients, between both eyes, and with standard treatment, consistency of improvements cannot be conclusively demonstrated. One non-severe AE (dizziness) was reported. Conclusions: Following individual cases, VR training improved BCVA and stereoacuity, particularly in children and patients with high compliance. However, considering the cohort as a whole, consistency of effects has to be confirmed in larger groups. Thus, the methodologically sophisticated AMBER study revealed differences in VR treatment efficacy between amblyopia types, children/adults, endpoints and tests, offering precious data for the design of meaningful future studies. It shows that neurovisual plasticity gauged by VR-games offers safe, engaging treatment options for residual amblyopia.

03.
PLOS Medicine 2026-06-04

Beyond associations: Navigating the safety of non-steroidal anti-inflammatory drugs (NSAIDs) in early pregnancy

by Andrew S. C. Yuen, Kenneth K. C. Man Pain and fever in pregnancy require treatment, but fetal safety concerns complicate analgesic choice. A recent PLOS Medicine study presents new evidence on the safety of first-trimester NSAID use and congenital malformation risk, but interpreting findings across studies is challenging. In this Perspective, Kenneth Man and Andrew Yuen highlight a recent PLOS Medicine study that presents new evidence on the safety of first-trimester NSAID use and congenital malformation risk, but discuss why interpreting findings across studies is challenging.

04.
arXiv (quant-ph) 2026-06-12

Testing the problem of time with cold atoms

arXiv:2509.07745v3 Announce Type: replace-cross Abstract: We realize a cold-atom system to quantitatively test relational constructions of time. A well-isolated atomic Bose-Einstein condensate evolves in a conservative trap that is partitioned by a thin optical barrier into an observed and unobserved sector, with negligible dissipation on the experimental timescale. Motivated by relational-time approaches discussed in the Wheeler-DeWitt framework, we ask whether the dynamics of the observed sector can be ordered using only internal degrees of freedom. To this end, we construct an entropic time from an experimentally defined coarse-grained entropy, and demonstrate that it can robustly order the events in the observed sector across repeated cycles of expansion and recollapse. We finally derive an effective Schroedinger equation parameterized by this internal time and show that it is able to reproduce the measured evolution. These results establish a controlled experimental setting in which relational-time constructions can be quantitatively tested.

05.
arXiv (quant-ph) 2026-06-16

High-dimensional coherence to entanglement transduction under canonical noise

arXiv:2606.16695v1 Announce Type: new Abstract: We develop an analytical framework for coherence-to-entanglement conversion in bipartite high-dimensional quantum systems, so-called qunits. An arbitrary coherent input qunit is coupled to an incoherent ancilla through a generalized controlled-shift operation, producing a maximally correlated bipartite state. By analyzing the partial transpose of the output state, we establish an exact dimension-independent connection between the input coherence and the generated entanglement. We then study how this conversion is affected by three standard noise processes applied after the conversion step: phase damping, global depolarizing noise, and independent amplitude damping. The resulting expressions show that these channels degrade entanglement in qualitatively different ways. Phase damping leads to a uniform attenuation of the entanglement generated from coherence, depolarizing noise introduces pairwise thresholds associated with entanglement sudden death, and amplitude damping produces an asymmetric decay governed by relaxation toward the ground state. For maximally coherent inputs, the general results reduce to simple closed-form behavior, allowing direct comparison of the three noise mechanisms as the system dimension increases. In particular, global depolarizing noise exhibits a dimension-dependent sudden-death threshold, while amplitude damping leads to a smooth suppression in the maximally coherent case. These results provide useful analytical benchmarks for high-dimensional resource conversion and for assessing noisy entanglement generation in qudit-based quantum-information settings.

06.
arXiv (CS.CV) 2026-06-17

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

07.
arXiv (CS.CV) 2026-06-18

Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension

Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner's field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schrödinger Bridge (I$^2$SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I$^2$SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I$^2$SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8 HU on simulated noisy data and 152.0 HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19 s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135 s) and surpassing DiffusionGAN (0.58 s), the second fastest. This combination of accuracy and efficiency indicates that I$^2$SB has potential for real-time or clinical deployment.

08.
arXiv (CS.CV) 2026-06-16

Interpolation between Convolution and Attention via K-Nearest Neighbors

作者:

The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. Convolutional Neural Networks are defined by spatially local convolution operations, while Transformers rely on global self-attention. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and weighted aggregation. Convolution selects neighbors by spatial proximity while self-attention selects by feature similarity, revealing that they lie on a continuous spectrum rather than representing categorically different computations. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. ConvNN exactly recovers standard and depthwise convolution by restricting neighbor selection to normalized spatial coordinates, and exactly recovers self-attention and its sparse variants, including KVT-attention, by replacing spatial proximity with scaled dot-product similarity. Beyond these special cases, ConvNN serves as a drop-in replacement for both convolution and attention layers, enabling systematic exploration of the intermediate spectrum between local and global aggregation through configurable similarity functions, neighbor selection strategies, positional encodings, and aggregation kernels.

09.
arXiv (CS.CV) 2026-06-19

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

10.
arXiv (CS.AI) 2026-06-16

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

arXiv:2605.29874v2 Announce Type: replace-cross Abstract: Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): ten of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Gemini 3.1 Pro Refine achieves the highest ICD in the dataset (0.925), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

11.
arXiv (CS.CL) 2026-06-24

Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

Driving VLA models incorporating Chain-of-Thought (CoT) reasoning are attractive because they leverage pretrained VLM representations and expose intermediate decisions in natural language, yet current rationales often lack the step-by-step decision semantics needed to keep the rationale causally connected to the planned motion. We introduce Neuro-Symbolic Drive, a neuro-symbolic driving framework that supervises a driving VLA with rule-grounded reasoning traces extracted directly from classical rule-based planners. Our key observation is that rule-based planners are symbolic AI systems that already function as executable reasoning engines: they reason about active safety constraints, search over candidate maneuvers, and select a final trajectory. We instrument these planners in simulation to capture both the executed trajectory and the internal decision trace at each rule-evaluation step. Each trace is serialized into structured rule-grounded reasoning and paired with the trajectory to fine-tune Qwen3.5-4B as a driving VLA. Because these traces are derived directly from the planner states that determine the action, they ensure reasoning is structurally coupled to motion generation by construction, rather than by post-hoc alignment. On our simulator-generated benchmark, detailed rule-grounded reasoning reduces ADE@3s from 0.47 to 0.26 and miss rate from 8.30% to 6.40% under three-camera perception, and from 0.54 to 0.26 and 10.13% to 5.99% under eight-camera perception. Neuro-Symbolic Drive thus converts neuro-symbolic planning logic into structured supervision. Code base: https://github.com/XiangboGaoBarry/Neural-Symbolic-Drive.

12.
arXiv (CS.AI) 2026-06-16

Theorem-Grounded Execution Ontologies for Interpretable Machine Reasoning

arXiv:2606.16010v1 Announce Type: cross Abstract: Large language models have achieved impressive performance on reasoning tasks spanning mathematics, science, programming, and commonsense inference. Despite these advances, their reasoning processes remain largely latent, making them difficult to interpret, verify, replay, debug, and transfer across domains. Existing approaches such as chain-of-thought, tree-of-thoughts, graph-of-thoughts, and tool-augmented reasoning expose intermediate reasoning artifacts but typically lack explicit execution semantics, formal state representations, and verifiable reasoning structures. We introduce Theorem-Grounded Execution Ontologies (TGEO), a framework that models reasoning as an executable state-transition process rather than a sequence of generated tokens. Given an input problem, TGEO identifies relevant theorem families, binds the problem to a domain ontology, discovers semantic objects, instantiates states and operators, constructs predicates and contracts, and synthesizes an executable reasoning graph. The resulting graph provides an interpretable, replayable, and auditable representation of reasoning in which every state transition, operator application, and validation step is explicitly represented. TGEO integrates five architectural components: (1) theorem-grounded reasoning priors, (2) executable ontologies, (3) operator-mediated state transitions, (4) predicate and contract-based execution validation, and (5) architectural auditing and failure localization. We evaluate TGEO on theorem-intensive reasoning tasks derived from mathematical benchmark domains and a curated Golden Execution Suite. Our findings demonstrate the value of executable reasoning representations for interpretable, verifiable, and reproducible AI reasoning systems.

13.
bioRxiv (Bioinfo) 2026-06-14

FENNEC: Fine-Tuned Ensemble Neural Networks Accelerate Chemically Modified siRNA Design and Screening

Small interfering RNAs (siRNAs) are a clinically validated therapeutic modality, yet designing potent chemically modified siRNAs remains a costly and iterative process, limited by scarce public data. Computational prediction of siRNA efficacy is therefore essential for rational design and accelerated preclinical development. However, despite the critical role of chemical modifications in therapeutic performance, current state-of-the-art machine learning methods either are not designed to model the chemical diversity of therapeutic siRNAs, or exhibit poor generalization performance. Here, we present FENNEC (Fine-Tuned Ensemble of Neural Networks for siRNA Efficiency Characterization), a machine-learning framework for predicting siRNA activity across chemically diverse design spaces. To support this effort, we curated the largest patent-derived dataset to date of chemically modified siRNAs from 42 patents using OCR-based table extraction and stringent filtering. FENNEC combines temporal convolutional networks with thermodynamic descriptors, experimental covariates, and embeddings from RNA foundation models to capture both local chemical determinants and broader target-context information. Importantly, we show that language-model-derived embeddings provide meaningful higher-order representations of target transcripts, particularly in data-scarce settings. FENNEC achieved robust predictive performance across both gene-level and scaffold-level validation settings, with additional experimental validation on a novel AHSA1-targeting dataset further supporting its generalizability across chemically modified siRNAs. In benchmarking, FENNEC outperformed classical machine-learning and state-of-the-art deep learning models, demonstrating generalization to unseen chemistry. Model interpretation recovered established design principles, including position-specific effects of glycol nucleic acid, 2'-fluoro modifications, and phosphorothioate backbones. Furthermore, in silico perturbation analyses suggest that FENNEC can serve not only as a predictive model, but also as an oracle for the design and optimization of chemically modified siRNAs. Together, our work addresses a key gap in the field by enabling chemically aware deep learning for siRNA design, supported by a large and diverse collection of chemically modified siRNA measurements.

14.
arXiv (CS.LG) 2026-06-24

Verifiable Foundation Models for Robot Safety

arXiv:2606.23754v1 Announce Type: cross Abstract: Deploying foundation models for robot control raises a central challenge: the expressive power that enables rich, multimodal perception also makes these models opaque and difficult to analyze formally, rendering them intractable for existing verification tools. In this paper, we present FEARL (Foundation-Enabled Assured Robot Learning), a framework that addresses this tension through a modular architectural decomposition. FEARL separates the policy into a large Controller (C) responsible for high-dimensional perception and task reasoning, and a small Safety module (S) that receives low-dimensional observations from dedicated safety sensors together with a bounded context embedding from C and produces the final action. Since many robot safety requirements, such as collision avoidance and workspace boundary constraints, can be expressed over these safety sensor observations, formal verification can be applied to S rather than to the full foundation-model backbone. This makes formal analysis tractable with existing tools while preserving the Controller's expressive power for task reasoning. To show that the decomposed policy remains capable of solving diverse tasks, we evaluate FEARL on three simulated robotic domains using multiple Controller backbones and training procedures, including pretrained off-the-shelf vision-language-action models. We further transfer the learned policy from one of our simulated tasks to a physical robot, suggesting that the low-dimensional safety interface supports practical sim-to-real transfer.

15.
arXiv (CS.LG) 2026-06-19

Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning

arXiv:2606.19883v1 Announce Type: new Abstract: We study a multi-agent multi-armed bandit problem in the competitive setup with two-sided matching markets under a human centric decision making model. To capture human preferences, we use cumulative prospect theory (CPT) that weighs the actions of the agent in a nonlinear fashion using a ($\alpha$-Hölder continuous) weight function. CPT has been widely used in behavioral economics and risk sensitive machine learning to emulate human preferences. We analyze the state-of-the-art learning algorithm with CPT weight distorted rewards and obtain a player optimal regret of $\mathcal{O}(K\log T \left(\frac{1}{\Delta}\right)^{2/\alpha})$, where $K$ denotes the number of arms, $T$ is the learning horizon, and $\Delta$ represents (suitably defined) players' minimum preference gap. Noticing the dependence on $\Delta$ to be sub-optimal, we further improve this regret by judiciously selecting the active set of arms during exploration, which removes the dependence on $K$ in the dominant term and achieves an improved (optimal) regret guarantees in the setting where the number of arms $K$ is significantly larger than the number of players $N$. In addition, we consider adversarial markets where the observed rewards of the agents may be corrupted. We propose and analyze algorithms for robust markets with CPT as risk sensitive measure in both settings where the total corruption budget is known and where it is unknown, and establish logarithmic player-optimal regret guarantees in both cases.

16.
arXiv (CS.CL) 2026-06-11

Compatibility-Aware Dynamic Fine-Tuning for Large Language Models

Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.

17.
arXiv (CS.AI) 2026-06-11

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

arXiv:2606.11440v1 Announce Type: new Abstract: Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

18.
arXiv (CS.AI) 2026-06-15

Safety-Contract Graph Multi-Agent Reinforcement Learning for Autonomous Network Security Response

arXiv:2606.13832v1 Announce Type: cross Abstract: Autonomous network-security response systems promise to reduce Security Operations Centre (SOC) reaction latency, but reward-only multi-agent reinforcement learning (MARL) can improve security reward while remaining non-deployable. We present a safety-contract graph MARL framework and instantiate it as ACD$^3$-GAT (Adaptive Constrained Counterfactual Decisioning with a Graph Attention Network encoder), an architecture that separates simulator observations from reusable operational budgets, constrained optimization, graph state encoding, and counterfactual action screening. We evaluate the method in CAGE Challenge 4, where agents operate under budgets for Mean Time to Recover (MTTR), false-positive response, and firewall change-management disruption. Across the benchmark, every unconstrained method violates the SOC downtime budget in 100% of evaluated episodes, with mean downtime proxy costs of 311-430 against a budget of 50. This complements prior CAGE Challenge 4 findings by showing that reward-only learning lacks operational discipline. Constrained MAPPO-GAT (C-MAPPO-GAT) isolates Lagrangian operational-cost control and budget-aware screening, while ACD$^3$-GAT adds budget context, CVaR tail-risk estimation, opponent-belief state, and Graph Counterfactual Risk Propagation (G-CRP). The replicated comparison includes three 200-episode seeds for IPPO, MAPPO-GAT, C-MAPPO-GAT, and ACD$^3$-GAT. C-MAPPO-GAT reduces downtime violation from 100% to 0.3% and mean downtime cost from 355.4 to 15.5 relative to MAPPO-GAT. ACD$^3$-GAT reduces mean downtime cost to 48.2 with a 13.8% violation rate, placing it on the safety-contract frontier rather than at the most conservative compliance point. Topology-seed and coupled adaptive Red-process stress tests preserve this contrast and show lower worst adaptive degradation for safety-constrained policies than reward-only MAPPO-GAT.

19.
medRxiv (Medicine) 2026-06-18

Looked but didn't see: inattentional blindness and yes-bias confabulation in vision-language models

Previous work showed that many participants fail to notice a gorilla in a video of people playing basketball. Another study found that 83% of trained radiologists failed to report a gorilla figure inserted into a chest CT nodule-search task, even though eye-tracking revealed that most observers had foveated the figure. We ask whether a similar phenomenon exists in contemporary vision-language models (VLMs). We find that (i) VLMs are capable of spotting the gorilla in both still-frame images and videos of lung CT scans; (ii) models display inattentional blindness, which varies according to model generation and type of stimulus presented; (iii) Gemini-3.1-Pro outperforms most other flagship and open-weight VLMs at identifying the presence or absence of the gorilla. We additionally ran a segmentation experiment utilizing two different model classes: a generalist (SAM 3), which found the gorilla but produced little to no results for anatomy-based prompts; a medical specialist (BiomedParse), which produced more promising anatomy-based results but flagged "gorilla" on gorilla-free control videos on 82% of frames. The behavioral signature of inattentional blindness reproduces in VLMs, but a unique confabulation failure mode means that any "did the model see X" claim requires signal-detection analysis with a matched-control false-alarm baseline.

21.
arXiv (CS.CV) 2026-06-19

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using 62,876 CFPs from 44,501 unique participants from the UK Biobank, DL models were trained to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

22.
arXiv (CS.AI) 2026-06-16

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

arXiv:2605.22664v3 Announce Type: replace Abstract: LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.

23.
medRxiv (Medicine) 2026-06-22

Burden of Cardiovascular Disease in Brazil, 1996-2023: A Retrospective Descriptive Study of the Epidemiology and Impact on Public Healthcare with Emphasis on Acute Myocardial Infarction

Background Cardiovascular diseases (CVD) are the leading cause of death worldwide, and their epidemiology is correlated with genetic predisposition, exposure to risk factors, sex, age, access to medical care, and other sociodemographic characteristics. Brazil is a developing country with a vast territory, which leads to structural inequalities. Estimates of CVD in Brazil, in its regions, and in its population are poorly evaluated and analysed. Methods We obtained CVD-related data from the Brazilian Unified Health System (SUS) and analysed mortality and morbidity from 1996 to 2023 by sex, race/ethnicity, age, and region. We calculated the risk of death from the most prevalent diseases, the average length of hospital stay, and the costs associated with heart transplantation. Findings In Brazil, acute myocardial infarction was the pathology that led to the highest number of deaths across all variables analysed during the evaluated period. Other CVD were also related to causes of death and morbidity, such as hypertensive diseases and heart failure. Interpretation Brazil presents a serious challenge to the public health system due to the high number of deaths and the progressive mortality rate. This study represents a fundamental contribution to the basis for formulating public health policies aimed at reducing the growing impact associated with these diseases. Funding CNPq, CAPES, FAPEMIG, INCT

24.
arXiv (CS.CV) 2026-06-16

Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

Vision-Language Models (VLMs) are AI systems that process both images and text, yet they often struggle with compositional visual reasoning questions that require chaining multiple steps together, such as identifying objects, counting them, and comparing the results. Existing approaches improve this reasoning by training models on human-written step-by-step explanations, but creating these annotations is expensive and difficult to scale. We propose a self-questioning framework that trains a VLM to break visual questions into smaller sub-questions and answer each one before producing a final response, using a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). The model is never shown examples of how to decompose questions, it discovers this behavior on its own, guided by a reward signal that scores whether the output contains sub-questions and whether the final answer is correct. We apply this framework to a 3-billion-parameter model, training on both synthetic scenes of geometric shapes (CLEVR) and real-world photographs (A-OKVQA). On A-OKVQA, both self-questioning and standard reinforcement learning substantially improve accuracy over the untrained model (52.2% and 51.6% vs. 46.8%). We introduce the first self-questioning VLM by rewarding not only the final answer like standard RL but additionally for generating intermediate sub-questions, enabling it to discover compositional decomposition strategies. These results suggest that teaching AI systems to ask themselves intermediate questions is a promising strategy for complex visual reasoning, particularly when the difficulty of a question warrants explicit step-by-step decomposition.

25.
arXiv (CS.CL) 2026-06-15

Spatio-Temporal Audio Language Modeling for Dynamic Sound Sources

Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.