Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-11

Can AI Agents Synthesize Scientific Conclusions?

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

02.
medRxiv (Medicine) 2026-06-22

Sex-specific multimorbidity clusters and all-cause mortality in relatively healthy older adults: findings from the ASPREE cohort

Background: Multimorbidity is common in older adults, but sex differences in chronic condition clustering remain unclear. This study explored multimorbidity clusters and their associations with all-cause mortality among community-dwelling adults aged 70 years and over. Methods: This was a secondary analysis of data from 16,095 Australian ASPREE participants aged at least 70 years without prior dementia or cardiovascular disease. Fifteen baseline chronic conditions were grouped using latent class analysis (LCA). Observed-to-expected (O/E) ratios characterised conditions over-represented within clusters, and Cox proportional hazards models assessed associations with all-cause mortality. Results: Among 16,095 participants (mean age 74 years), 88.3% had multimorbidity at baseline; 4,217 deaths occurred over a median follow-up of 10.85 years. Five clusters were identified overall: hypertension and dyslipidemia (52.1%), gout and metabolic (14.4%), depressive symptoms, osteoporosis and frailty (10.0%), anaemia and kidney disease (10.2%), and hypotension, thyroid disorder and past cancer (13.3%). Sex-stratified analyses revealed three clusters in males and four in females. The frailty, depressive symptoms and osteoporosis cluster was associated with higher mortality in both sexes (aHR 1.56 [95% CI 1.40-1.73] in males; 1.68 [1.49-1.89] in females). Higher mortality was also observed for the metabolic, gout and kidney disease cluster in males (aHR 1.63 [1.47-1.81]) and the gout, anaemia and kidney disease cluster in females (aHR 1.96 [1.74-2.21]). Conclusions: Distinct multimorbidity clusters differed by sex and were associated with increased all-cause mortality. These findings may support risk stratification, targeted screening, and more person-centred management of older adults with multimorbidity.

03.
arXiv (CS.AI) 2026-06-11

LaQual: An Automated Framework for LLM App Quality Evaluation

arXiv:2508.18636v2 Announce Type: replace-cross Abstract: Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recommendation mechanisms in LLM app stores predominantly rely on static metrics, such as user interactions and favorites, making it challenging for users to efficiently identify high-quality apps. At the same time, current academic research focuses on specific vertical fields and lacks a general, automated evaluation framework applicable to the diverse LLM app ecosystem. To address the above challenges, we present LaQual, an automated framework for LLM app quality evaluation. LaQual integrates three key stages: (1) LLM app labeling and hierarchical classification for precise scenario mapping; (2) static indicator evaluation using time-weighted user engagement and functional capability indicators to filter low-quality apps; and (3) dynamic scenario-adapted evaluation, where an LLM generates scenario-specific evaluation metrics, scoring criteria, and tasks for comprehensive quality evaluation. Experiments on a mainstream LLM app store demonstrate the effectiveness of LaQual. Its automated scores show high consistency with human judgments. Through effective screening, LaQual can reduce the candidate LLM app pool by 66.7% to 81.3%. User studies further validate its significant outperformance over baseline systems, particularly in comparison efficiency (mean 5.45 vs. 3.30) and value of explanatory information (4.75 vs. 2.25). These results demonstrate that LaQual provides a scalable, objective, and user-centric solution for high-quality discovery and recommendation of LLM apps in real-world scenarios.

04.
arXiv (CS.CV) 2026-06-16

TurboGS: Accelerating 3D Gaussian Splatting via Error-Guided Sparse Pixel Sampling and Optimization

Consumer-level applications require fast optimization of 3D Gaussian Splatting (3DGS) with high-fidelity novel view rendering. However, existing 3DGS acceleration approaches still incur substantial computation on redundant pixels while sacrificing fine details. In this paper, we present TurboGS, an error-guided training framework that accelerates 3DGS by concentrating optimization on perceptually informative pixels. TurboGS is built upon four core components: (1) a tile-wise sparse pixel sampling, which, driven by multi-view reconstruction errors during training, prioritizes challenging regions and skips well-reconstructed ones to avoid redundant gradient computation; (2) a tile-wise structure-aware loss with sparse Normalized Cross-Correlation, which provides sparse yet effective supervision to preserve fine details and stabilize training; (3) an error-driven Gaussian density control strategy, which dynamically allocates model capacity and removes redundant primitives; and (4) a tailored hybrid optimizer that couples Hessian-informed updates with Adam moment damping to stabilize and improve convergence under sparse supervision. Experiments on standard benchmarks demonstrate that TurboGS can deliver on par or superior rendering quality within 100 seconds on a single RTX 5090 GPU card (up to 10x training speedup over vanilla 3DGS).

05.
arXiv (quant-ph) 2026-06-12

Quantum Error Correction Codes for Truncated SU(2) Lattice Gauge Theories

作者:

arXiv:2511.13721v2 Announce Type: replace Abstract: We construct two quantum error correction codes for pure SU(2) lattice gauge theory in the electric basis truncated at the electric flux $j_max=1/2$, which are applicable on quasi-1D plaquette chains, 2D honeycomb and 3D triamond and hyperhoneycomb lattices. The first code converts Gauss's law at each vertex into a stabilizer while the second only uses half of the vertices and is locally the carbon code. Both codes are able to correct single-qubit errors. The electric and magnetic terms in the SU(2) Hamiltonian are expressed in terms of logical gates in both codes. The logical-gate Hamiltonian in the first code exactly matches the spin Hamiltonian for gauge singlet states found in previous work.

06.
arXiv (CS.LG) 2026-06-11

Interpretable Neural Marked Statistics for Cosmological Inference

arXiv:2606.11295v1 Announce Type: cross Abstract: Recovering cosmological information beyond the power spectrum is a central goal for upcoming cosmological surveys, since late-time non-Gaussian signal in the matter density cannot be accessed through two-point statistics alone. Marked statistics fold part of this information back into the two-point level by reweighting the field with non-linear functions. We propose a neural marking scheme to generalize this process through a set of interpretable, physically motivated transformations that directly allow to interpret the gain in cosmological information at the morphological level. We employ a contrastive learning objective to align learnable marked summaries with the underlying cosmological parameters. At $k_{\max}=0.2\,h\mathrm{Mpc}^{-1}$, our neural mark tightens the marginalized constraint on $\sigma_8$ by $2.9\times$ and on $\Omega_m$ by $1.8\times$ compared to classical marks, breaking the $\Omega_m-\sigma_8$ degeneracy at the Fisher information level. It further reduces the parameter MSE across our cosmological parameter prior by $1.45\times$ over the best classical mark. The learned latent geometry aligns with the $\Omega_m$ and $\sigma_8$ directions in parameter space, indicating that the contrastive objective recovers the dominant axes of cosmological information. Our approach opens the door to more powerful, interpretable summary statistics for cosmological inference.

07.
arXiv (CS.AI) 2026-06-12

From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation

arXiv:2606.12603v1 Announce Type: cross Abstract: Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electronic wheelchairs. Unlike autonomous driving on the road, long-horizon sidewalk navigation requires precise maneuvering through unpredictable sidewalk terrains and pedestrians, with a lightweight perception stack as minimal as a single monocular RGB camera. While imitation learning (IL) from demonstrations offers a practical solution, the resulting autopilot policy often suffers from compounding errors, a lack of social compliance on sidewalks, and deficiencies in counterfactual reasoning to handle complex situations. To address these challenges, we introduce FlowPilot, a mapless navigation policy that achieves robust and efficient long-horizon navigation performance using only a monocular RGB camera. We first propose to use anchored flow matching as an action representation for policy pre-training on large-scale robot fleet data and to capture the diverse, complex, multimodal distribution of sidewalk navigation behaviors. To bridge the gap between imitation and alignment, we further design a human-in-the-loop preference learning scheme to tune the policy on a small amount of human intervention data. It strengthens the model's counterfactual reasoning and social compliance on sidewalks. We evaluate FlowPilot through extensive simulation and real-world experiments in diverse sidewalk environments. FlowPilot achieves 42% success rate and 66% route completion in simulation, while FlowPilot-HP further improves real-world robustness and social compliance, reducing IR by 40.0% and NIR by 52.1% relative to the base model.

08.
arXiv (CS.AI) 2026-06-16

Graphical-Probabilistic Modeling of Generative Flows in LLM-Native Software Systems

arXiv:2606.15943v1 Announce Type: cross Abstract: Engineering LLM-native software remains a challenging and immature field. Current practice is largely exploratory, relying on experimentation and heuristic techniques such as prompting and context engineering. These, however, are low-level and lack the principled structure needed to support design-level reasoning or analysis. In contrast, traditional software engineering leverages modularity and abstraction to communicate and analyze system behavior. To bring similar rigor to LLM-native development, we propose methods for documenting generative flows and for stating properties of LLM-based software designs. Such methods must account for the stochastic, prompt-dependent behavior of large language models while remaining expressive enough to capture emergent phenomena. Our initial approach is based on graphical probabilistic models, tailored to capture phenomena characteristic of LLM-native systems. This framework – what we term Generation Networks – aims to provide a foundation for principled reasoning about generative interactions and system-level properties in LLM-centric software architectures.

09.
arXiv (CS.AI) 2026-06-18

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

arXiv:2606.19245v1 Announce Type: new Abstract: Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

10.
arXiv (CS.AI) 2026-06-17

From Noise to Order: Learning to Rank via Denoising Diffusion

arXiv:2602.11453v3 Announce Type: replace-cross Abstract: Learning-to-rank (LTR) methods have traditionally been limited to discriminative machine learning approaches that model the probability of the document being relevant to the query given some feature representation of the query-document pair. We propose an alternative denoising diffusion-based generative approach to LTR that instead models the full joint distribution over features and relevance labels. While in discriminative LTR, an over-parameterized ranking model may find different ways to fit the training data, we posit that candidate solutions that can explain the full data distribution under the generative setting maybe better at estimating relevance. Thus, we propose DiffusionRank that extends TabDiff, an existing diffusion model for tabular datasets, to create generative alternatives to classical discriminative pointwise and pairwise LTR objectives. Our work demonstrates improvements from DiffusionRank over discriminative counterparts on four standard LTR datasets and points to a rich space for future exploration to leverage ongoing advancements in deep generative models for LTR. Our code is publicly available at https://github.com/sadjadeb/DiffusionRank.

11.
arXiv (CS.LG) 2026-06-16

Next-Latent Prediction Transformers Learn Compact World Models

arXiv:2511.05963v4 Announce Type: replace Abstract: Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc lookups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next token. Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. This simple auxiliary objective injects a recurrent inductive bias into transformers while leaving their architecture, parallel training efficiency, and inference unchanged. NextLat effectively encourages transformers to form compact internal world models with coherent belief states and transition dynamics – crucial properties not guaranteed by standard next-token prediction alone. Empirically, across benchmarks in world modeling, reasoning, planning, and language modeling, NextLat demonstrates significant gains over standard next-token prediction and other baselines in downstream accuracy, representation compression, and lookahead planning. Furthermore, NextLat enables variable-length self-speculative decoding, accelerating inference by up to 3.3x in language modeling. NextLat offers a simple yet effective paradigm for learning compact, predictive representations in transformers that generalize better. Our code is available at https://github.com/JaydenTeoh/NextLat.

12.
arXiv (CS.CV) 2026-06-17

Advances in 4D Representation: Geometry, Motion, and Interaction

We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well as 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/

13.
arXiv (CS.AI) 2026-06-11

JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization

arXiv:2606.11425v1 Announce Type: cross Abstract: Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can adapt but often relies on low-level mutations that require many target queries. We propose JailbreakOPT, a tool-assisted framework for improving iterative single-turn jailbreak prompt optimization. JailbreakOPT organizes diverse atomic jailbreak prompts into an attack tool library and composes them through a unified intra-episode optimization abstraction to generate stronger standalone attack prompts. To reuse experience across attack episodes, JailbreakOPT further frames tool selection as a contextual bandit problem and applies contextual Thompson sampling to guide exploration and exploitation based on past outcomes. Experiments across multiple target LLMs and attack goals show that JailbreakOPT improves attack success rate (ASR) while reducing the number of attacks until success (No.A) compared with atomic single-turn attacks and existing iterative optimization baselines. This paper may contain offensive or harmful content.

14.
arXiv (CS.LG) 2026-06-15

Graph Structured Combinatorial Semi-Bandit with Nonlinear Reward Associations through Separable Signals

arXiv:2606.14650v1 Announce Type: new Abstract: The identification of optimal structures within vast arrays of interconnected data necessitates significant sampling- and computational effort. Learning and leveraging underlying signal dependencies can improve efficiency and predictive capabilities considerably, but the ubiquity of nonlinear statistical relations amplifies the complexity of such undertakings. In this paper, we develop novel generic and adaptive strategies equipped with routines for graph-based causal reward modeling, analytic reproducing kernel methods, and Taylor approximation of functional processes. We establish theoretical performance guarantees sublinear in time and linear in data volume over time. Our analyses cover robustness to a multitude of uncertainties arising from noise interference, gradual model convergence, and solution space mismatch. The framework's general appeal is substantiated by a minimalistic set of conditions or reliance on prior estimates, while various outlined modifications address specific or extended settings. To demonstrate practical effectiveness, we conduct numerical experiments using both benchmarked synthetic and real-world transportation datasets.

15.
arXiv (CS.LG) 2026-06-19

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

arXiv:2606.19823v1 Announce Type: cross Abstract: Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.

16.
arXiv (CS.CV) 2026-06-12

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

17.
arXiv (CS.CV) 2026-06-19

OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification

Multi-view imaging, such as mammography and chest radiography, is a standard component of clinical practice. However, medical images are often unregistered and contain view-specific artifacts or irrelevant background cues that can obscure diagnostically relevant findings. Many existing methods directly fuse per-view representations, allowing such irrelevant content to contaminate the fused embedding and reducing robustness under varying view configurations. We propose OTCHA, a confidence-aware latent hub token alignment module based on optimal transport (OT) that refines patch tokens before fusion for multi-view classification. OTCHA introduces a set of learnable latent hub tokens shared across views. For each view, we compute an OT plan between patch tokens and hub tokens that jointly considers feature similarity and geometry, and augment the OT formulation with token-conditional dustbins to enable partial matching and discard irrelevant tokens. The resulting transport plan provides token-wise matching confidence, which gates hub-mediated message passing and weights a novel optimal-transport-based representation alignment loss to stabilize refinement. Experiments on three multi-view medical image datasets demonstrate consistent improvements over competing baselines across diverse anatomies and view configurations. Our code is available at https://github.com/labhai/OTCHA.

18.
arXiv (CS.CV) 2026-06-16

Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies

Visual reasoning matters for many computer vision tasks that go beyond surface-level object detection and classification. Despite progress in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys typically cover only one part of the problem, such as visual question answering, scene-graph generation, neuro-symbolic AI, or multimodal chain-of-thought, and rarely analyze reasoning types, methodologies, and evaluation protocols together. This survey addresses that gap. Following a structured literature review, we group visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and examine how each is implemented across methods that range from graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems to reasoning with vision-language models (VLMs) and multimodal large language models (MLLMs), including visual chain-of-thought, visual programming, and tool-augmented and test-time reasoning. We then review evaluation protocols for functional correctness, structural consistency, and causal validity, and we analyze their limits in generalizability, reproducibility, faithfulness, and explanatory power. We also identify open challenges: scaling to complex scenes, integrating symbolic and neural paradigms more deeply, the shortage of comprehensive benchmarks, language-prior shortcuts and hallucination in foundation models, and reasoning under weak supervision. Finally, we set out a research agenda for vision systems and argue that connecting perception and reasoning is necessary for transparent, trustworthy, and cross-domain models, especially in high-stakes settings such as autonomous driving and medical diagnostics.

19.
arXiv (CS.AI) 2026-06-16

Variance Reduction for Non-Log-Concave Sampling with Applications to Inverse Problems

arXiv:2606.16257v1 Announce Type: cross Abstract: Sampling from high-dimensional, non-log-concave distributions with unnormalized densities is a fundamental challenge in machine learning, particularly when the exact gradient of the potential is unavailable and must be approximated via stochastic gradients that exhibit high variance under a fixed budget of gradient computations per iteration. Although variance reduction techniques such as SGD with momentum, STORM, and PAGE have demonstrated improved convergence properties in non-convex optimization, their implications for sampling from non-log-concave distributions remain largely unexplored. In this work, we develop the first unified analysis of these estimators for sampling from non-log-concave distributions. We establish improved non-asymptotic convergence rates in $\varepsilon$-relative Fisher information and, under a Poincaré inequality assumption, in squared total variation distance, and further prove weak convergence to the target distribution. We extend our analysis to solving inverse problems with score-based generative priors. We empirically validate our theory and demonstrate that, under a fixed gradient computations per iteration, variance-reduction techniques consistently improve sample quality in two standard imaging applications.

20.
arXiv (CS.LG) 2026-06-12

Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement Learning

arXiv:2606.13605v1 Announce Type: cross Abstract: This paper presents a distribution-agnostic robust trajectory-optimization framework based on chance-constrained reinforcement learning. The uncertainty is represented here through initial conditions and process noise, with the only requirement being that it can be sampled. A deterministic nominal trajectory is first computed offline, and reinforcement learning is then used only to robustify that baseline through a structured affine closed-loop correction law comprising a feedforward control adjustment and time-varying feedback gains. Probabilistic feasibility is enforced empirically through rollout-based upper-tail quantiles, while terminal dispersion is regulated through covariance-feasibility penalties. The framework is assessed on two materially different trajectory design problems. The flagship case study is a three-dimensional multi-impulse Earth-Mars transfer, where the learned policy is benchmarked against a recent robust trajectory-optimization reference under Gaussian uncertainty and then evaluated under bounded uniform uncertainty and under process disturbances not seen during training. The second case study is a stochastic atmospheric pinpoint rocket landing problem, used to assess portability to a short-horizon continuous-thrust setting with drag, mass depletion, and glide-slope constraints. The results show that the proposed framework can remain competitive in upper-tail fuel cost while preserving probabilistic feasibility, and that the same robustification scaffold can be carried across heterogeneous spacecraft trajectory planning problems without redesign of its core stochastic-control structure.

21.
arXiv (CS.LG) 2026-06-19

Beyond Averaging in John Ellipsoid Approximation: High-Accuracy Algorithms in the Leverage-Score Model

arXiv:2606.20082v1 Announce Type: cross Abstract: The John ellipsoid of a symmetric polytope $P=\{\mathbf{x}\in\mathbb{R}^d:\|\mathbf{A}\mathbf{x}\|_\infty\le1\}$, $\mathbf{A}\in\mathbb{R}^{n\times d}$, is computed by a long line of leverage-score algorithms, from Cohen, Cousins, Lee and Yang (COLT 2019) to its successors [WY24, CLS+25], all reaching a $(1+\varepsilon)$-approximation in $\Theta(\varepsilon^{-1}\log(n/d))$ iterations. We separate this complexity into three costs the modern line conflates (certification, identification, and accuracy) and locate the historical $\varepsilon^{-1}$ in the first alone. In the equivalent D-optimal-design form $\min_{\mathbf{p}\in\Delta_n}-\log\det(\sum_i p_i\mathbf{a}_i\mathbf{a}_i^\top)$, the leverage-score oracle is exactly the first-order oracle and the $(1+\varepsilon)$-John guarantee the Frank-Wolfe gap $g(\mathbf{p})\le\varepsilon d$; through this dictionary the costs come apart. The $\varepsilon^{-1}$ is a certification artifact: the uniform average of the iterates, the certificate used throughout the line, has gap exactly $\Theta(1/T)$, however cheap each iteration is made. Pointed instead at the last iterate the same oracle is fast: a warm-started accelerated method reaches the guarantee in $C(\mathbf{A})+O(\sqrt{\kappa}\log(1/\varepsilon))$ queries after an $\varepsilon$-independent setup $C(\mathbf{A})$, and once the optimal face is identified the facial problem is an unconstrained self-concordant minimization whose Hessian the oracle recovers exactly, so damped Newton needs only $O(\log\log(1/\varepsilon))$ steps, for a total of $C(\mathbf{A})+O(d^2\log\log(1/\varepsilon))$ queries. The accuracy dependence is thus doubly logarithmic after an $\varepsilon$-independent, condition-dependent setup; the open problem is the remaining identification cost (a condition-free bound on reaching the optimal face) and lower bounds. Accuracy is not the obstruction.

22.
arXiv (CS.CV) 2026-06-16

Training-Free Adversarial Robustness in Computational MRI

Deep learning (DL) methods have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining. In this work, we propose a novel approach for mitigating adversarial attacks on MRI reconstruction models without any retraining. Based on the idea of cyclic measurement consistency, we devise a novel mitigation objective that is minimized in a small ball around the attack input. Results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods. We also introduce a practically relevant scenario for small adversarial perturbations that models impulse noise in raw data, which relates to herringbone artifacts, and show the applicability of our approach in this setting. Finally, we show our mitigation approach remains effective in two realistic extension scenarios: a blind setup, where the attack strength or algorithm is not known to the user; and an adaptive attack setup, where the attacker has full knowledge of the defense strategy.

23.
arXiv (CS.AI) 2026-06-11

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

arXiv:2602.22638v2 Announce Type: replace Abstract: Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench.

24.
medRxiv (Medicine) 2026-06-15

Natural Language Processing Based Solution for Labeling Brain Metastasis Identified in Radiology Reports

Abstract Purpose: Brain metastases (BM) far exceed primary CNS tumours and constitute the majority workload for neuro-oncology care providers. Currently, the cancer registries only capture synchronous BMs, which is only a small proportion of all BMs. We aim to develop and validate a natural language processing (NLP) algorithm that identifies brain metastases in radiology reports, enabling scalable surveillance of asynchronous BMs. Methods: Using population-based cancer registry data in Alberta, Canada, we identified a cancer cohort diagnosed between 2012–2019 with follow-up to 2022. All brain/head radiology reports at and post-cancer diagnosis were identified. Reports were sampled through a multi-phase approach and manually labeled for BM presence. We trained two Bio_ClinicalBERT models on the "Findings" and "Impressions" sections, respectively, and took the maximum predicted probability as the report-level prediction. Internal and external validation used reports from the Canadian provinces of Alberta, Ontario, and British Columbia. Results: The models were trained on 1,879 samples. For internal validation, 1,833 reports from 357 patients were tested. At a probability threshold of 0.4, the model achieved a sensitivity of 0.888 and precision of 0.499. The ensemble substantially outperformed single-section models, which achieved sensitivities of only 67.8% (Findings) and 74.2% (Impressions). On external validation, sensitivity was 0.918 in Ontario and 0.726 in British Columbia, demonstrating robustness across diverse data distributions. Conclusions: An NLP-based pipeline processing both Findings and Impressions sections has been developed and validated in three Canadian provinces. It meets cancer registry operational requirements and to be implemented into the surveillance workflow in Alberta and British Columbia, providing a foundation for population-level BM surveillance.

25.
arXiv (CS.AI) 2026-06-11

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

arXiv:2606.11440v1 Announce Type: new Abstract: Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.