论文广场 - AcademicHub

01.

arXiv (CS.CL) 2026-06-17 DOI: arXiv:2509.26476

Regression Language Models for Code

作者:

Yash Akhauri ↗Xingyou Song ↗Arissa Wongpanich ↗Bryan Lewandowski ↗Mohamed S. Abdelfattah ↗

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) using a frozen LLM encoder can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM based on T5Gemma, obtains >0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves >0.5 average Spearman-rank across 24 different programming languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

阅读与讨论 → 访问原文 →

02.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2601.04304

Chiral Lattice Gauge Theories from Symmetry Disentanglers

作者:

Ryan Thorngren ↗John Preskill ↗Lukasz Fidkowski ↗

arXiv:2601.04304v2 Announce Type: replace-cross Abstract: We propose a Hamiltonian framework for constructing chiral gauge theories on the lattice based on symmetry disentanglers: constant-depth circuits of local unitaries that transform not-on-site symmetries into on-site ones. When chiral symmetry can be realized not-on-site and such a disentangler exists, the symmetry can be implemented in a strictly local Hamiltonian and gauged by standard lattice methods. Using lattice rotor models, we realize this idea in 1+1 and 3+1 spacetime dimensions for $U(1)$ symmetries with mixed 't Hooft anomalies, and show that symmetry disentanglers can be constructed when anomalies cancel. As an example, we present an exactly solvable Hamiltonian lattice model of the (1+1)-dimensional "3450" chiral gauge theory, and we argue that a related construction applies to the $U(1)$ hypercharge symmetry of the Standard Model fermions in 3+1 dimensions. Our results open a new route toward fully local, nonperturbative formulations of chiral gauge theories.

阅读与讨论 → 访问原文 →

03.

arXiv (CS.AI) 2026-06-24 DOI: arXiv:2509.04827

VoltanaLLM: Energy-Efficient and SLO-Aware Disaggregated LLM Serving via Adaptive Frequency Control and State-Space Routing

作者:

Jiahuan Yu ↗Aryan Taneja ↗Junfeng Lin ↗Minjia Zhang ↗

arXiv:2509.04827v3 Announce Type: replace-cross Abstract: The energy cost of Large Language Model (LLM) inference is rapidly becoming a barrier to sustainable and scalable deployment. Although modern serving architectures expose distinct prefill and decode behaviors, existing systems fail to exploit these phase differences for energy-efficient serving under strict latency SLOs. This paper introduces VoltanaLLM, the first system that explicitly targets and reduces the energy bloat in modern prefill-decode (P/D) disaggregated LLM serving. Guided by a control-theory perspective, VoltanaLLM separates two levers: per-instance operating-point selection (GPU frequency per iteration) and system-level state-space routing of requests. We empirically observe that LLM inference exhibits a U-shaped energy-frequency curve creating "sweet spots" that depend on phase behavior and load. VoltanaLLM exploits this by combining phase-specific, iteration-level frequency selection driven by a lightweight, online-adaptive latency predictor, with a decode state-space guided router that avoids architectural granularity-induced inefficiencies, all while meeting desired SLOs. We implement VoltanaLLM using SGLang and evaluate it across multiple models and real-world workloads. Our results show VoltanaLLM reduces end-to-end energy by up to 36.3% versus a static max-frequency baseline while maintaining high SLO attainment, and generalizes to newer GPUs. These results point to sustainable LLM serving via phase-aware, iteration-level frequency selection coupled with architecture-aware routing. Source code is available in https://github.com/Supercomputing-System-AI-Lab/VoltanaLLM.

阅读与讨论 → 访问原文 →

04.

arXiv (CS.AI) 2026-06-18 DOI: arXiv:2605.10840

Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

作者:

Yixuan Yang ↗Mehak Arora ↗Ryan Zhang ↗Baraa Abed ↗Junseob Kim ↗Tilendra Choudhary ↗Md Hassanuzzaman ↗Kevin Zhu ↗Ayman Ali ↗Chengkun Yang ↗Alasdair Edward Gent ↗Victor Moas ↗…

arXiv:2605.10840v3 Announce Type: replace-cross Abstract: We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data – to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning – remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA's five-phase pretraining curriculum – predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization – addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent $\ell_1$ rollout drift uniquely converges ($-$15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83$\times$ further than stable patients in latent space, vs $\leq$2.62$\times$ for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).

阅读与讨论 → 访问原文 →

05.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.14957

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

作者:

Haoxu Huang ↗Long Chen ↗Jingyun Chen ↗Jinu Hyun ↗James Ryan Loftus ↗Kara Melmed ↗Daniel Orringer ↗Jennifer Frontera ↗Seena Dehkharghani ↗Arjun Masurkar ↗Narges Razavian ↗

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

阅读与讨论 → 访问原文 →

06.

arXiv (CS.AI) 2026-06-19 DOI: arXiv:2606.19568

Exploring Feature Extraction Technique Parameters for Acoustic Gunshot Classification

作者:

Sinclair Gurny ↗Ryan Quinn ↗

arXiv:2606.19568v1 Announce Type: cross Abstract: Acoustic gunshot detection is a problem with applications across civilian public safety, military operations, and wildlife conservation, yet the field lacks a rigorous exploration of feature extraction techniques with a focus on generalization to realistic data. The mixed effectiveness of commercial gunshot detection and classification systems indicates an open problem that is not adequately addressed by the current literature. In this paper, we present a systematic investigation of common feature extraction techniques using a dataset of 23,000 gunshot recordings across 85 firearms and 21 calibers. We benchmark three feature extraction techniques with 12 total unique parameter sets using ResNet-18. Our results demonstrate that using the correct feature extraction technique can improve top-1 accuracy by up to 20%, and utilizing the correct parameters for a given feature extraction technique can improve that value by up to 4.7%.

阅读与讨论 → 访问原文 →

07.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.18135

Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)

作者:

Sinclair Gurny ↗Ryan Quinn ↗

arXiv:2606.18135v1 Announce Type: cross Abstract: In this work, we introduce the Certus Caliber Classification Gunshot Dataset (C3GD), a publicly accessible data set developed for the analysis of firearm muzzle blast sounds. The dataset aims to provide a wide variety of firearms, calibers, cartridges, microphones, and microphone locations with metadata detailed beyond what is currently otherwise available. It comprises more than 8000 field-collected data points from 28 firearms across 16 calibers. Because data collection in the field is costly, much of the existing research has been done using gunshot audio collected from the internet, which increases the risk of low-quality data and label noise. This dataset is primarily focused on caliber classification, but can also be used for gunshot detection, audio separation, and audio signal processing, providing a diversified and real-world reference. The dataset aims to provide enough diversity to be able to generalize to more real-world applications while also providing enough metadata for detailed academic analysis.

阅读与讨论 → 访问原文 →

08.

arXiv (CS.LG) 2026-06-11 DOI: arXiv:2606.12360

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

作者:

Leon Bergen ↗Usha Bhalla ↗Sidharth Baskaran ↗Max Loeffler ↗Raphael Sarfati ↗Dhruvil Gala ↗Ryan Panwar ↗Santiago Aranguri ↗Thomas Fel ↗Atticus Geiger ↗Matthew Kowal ↗Siddharth Boppana ↗…

arXiv:2606.12360v1 Announce Type: new Abstract: Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

阅读与讨论 → 访问原文 →

09.

arXiv (math.PR) 2026-06-15 DOI: arXiv:2605.00375

Stability of the $k$-Plane Transform on Measures and Hölder-Type Comparisons of Wasserstein Metrics

作者:

Fatma Terzioglu ↗Ryan Murray ↗

arXiv:2605.00375v2 Announce Type: replace-cross Abstract: We establish stability estimates for the $k$-plane transform on finite positive Radon measures, with emphasis on Fourier and Wasserstein metrics. We first introduce a metric on $k$-plane transform data and prove a bi-Lipschitz stability estimate showing that this metric is equivalent to a generalized Fourier metric obtained by augmenting the Fourier distance between centered normalized measures with separate barycenter and total mass difference terms. Building on a Hölder-type comparison between Fourier and Wasserstein metrics due to Carrillo and Toscani, we extend this comparison to positive Radon measures under uniform bounds on centered moments of order slightly larger than $2$. This yields Hölder-type stability for the $k$-plane transform in a generalized $2$-Wasserstein metric and, in particular, a $W_2$-stability estimate for centered probability measures. We also compare the $2$-Wasserstein distance with its max-sliced analogue. For centered probability measures with uniformly bounded moments of order slightly larger than $2$, we prove a two-sided Hölder-type comparison between these distances. We then extend the result to positive Radon measures by applying it to centered normalized measures and adding separate barycenter and mass terms. Finally, for absolutely continuous compactly supported probability measures with bounded densities, we prove a strong equivalence between the $2$-Wasserstein distance of the measures and the $(k/2-1)$-order Sobolev norm of the $k$-plane transform data of the difference of their densities.

阅读与讨论 → 访问原文 →

10.

arXiv (CS.AI) 2026-06-17 DOI: arXiv:2606.18023

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

作者:

Jian Yang ↗Shawn Guo ↗Wei Zhang ↗Tianyu Zheng ↗Yaxin Du ↗Haau-Sing Li ↗Jiajun Wu ↗Yue Song ↗Yan Xing ↗Qingsong Cai ↗Zelong Huang ↗Chuan Hao ↗…

arXiv:2606.18023v1 Announce Type: cross Abstract: Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain–cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced mismatch remains roughly fixed as refinement gains shrink, the offset cost increasingly dominates. This gain–cost trade-off explains PLT's saturation at two loops and provides diagnostics for loop-count selection.

阅读与讨论 → 访问原文 →

11.

arXiv (CS.LG) 2026-06-19 DOI: arXiv:2606.19712

Efficient Neural Network Model Selection for Few-Class Application Datasets

作者:

Bryan Bo Cao ↗Abhinav Sharma ↗Lawrence O'Gorman ↗Michael Coss ↗Shubham Jain ↗

arXiv:2606.19712v1 Announce Type: new Abstract: While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are typically evaluated on datasets with thousands of classes, yet many real-world applications involve fewer than ten. To address this understudied but common setting, we develop a measure of classification difficulty based on data-side properties and show how it enables more efficient model selection for few-class datasets, where traditional approaches are less effective. We term this phenomenon "few-class distinctiveness". Our metric allows comparison of models and datasets 6 to 29$\times$ faster than repeated training and testing. Leveraging this insight, we extend scaled model families below the smallest published models, achieving greater efficiency at similar accuracy, for example models up to 42% smaller than YOLOv5-nano for a mobile robot task. Targeting resource-constrained applications, we demonstrate few-class model selection across mobile robot, drone, and IoT scenarios, highlighting practical gains in efficiency without sacrificing performance.

阅读与讨论 → 访问原文 →

12.

arXiv (quant-ph) 2026-06-19 DOI: arXiv:2606.19730

Topological Quantum Interferometry

作者:

Tianyou Ying ↗Yufeng Zhou ↗Chengwei Pan ↗Ryan Hogan ↗Ruoyang Zhang ↗Hui Liu ↗Shining Zhu ↗Xiaoqin Gao ↗

arXiv:2606.19730v1 Announce Type: new Abstract: Structured light provides high-dimensional Hilbert spaces holding tremendous potential for fundamental quantum optics and quantum technologies. However, existing characterization methods, like Hong-Ou-Mandel (HOM) interference, typically assume perfectly tuned conditions, overlooking the geometric physics governing spatial mode evolution. Here, we establish topological quantum interferometry driven by an interaction-based geometric phase, the exchange Berry phase (BPX). Our formalism generalizes $q$-plate state generation and characterization to arbitrary topological charges and (de)tuning conditions, demonstrating that BPX acts as a geometric marker governing spatial interference. We show BPX serves as a deterministic control parameter, decomposing two-photon spatial patterns into geometry-dictated fundamental modes. This mapping reveals topological invariants and phase singularities that function as a non-tomographic witness for state dimensionality estimation, circumventing full-state reconstruction. Being device-independent and highly scalable, this approach enables scalable high-dimensional characterization and topologically protected state selection, with direct applicability to quantum metrology and high-capacity quantum networks.

阅读与讨论 → 访问原文 →

13.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2605.04221

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

作者:

Yao-Shun Chuang ↗Tushti Mody ↗Uday Pratap Singh ↗Shirindokht Shiraz ↗Chun-Teh Lee ↗Ryan Brandon ↗Muhammad F Walji ↗Xiaoqian Jiang ↗Bunmi Tokede ↗

Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.

阅读与讨论 → 访问原文 →

14.

arXiv (CS.LG) 2026-06-17 DOI: arXiv:2602.06276

Statistical Learning from Attribution Sets

作者:

Lorne Applebaum ↗Robert Busa-Fekete ↗August Y. Chen ↗Claudio Gentile ↗Tomer Koren ↗Aryan Mokhtari ↗

arXiv:2602.06276v2 Announce Type: replace Abstract: We address the problem of training conversion prediction models in advertising domains under privacy constraints, where direct links between ad clicks and conversions are unavailable. Motivated by privacy-preserving browser APIs and the deprecation of third-party cookies, we study a setting where the learner observes a sequence of clicks and a sequence of conversions, but can only link a conversion to a set of candidate clicks (an attribution set) rather than a unique source. We formalize this as learning from attribution sets generated by an oblivious adversary equipped with a prior distribution over the candidates. Despite the lack of explicit labels, we construct an unbiased estimator of the population loss from these coarse signals via a novel approach. Leveraging this estimator, we show that Empirical Risk Minimization achieves generalization guarantees that scale with the informativeness of the prior and is also robust against estimation errors in the prior, despite complex dependencies among attribution sets. Simple empirical evaluations on standard datasets suggest our unbiased approach significantly outperforms common industry heuristics, particularly in regimes where attribution sets are large or overlapping.

阅读与讨论 → 访问原文 →

15.

arXiv (CS.CL) 2026-06-24 DOI: arXiv:2606.24083

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

作者:

Morayo Danielle Adeyemi ↗Ryan A. Rossi ↗Franck Dernoncourt ↗

"Talk short. Drop grammar. Save token." This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user's prompt or the model's response) is being compressed. We present Cavewoman, a two-channel evaluation protocol that scores every generation on task accuracy, realized per-item cost, and reference-text agreement against the model's unconstrained reference. We evaluate eight models on five datasets at five reduction levels, with both channels measured on the same items. Output compression cuts realized cost on most API models (1.4-2.4x per model, up to 3x in the best case) and on all four open-weight models under public-tier pricing. Input compression has the opposite effect, a strict lose-lose: it raises net cost rather than lowering it (~1.15x on the five-benchmark mean, up to 1.8x on the worst dataset and 2.7x under stronger compression), because models compensate with longer responses even as accuracy collapses. Under the same setting, surface text diverges from the unconstrained reference: on the non-reasoning models, roughly half of all generations are correct yet their surface text no longer entails the model's own unconstrained baseline generation. The divergence survives length-controlled re-scoring, multiple-comparisons correction, and replication under complementary semantic measures. Code and data are available at https://github.com/danielle34/cavewoman.

阅读与讨论 → 访问原文 →

16.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.16037

Adiabatically-induced Kawaguchi geometry and jerk in quantum-classical systems

作者:

Athanasios Chatzistavrakidis ↗Larisa Jonke ↗Ryan Requist ↗

arXiv:2606.16037v1 Announce Type: new Abstract: Adiabatically eliminating the quantum degrees of freedom in a mixed quantum-classical system produces an effective force in the classical equation of motion. The elimination can be made to any order in the adiabatic parameter, generating a series of higher order forces. By applying a sequence of near-identity unitary transformations to the quantum state, we derive a hierarchy of increasingly accurate effective actions for the classical variables. The third order Euler-Lagrange equation is non-Newtonian as the force depends on the jerk, the third order time derivative of position. We find that the third order terms induce a special kind of Kawaguchi geometry on the space of classical variables. This geometry is characterized by an almost symplectic structure and a differential line element that depends on the acceleration in addition to the velocity. Our results can be used to efficiently capture higher order nonadiabatic effects in molecular dynamics simulations.

阅读与讨论 → 访问原文 →

17.

arXiv (CS.AI) 2026-06-12 DOI: arXiv:2606.12429

Muse Spark Safety & Preparedness Report

作者:

Cristina Menghini ↗Peter Ney ↗Hamza Kwisaba ↗Zifan ↗Wang ↗Miles Turpin ↗Felix Binder ↗Jean-Christophe Testud ↗Aidan Boyd ↗Nathaniel Li ↗Ivan Evtimov ↗Klaudia Krawiecka ↗…

arXiv:2606.12429v1 Announce Type: cross Abstract: Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains under Meta's Advanced AI Scaling Framework, along with the evidence that informed our launch decision. We then discuss additional considerations, such as Muse Spark's broader content safety and behavioral profile, that are relevant to overall safety but fall outside the catastrophic risk domains governed by the Framework. Our preparedness results covering Chemical and Biological, Cybersecurity, and Loss of Control risks assess Muse Spark's deployment within Meta AI as presenting acceptable levels of residual risks under our Advanced AI Scaling Framework. We conducted a broad set of evaluations targeting dual-use and high-risk capabilities across these catastrophic risk domains. Those evaluations identified elevated risks prior to mitigations, with Chemical and Biological capabilities assessed as likely reaching the "high risk" category under the Advanced AI Scaling Framework before safeguards were applied. We have implemented a multi-layered set of mitigations that address the identified risks, and Muse Spark demonstrates state-of-the-art refusal across a range of benchmarks related to hazardous workflows in chemistry and biology. We therefore release Muse Spark as the underlying model of Meta AI.

阅读与讨论 → 访问原文 →

18.

arXiv (CS.CL) 2026-06-12 DOI: arXiv:2606.13681

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

作者:

Jundong Xu ↗Qingchuan Li ↗Jiaying Wu ↗Yihuai Lan ↗Shuyue Stella Li ↗Huichi Zhou ↗Bowen Jiang ↗Lei Wang ↗Jun Wang ↗Anh Tuan Luu ↗Caiming Xiong ↗Hae Won Park ↗…

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

阅读与讨论 → 访问原文 →

19.

Nature (Science) 2026-06-15 DOI: HASH:e291a00d66518bb44155133040c20bcb

How AI is revealing the secret lives of animals from hummingbirds to pumas

作者:

Ryan Huling ↗

Advances in machine learning and other technologies are helping researchers to trace the movements, landmarks and social practices of wildlife. Advances in machine learning and other technologies are helping researchers to trace the movements, landmarks and social practices of wildlife.

阅读与讨论 → 访问原文 →

20.

arXiv (CS.LG) 2026-06-17 DOI: arXiv:2606.18166

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

作者:

Ahmed Ryan ↗Saad Sakib Noor ↗Md Erfan ↗Shaswata Mitra ↗Sudip Mittal ↗Md Rayhanur Rahman ↗

arXiv:2606.18166v1 Announce Type: cross Abstract: Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but historically required extensive human effort. Pre-Large Language Model (LLM) automation sped up this process, but could not resolve the complex language and multi-step attack patterns found in unstructured CTI reports. LLMs addressed previous limitations by using contextual reasoning to understand unstructured text. However, current evaluations rely on simplified, single-technique sentences that ignore the complexity of real-world CTI reports, which often leads to inflated performance results. Consequently, the baseline performance of open-source LLMs on complex unstructured CTI reports remains unevaluated. To address this gap, we constructed a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive, 795 negative) from 83 complex unstructured CTI reports. These sentences were mapped to 114 unique ATT&CK techniques using a six-phase annotation process, achieving \k{appa} = 0.68 inter-annotator agreement. Using this dataset, we evaluated seven open-source LLMs ranging from 8B to 236B parameters across prompt strategy and temperature configurations. The highest-performing LLM achieved a micro-averaged F1 score of 0.22, establishing the empirical baseline for multi-label ATT&CK classification on complex unstructured CTI. Parameter size showed a statistically significant positive correlation with F1 score. Prompt strategy and temperature produced no statistically significant gains across model configurations. These results indicate that current open-source LLMs are insufficient for production-grade ATT&CK classification. The dataset, benchmark, and findings provide a reproducible foundation for future CTI research.

阅读与讨论 → 访问原文 →

21.

bioRxiv (Bioinfo) 2026-06-11 DOI: HASH:51f0d02ac861598f27251dc5f980a497

Combinatorial docking and molecular generation to navigate over 100-billion molecules for prospective ligand discovery

作者:

Zhang ↗Yang ↗Chen ↗Lam ↗Bryant ↗Pidathala ↗Wang ↗Moroz ↗Radchenko ↗Alon ↗Lee ↗C.-H ↗…

Commercially available make-on-demand libraries now exceed 100 billion compounds, requiring over 50 years to screen on 2,000 CPU cores using conventional docking. We present two complementary approaches to address this challenge. CombiDOCK, a combinatorial docking framework, enables exhaustive screening at the 100-billion scale within 40 days. MINT-Dock, a generative framework, accelerates navigation of this space by integrating CombiDOCK with Monte Carlo Tree Search. Benchmarked on 46 diverse targets, CombiDOCK matched full-molecule docking accuracy, and MINT-Dock achieved a 4,800-fold enrichment over random selection. Compared with prior billion-scale brute-force campaigns against {sigma}2, VMAT2, and VAChT, prospective CombiDOCK screens of the 100-billion-molecule library yielded higher hit rates and more potent ligands, while MINT-Dock achieved comparable outcomes across single- and multi-target objectives with >20-fold computational cost reductions. Docking-predicted poses of the best VAChT-binding compounds were confirmed by cryo-EM structures. These methods provide exhaustive and generative paths for navigating the trillion-molecule frontier of drug discovery.

阅读与讨论 → 访问原文 →

22.

arXiv (CS.CL) 2026-06-11 DOI: arXiv:2606.12087

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

作者:

Jia Deng ↗Yimeng Chen ↗Xiaoqing Xiang ↗Ziyang Zeng ↗Shuo Tang ↗Wayne Xin Zhao ↗Feng Chang ↗Chuan Hao ↗Yuan Wei ↗Ran Tao ↗Bryan Dai ↗Ji-Rong Wen ↗…

Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase apparent difficulty by enriching graph structures, but structural complexity alone does not guarantee realized search difficulty: the intended search process can collapse through a cheaper identifying route. We formalize this gap with a shortcut-aware difficulty framework and identify four actionable shortcut risks: evidence co-coverage, single-clue selectivity, exposed constants, and prior-knowledge binding. To diagnose their realized effects, we use trajectory signatures including solving cost, answer hit time, and prior-shortcut rate. Guided by this framework, we introduce FORT, a Framework of Shortcut-Resistant Training-Data Synthesis. FORT constructs shortcut-resistant training data by controlling shortcut risks across entity selection, evidence graph construction, question formulation, and adversarial refinement. Experiments show that FORT induces longer pre-answer search and fewer shortcut patterns than existing open-source deep search datasets. Using the resulting trajectories, we train FORT-Searcher with supervised fine-tuning (SFT) only, and it achieves the best overall performance among comparable-size open-source search agents on challenging deep search benchmarks. Relevant resources will be made available at https://github.com/RUCAIBox/FORT-Searcher.

阅读与讨论 → 访问原文 →

23.

arXiv (quant-ph) 2026-06-16 DOI: arXiv:2606.15586

Quantum Illumination with Symmetry-Constrained Random Unitaries

作者:

Sooryansh Asthana ↗Sai Vinjanampathy ↗

arXiv:2606.15586v1 Announce Type: new Abstract: Quantum illumination provides a quantum advantage in detecting weakly reflecting objects embedded in a noisy environment, even when environmental noise destroys most of the initial entanglement. We investigate this advantage using Haar-random probe states constrained to symmetry-resolved subspaces. Employing tools from quantum channel discrimination and asymptotic hypothesis testing, we derive the discrimination exponents associated with Haar-random probe ensembles and identify the role of symmetry in determining their performance. We show that typical states drawn from fixed-charge sectors achieve the same asymptotic quantum-illumination advantage as maximally entangled probes. In particular, we show that the effective thermal-noise suppression and the corresponding Chernoff exponent are governed by the dimension of the accessible symmetry sector. Our results reveal that the operational resource underlying quantum illumination can be generalized from fine-tuned structure of a specific probe state to the existence of a large symmetry-protected correlation subspace. These findings establish a direct connection between quantum illumination, symmetry-resolved typicality, and quantum channel discrimination, and demonstrate that near-optimal quantum hypothesis testing resources can emerge naturally from generic many-body quantum states constrained by conservation laws.

阅读与讨论 → 访问原文 →

24.

arXiv (CS.CL) 2026-06-16 DOI: arXiv:2606.16118

Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

作者:

Olivia Peiyu Wang ↗Sanna Wong-Toropainen ↗Daneshvar Amrollahi ↗Ryan Bai ↗Tashvi Bansal ↗Arush Garg ↗Leilani H. Gilpin ↗

Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing three paradigms, including pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. While introducing formal structure improves accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, we show that this gain does not imply faithful reasoning. We identify three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not; implicit constraint blindness, where LLMs overlook logical constraints present in formal representations; and program synthesis failures, where LLMs generate incorrect Z3 code despite structured prompting. Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.

阅读与讨论 → 访问原文 →

25.

arXiv (CS.CV) 2026-06-16 DOI: arXiv:2606.15427

Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

作者:

Nicholas A. Welsh ↗Lennon J. Shikhman ↗Monty Nehru Attazs ↗Seemanthini K. Putane ↗Van Minh Nguyen ↗Ryan T. White ↗

Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight, adding new semantic capabilities in orbit requires retraining and re-uploading parameters. We investigate whether prompt-driven vision–language models can enable post-launch semantic expansion, allowing new spacecraft components to be specified via natural-language prompts without modifying onboard weights. We evaluate zero-shot instance segmentation of spacecraft components under a strictly frozen, single-pass inference protocol on a test set of $129$ images of previously unseen satellites. Under fixed global thresholds and no post-processing, SAM3 achieves $0.385$ mAP@$0.5$ and $0.267$ mAP@$0.5{:}0.95$. Performance is strongly scale-dependent: large structural elements like spacecraft bodies ($0.639$ AP@$0.50$) and solar arrays ($0.598$ AP@$0.5$) localize reliably, while relatively small appendages like antennas ($0.221$ AP@$0.5$) and thrusters ($0.081$ AP@$0.5$) remain difficult. Prompt formulation influences performance, with structured prompts incorporating spatial and geometric descriptors yielding up to $82%$ improvement over short category-name prompts. The model operates within the memory and compute envelope of contemporary embedded GPUs, suggesting prompt-driven grounding can provide a practical mechanism for post-launch semantic extension of dominant spacecraft structures while highlighting limitations of zero-shot localization for fine-scale components under orbital domain shift.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络