Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CL) 2026-06-12

Occupational Prompting Reveals Cultural Bias in Large Language Models

Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational identities with broader cultural value patterns. Prior work used nationality-based cultural prompting to study how LLM responses to value-survey questions align with human cultural benchmarks. In this paper, we extend that framework by replacing cultural prompting with occupational prompting to examine how professional-role cues influence value-survey responses in open-weight LLMs. Using a survey-grounded evaluation pipeline based on questions from the Integrated Values Surveys, we project model responses into the two-dimensional Inglehart–Welzel cultural space. We prompt open-weight LLMs to answer questions under occupational identities such as accountant, teacher, engineer, and nurse, and then analyze how these occupation-conditioned responses are positioned on the cultural map. Our results show that when open-weight LLMs are prompted with occupations rather than national identities, their responses remain within a broadly Western-leaning region of the cultural map. However, different occupations introduce shifts within this region, producing distinct occupational skews. This indicates that occupational prompts are not treated as neutral role labels, but instead elicit structured value patterns. These findings extend survey-based evaluation of cultural bias beyond nationality-based prompting and provide a framework for studying how occupational personas shape value expression in LLMs.

02.
arXiv (CS.CV) 2026-06-17

MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

We introduce MM++ (Multilayer Mahalanobis++), a fully unsupervised, strictly post-hoc, and scale-invariant framework for Out-of-Distribution (OOD) detection. To address the trade-off between scale invariance and hierarchical expressivity, MM++ constructs a principled joint feature space. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression. By fusing these selected layers with the terminal representation, the framework captures latent cross-layer correlations while mitigating early-layer noise. Crucially, a Ledoit-Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation. Requiring no auxiliary OOD data, classifier fine-tuning, or architectural modifications, MM++ delivers robust performance across distinct architectures for both near- and far-OOD detection.

03.
arXiv (math.PR) 2026-06-16

Free energy of non-convex multi-species spin glasses with centered Ising spins

arXiv:2606.16636v1 Announce Type: new Abstract: We identify the limit free energy of all multi-species spin glasses with centered $\pm 1$ spins. The result was previously known only under a convexity assumption on the covariance function of the Hamiltonian. We also obtain a one-species reduction of the formula for balanced multi-species models.

04.
arXiv (CS.CV) 2026-06-18

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

05.
arXiv (quant-ph) 2026-06-16

High-dimensional coherence to entanglement transduction under canonical noise

arXiv:2606.16695v1 Announce Type: new Abstract: We develop an analytical framework for coherence-to-entanglement conversion in bipartite high-dimensional quantum systems, so-called qunits. An arbitrary coherent input qunit is coupled to an incoherent ancilla through a generalized controlled-shift operation, producing a maximally correlated bipartite state. By analyzing the partial transpose of the output state, we establish an exact dimension-independent connection between the input coherence and the generated entanglement. We then study how this conversion is affected by three standard noise processes applied after the conversion step: phase damping, global depolarizing noise, and independent amplitude damping. The resulting expressions show that these channels degrade entanglement in qualitatively different ways. Phase damping leads to a uniform attenuation of the entanglement generated from coherence, depolarizing noise introduces pairwise thresholds associated with entanglement sudden death, and amplitude damping produces an asymmetric decay governed by relaxation toward the ground state. For maximally coherent inputs, the general results reduce to simple closed-form behavior, allowing direct comparison of the three noise mechanisms as the system dimension increases. In particular, global depolarizing noise exhibits a dimension-dependent sudden-death threshold, while amplitude damping leads to a smooth suppression in the maximally coherent case. These results provide useful analytical benchmarks for high-dimensional resource conversion and for assessing noisy entanglement generation in qudit-based quantum-information settings.

06.
arXiv (CS.AI) 2026-06-15

Generalized Discrete Diffusion with Self-Correction

arXiv:2603.02230v2 Announce Type: replace-cross Abstract: Self-correction is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose a Self-Correcting Discrete Diffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality.

07.
arXiv (CS.CL) 2026-06-12

LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

08.
arXiv (CS.LG) 2026-06-16

Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training

arXiv:2606.16384v1 Announce Type: new Abstract: Pretraining language models with extended context windows enhances their ability to leverage rich information during generation. Existing methods split input sequences into chunks, broadcast them across multiple devices, and compute attention block by block which incurs significant communication overhead. While feasible in high-speed clusters, these methods are impractical for decentralized training over low-bandwidth connections. We propose a compression method for communication-efficient context parallelism in decentralized settings, achieving a remarkable compression rate of over 95\% with negligible overhead and no loss in convergence. Our key insight is to exploit the intrinsic low-rank structure of activation outputs by dynamically constraining them to learned mixtures of subspaces via efficient reparameterizations. We demonstrate scaling billion-parameter decentralized models to context lengths exceeding 100K tokens on networks as slow as 300Mbps, matching the wall-clock convergence speed of centralized models on 100Gbps interconnects.

09.
arXiv (CS.AI) 2026-06-15

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

arXiv:2606.13782v1 Announce Type: new Abstract: Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.

10.
arXiv (CS.AI) 2026-06-15

Minim: Privacy-Aware Minimal View for Agents via Trusted Local Sanitization

arXiv:2606.13949v1 Announce Type: new Abstract: Modern LLM-powered autonomous agents increasingly rely on rich user interface (UI) state observations to achieve reliable action grounding in complex digital environments. However, many deployments transmit the full UI state to remote inference servers even when most elements are irrelevant to the current task, which can leak sensitive but unnecessary context such as authentication codes, private notifications, and background application states. We propose MINIM, a trusted local broker that performs privacy-aware minimization on the client side before any observation leaves the device. Grounded in Contextual Integrity (CI), MINIM learns a dual-score representation for each UI element by predicting an inherent sensitivity score (s) and a task-conditioned necessity score (n). These scores drive a ternary disclosure policy that keeps essential elements, abstracts sensitive attributes when needed, and removes task-irrelevant content. We optimize a CI-aware objective that penalizes necessity errors more strongly on high-risk content, enabling aggressive pruning while preserving task-critical information. Experiments on real-world UI observations derived from WebArena show that MINIM substantially reduces task-irrelevant sensitive leakage while preserving task-critical semantic context and the interactive affordances required for reliable agent actions.

11.
arXiv (CS.LG) 2026-06-16

Incentives and Evidence in Learned Service Orchestration

arXiv:2606.16555v1 Announce Type: cross Abstract: Reinforcement learning for service orchestration has been the subject of sustained research for over a decade, yet it is not used in production at scale. The usual explanation is that learned controllers degrade under delayed and noisy telemetry, workload shifts, and uncontrolled tenants. We test whether existing evidence supports that explanation. We evaluate three highly influential RL-based orchestration systems spanning resource allocation, DAG scheduling, and autoscaling, using pre-registered predictions about comparative degradation under production-relevant perturbations and paired inference with family-wise error correction. Across the tests, most predicted performance reversals do not occur. Diagnostic analyses show that these outcomes often reflect comparator collapse, artefact limitations, or evaluation choices rather than evidence that learned controllers tolerate the perturbations. One apparent advantage under observation lag is roughly fortyfold compared to a Kubernetes HPA-equivalent controller. Another widely cited result cannot be reconstructed from its released artefact, and the strongest reproducible margin is far smaller than the published results. Conclusions also reverse under changes in perturbation magnitude and evaluation mode. Based on these results and broader patterns in the literature, we identify an institutional problem. Publication and review incentives favour benchmark gains against convenient comparators, even when those gains provide little evidence of deployment performance. We argue that the problem is not solely technical. Rather, it is institutional, so learned orchestration needs production-grade comparators, registered perturbation models, separate operational metrics, and publication criteria that reward reproducible operational evidence. Without these changes, the literature can grow without establishing whether learning improves orchestration.

12.
arXiv (CS.AI) 2026-06-16

Towards End-to-End Automation of AI Research

arXiv:2606.15497v1 Announce Type: new Abstract: The automation of science is a long-standing ambition in the field of AI. While the community has made significant progress in automating individual components of the scientific process, a system that autonomously navigates the entire research lifecycle – from conception to publication – has remained out of reach. Here, we present the strongest demonstration to date toward automating the entire process end-to-end. We present The AI Scientist, which creates research ideas, writes code, runs experiments, plots and analyzes data, writes the entire scientific manuscript and performs its own peer review. Its ideas, execution, and presentation are of sufficient quality to produce a manuscript generated by an AI system that passes the first round of peer review at a major machine learning conference workshop. The workshop has an acceptance rate of 70 percent. Our system leverages modern foundation models within a complex agentic system. We evaluate The AI Scientist in two settings: a focused mode using human-provided code templates as an initial scaffold to conduct research on a specific topic, and a template-free, open-ended mode that leverages agentic search for wider scientific exploration. Both settings produce diverse ideas and automatically test, report on, and evaluate them. This achievement demonstrates AI's growing capacity for scientific contribution and signifies a potential paradigm shift in how research is conducted. As with any impactful new technology, there could be significant risks, including taxing overwhelmed review systems and adding noise to scientific literature. However, if developed responsibly, such autonomous systems could greatly accelerate scientific discovery.

13.
arXiv (CS.LG) 2026-06-11

A Riemannian Approach to Low-Rank Optimal Transport

arXiv:2606.12120v1 Announce Type: new Abstract: Low-rank optimal transport (OT) mitigates the quadratic scaling of classical solvers, yet existing approaches rely heavily on first-order mirror-descent updates that require careful hyperparameter tuning and ignore the optimization landscape's curvature. To address these limitations, we propose a unified Riemannian geometric framework for low-rank OT, modeling balanced and unbalanced rank-$r$ positive factored couplings as novel smooth embedded submanifolds of the positive orthant. By equipping these manifolds with the Fisher-Rao product metric, we derive tractable formulations for Riemannian projectors, retractions, and Hessian-vector products. Our cost-agnostic framework seamlessly extends to linear OT, Gromov-Wasserstein (GW), fused GW, and their unbalanced counterparts. For balanced OT, our geometric ingredients are computed via efficient conjugate-gradient and iterative Bregman updates. For the unbalanced OT, our operations elegantly reduce to closed-form scalings, completely eliminating inner iterative loops. In both regimes, per-iteration complexity scales linearly with dataset size, and we provide a rank-sufficiency certificate for global optimality verification. Extensive experiments across a range of problem sizes demonstrate that our regularization-free first- and second-order solvers achieve faster convergence and superior performance over existing state-of-the-art low-rank OT solvers.

14.
arXiv (CS.CV) 2026-06-12

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model its own generated caption as auxiliary evidence – assuming that a caption, once available, is something to consume. We show this fails: naively appending a caption can lower accuracy rather than raise it, dropping Qwen2.5-VL-3B$^\dagger$ on HallusionBench by nearly ten points. To understand why, we build GD-Probe, a diagnostic set that pairs a global and a detail question on the same image, so that any difference in caption effect is attributable to the question alone. Caption utility proves to be a per-query property: the same caption helps global questions and harms detail ones, through a single mechanism – an embedded caption competes with the image for attention and pulls the model's evidence onto its own text – whose sign is set by whether the caption covers the queried content. Crucially, this regime is readable from quantities the decoder already emits, with no attention access or grounding. We turn this into GEASS (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that decides per query how much of the caption to trust, gating it by the clean path's confidence, weighting it by the entropy reduction it induces, and raising the evidence bar when the two pathways disagree. Across four VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting, adding only two forward passes and no parameters.

15.
arXiv (CS.LG) 2026-06-18

On the Residual Scaling of Looped Transformers: Stability and Transferability

arXiv:2606.18524v1 Announce Type: new Abstract: Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = \lambda/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.

16.
arXiv (CS.LG) 2026-06-15

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions

arXiv:2605.05983v2 Announce Type: replace Abstract: Recently, steering vectors (SVs) have emerged as an effective and lightweight approach to steer behaviors of large language models (LLMs), among which fine-tuned SVs are more effective than optimization-free ones. However, current approaches to fine-tuned SVs suffer from two limitations. First, they require careful selection of steering factors on a per-SV basis to balance steering effectiveness and generation quality at inference time. Second, they operate as full-sequence SVs (FSSVs), which can sacrifice generation quality regardless of factor selection due to excessive intervention on the model generation process. To address the first limitation, we propose joint training of steering factors and directions, such that post-hoc factor selection is no longer required. Using neural network scaling theory, we find that moderately large initialization sizes and learning rates for steering factors are essential for stability and efficiency of joint training. To tackle the second limitation, we draw inspiration from representation fine-tuning and introduce Prompt-only SV (PrOSV), an SV that intervenes only on a few prompt tokens. Our empirical results show that PrOSV outperforms traditional FSSVs on AxBench when using our joint training scheme. We also find that PrOSV achieves a better tradeoff between general model utility and adversarial robustness than FSSV.

17.
arXiv (CS.CV) 2026-06-15

HiST: A Hierarchical Sparse Transformer for Cross-Modal Spatial Transcriptomics Modeling

Spatial transcriptomics (ST) links gene expression with tissue morphology but remains expensive and low-throughput, motivating surrogates that infer expression from routine histology. Whole-slide H&E-to-ST inference pairs a gigapixel image with gene measurements at a sparse, irregular set of locations, making multiscale modeling challenging without incurring dense-grid overhead or quadratic token mixing. We propose HiST, a hierarchical sparse transformer that treats measured locations as a lattice-indexed sparse field and builds a dyadic encoder–decoder directly on the active tissue footprint. HiST combines sparse window attention for local geometric correspondence with resolution-changing operators for rapid multiscale context integration. For a fixed window size, the dominant runtime and memory scale with the number of observed locations rather than the dense slide area. To mitigate slide-specific acquisition variation, HiST adds a bottlenecked global conditioning pathway via a slide calibration token that summarizes slide-level context and conditions local representations. On a multi-organ benchmark spanning diverse tissues and acquisition sources, HiST improves predictive performance over recent baselines while reducing runtime and peak memory.

18.
arXiv (CS.LG) 2026-06-12

Physics-Aware Auxiliary Losses Improve Out-of-Distribution Generalization of a GNN Synthesizability Filter

arXiv:2606.12651v1 Announce Type: new Abstract: Machine-learning drug-discovery pipelines increasingly rely on generative models that propose molecules far from the data used to train downstream synthesizability filters. Existing filters (SAScore, SCScore, RAscore, DeepSA) are purely statistical and degrade in exactly this out-of-distribution (OOD) regime. We ask whether cheap, closed-form physical priors, used as auxiliary supervision on a graph neural network (GNN), improve OOD generalization. We add two auxiliary losses to a GINE backbone: a topological complexity regression supervised by the Bertz index, and a strain-energy soft penalty supervised by MMFF94 force-field energy. On a 65,177-molecule corpus (HIV, Tox21, COCONUT) labeled by SAScore thresholds we reproduce a strong in-distribution baseline, then evaluate a 4-way ablation (baseline / +complexity / +strain / +both) on a single-source OOD split (train on drug-like HIV+Tox21, test on COCONUT natural products), repeated over 5 seeds with paired bootstrap confidence intervals. All three physics-aware variants give a small but statistically significant OOD improvement over the baseline (mean OOD AUC 0.9774): +complexity Delta = +0.0060 (95% CI [+0.0023, +0.0102]), +strain Delta = +0.0032 ([+0.0008, +0.0052]), +both Delta = +0.0066 ([+0.0038, +0.0093]); every interval excludes zero, and the combination is best. The variants are indistinguishable in-distribution, so the effect is visible only under OOD evaluation. We are explicit that the effects are modest, and we report a cautionary methodological finding: a single-seed version of this experiment produced a qualitatively different (non-monotone) story that did not survive multi-seed evaluation.

19.
arXiv (CS.LG) 2026-06-12

Epistemic Uncertainty Is Not the Reducible Kind

作者:

arXiv:2606.12646v1 Announce Type: cross Abstract: The standard taxonomy of predictive uncertainty defines epistemic uncertainty as the part removable by collecting more data, while the standard measure identifies it with a mutual-information term. We prove the definition and the measure are extensionally inconsistent. On an explicit construction, the measure assigns all uncertainty to the epistemic class, yet no quantity of training data reduces it. Reducibility is instead a property of the pair (uncertainty, acquisition class), and the dichotomy resolves into three parts: aleatoric, sample-reducible epistemic, and mechanism-reducible epistemic uncertainty. An exact identity for the value of an observation shows that in-distribution data never reduces mechanism-irreducible uncertainty and generically increases it. Ensemble disagreement, the deployed epistemic estimate, tracks the training procedure rather than the epistemic term. It collapses to zero beneath a positive truth under consistent training, and equals hyperparameter-scaled initialization noise under interpolation. A finite-sample falsification test and seed-swept experiments confirm the theory.

20.
arXiv (CS.CV) 2026-06-15

RT-VLA: Real-Time Vision-Language-Action Models via Knowledge Distillation

Vision-Language-Action (VLA) models have shown strong potential for end-to-end autonomous driving by jointly modeling visual perception, language reasoning, explainability and action prediction. However, their large vision-language backbones and reasoning modules introduce substantial inference latency and thereby prevent their deployment in the unforgiving reality of the road networks. We propose RT-VLA, a lightweight, distilled VLA model that transfers the driving and reasoning capabilities of the state-of-the-art SimLingo model into a compact student through multi-level supervised distillation. RT-VLA preserves language-based reasoning and supports post-hoc explanation through offline language analysis of safety-critical driving moments without adding latency to real-time control. Compared to the SimLingo teacher, RT-VLA maintains competitive closed-loop driving and language reasoning performance while reducing inference time by 44.8X in vision-only mode and 7.9X in vision+language mode. These results suggest that supervised distillation is a practical approach for building real-time, explainable VLA-style autonomous driving models.

21.
arXiv (CS.CL) 2026-06-12

SupraBench: A Benchmark for Supramolecular Chemistry

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

22.
arXiv (CS.CL) 2026-06-17

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English–Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English–Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.

23.
arXiv (CS.CV) 2026-06-12

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

24.
arXiv (CS.CL) 2026-06-15

SANA: What Matters for QA Agents over Massive Data Lakes?

Exploratory question answering (EQA) over data lakes requires an LLM agent to discover relevant sources, analyze retrieved data, and adapt its actions based on intermediate results. End-to-end accuracy alone cannot distinguish failures in search, planning, data analysis, or the agent's Action Policy: its decisions about what to do next and when to submit an answer. We present SANA (Search Agent Navigation Ablation framework), a diagnostic ablation framework that transforms EQA tasks into runtime profiles containing gold source sequence, sanitized subquestions, and execution records. SANA uses these profiles to construct idealized search, planning, and data-analysis tools, allowing each component to be ablated; the residual gap is diagnostic evidence for policy failures. To illustrate SANA as a reusable evaluation framework, we adapted two recent EQA benchmarks, LakeQA and KramaBench, and evaluated lightweight and mid-sized agents under fixed prompts, budgets, data lakes, and runtimes. Across both benchmarks, data analysis is a consistent bottleneck while planning is less so. Search is a major limitation in LakeQA's large data-lake setting, but less so for the smaller-scale KramaBench. SANA thus deconstructs end-to-end task accuracies into a diagnosis of where data-lake agents fail, and allows for systematic comparisons of progress in search, planning, data analysis, and agent design.

25.
PLOS Computational Biology 2026-06-22

Cell-type resolved transcriptional network analysis of <i>in vivo</i> cellular senescence following injury

作者:

by Alda Sabalic, Victoria Moiseeva, Andres Cisneros, Oleg Deryagin, Eusebio Perdiguero, Pura Muñoz-Cánoves, Jordi Garcia-Ojalvo Identifying the genetic correlates of complex phenotypes is a challenging task. Methods coming from the field of complex networks can help finding such molecular patterns, by revealing statistical associations among groups of genes that correlate with the phenotype. Here we study cellular senescence, a complex cell state whose molecular underpinnings are still under active investigation. We analyze cell type–resolved RNA sequencing data obtained from injured muscle tissue in mice, with a network-based approach that merges eigenvector centrality feature selection and community detection. Our analysis identifies genetic markers that had not been associated with senescence so far, which are validated with existing single-cell RNA sequencing data in a different type of tissue. The identified key genes belong to transcriptional pathways associated with established hallmarks of senescence, and thus can be interpreted as molecular correlates of such hallmarks. The method proposed here could be applied to any complex cellular phenotype even when only bulk RNA sequencing is available, provided the data is resolved by cell type.