Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.LG) 2026-06-18

Beyond Algorithms: Conceptual Innovation in Medical Imaging AI

arXiv:2606.19270v1 Announce Type: cross Abstract: Artificial intelligence has driven rapid progress in medical imaging research, producing increasingly sophisticated algorithms and steady improvements on benchmark tasks. However, this algorithm-centric trajectory has also revealed a growing imbalance: while computational methods advance rapidly, the conceptual foundations that define imaging tasks, evaluation metrics, and clinical meaning sometimes remain underexamined. In this Perspective, we distinguish algorithmic innovation, which focuses on improving computational implementations and performance within a fixed problem definition, from conceptual innovation, which reframes what problems are posed, how success is measured, and why an approach is clinically relevant. We argue that prevailing incentive structures, training pathways, and publication norms disproportionately reward algorithmic novelty, particularly for early-career researchers, while at times undervaluing conceptual contributions that are essential for scientific maturation and clinical translation. Through representative examples from medical imaging AI, we show how insufficient conceptual grounding can lead to misaligned objectives, fragile generalization, and limited real-world impact. We conclude with actionable recommendations for researchers, mentors, reviewers, and journals to better recognize, support, and integrate conceptual innovation alongside algorithmic advances.

02.
arXiv (math.PR) 2026-06-15

Stationary measures for higher spin vertex models on a strip

作者:

arXiv:2309.04897v2 Announce Type: replace-cross Abstract: We introduce a higher spin vertex model on a strip with fused vertex weights. This model can be regarded as a generalization of both the unfused six-vertex model on a strip arXiv:2212.09111 and an 'integrable two-step Floquet dynamics' model introduced in arXiv:1711.08884. We solve for the stationary measure using a fused version of the matrix product ansatz and then characterize it in terms of the Askey-Wilson process. Using this characterization, we obtain the limits of the mean density along an arbitrary down-right path. It turns out that all these models share a common phase diagram, which, after an appropriate mapping, matches the phase diagram of open ASEP. This provides evidence for the universality of this phase diagram.

03.
arXiv (CS.AI) 2026-06-17

From Noise to Order: Learning to Rank via Denoising Diffusion

arXiv:2602.11453v3 Announce Type: replace-cross Abstract: Learning-to-rank (LTR) methods have traditionally been limited to discriminative machine learning approaches that model the probability of the document being relevant to the query given some feature representation of the query-document pair. We propose an alternative denoising diffusion-based generative approach to LTR that instead models the full joint distribution over features and relevance labels. While in discriminative LTR, an over-parameterized ranking model may find different ways to fit the training data, we posit that candidate solutions that can explain the full data distribution under the generative setting maybe better at estimating relevance. Thus, we propose DiffusionRank that extends TabDiff, an existing diffusion model for tabular datasets, to create generative alternatives to classical discriminative pointwise and pairwise LTR objectives. Our work demonstrates improvements from DiffusionRank over discriminative counterparts on four standard LTR datasets and points to a rich space for future exploration to leverage ongoing advancements in deep generative models for LTR. Our code is publicly available at https://github.com/sadjadeb/DiffusionRank.

04.
arXiv (CS.CV) 2026-06-16

A biological vision inspired framework for machine perception of abutting grating illusory contours

Higher levels of machine intelligence demand alignment with human perception and cognition. Deep neural networks (DNN) dominated machine intelligence have demonstrated exceptional performance across various real-world tasks. Nevertheless, recent evidence suggests that DNNs fail to perceive illusory contours like the abutting grating, a discrepancy that misaligns with human perception patterns. Departing from previous works, we propose a novel deep network called illusory contour perception network (ICPNet) inspired by the circuits of the visual cortex. In ICPNet, a multi-scale feature projection (MFP) module is designed to extract multi-scale representations. To boost the interaction between feedforward and feedback features, a feature interaction attention module (FIAM) is introduced. Moreover, drawing inspiration from the shape bias observed in human perception, an edge detection task conducted via the edge fusion module (EFM) injects shape constraints that guide the network to concentrate on the foreground. We assess our method on the existing AG-MNIST test set and the AG-Fashion-MNIST test sets constructed by this work. Comprehensive experimental results reveal that ICPNet is significantly more sensitive to abutting grating illusory contours than state-of-the-art models, with notable improvements in top-1 accuracy across various subsets. This work is expected to make a step towards human-level intelligence for DNN-based models.

05.
arXiv (CS.CL) 2026-06-15

Hybrid Classical-Quantum Variational Autoencoder for Neural Topic Modeling

作者:

Neural topic models enable scalable semantic discovery, but their integration with quantum hardware remains largely unexplored. We present a proof-of-concept hybrid classical-quantum variational autoencoder (VAE) for topic modeling, embedding parameterized quantum circuits within the VAE inference network while retaining a classical topic-word decoder. To address the resource constraints of quantum hardware, we propose a modified Gaussian Softmax posterior that decouples latent space dimensionality from the number of topics to be extracted, enabling the model to operate with a low-resource 10-qubit quantum device. On the AgNews dataset, the hybrid VAE outperforms state-of-the-art neural topic models (NTMs), reaching a $C_v$ coherence score of 0.71 and an NPMI score of 0.20 while preserving high topic diversity. For comparison, we also construct a fully classical variant, which also outperforms state-of-the-art models on AgNews and exhibits clear class separation in the latent space. These results demonstrate that hybrid VAEs are computationally viable even on NISQ-era devices and represent a promising direction for quantum-enhanced topic modeling.

06.
Nature (Science) 2026-06-17

These ‘master’ proteins protect us from deadly mutations — and could inspire new drugs

作者:

Biology has clever ways to mask the effects of potentially harmful gene mutations. Scientists are investigating how this ‘buffering’ works — and how to exploit it. Biology has clever ways to mask the effects of potentially harmful gene mutations. Scientists are investigating how this ‘buffering’ works — and how to exploit it.

07.
arXiv (CS.AI) 2026-06-17

Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains

arXiv:2606.17269v1 Announce Type: new Abstract: In skill-constrained production-inventory systems, the qualified human capacity available tomorrow depends on training decisions made today: production requires certified workers, certifications decay unless maintained, and training consumes the same scarce worker hours that production needs now. We study a closed-loop skill-constrained model predictive controller that, at every shift, solves a finite-horizon mixed-integer program over production, inventory, backlog, and training, with binary predicted certification, hard production eligibility, and an interpretable terminal value that prices certified-capacity gaps at the horizon boundary; only the first-period action is applied before replanning. On synthetic, seed-controlled SkillChain-Gym scenarios - announced and surprise new-skill shocks, demand shocks, absenteeism, forecast- and availability-quality modes, capacity-boundary and training-rate sweeps, and negative controls - we evaluate the controller against production-only and maintenance-only ablations, static cross-training insurance plans, and a strong reactive heuristic, under an ex-ante locked configuration and paired statistics. The result is regime dependence, not superiority: no policy class dominates. Predictive control helps when skill or labor bottlenecks are forecastable early enough for training to complete; lean static insurance remains hard to beat under surprise shocks, near the demand-capacity boundary, and wherever pre-shock slack makes insurance cheap. Attribution ablations separate certification maintenance, re-acquisition of lapsed certifications, and greenfield skill acquisition. Forecastability, not adaptivity per se, decides when predictive control pays.

08.
medRxiv (Medicine) 2026-06-11

The impact of pre-stroke statin use on baseline corrected infarct volume and collateral perfusion

Stroke is a leading cause of disability and mortality worldwide, with ischaemic stroke the most prevalent type. Statins, used for cholesterol management, have demonstrated benefits in reducing stroke risk and improving outcomes in preclinical studies. However, the impact of pre-stroke statin use on stroke outcomes remain inconsistent. In this study, we aim to evaluate whether pre-stroke statin use is associated with greater volume of salvaged tissue and improved cerebral collateral perfusion. A retrospective analysis was conducted using data from 281 patients presenting with acute ischemic stroke to the John Hunter Hospital between May 2015 and May 2020. Patients were grouped based on pre-stroke statin use, and clinical variables, including infarct volume and collateral perfusion, were assessed. The primary outcome was salvage volume derived from baseline perfusion lesion volume minus infarct volume at follow-up. Collateral perfusion was measured by the hypoperfusion volume defined by delay time (DT)>6 seconds divided by the hypoperfusion volume defined by DT >2 seconds. Patients on statins at admission were significantly older and had more comorbidities. No significant association was found between pre-stroke statin use and salvage volume or collateral perfusion after adjusting for covariates. Larger initial infarct core was a significant predictor of salvage volume due to larger salvageable tissue volume at baseline. These findings indicate that pre-morbid statin use is not associated with larger salvage volume or improved cerebral collateral perfusion.

09.
arXiv (CS.LG) 2026-06-16

Learning the Geometry of Data: A Mathematical Review of Shape Space Analysis

arXiv:2606.17022v1 Announce Type: cross Abstract: A central objective of machine learning is to identify structure and patterns in data. Advances in data acquisition have increasingly produced datasets whose observations possess rich geometric form, giving rise to shape spaces that encode variability in object geometry. Such datasets arise across a wide range of disciplines, including biology, medicine, anthropology, and computer vision, where subtle geometric differences often carry important scientific information. Traditional machine learning methods, however, are frequently ill-equipped to account for the nonlinear geometric structure underlying these data. This survey synthesizes a rapidly growing body of work on shape space analysis, which provides a mathematical and computational framework for the study of geometric data. Drawing on ideas from differential geometry, statistics, and machine learning, we organize the literature around a common analytical pipeline: shape representation and parameterization, the rigorous construction of robust geodesic metrics, statistical analysis on shape spaces, and geometry-aware learning methods. We discuss how these tools enable the characterization of shape variability, the comparison of geometric objects, and the analysis of structural trajectories across populations and time. To illustrate the breadth of the field, we highlight applications spanning multiple scales of biological organization, including studies of subcellular morphology and primate tooth evolution. Across these and many other domains, researchers face common challenges arising from complex, nonlinear, and often unaligned geometric variation. The review concludes by identifying key theoretical and computational challenges, as well as emerging opportunities driven by increasingly large and diverse geometric datasets.

10.
medRxiv (Medicine) 2026-06-22

Multisite Real-World Validation of an Electronic Health Record-Integrated Generative Artificial Intelligence Tool for Venous Thromboembolism Risk Stratification

Background: Guiding risk-appropriate inpatient thromboprophylaxis requires venous thromboembolism (VTE) risk stratification; however, reliable risk determination remains inconsistent in routine care. Health systems increasingly pilot artificial intelligence (AI) tools, yet few studies demonstrate rigorous evaluation in the context of a learning health system (LHS). We evaluated the performance of a pilot electronic health record (EHR)-integrated generative AI (GenAI) system, inHealth General Reasoner (iHGR), for VTE risk stratification versus clinician order set classifications and physician-adjudicated chart review. Methods: This multisite retrospective validation study included adult inpatient admissions at Johns Hopkins Medicine between June 21, 2025, and Dec 18, 2025 (checklist-based order set from June 21, 2025 - November 19, 2025, and clinician judgement-based order set from November 29 - December 18, 2025). From 758 eligible admissions, we randomly sampled 500 balanced by site and order set periods. iHGR and clinician-selected order set classifications were compared with the reference standard (RS). Primary outcomes were iHGR sensitivity and specificity. Secondary analyses compared the order sets with the same RS to evaluate workflow comparators and error patterns. Results: iHGR achieved 81.8% sensitivity (95% CI 77.3-85.6) and 70.9% specificity (63.6-77.3). The checklist-based order set had 61.3% sensitivity (53.7-68.5) and 86.2% specificity (77.4-91.9). The clinician judgement-based order set had 78.1% sensitivity (71.3-83.7) and 65.4% specificity (54.3-75.0). False-negative iHGR classifications were associated with missed narrative risk factors. Conclusion: iHGR showed higher sensitivity for VTE risk than checklist-based order sets and clinician judgement without introducing systematic bias. In silico evaluation of pilot AI systems within LHSs can identify clinically important performance trade-offs and implementation targets before operational scale-up. Narrative clinical data abstraction remained a key limitation, supporting the use of GenAI to support rather than supplant clinician judgement.

11.
arXiv (CS.CV) 2026-06-12

Radar-Guided Polynomial Fitting for Metric Depth Estimation

We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

12.
arXiv (math.PR) 2026-06-16

Stochastic control with dividend payments and capital injections for Markov additive processes

作者:

arXiv:2604.00190v4 Announce Type: replace Abstract: Motivated by de Finetti's optimal dividend problem with capital injections, we study a stochastic control problem for the additive component of a Markov additive process (MAP). In contrast to previous studies, the modulating component is allowed to be a general right process on a Radon space, so the model is not restricted to finite-state regime switching and cannot in general be reduced to a finite collection of Lévy process control problems. Capital injections are allowed at arbitrary times. We first consider the case in which dividend payments are allowed only at prescribed discrete times and establish necessary and sufficient conditions for the optimality of a strategy. These conditions then yield the optimality of a class of Markov-modulated periodic–classical barrier strategies. Combining this optimality result with an approximation argument, we obtain insight into the possible form of optimal strategies in the case where dividend payments, like capital injections, may be made at arbitrary times. Because of the generality of the MAPs considered here, the proof techniques used in previous studies of similar problems are not directly applicable. We therefore develop an alternative argument based on the additive structure of MAPs and dynamic programming between dividend opportunities. The argument also suggests a possible approach to other stochastic control problems involving general MAPs.

13.
bioRxiv (Bioinfo) 2026-06-11

Pillbox: A Leakage-Aware Foundation-Model Predictor and Lineage-Ceiling Diagnostic for Cancer Drug Response

We present Pillbox, a predictor whose pipeline is audited against the six Asiaee leakage modes with the one residual pathway shown by per-fold ablation to be non-load-bearing on hard splits. Our model combines CpGPT methylation embeddings, CLAMP drug embeddings, and per-fold-fit gene-expression principal components which are fused by Feature-wise Linear Modulation (FiLM)-conditioned graph attention on the STRING v12 protein-protein interaction graph. Then we alpha-ensemble the model against a histogram-based gradient boosting regressor baseline. On GDSC GSE68379 (987 cell lines, 375 drugs) across seeds 42, 7, and 123, the ensemble reaches test R-Squared of 0.78, 0.77, and 0.76 on random, histology-blind, and site-blind splits respectively, with cell-aware lifts above the drug-mean floor of +0.054, +0.060, and +0.037. As a quantitative diagnostic for feature-stack saturation we propose the cross-architecture residual correlation, calibrated against a same-architecture-different-initialization control. On histology-blind splits the cross-architecture value of 0.939 falls short of the same-architecture ceiling of 0.974 by approximately 0.03 in residual correlation, a gap we interpret as the headroom available to architecture choice on top of the current foundation-model representation and consistent with the long-established observation that tissue lineage dominates cell-line drug response. We integrated curated mutation, methylation, and drug-target-expression channels, but these do not improve prediction once foundation-model embeddings are in place. Cross-screen validation against PRISM matches the GDSC-to-PRISM measurement reproducibility ceiling within 0.01 Spearman.

14.
arXiv (quant-ph) 2026-06-16

Analytical solution of the Schr\"{o}dinger equation with $1/r^3$ and attractive $1/r^2$ potentials: Universal three-body parameter of mixed-dimensional Efimov states

arXiv:2601.19517v2 Announce Type: replace-cross Abstract: We study the Schr\"{o}dinger equation with $1/r^3$ and attractive $1/r^2$ potentials. Using the quantum defect theory, we obtain analytical solutions for both repulsive and attractive $1/r^3$ interactions. The obtained discrete-scale-invariant energies and wave functions, validated by excellent agreement with numerical results, provide a natural framework for describing the universality of Efimov states in mixed dimension. Specifically, we consider a three-body system consisting of two heavy particles with large dipole moments confined to a quasi-one-dimensional geometry and resonantly interacting with an unconfined light particle. With the Born-Oppenheimer approximation, this system is effectively reduced to the Schr\"{o}dinger equation with $1/r^3$ and $1/r^2$ potentials, and manifests the Efimov effect. Our analytical solution suggests that, for repulsive dipole interactions, the three-body parameter of the mixed-dimensional Efimov states is universally set by the dipolar length scale, whereas for attractive interactions it explicitly depends on the short-range phase. We also investigate the effects of finite transverse confinement and find that our analytical results are useful for describing the Efimov states composed of two polar molecules and a light atom.

15.
arXiv (CS.CL) 2026-06-16

Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.

16.
arXiv (quant-ph) 2026-06-15

Conditional squeezing induced by a two-level system: arbitrary-time Magnus coefficients in the quantum Rabi model

arXiv:2508.03506v5 Announce Type: replace Abstract: We present a systematic Magnus expansion treatment of the quantum Rabi model beyond the Rotating Wave Approximation. We show that at the second order of Magnus series, the second-order evolution operator contains a term that induces conditional squeezing of the field mode depending on the state of the atom, in addition to the energy shifts. We analyze the scaling behavior of the conditional squeezing coefficient for $^{87}\mathrm{Rb}$ $5^2S_{1/2}\rightarrow5^2P_{1/2}$ transition line and show that the slow envelope of the squeezing coefficient is maximized at half-detuning cycles, and that it scales with $\frac{4g^2}{\omega_0|\Delta|}$. We also show that the quadrature squeezing angle suggests a possible route towards quantum non-demolition readouts, while further investigation is required for a full first-order suppression. We then connect our work to the well-studied AC-Stark shift and Bloch-Siegert shift using the effective Hamiltonian theory. Finally, we show how the energy shifts and the conditional squeezing arise, as a whole $\mathrm{SU}(1,1)$ algebra, and how they can be disentangled as individual unitary evolutions.

17.
arXiv (CS.AI) 2026-06-19

RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning

arXiv:2606.20142v1 Announce Type: new Abstract: This paper introduces RACL, a Reasoning-Agent Control Layer for metaheuristics. RACL places a reasoning agent above an existing optimizer. The agent does not replace the optimizer and does not modify business constraints. Instead, it controls the optimizer's internal search behavior by observing operational memory, reasoning over past behavior, formulating bounded hypotheses, testing interventions, evaluating outcomes, applying guardrails, consolidating useful policies and explaining its decisions. The experiment uses vehicle routing as a testbed, but the contribution is not a new routing solver, a particular ALNS configuration or a specific set of routing rules. The contribution is the RACL method: a way for a reasoning agent to discover, validate, consolidate and explain algorithmic control rules for a metaheuristic. In the current experimental setting, RACL improves or ties the Operational Memory Policy in 21 of 21 feasible cases and improves or ties a non-reasoning Stagnation-Triggered Policy in 18 of 21 feasible cases, with an average RACL vs STP cost delta of -0.641%. In the Sevilla-9/10 runtime sample, RACL improves average cost by -8.337% versus Fixed and -1.605% versus STP without showing material computational overhead. During the proof-of-concept, Codex was used as an in-the-loop reasoning agent observing executions, interpreting logs and proposing live bounded interventions. The policy proxy was later used only to make quantitative evaluation reproducible.

18.
arXiv (CS.LG) 2026-06-15

MOSIC: Model-Agnostic Optimal Subgroup Identification with Multi-Constraint for Improved Reliability

arXiv:2504.20908v3 Announce Type: replace Abstract: Current subgroup identification methods typically follow a two-step approach: first estimate conditional average treatment effects and then apply thresholding or rule-based procedures to define subgroups. While intuitive, this decoupled approach fails to incorporate key constraints essential for real-world clinical decision-making, such as subgroup size and propensity overlap. These constraints operate on fundamentally different axes than CATE estimation and are not naturally accommodated within existing frameworks, thereby limiting the practical applicability of these methods. We propose a unified optimization framework that directly solves the primal constrained optimization problem to identify optimal subgroups. Our key innovation is a reformulation of the constrained primal problem as an unconstrained differentiable min-max objective, solved via a gradient descent-ascent algorithm. We theoretically establish that our solution converges to a feasible and locally optimal solution. Unlike threshold-based CATE methods that apply constraints as post-hoc filters, our approach enforces them directly during optimization. The framework is model-agnostic, compatible with a wide range of CATE estimators, and extensible to additional constraints like cost limits or fairness criteria. Extensive experiments on synthetic and real-world datasets demonstrate its effectiveness in identifying high-benefit subgroups while maintaining better satisfaction of constraints.

19.
arXiv (CS.LG) 2026-06-15

Detecting Lookahead Bias in LLM Forecasts

arXiv:2512.23847v2 Announce Type: replace-cross Abstract: We develop a statistical procedure to detect lookahead bias in economic forecasts generated by large language models (LLMs). Using a date-only recall query for a firm-date pair, we estimate the probability that the LLM has internalized information about the realized outcome, a statistic we term Lookahead Propensity (LAP). LAP is materially positive throughout the in-sample period and collapses essentially to zero right after the training-data cutoff. We show that a positive interaction between LAP and the LLM forecast in an accuracy regression indicates lookahead-bias contamination, and apply the test to two forecasting tasks: news headlines predicting stock returns and earnings call transcripts predicting capital expenditures. In both applications, the LLM forecast's predictive power is amplified on high-LAP firm-date pairs, and the interaction loses significance on post-training-cutoff samples. Our test provides a cost-efficient, diagnostic tool for assessing the validity and reliability of LLM-generated forecasts.

20.
arXiv (CS.AI) 2026-06-16

Phase-Aware Guidance Injection for Recurrent MAPPO in Assembly-Line Disruption Recovery

arXiv:2606.16330v1 Announce Type: new Abstract: Disruption recovery in industrial assembly lines requires timely decisions under machine faults, worker absence, and emergency orders. Existing methods either rely on rigid handcrafted recovery logic or learn adaptive policies that do not readily exploit heterogeneous external recovery knowledge at decision time to reduce abnormal recovery time (ART) and preserve on-time delivery (OTD). To address this gap, we propose a phase-aware guidance injection framework that augments a trained recurrent MAPPO (RMAPPO) scheduling policy through logit-level action bias during evaluation. The framework provides a unified decision-time interface for rule-based, replay-based, and online LLM-based guidance, while activating intervention only during abnormal and recovery phases. Experiments on a custom AssemblyLineEnv show that high-quality rule guidance yields the strongest gains, replay-based guidance degrades smoothly under imperfect availability, and online LLM guidance still provides useful intermediate improvements. These results show that decision-time guidance injection can exploit heterogeneous recovery hints without redesigning the actor.

21.
arXiv (CS.AI) 2026-06-19

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

arXiv:2606.20227v1 Announce Type: new Abstract: Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models.

22.
arXiv (CS.CV) 2026-06-18

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

23.
arXiv (CS.AI) 2026-06-16

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

arXiv:2606.00288v2 Announce Type: replace Abstract: Large language models are undergoing a transition from model technology to system technology. Engineering challenges like cache reuse, context capacity, agent scheduling, and permission control resemble classical computer systems problems. This raises a question: if we treat the LLM as a CPU, KV cache as processor cache, context window as main memory, and agent framework as an operating system, can decades of computer architecture wisdom guide next generation model native systems? This paper pursues this analogy as a visionary survey. We map computer architecture concepts onto the emerging model native stack, survey literature across LLM as OS, memory management, agent frameworks, tool protocols, multi agent coordination, cognitive architectures, and safety governance, finding that each addresses a different layer without a unifying model. We propose the Intelligent Computing Architecture (ICA): six functional layers with interface contracts and design axioms. We resolve the tension over whether the LLM resembles a CPU or OS via a dual plane architecture a probabilistic execution plane (what can be computed) and a deterministic control plane (what should be computed), with every layer passing through as a graded crossover. We propose three Amdahl style design heuristics Semantic Locality, Context Budget, and Agent Speedup as organizing back of envelope models, illustrate their parameter ranges with published data, and identify predictive validation as the principal open task. We articulate analogy boundaries, note differences between silicon and model era architectures, and propose a research roadmap. This is a conceptual and survey contribution with no new experimental results.

24.
arXiv (CS.CL) 2026-06-17

Rethinking Groups in Critic-Free RLVR

Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.

25.
arXiv (CS.CL) 2026-06-12

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.