Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-24

Modality-Aware Out-of-Distribution Detection for Multi-Modal Action Recognition

The incorporation of additional modalities into action recognition models increases their performance across a wide range of settings. However, how this additional information can contribute to making the models more robust remains underexplored, particularly for the case of multi-modal out-of-distribution (OOD) detection. While methods exist that regularize the multi-modal training process with OOD detection in mind, they still apply off-the-shelf OOD detectors designed for the uni-modal case during inference, discarding important information. Based on an interesting relationship we find between the multi-modal and uni-modal predictions, we propose to use this signal to build a post-hoc detector explicitly designed for the multi-modal scenario. We combine this new source of information with a feature-space score, which detects off-manifold samples in the multi-modal space, and normalize them by the multi-modal logits. In doing so, the proposed hybrid detector is compatible with existing training-time approaches and consistently improves performance. Experiments on a wide range of established datasets from the MultiOOD benchmark show that, on average, our approach outperforms the state of the art. Our results show the importance of explicitly considering the different modalities at inference time for multi-modal OOD detection.

02.
arXiv (CS.CV) 2026-06-15

Morphology-Aware Sample Assignment: Overcoming IoU Insensitivity for Surface Defect Detection

Intersection-over-Union (IoU), as a pivotal metric for evaluating the spatial alignment between candidate proposals and ground-truth annotations, directly determines the quality of positive sample sets and the training efficacy of visual detection models. Through theoretical modeling and analysis, we uncover a non-sensitive region on the IoU response curve, within which samples yield nearly identical IoU scores despite distinct geometric overlaps. To overcome this limitation, we introduce a set of morphological similarity metrics covering area, shape, and aspect ratio, to refine the positive sample assignment process, thereby ensuring more discriminative and reliable matching. A supplementary matching score is derived via mean-based aggregation of these multidimensional similarities, compensating for the intrinsic limitation of IoU in representing structural correspondence. Theoretically, incorporating morphological similarity reshapes the response distribution of the matching function, yielding both effective directional gradients and polygon-like iso-response contours, which tightly confine high-response regions around each ground-truth instance and substantially enhance the precision of positive sample selection. Experiments based on the YOLOv9 framework demonstrate consistent performance gains on both NEUDET and GC10- DET datasets. Notably, the proposed approach is fully plug-and-play and incurs zero additional inference overhead, thereby ensuring deployment efficiency for industrial visual inspection.

03.
arXiv (math.PR) 2026-06-16

High-Order Talagrand and Eldan–Gross Inequalities via Besov-Type Variance Functionals

arXiv:2606.14876v1 Announce Type: new Abstract: By introducing high-order Besov-type variance functionals that generalize the canonical variance, we develop a unified framework for proving high-order Talagrand-type inequalities that relate high-order energies to Fourier weights. Applying this machinery, we establish high-order Poincaré-type, $L^p$–$L^q$, isoperimetric-type, Falik–Samorodnitsky and Eldan–Gross inequalities, all with explicit constants, in both the Boolean and Gaussian settings. Fundamentally, our semigroup-based framework relies primarily on hypercontractivity and high-order Bismut-type derivative estimates, and is broadly applicable.

04.
arXiv (CS.CL) 2026-06-11

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle. We show that valid mathematical reasoning induces a measurable, training-free spectral signature in transformer attention. By treating each attention matrix as a weighted token graph, we extract four diagnostics: Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and smoothness, that require no learned parameters. Experiments across seven models from four architectural families yield effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling $85$–$96\%$ single-threshold classification accuracy. Two findings sharpen the interpretation. First, Platonic validity: the spectral signal tracks logical coherence rather than compiler acceptance, proofs rejected for timeouts or missing imports are correctly classified as valid, a distinction confirmed by a manual audit ($\kappa = 0.82$, $n = 51$). Second, architectural determinism: Sliding Window Attention shifts the discriminative feature from HFER to smoothness ($d = 2.09$, $p < 10^{-48}$), showing that attention design governs which spectral channel encodes reasoning quality. Causal ablation confirms the signature traces induction-head circuits. The method generalises to informal chain-of-thought ($d = 0.78$, $p < 10^{-3}$), and in proof search, HFER reranking improves Best-of-16 Pass@1 by $+4.4$–$6.6$\%, matching $98\%$ of the AUC of fully supervised probes with zero labels. Spectral graph analysis is a principled, architecture-aware primitive for reasoning verification.

05.
arXiv (CS.AI) 2026-06-17

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

arXiv:2606.17546v1 Announce Type: new Abstract: Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

06.
arXiv (quant-ph) 2026-06-19

Stalls and Spequlation: Pipelined Execution for Fault Tolerant Quantum Computation

arXiv:2606.19593v1 Announce Type: new Abstract: Fault-tolerant quantum computation requires the coordinated action of three distinct systems: classical control logic, quantum hardware, and classical error decoders. Current scheduling models treat logical operations as atomic, hiding the fact that these subsystems operate sequentially and spend significant time idle. We present a pipelined execution framework that decomposes each logical operation into its component stages i.e. Control, Execute, and Decode. Building on this, we discuss some speculation strategies that allow successor operations to begin processing before their predecessors have completed decoding. We evaluate our framework on several common benchmarks and show that pipelining with speculation reduces total pipeline steps by 20-40% compared to a no-speculation baseline. The most aggressive strategy consistently outperforms conservative alternatives, even though partial rollback is needed at times, because the per-rollback penalty is small relative to the parallelism gained. We further show that speculation facilitates load balancing by distributing work more evenly across the heterogeneous subsystems of a fault-tolerant quantum computer, converting idle time into useful computation while also saving on execution time.

07.
arXiv (CS.CL) 2026-06-16

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops UDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.

08.
Nature (Science) 2026-06-09

A unicellular relative links aggregative multicellularity to animal origins

作者:

How animals evolved complex multicellularity from their unicellular ancestors remains unanswered. Unicellular relatives of animals exhibit simple multicellularity through clonal division, formation of multinucleate coenocytes, or aggregation. 1 Therefore, animal multicellularity may have evolved from one (or a combination) of these behaviours. Aggregation has classically been dismissed as a means to complex multicellularity. 2 However, aggregation occurs in many extant animal cells and has also been recently described in three close unicellular relatives of animals (the choanoflagellates Salpingoeca rosetta and Choanoeca flexa, and the filasterean Capsaspora owczarzaki). 3-5 It is unclear whether aggregation in these species is derived or ancestral, and its relevance for animal origins remains unknown. To fill this gap, we investigated whether an additional close unicellular relative of animals can undergo aggregation. We discovered that the marine free-living bacterivorous filasterean Ministeria vibrans 6 forms homogeneous aggregates with reproducible kinetics that have long-term stability, and that improved feeding and mating may be evolutionary drivers of this aggregation. Notably, we found that homologs of many animal multicellularity genes involved in cell adhesion, signalling, and transcriptional regulation were deployed during the aggregation process, indicating that they may have been used for aggregation in the unicellular ancestors of animals before being co-opted into animal multicellular development. Thus, our results imply that aggregative multicellularity was key to the development of the multicellular animal genetic toolkit.

09.
arXiv (CS.CV) 2026-06-11

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

Temporal grounding–returning the interval $[t_s, t_e]$ for a natural-language query over a video–is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but–given a natural-language query–by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM–mirroring retrieve-then-read in open-domain QA.

11.
arXiv (CS.LG) 2026-06-16

HawkesNest: A Multi-Axis Synthetic Benchmark for Spatiotemporal Pattern Complexity

arXiv:2606.16863v1 Announce Type: new Abstract: Evaluation of spatiotemporal point process (STPP) models relies heavily on opaque real-world datasets, where latent generative structure is unknown and model failures are difficult to attribute. We introduce HawkesNest, a generator-aligned benchmark for controlled spatiotemporal pattern complexity built on a multivariate Hawkes backbone. HawkesNest defines four complexity axes: space–time entanglement, background heterogeneity, cross-type interaction, and domain topology. Each axis is associated with a deterministic index computed from the latent data-generating mechanism. By varying these axes while holding global rate, stability, and simulation budget fixed, HawkesNest enables diagnostic stress tests of STPP models under known structural difficulty. We verify that the indices are monotone and nearly orthogonal under controlled sweeps. We illustrate its use by showing that Hawkes-family baselines degrade under joint heterogeneity–entanglement complexity, even though they are structurally aligned with the Hawkes data-generating backbone. We further show that HawkesNest exposes neural-model sensitivity: AutoSTPP remains vulnerable under isolated increases in space–time entanglement. Code. Available at https://github.com/YahyaAalaila/HawkesNest

12.
arXiv (CS.AI) 2026-06-25

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

arXiv:2606.26080v1 Announce Type: cross Abstract: Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage – log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.

13.
medRxiv (Medicine) 2026-06-22

Referral pathways, ETAT triage acuity, and inpatient outcomes among children presenting to a national tertiary paediatric emergency unit in Ghana: a prospective cohort study

Emergency referral systems in sub-Saharan Africa are fragmented, and children reaching tertiary facilities through different referral pathways often arrive in advanced clinical states. Prospective data simultaneously characterising referral patterns, triage acuity at presentation, diagnostic case mix, and inpatient mortality at a national tertiary paediatric emergency unit are lacking from West Africa. This prospective cohort study enrolled 675 consecutively presenting children aged one month to 12 years at the Paediatric Emergency Unit of Korle Bu Teaching Hospital, Accra, Ghana, from February to December 2019. The primary outcome was all-cause inpatient mortality. Key variables collected included referral status and facility tier, Emergency Triage Assessment and Treatment (ETAT) triage category, ICD-10 diagnostic classification, Oyedeji socioeconomic classification, and time from symptom onset to PEU registration. Crude odds ratios were computed for all candidate predictors. Multivariable logistic regression was conducted using complete case analysis (n = 613). Of 675 children, 63.0% (n = 425) were referred from another health facility; referred children had higher ETAT emergency triage category rates than self-presenting children (32.7% vs 27.6%, p < 0.001). Overall inpatient mortality was 9.9% (67/675). Mortality varied by referral source: 16.7% among secondary/regional hospital referrals, 11.0% among lower-tier facility referrals (district, municipal, CHAG, polyclinic, private, health centre, and maternity home facilities combined, n = 356), 7.6% among self-presenting children, and 7.4% among tertiary referrals. Overall, 30.8% of children were classified as ETAT emergencies on arrival, with case fatility rate of 21.6%. The three most common diagnostic domains were respiratory conditions (17.2%), blood and haematological disorders (17.0%), and digestive presentations (16.4%). Inpatient mortality was highest in neoplastic disease (33.3%, n = 30) and circulatory presentations (31.0%, n = 29). In the primary multivariable analysis (n = 613, 51 events; events-per-variable ratio 4.2), no referral tier was independently associated with inpatient mortality after adjustment. Referral from secondary/regional hospitals showed a borderline non-significant association (adjusted odds ratio 3.09, 95% CI 0.96 to 9.90, p = 0.058). School going children (60-119 months) had higher odds of inpatient death than infants (adjusted odds ratio 5.56, 95% CI 1.16 to 26.53, p = 0.032), as did adolescents (adjusted odds ratio 10.01, 95% CI 2.15 to 46.69, p = 0.003). ETAT emergency category and lower socioeconomic status were not independently significant in this model. A pre-specified sensitivity analysis using the full analytic cohort (n = 674, events-per-variable ratio 6.7) with collapsed referral categories did not confirm any referral tier association; ETAT emergency category and lower SES were independently associated in the sensitivity model. All multivariable estimates should be regarded as exploratory. This prospective cohort provides simultaneous characterisation of referral patterns, ETAT triage acuity, diagnostic case mix, and inpatient mortality at a national tertiary paediatric emergency unit in West Africa. The referral-mortality gradient and high ETAT emergency category proportion document the severity of illness arriving through different referral pathways at this facility. The association between secondary/regional hospital referral and inpatient mortality is hypothesis-generating and requires replication in an adequately powered multicentre study before any service-level conclusions can be drawn.

14.
arXiv (quant-ph) 2026-06-16

3D Ising criticality with Platonic lattice superconducting qubits

arXiv:2606.16854v1 Announce Type: new Abstract: The three-dimensional (3D) Ising model is a foundational model in statistical physics and critical phenomena, yet its analytical intractability has long impeded the precise determination of universal critical exponents. While high-precision estimates have been obtained through classical numerical methods and conformal bootstrap techniques, a direct quantum simulation of the 3D Ising criticality remains challenging, requiring nontrivial connectivity, sufficient system size, and high spectral resolution. In this work, assisted by the state-operator correspondence of conformal field theory, we perform a digital quantum simulation of the 3D Ising critical exponents using a multiply-connected 9-qubit superconducting quantum processor with a Platonic lattice geometry. Employing an extended variational quantum eigensolver equipped with a phase-based loss function, we variationally prepare the low-energy eigenstates of the transverse-field Ising model on a cubic Platonic lattice encoded in an 8-qubit register. The four lowest eigenenergies are extracted via Fourier-transform analysis and high-precision numerical fitting, agreeing with the exact diagonalization values up to +/- 0.001. The resulting scaling dimension Delta_epsilon = 1.5850 and critical exponent nu = 0.7067 match well with theory.

15.
medRxiv (Medicine) 2026-06-24

Topical fresh Taraxacum mongolicum wet dressing as an adjunct to ceftriaxone for localized skin and soft tissue infections: A single-center assessor-blinded randomized controlled trial

Background: Localized skin and soft tissue infections may need systemic antibacterials, but local inflammation can delay symptom recovery. We evaluated whether topical fresh Taraxacum mongolicum wet dressing added to ceftriaxone was associated with short-term benefit in selected clinically stable adults. Methods: In this single-center, assessor-blinded, three-arm randomized trial, 180 adults aged 18-74 years were randomized 1:1:1 to topical T. mongolicum plus intravenous ceftriaxone, topical T. mongolicum alone, or ceftriaxone alone for 7 days. The primary outcome was day-7 clinical response assessed by blinded independent assessors using prespecified global clinical improvement criteria. Analyses followed the intention-to-treat principle; sensitivity analyses assessed robustness. Results: Day-7 clinical response rates were 91.67% (55/60), 76.67% (46/60), and 68.33% (41/60) in the combined, T. mongolicum, and ceftriaxone groups, respectively (overall P = 0.006). Compared with ceftriaxone alone, combined therapy had a higher response rate (risk difference, 23.3 percentage points; 95% CI, 9.6 to 37.0; risk ratio, 1.34; 95% CI, 1.11 to 1.62). Sensitivity analyses were directionally consistent. Secondary outcomes and bacterial clearance favored the combined group. No serious adverse events were reported. Conclusions: In selected clinically stable adults with localized skin and soft tissue infections, adjunctive topical fresh T. mongolicum plus ceftriaxone was associated with improved short-term outcomes compared with ceftriaxone alone. Findings require cautious interpretation because this was a single-center, partially blinded trial without a placebo dressing control. The dressing should not replace antibiotics, drainage, or urgent care when indicated. Trial registration: International Traditional Medicine Clinical Trial Registry, ITMCTR2026000549.

16.
arXiv (CS.LG) 2026-06-25

Minimax PAC Bounds for Learning in Exogenous Contextual MDPs

arXiv:2606.25170v1 Announce Type: cross Abstract: We study PAC learning in tabular discounted Markov decision processes with exogenous i.i.d. contexts, with discount factor $\gamma$, finite state space $\mathcal X$, action space $\mathcal A$, and context space $\mathcal Z$. At each time step, a context is drawn independently from an unknown distribution $\mu$ and revealed before the agent acts. This context may affect both rewards and transitions, while remaining uncontrolled by the agent. Depending on the regime, the learner has access either to a sampling oracle for $\mu$, to a sampling oracle for the transition kernel conditioned on state-context-action tuples, or to both. Oracles can be accessed before and during policy execution. The sample complexity is measured by a couple $(n,m)$, where $n$ is the number of calls to the sampling oracles before execution and $m$ is the number of calls to the sampling oracles during execution. When rewards and transitions are known and only the context distribution $\mu$ is sampled, we give a variance-reduced algorithm that solves policy evaluation (PE), best-value estimation (BVE), and best-policy extraction (BPE) with $\left(\widetilde O\left(1/((1-\gamma)^3\varepsilon^2)\right), 0 \right) $ sample complexity. The rate is independent of $|\mathcal Z|$ and minimax optimal up to logarithmic factors. As a corollary, we also obtain tight rates in the case of one-step perfect look-ahead, improving upon the existing guarantees. In the fully unknown regime, where both $\mu$ and P must be learned, we show that PE remains $|\mathcal Z|$-free, with matching upper and lower bounds $\bigl(\widetilde O(|\mathcal X|/((1-\gamma)^3\varepsilon^2)),\, \widetilde O(1/((1-\gamma)^2\varepsilon^2))\bigr)$.

17.
arXiv (CS.LG) 2026-06-16

Bridging data-driven priors via the score function for posterior sampling – Comparative review and experimental study

arXiv:2606.14800v1 Announce Type: cross Abstract: This paper reviews how a diverse set of popular data-driven priors commonly used in Bayesian inverse problems can be unified through their respective score functions. By framing these priors under this common perspective, we show that they can benefit from their straightfoward and effective integration into a recently proposed sampling algorithm. The applicability of this common framework is illustrated by considering several data-driven priors, namely regularization-by-denoising, normalizing flow-based priors, score-based generative models, and convex-ridge regularizers. For these four particular priors, the performance of the method is evaluated when conducting image inpainting and single image super-resolution. These results, as well as those obtained when restoring real images acquired in a geological context, demonstrate the efficiency of the method. This unified framework proves versatile enough to handle any posterior distribution defined by a broad class of score function-based priors, beyond the specific cases considered in this paper.

18.
arXiv (CS.CL) 2026-06-25

Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders

Building trustworthy AI systems for mental health support is a shared priority across stakeholders from multiple disciplines. However, "trustworthy" remains loosely defined and inconsistently operationalized. AI research often focuses on technical criteria (e.g., robustness, explainability, and safety), while therapeutic practitioners emphasize therapeutic fidelity (e.g., appropriateness, empathy, and long-term user outcomes). To bridge the fragmented landscape, we propose a three-layer trust framework, covering human-oriented, AI-oriented, and interaction-oriented trust, integrating the viewpoints of key stakeholders (e.g., practitioners, researchers, regulators). Using this framework, we systematically review existing AI-driven research in mental health domain and examine evaluation practices for ``trustworthy'' ranging from automatic metrics to clinically validated approaches. We highlight critical gaps between what NLP currently measures and what real-world mental health contexts require, and outline a research agenda for building socio-technically aligned and genuinely trustworthy AI for mental health support.

19.
arXiv (CS.CV) 2026-06-17

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

20.
arXiv (CS.CV) 2026-06-15

LiAuto-GeoX: Efficient Grounded Driving Transformer

Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present LiAuto-GeoX, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that LiAuto-GeoX runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

21.
arXiv (CS.CL) 2026-06-17

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.

22.
arXiv (math.PR) 2026-06-17

Convergence rate of Euler–Maruyama scheme to the invariant probability measure under total variation distance for the SDEs

arXiv:2505.04218v3 Announce Type: replace Abstract: This article shows the geometric decay rate of Euler-Maruyama scheme for one-dimensional stochastic differential equation towards its invariant probability measure under total variation distance. Firstly, the existence and uniqueness of invariant probability measure and the uniform geometric ergodicity of the chain are studied through introduction of non-atomic Markov chains. Secondly, the equivalent conditions for uniform geometric ergodicity of the chain are discovered, by constructing a split Markov chain based on the original Euler-Maruyama scheme.

23.
arXiv (CS.LG) 2026-06-11

Provable Recovery of Locally Important Signed Features and Interactions from Random Forest

arXiv:2512.11081v2 Announce Type: replace-cross Abstract: Feature and Interaction Importance (FII) methods are essential in supervised learning for assessing the relevance of input variables and their interactions in complex prediction models. In many domains, such as personalized medicine, local interpretations for individual predictions are often required, rather than global scores summarizing overall feature importance. Random Forests (RFs) are widely used in these settings, and existing interpretability methods typically exploit tree structures and split statistics to provide model-specific insights. However, theoretical understanding of local FII methods for RF remains limited, making it unclear how to interpret high importance scores for individual predictions. We propose a novel, local, model-specific FII method that identifies frequent co-occurrences of features along decision paths, combining global patterns with those observed on paths specific to a given test point. We prove that our method consistently recovers the true local signal features and their interactions under a Locally Spike Sparse (LSS) model and also identifies whether large or small feature values drive a prediction. We illustrate the usefulness of our method and theoretical results through simulation studies and a real-world data example.

24.
arXiv (CS.CV) 2026-06-17

StereoFactory: A Unified Merging Framework for Robust Stereo Matching

Stereo matching has advanced through foundation models trained on large-scale datasets, yet this paradigm suffers from a scalability bottleneck: incorporating new data requires costly joint retraining. Model merging offers a scalable post-hoc alternative by integrating knowledge from specialized models after source checkpoints are available. However, existing merging methods typically retain all available models or rely on greedy inclusion, which can preserve harmful task-vector interference. We propose StereoFactory, a coarse-to-fine evolutionary framework for adaptive model merging. Stage~1 employs a genetic algorithm to search the combinatorial space of model subsets, determining which models should participate. Stage~2 addresses module-level knowledge specialization (different functional modules exhibit distinct preferences for knowledge sources) through CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling. Experiments across two architectures and four benchmarks demonstrate that StereoFactory consistently achieves the best four-benchmark average under the same checkpoint pool, reducing the average error from 3.80 to 3.30 on NMRF and from 2.88 to 2.19 on FoundationStereo relative to the strongest controlled baseline. The post-hoc search requires only 2.7–3.7\% of the corresponding joint-retraining wall-clock time. Analysis reveals that knowledge contributions are inherently module-specific, and selected subsets can transfer across architectures with minimal degradation. Code will be publicly released upon acceptance at: https://github.com/XiandaGuo/StereoFactory.

25.
arXiv (CS.AI) 2026-06-25

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

arXiv:2606.25178v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has been extended from single-domain training to multi-domain reasoning suites spanning mathematics, programming, and science. However, the training curriculum (how often each domain is sampled) is typically fixed or hand-tuned, even though reasoning skills transfer unevenly across domains. Existing learnability-based curricula adapt to where the policy is currently improving, but are blind to whether a gradient step on the selected domain benefits the remaining domains. In this paper, we propose Transfer-Aware Curriculum (TAC), a bandit-style online curriculum that prioritizes domains whose updates broadly benefit the rest of the training suite. TAC repurposes signals already produced by RL training: per-domain advantages capture local learnability, and projected gradients, taken from the GRPO step being computed, estimate cross-domain transferability via gradient-geometry alignment, at negligible cost (