Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.AI) 2026-06-12

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

arXiv:2606.12500v1 Announce Type: cross Abstract: Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

02.
arXiv (CS.LG) 2026-06-16

Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

arXiv:2606.17043v1 Announce Type: cross Abstract: When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate $g_t$ merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.

03.
Nature Biotechnology 2026-06-19

Efficient site-specific gene addition using R2 retrotransposons in tobacco and rice

作者:

Precise integration of multikilobase DNA fragments remains a major technical barrier in plants. Here we introduce non-long terminal repeat (non-LTR) R2 retrotransposons as a versatile system for targeted gene integration in plants. We reconstituted R2 activity in Nicotiana benthamiana and benchmarked insertion efficiency and fidelity using a TMV-based episomal reporter system. We demonstrate site-specific integration of GFP (2.2 kb) and recombinase-compatible landing pads (0.6 kb) into 28S rDNA arrays, with intact cassette insertion frequencies up to 75% and 53%, respectively. To temporally constrain donor availability and avoid DNA intermediates, we combined in planta effector expression with recombinant RNA virus-mediated donor delivery. We apply R2 retrotransposons for targeted insertion of resistance cassettes within the rDNA of rice callus, achieving integration efficiencies up to 17%. These results position R2 retrotransposons as a double-strand break-free system for RNA-templated insertion of multikilobase gene cassettes at rDNA loci, for safe-harbor trait stacking in plants with potential applications in crop improvement and synthetic biology. Retrotransposons are applied in plants for safe-harbor transgene integration.

04.
Science (Express) 2026-05-21

Nodeless superconducting gap and electron-boson coupling in (La,Pr,Sm)3Ni2O7 films | Science

作者: 未知作者

The discovery of superconductivity in Ruddlesden-Popper (RP) bilayer nickelate films under ambient pressure provides an opportunity to directly investigate electronic energy scales of the superconducting state and the pairing mechanism. We report angle-resolved photoemission spectroscopy measurements of superconducting (La,Pr,Sm) 3 Ni 2 O 7 thin films by developing an ultra-high vacuum cryogenic sample quenching and transfer technique. A superconducting gap of ~18 meV with coherence peaks is observed along the Brillouin zone diagonal. The finite gap persists across the entire Brillouin zone, revealing the absence of gap nodes. A kink is observed in the energy-momentum dispersion at ~70 meV below Fermi level, indicating an electron-boson coupling. The simultaneous observation of a nodeless superconducting gap and electron-boson coupling provides insight into the pairing symmetry and gluing mechanism in RP bilayer nickelates.

05.
arXiv (quant-ph) 2026-06-17

Post-Selection Probability and Fidelity of Bidirectional Teleportation

arXiv:2606.17251v1 Announce Type: new Abstract: Understanding the scrambling of quantum information is central to many areas of quantum physics, including quantum thermalization, entanglement growth, and quantum information processing. Insights from these studies have, in turn, inspired the development of novel quantum protocols and algorithms. Recently, a bidirectional teleportation protocol was proposed to implement a digital SWAP operation between qubits by leveraging chaotic Hamiltonian evolution combined with measurement and post-selection. In this work, we provide a comprehensive study of two central quantities that characterize the protocol, the post-selection probability and the fidelity, taking into account possible errors in time-reversed dynamics. We show that these quantities can be expressed in terms of standard diagnostics in quantum dynamics, including the Loschmidt echo and its subsystem variant. The results unveil (1) the initial-state dependence of the fidelity and (2) the stability of the post-selection probability in integrable models. Our findings offer practical guidance for the implementation of the protocol on realistic quantum devices.

06.
arXiv (CS.LG) 2026-06-19

The Representational Limit of Scalar Interactions: An Interventional Decomposition

arXiv:2606.19410v1 Announce Type: cross Abstract: Signed pairwise interaction scores fundamentally conflate uniqueness (U), redundancy (R), and synergy (S). We prove this on a minimal 3-way XOR structural causal model: faithful indices such as Shapley-Taylor return zero per pair, whereas projective indices such as Shapley Interaction spread the third-order effect into pair scalars that conflate the three mechanisms. We introduce Stochastic Hi-Fi, a post-hoc, retraining-free predictability decomposition that estimates per-feature U/R/S profiles by interventional masked inference. The estimator provides exact interventional semantics, finite-sample Monte Carlo bounds, strict variance reduction from coupled diamond sampling, and uniform finite-vocabulary convergence. Across tabular SCMs, Stochastic Hi-Fi recovers structure missed by scalar baselines (up to 411x larger interaction-magnitude recovery ratios). It also separates redundant and synergistic heads in the GPT-2 IOI circuit. On NIH ChestX-ray14, Stochastic Hi-Fi matches GradCAM on Pointing Game and improves substantially on Deletion AUC.

07.
arXiv (CS.CL) 2026-06-19

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

Automatic Speech Recognition (ASR) has become a key technology for human–AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

08.
arXiv (CS.AI) 2026-06-19

Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining

arXiv:2606.20363v1 Announce Type: new Abstract: Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5\% to 20.5\%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.

09.
arXiv (CS.CL) 2026-06-11

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

A social highlighter's most useful signal – which passages a crowd of readers marks – exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies – it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

10.
arXiv (quant-ph) 2026-06-16

Non-Gaussian Phase Transition and Cascade of Instabilities in the Dissipative Quantum Rabi Model

arXiv:2507.07092v3 Announce Type: replace Abstract: The open quantum Rabi model describes a two-level system coupled to a harmonic oscillator. A Gaussian phase transition for the nonequilibrium steady states has been predicted when the bosonic mode is soft and subject to damping. We show that oscillator dephasing is a relevant perturbation, which leads to a non-Gaussian phase transition and an intriguing cascade of instabilities for $k$-th order bosonic operators, as well as a jump in the steady-state qubit polarization. For the soft-mode limit, the equations of motion form a closed hierarchy and spectral properties can be efficiently studied. To this purpose, we establish a fruitful connection to non-Hermitian Hamiltonians. The results for the phase diagram, stability boundaries, and relevant observables are based on mean-field analysis, exact diagonalization, perturbation theory, and Keldysh field theory.

11.
arXiv (CS.LG) 2026-06-12

A solvable model for unsupervised federated learning

arXiv:2606.13045v1 Announce Type: cross Abstract: We introduce a theoretical framework for analyzing federated learning in a generative setting through a teacher-multiple interacting students scenario, in which each student receives a distinct realization of the data, either through a different noise corruption or by accessing a different subset, possibly of varying size. Using theoretical tools in equilibrium disordered system, we analytically show that interactions among students systematically enhance learning performance: highly noisy students require fewer samples to recover the underlying pattern, while low-noise students achieve a larger overlap with the ground-truth signal. We derive the optimal Bayesian conditions for teacher recovery as functions of the sample complexity, noise level, and interaction strength, and validate these predictions through numerical simulations. The resulting dynamics can be mapped onto equilibrium sampling in a Restricted Boltzmann Machine with a structured hidden layer, providing a principled theoretical understanding of how interactions improve distributed generative modeling.

12.
arXiv (CS.CL) 2026-06-12

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

13.
arXiv (CS.CL) 2026-06-12

Localizing Anchoring Pathways in Language Models

Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B–8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.

14.
arXiv (CS.AI) 2026-06-11

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

arXiv:2606.11473v1 Announce Type: cross Abstract: Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.

15.
arXiv (CS.LG) 2026-06-17

Geodesic Calculus on Implicitly Defined Latent Manifolds

arXiv:2510.09468v3 Announce Type: replace Abstract: Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

16.
arXiv (CS.AI) 2026-06-17

Optimism Stabilizes Thompson Sampling for Adaptive Inference

arXiv:2602.06014v2 Announce Type: replace-cross Abstract: Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study adaptive inference for Thompson sampling with Gaussian randomized indices in $K$-armed stochastic bandits with independent sub-Gaussian reward noises, and identify optimism as a key mechanism for restoring stability, meaning that each arm's pull count concentrates around a deterministic scale. This stability yields asymptotically valid Wald inference despite adaptive sampling. First, we prove that variance-inflated TS is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal, with asymptotically uniform allocation over optimal arms and sharp logarithmic pull-count asymptotics for suboptimal arms. This resolves the $K$-armed extension question raised by \citet{halder2025stable}, using new winner-map and Lyapunov-drift techniques to control allocation among multiple optimal arms. Second, we analyze an alternative optimistic modification that keeps the Gaussian index variance unchanged but adds an explicit mean bonus to the index center, and establish a similar stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid Wald inference in multi-armed bandits, while incurring only a mild additional regret cost.

17.
arXiv (CS.LG) 2026-06-11

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

arXiv:2606.11255v1 Announce Type: new Abstract: Bernstein–Schur kernels are products of a finite-feature kernel (one with an explicit finite-dimensional feature map) and a completely monotone shift-invariant kernel: nonstationary kernels that fall between the shift-invariant and dot-product templates random features usually exploit, so in general neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and randomizes the completely monotone radial factor, sampling the latter's one-dimensional Bernstein–Widder scale and then applying Gaussian random Fourier features (whose frequency is still $d$-dimensional). The feature dimension is then $Dm$, set by the sketch size $m$ and the radial-draw count $D$, free of the $O(d^2)$ size of the exact modulation feature. Keeping the modulation \emph{exact is the analyzable limit ($m\to\infty$): there we prove unbiasedness, an exact variance for the recommended flat estimator, an expected matrix-Bernstein operator-norm bound (with a matching high-probability tail) controlled by the top eigenvalues of the kernel and modulation Gram matrices together with an intrinsic dimension rather than the crude $N\max_{ij}$ entrywise route, and a deterministic relative-spectral kernel-ridge stability result. By conditioning on the sketch, the doubly-randomized estimator inherits the same intrinsic-dimension operator-norm guarantee plus a single additive sketch term, tunable by $m$ independently of $D$. The motivating instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, $b\ge0$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$; for it the radial mixture is the IMQ spectral sampler, and one frequency per scale is variance-optimal at a fixed radial-feature budget.

18.
bioRxiv (Bioinfo) 2026-06-23

Early Tracheal and Salivary miRNAs in Extremely Preterm Infants Predict BPD-related Pulmonary Hypertension

Pulmonary hypertension (BPD-PH) associated with bronchopulmonary dysplasia (BPD) in preterm infants associates with high morbidity and mortality within the first two years of life. In a previous unbiased study, we identified a panel miRNAs in tracheal aspirates (TA) that were differentially expressed in extremely low gestational age newborns (ELGANs) with BPD-PH compared to those with BPD but no PH. To explore the predictive potential of these miRNAs, we studied TA exosomes from 7 days old ELGANs and analysed a curated panel of 16 miRNAs through logistic regression and calculated the predictive AUROC to diagnose BPD-PH at 36 weeks PMA. AUROC of TA miRNAs was 0.76 with sensitivity and specificity of 53% and 93%, respectively. Adding sex and gestational age to the variables improved the AUROC to 0.78 with sensitivity and specificity of 61 and 87% respectively. Due to challenges of obtaining TA in non-invasively ventilated infants, we collected saliva samples from ELGANs at 7 days of age and compared the log expression of these 16 miRNAs in both biofluids and found significant correlation in their expression (pearson r=0.92, p

19.
arXiv (CS.CL) 2026-06-24

Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning

Experience-driven self-evolution is critical for large language model (LLM) agents to improve through open-world interaction. However, existing experience learning methods mostly rely on single-agent loops, where the same agent executes tasks, summarizes outcomes, and determines memory content. This setup makes agents vulnerable to the Self-Confirmation Trap: wrong-but-self-consistent trajectories are misidentified as successful experience, leading to cumulative errors during retrieval and reuse. To address this issue, we propose EDV, an Execute-Distill-Verify framework for reliable experience learning. In the Execute stage, multiple heterogeneous agents explore the same task space in parallel to generate diverse candidate trajectories. In the Distill stage, a dedicated third-party agent comparatively analyzes these trajectories to produce candidate experiences, reducing executor-centric summarization bias. In the Verify stage, the execution group validates candidates via a consensus mechanism, and only approved experiences are written into shared or private memory. By decoupling the three stages, EDV transforms experience learning from isolated self-reflection into collaborative construction, filtering erroneous and noisy content before memory insertion. We evaluate EDV on three challenging long-horizon benchmarks: tau2-bench, Mind2Web and MMTB. Results show EDV consistently outperforms strong baselines, validating that reliable experience construction is essential for robust agent self-evolution. Our code is available at https://github.com/shidingz/EDV.

20.
arXiv (quant-ph) 2026-06-12

Quantum metrology via partial quantum error correction

arXiv:2605.08341v2 Announce Type: replace Abstract: We introduce a method for error-corrected quantum metrology where only partial quantum error correction (QEC) is needed to suppress local noise and maintain the probe states' super-standard-quantum-limit (super-SQL) sensing performance. This stands in contrast to the existing QEC-assisted sensing schemes in Phys. Rev. Lett. 112, 080801 (2014) and Phys. Rev. Lett. 112, 150802 (2014), where a probe state is encoded into the logical subspace of a quantum code and error correction involves measurements on all checks of the code. Here, we encode the probe states into superpositions of energetically different states of the underlying quantum code. For our probe states, error correction using a subset of checks is enough to suppress noise both before and after phase imprinting. We analyze the tradeoff in noise suppression. For noise parallel to our phase imprinter of weight $l$, we achieve a suppression of $p^\delta$ where $p$ is the noise strength and $\delta = \lfloor (l+1)/2 \rfloor$. We propose an adaptive imprinter weight increasing strategy to maintain super-SQL performance as we scale up the system. In all our examples, checks and phase imprinters are chosen to be local operators avoiding non-local connectivity.

21.
arXiv (CS.CV) 2026-06-11

Spatially Selective Self-Training for Unsupervised Building Change Detection

Unsupervised building change detection aims to learn building-change masks from unlabeled bi-temporal remote sensing images. Existing label-free methods often follow a discrepancy-to-mask paradigm, directly using temporal differences, frozen foundation-model responses, prompt-based outputs, or post-processing results as final change maps. Although these strategies provide annotation-free cues, they do not learn a task-specific building-change detector and remain vulnerable to the gap between generic temporal discrepancies and building-defined structural changes. In practice, such discrepancies are often noisy and task-irrelevant, as appearance shifts, registration errors, and non-building modifications can produce strong but misleading responses. To address this problem, we propose SST-CD, a spatially selective self-training framework that reformulates fully label-free building change detection as end-to-end detector learning under noisy pseudo supervision. SST-CD uses temporal discrepancies as candidate pseudo labels and trains the detector only on spatially reliable pixels, whose reliability is estimated by a local consistency criterion that filters inconsistent regions from supervision. To further stabilize noisy self-training, a lightweight feature adapter recalibrates bi-temporal features, while a prototype-based decoder produces compact change and no-change representations. Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show that SST-CD achieves F1 scores of 83.08%, 91.69%, and 86.60%, respectively, outperforming existing unsupervised and label-free baselines.

22.
arXiv (CS.CL) 2026-06-25

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably identify harmful LLM outputs in user-model conversations without substantial performance loss relative to LLM-based judges. We benchmark these encoder classifiers against rule-based prefix matching, fine-tuned LLM classifiers, and LLM judges using a range of judge-prompting strategies across open-source adversarial datasets. The LLM judges include evaluation methodologies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, and a Claude-as-a-judge setup, as well as fine-tuned safety classifiers such as LlamaGuard 3 and LlamaGuard 4. The encoder classifiers are fine-tuned on judge-labeled data using a majority-voting label strategy and are then evaluated on a gold-standard holdout dataset to assess their performance relative to LLM judges. We report absolute performance using F1 score, false negative rate, and precision-recall metrics. We also break down results by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation, to identify where encoder classifiers align with or diverge from LLM-based judges. Our findings provide guidance on when encoder classifiers can serve as cost- and latency-efficient alternatives to LLM-based safety evaluation.

23.
arXiv (CS.LG) 2026-06-16

Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models

arXiv:2602.16793v2 Announce Type: replace Abstract: In the past year, custom and unreleased math reasoning models reached gold medal performance on the International Mathematical Olympiad (IMO). Similar performance was then reported using large-scale inference on publicly available models but at prohibitive costs (e.g., 3000 USD per problem). In this work, we present an inference pipeline that attains best-in-class performance on IMO-style math problems at an average inference cost orders of magnitude below competing methods while using only general-purpose off-the-shelf models. Our method relies on insights about grader failure in solver-grader pipelines, which we call the Cognitive Well (iterative refinement converging to a wrong solution that the solver as well as the pipeline's internal grader consider to be basically correct). Our pipeline addresses these failure modes through conjecture extraction, wherein candidate lemmas are isolated from generated solutions and independently verified alongside their negations in a fresh environment (context detachment). On IMO-ProofBench Advanced (PB-Adv), our pipeline achieves 67.1 percent performance using Gemini 3.0 Pro with an average cost per question of approximately 31 USD. At the time of evaluation, this represented the state-of-the-art on PB-Adv among both public and unreleased models, and more than doubles the success rate of the next best publicly accessible pipeline, all at a fraction of the cost.

24.
arXiv (CS.AI) 2026-06-11

A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

arXiv:2606.12040v1 Announce Type: new Abstract: The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation-evaluation-optimization" closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: https://github.com/MXY820/barrier-design. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

25.
arXiv (CS.CL) 2026-06-17

LLM Features Can Hurt GNNs: Concatenation Interference on Homophilous Graph Benchmarks

Adding LLM-generated node features to graph neural networks (GNNs) is widely reported to improve accuracy on standard benchmarks. We document a contrasting observation: when LLM features are introduced through pure input concatenation (rather than joint training, distillation, or prompt-conditioning), they can systematically degrade accuracy on the same homophilous benchmarks where end-to-end LLM pipelines succeed. With an MLP backbone on the Planetoid public split and bag-of-words original features, concatenating SBERT-encoded GPT-4o-mini TAPE features reduces PubMed test accuracy by -17.0 +/- 0.3 pp and Cora by -4.3 +/- 0.6 pp (CiteSeer -0.6 +/- 0.8 pp, within seed noise). The drop attenuates as we relax each condition (GCN / GCNII / GAT backbones, random splits, smaller encoders) and reverses on medium-homophily WikiCS (+4.4 pp) and ogbn-arxiv (+11.7 pp). To predict when concatenation helps versus hurts, we report a simple measure of LLM-alone discriminability, Delta_sig. Across 9 datasets Delta_sig correlates with the concatenation cost more strongly than homophily at point estimate (r^2 = 0.38 vs. 0.06; N=9, bootstrap CIs overlap). The bootstrap-best change-point is tau = 13.8 pp, and the rule "Delta_sig