Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (math.PR) 2026-06-15

On the Poisson Follower Model

arXiv:2309.04864v5 Announce Type: replace Abstract: We introduce a stochastic geometry dynamics inspired by opinion dynamics that captures the essence of modern asymmetric social networks with leaders and followers. Points in the Euclidean space represent opinions, and the leader of an agent is the one with the closest opinion. In this dynamics, each follower updates its opinion by halving the distance to its leader. We demonstrate that this simple dynamics and its iterations exhibit several interesting purely geometric phenomena related to the evolution of leadership and opinion clusters, which resemble those observed in social networks. We also show that when the initial opinions are randomly distributed as a stationary Poisson point process, the spatial frequency of each of these phenomena can be expressed through an integral geometry formula involving semi-algebraic domains. Finally, we analyze numerically the limiting behavior of this follower dynamics. In the Poisson case, the agents fall into two categories: ultimate followers, who continue updating their opinions indefinitely, and ultimate leaders, who adopt a fixed opinion after a finite time. Spatial discrete event simulations support all our findings.

02.
arXiv (CS.CV) 2026-06-16

MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.

03.
arXiv (CS.CV) 2026-06-16

DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

Industrial anomaly localization aims to accurately identify and localize abnormal regions in industrial products, addressing the critical challenge of detecting unseen defect categories in real-world scenarios. Traditional closed-set methods often suffer from poor cross-scenario generalization, while existingMultimodal Large Language Model (MLLM)-based approachesface two core limitations: they either adopt QA-style paradigmsmisaligned with the practical demands of localization, or relyon standard optimization techniques such as Group RelativePolicy Optimization (GRPO), which fails to deliver effectivelearning signals for subtle defects. To tackle these issues, thispaper proposes DifferAD-R1, an MLLM-augmented reinforcement learning framework tailored for industrial anomaly localization. We design a Difference-Guided dual-image paradigm,which reformulates the localization task as a one-shot difference grounding problem to effectively explore cross-scenarioanomalies. A Dual-Consistency Localization Reward is developedfor hard-to-detect anomalies, enhancing optimization stabilityand robustness. Additionally, we integrate a difficulty-awarestrategy with adaptive reweighting and group-wise resamplingto prioritize learning on challenging instances. To facilitateevaluations in real-world industrial settings, we construct theAD-DualDiff dataset, comprising 13K paired images across 20categories. Experimental results demonstrate that DifferADR1 significantly outperforms existing baselines and achievescompetitive performance compared to large-scale models likeQwen3-VL (235B parameters). Our code is publicly availableat: https://github.com/Rong2026/work-1.

04.
medRxiv (Medicine) 2026-06-17

Non-Medical COVID-19 Impacts and Hearing Status: A Global Study of Differential Health Impact Among Deaf, Hard of Hearing, and Hearing Populations

Background: Deaf and hard of hearing (HoH) experienced complex challenges during the COVID19 pandemic, including obscured visual communication from mask mandates, inaccessible public health messaging, and inadequate interpreter availability. We examined whether hearing status predicted nonmedical COVID19 impact on a global level. Methods: We conducted a nested cross-sectional analysis within a global study collecting data across two waves (April to May 2020 and July to August 2022) from 184 countries. Participants (N=7,998) were categorized as Deaf (n=304), Hard of Hearing (HoH; n=951), or Hearing (n=6,743). The primary outcome was a composite COVID-related non-medical Personal Impact TScore derived from 14 items across employment, resource access, and healthcare domains. Multinomial logistic regression models progressively adjusted for demographic, structural, and psychosocial variables. Results: Deaf participants reported substantially higher rates of pandemic-related job loss (28.9% vs. 9.6% hearing), healthcare cancellations (39.9% vs. 24.6%), and inability to obtain basic supplies. Over half (55.9%) of Deaf participants scored above the median composite impact index, compared to 39.2% of hearing participants. In the fully adjusted model, Deaf status remained an independent predictor of high non-medical impact (aOR=1.6, 95% CI: 1.1 to 2.4). HoH status showed no statistically significant difference from hearing participants in any model. Conclusions: People identifying as Deaf experienced significant disparities during COVID19 when compared with HoH or hearing people, driven by language access barriers and institutional exclusion rather than hearing loss per se. These experiences underscore the importance for systemic interventions centering on accessible communication, Deaf-centered needs, and reducing audism in Deaf-hearing interaction.

05.
arXiv (CS.CL) 2026-06-19

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a Korean-specific loss channel, where even a minimal transcription difference can change the intended question and degrade downstream QA performance. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM cascade with an approximately matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

06.
arXiv (CS.CL) 2026-06-19

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

07.
arXiv (CS.CL) 2026-06-12

Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

Large language models (LLMs) have enabled time series (TS) analysis by jointly modeling numerical observations and textual context through a shared token interface. However, TS tokens and prompt tokens exhibit fundamentally different information structures, making uniform token processing inefficient. In this paper, we study token efficiency in TS language modeling from an asymmetric-token perspective. We show that TS tokens have highly uneven spectral contributions, where many tokens share redundant frequency patterns while a small subset preserves critical temporal evidence. We also observe that prompt-token influence attenuates with model depth, suggesting that full prompt retention across all layers is unnecessary. Based on these findings, we develop an adaptive token budgeting framework that compresses TS tokens via frequency-domain structure and progressively reduces prompt tokens across layers. Experiments across forecasting, classification, imputation, and anomaly detection demonstrate up to 7.68$\times$ inference acceleration and performance gains in 78\% of evaluated settings, showing the effectiveness of asymmetric token compression for scalable TS foundation models.

08.
arXiv (CS.LG) 2026-06-11

Annealed Entropic Allocation for Ranking and Selection

arXiv:2606.11347v1 Announce Type: cross Abstract: We propose Annealed Entropic Allocation, an annealed weighted soft-min framework for sequential budget allocation in ranking and selection. The central idea is to replace the non-smooth maximin large-deviation rate objective with a weighted log-sum-exp surrogate that aggregates challenger-specific pairwise scores through soft-min weights, mitigating hard switching when several challengers are nearly active. To improve finite-budget discrimination, we incorporate the saddlepoint approximation – a sub-exponential correction derived from refined pairwise tail asymptotics. Because these corrections are sub-exponential and the smoothing parameter is annealed to zero, the surrogate preserves the same first-order large-deviation target as the classical maximin formulation. We show that the surrogate converges uniformly to the hard minimum, that the soft-min weights concentrate on the active challengers, and that, under fixed weights, the induced target allocation map is continuous on the simplex interior. Numerical experiments on Gaussian and exponential instances demonstrate competitive performance, especially when multiple challengers are nearly tied.

09.
bioRxiv (Bioinfo) 2026-06-18

Benchmarking gene expression reconstruction from single-cell latent representations

Single-cell transcriptomics is typically modeled in low-dimensional latent representations that improve the signal-to-noise ratio of the data. Such representations underpin data integration, cell state discovery, and perturbation prediction, with applications ranging from large-scale organ atlases to latent trajectory modeling. Recent virtual cell approaches further leverage these representations to predict cellular responses as distributional shifts in latent space. Each of these applications ultimately requires faithful gene expression reconstruction from latent spaces for biological interpretation, enabling gene-level analysis of predicted perturbed or batch-corrected cells. Yet representation choice is typically treated as an implementation detail rather than a primary modeling decision, with no systematic evaluation of how well latent representations support gene expression reconstruction. Here, we introduce ReconEval, a benchmark for evaluating gene expression reconstruction from single-cell latent spaces. We benchmark two classes of latent representations: end-to-end trained models such as PCA, autoencoders, and variational autoencoders, and pretrained single-cell foundation model embeddings coupled to newly trained decoders. Reconstruction is evaluated both directly and after latent-space perturbation prediction. Across perturbational and observational datasets totaling over 100 million cells, our metric suite quantifies statistical fidelity; biological signal preservation, including differential expression, coexpression, cell-cycle structure, cytokine response and pathway activity; and perturbation-specific effects. We find that autoencoders achieve the strongest stand-alone reconstruction at low dimensionality, while variational regularization does not improve generalization in reconstruction. Frozen foundation model embeddings retain recoverable gene-level information, with reconstruction quality depending strongly on decoder architecture and pretraining objective. In latent perturbation modeling, high-dimensional PCA matches foundation model embeddings, while low-dimensional AE embeddings are optimal for flow-based generative models. Overall, reconstruction depends critically on the interplay between representation and downstream model, and simpler representations can outperform complex alternatives given appropriate capacity. Our benchmark establishes reconstruction as a critical evaluation axis for single-cell foundation models. We envision it improving the biological interpretability of latent-space modeling, a prerequisite for future virtual cell models to be validated by domain experts and grounded in biology.

10.
arXiv (quant-ph) 2026-06-24

Concatenating Algebraic Codes over High-Rate Quantum LDPC Codes

arXiv:2605.21898v2 Announce Type: replace Abstract: Different quantum error correction schemes trade off overhead, error suppression, and hardware connectivity. Code concatenation can relax these tradeoffs by using an outer code whose non-local connectivity is supplied by logical operations of an inner code rather than directly by hardware. Prior works showed that this can reduce memory overhead for local low-rate inner codes such as the surface code. Here, we study concatenation over non-local, high-rate inner codes. Such inner codes experience correlated errors among the many logical qubits in a single codeblock. We handle this by treating each block as a single logical Galois qudit, enabling concatenation with algebraic outer codes with excellent parameters and, crucially, list decoders. In particular, we consider a memory system formed by concatenating quantum Reed-Solomon outer codes over the gross code. For fault-tolerant syndrome extraction, we develop a Galois qudit Shor scheme using "time-like" Reed-Solomon protection against measurement errors. Interestingly, a lightweight fault tolerance scheme, that would fail for qubits, works well for large-alphabet qudits, suggesting a very different theory of fault tolerance for such qudits. The whole protocol is optimised via improved bicycle instruction logical error rates, novel compilation strategies, and recent decoder post-selection rules. At uniform $10^{-3}$ physical noise, the concatenated gross code reaches the teraquop regime, which it previously could not access, with a lower space overhead than the $288$-qubit two-gross code, while offering several advantages from the engineering standpoint. Beyond our main case study, we believe the core ideas of Galois qudits, quantum Reed-Solomon outer codes, and list decoding, will prove generically powerful and highly transferable ideas across high-rate quantum architectures.

11.
medRxiv (Medicine) 2026-06-22

Effect of Lowering the Drink-Driving Blood Alcohol Limit in Scotland on Road Traffic Crashes: a Synthetic Difference-in-Differences Study

Objective: To evaluate the road safety impact arising from Scotlands 2014 reduction in the legal blood alcohol concentration (BAC) limit for drivers, and to assess whether the effect of the reform varied across different spatial contexts. Design: A quasi-experimental statistical longitudinal study using a Synthetic Difference-in-Differences (SDID) approach. Setting: Small-area panel data for Great Britain, with areas (Middle-layer Super Output Areas, MSOAs, in England and Wales and Intermediate Zones, IZs, in Scotland) classed into control and treatment groups according to whether they were exposed to Scotlands BAC reform. The control and treatment groups comprise 7088 spatial units in England and Wales and 852 spatial units in Scotland, respectively, observed over the period 2008-2019. Participants: The study primarily analyses police-reported road traffic collision data from the UK Department for Transports STATS19 system. Data were analysed at the MSOA/IZ level. This is a secondary dataset, and we therefore did not involve patients or the public in formulating the research question, determining outcome measures, or designing and conducting the study. Main Outcome Measures: The main outcome measures were log-transformed rates of total road traffic crashes, and (weekend) night-time crashes (22:00-04:00) per 100,000 population. The latter is used as a proxy measure for drunk driving. Results: Our results indicate that the reduction in the legal BAC limit led to statistically significant declines in road traffic crash rates. Aggregate estimates suggest reductions of 12.0% (95% confidence interval (CI): [-13.7%, -10.3%]) in total crashes, 15.6% (95% CI: [-20.7%, -10.2%]) in night-time crashes, and 12.4% (95% CI: [-16.7%, -7.9%]) in weekend night-time crashes. We also find substantial heterogeneity in treatment effects across spatial contexts. Effects were strongest in rural and less densely populated areas, where reductions exceeded 16% (95% CI: [-18.7%, -13.9%]) for total crashes and reached up to 29.6% (95% CI: [-35.8%, -22.8%]) for night-time and 21.4% (95% CI: [-28.3%, -13.9%]) for weekend night-time crashes. Moderate but statistically significant effects were also observed in dense urban areas, whereas effects in suburban and transitional areas were smaller and not statistically significant. Conclusions: Our analysis suggests that lowering the legal BAC limit in Scotland led to meaningful reductions in road traffic crashes, particularly during higher-risk periods and in rural areas. The findings further suggest that the effectiveness of BAC regulation may vary across local contexts, highlighting the importance of accounting for spatial heterogeneity when evaluating road safety policies.

12.
arXiv (CS.AI) 2026-06-12

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

arXiv:2606.12841v1 Announce Type: cross Abstract: Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

13.
arXiv (CS.CV) 2026-06-16

FrameOracle: Learning What to See and How Much to See in Videos

Vision-language models (VLMs) advance video understanding but operate under tight computational budgets, making performance dependent on selecting a small, high-quality subset of frames. Existing frame sampling strategies, such as uniform or fixed-budget selection, fail to adapt to variations in content density or task complexity. To address this, we present FrameOracle, a lightweight, plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained via a curriculum that progresses from weak proxy signals, such as cross-modal similarity, to stronger supervision with FrameOracle-41K, the first large-scale VideoQA dataset with validated keyframe annotations specifying minimal sufficient frames per question. Extensive experiments across five VLMs and six benchmarks show that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without accuracy loss. When starting from 64-frame candidates, it reduces inputs to 13.9 frames on average while improving accuracy by 1.5%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.

14.
arXiv (CS.AI) 2026-06-18

MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba

arXiv:2606.18599v1 Announce Type: cross Abstract: The Controller Area Network (CAN) protocol is the primary communication standard for Electronic Control Units (ECUs) in modern vehicles, but its lack of encryption and authentication exposes it to a range of security threats. Existing intrusion detection systems are largely tuned to fabrication-style attacks (DoS, fuzzing, ID spoofing realised by frame injection), in which detection signals such as per-ID inter-arrival statistics are readily available. We instead address the harder masquerade setting[b37], in which an internal adversary substitutes a legitimate frame in-situ at its original transmission slot, preserving traffic periodicity and rendering traffic-statistic defences ineffective. We propose the Mamba Intrusion Detection System (MIDS), an innovative dual-stream framework that processes CAN identifiers and payloads in parallel and reconstructs their joint temporal semantics through bidirectional selective state-space modelling. To evaluate MIDS, we collected over 100 million CAN frames from a physical Tesla Model 3 across three driving regimes and synthesised 54 masquerade attack variants spanning ID-only, data-only, and combined modifications. MIDS attains an F1 of 96.94\% on this dataset, exceeding the strongest reproducible baseline by more than 8 percentage points, while sustaining a 1.147~ms single-window inference latency – ample headroom for real-time onboard deployment. To verify generalisation, we further evaluate MIDS on four public benchmarks (ROAD, CrySyS, OTIDS, CT\&T) covering both masquerade and injection scenarios; MIDS attains F1 from 93.70\% to 99.61\%, outperforming the strongest of eight reproduced baselines by up to 13.94 percentage points under a unified 5-fold protocol.

15.
arXiv (CS.AI) 2026-06-19

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

arXiv:2606.19791v1 Announce Type: cross Abstract: The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

16.
arXiv (CS.CV) 2026-06-11

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.

17.
arXiv (CS.LG) 2026-06-24

You Don't Need to Run Every Eval

arXiv:2606.24020v1 Announce Type: new Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to run every eval? We compile a public score matrix of 84 frontier models on 133 benchmarks (2,604 cells, 23.3% filled) and find it is approximately rank-2: a model's scores across all 133 benchmarks are largely determined by just two numbers. We confirm this in two ways: scores hidden from the matrix are best recovered using two factors, and two factors already explain over 90% of the variation among models on the benchmarks they share. Building on this, we design BenchPress: a logit-space rank-2 matrix completion method that recovers held-out scores to within 4.6 points, and a confidence layer that says when each prediction can be trusted. Using BenchPress, we find a subset of five benchmarks {GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1} that can recover the rest of a model's public scorecard to within 3.93 points. For a tighter inference budget, a cheaper set {GPQA-D, MMLU-Pro, Aider Polyglot, MATH-500, AIME 2026} can predict a model's evals to within 4.55. We release the score matrix, the BenchPress code, and an interactive tool that predicts any model's score on any benchmark.

18.
arXiv (CS.CV) 2026-06-12

Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios

Vehicle color recognition is an important cue for vehicle identification in surveillance systems, especially when license plates are illegible due to low resolution, occlusion, motion blur, or poor illumination. However, real-world vehicle color distributions are highly imbalanced, making overall accuracy insufficient to assess performance on rare but operationally relevant colors. This paper presents a comprehensive study of vehicle color recognition under severe class imbalance using UFPR-VeSV, a challenging real-world surveillance dataset. We investigate synthetic minority-class augmentation through two off-the-shelf generative strategies: text-conditioned image generation with RunDiffusion/JuggernautXL and image-conditioned color editing with Gemini 2.0 Flash. The curated synthetic data are combined with modern visual representations, loss reweighting, learning-rate scheduling, color-safe augmentation, foreground-aware preprocessing, and ensemble fusion. The bestperforming approach achieves 94.6% micro accuracy and 79.7% macro accuracy, improving macro accuracy by 8.2 percentage points over recent literature. A manual error analysis further shows that many remaining failures are visually ambiguous even for human annotators, highlighting the practical limits of color-based vehicle identification in unconstrained surveillance imagery. The generated images and source code are publicly available at https://github.com/viniciusorru/vcr-synthetic

19.
arXiv (CS.AI) 2026-06-16

TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

arXiv:2605.13909v2 Announce Type: replace-cross Abstract: Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms-Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.

20.
arXiv (math.PR) 2026-06-24

Uniform-in-time Gaussian fluctuations for multiscale nonlinear stochastic systems via Malliavin Calculus

arXiv:2606.23865v1 Announce Type: new Abstract: We establish a uniform-in-time quantitative central limit theorem (QCLT) for a nonlinear slow-fast stochastic system. We identify significant weaker sufficient conditions that enable us to obtain time-independent bounds for the Wasserstein distance between the fluctuation process and a centered Gaussian random variable. To prove our main result, we utilize tools from Malliavin calculus, specifically the second-order Poincaré inequality. In this context, applying the Poincaré inequality requires demonstrating uniform bounds over time for both the first- and second-order Malliavin derivatives.

21.
arXiv (CS.CV) 2026-06-24

Training-Time Optical Priors for Wireless Capsule Endoscopy Classification: Hemoglobin-Aware Input Fusion with Cross-Vendor Evaluation

Background. RGB-trained classifiers for wireless capsule endoscopy (WCE) conflate hemoglobin contrast with bile staining and illumination falloff, limiting sensitivity to small-vessel vascular findings such as Lymphangiectasia. We introduce a physics-informed framework that injects an analytic, Monte-Carlo-inspired hemoglobin prior into a standard classifier purely at training time – to our knowledge the first use of an explicit optical light-transport prior in WCE classification. Methods. On Kvasir-Capsule (47,238 frames, 43 patients, 11 evaluable classes; patient-disjoint split) we test, across 6 seeds against an RGB-only EfficientNet-B0 baseline: (i) a 5-channel input-fusion variant feeding the prior P_blood alongside RGB; (ii) a distillation variant that runs on plain 3-channel RGB at inference; and (iii) a three-stream extension adding a temporal Transformer and an autoencoder-residual stream. We replicate across ResNet-18 and ConvNeXt-Tiny and report cross-vendor zero-shot transfer on the public Galar cohort. Results. Input fusion lifts cross-seed macro-AUC 0.760 -> 0.783 (5/6 seeds positive); distillation reaches 0.773; the three-stream model reaches 0.804 (+0.044 over baseline, paired DeLong p < 1e-4). Lymphangiectasia AUC rises 0.238 -> 0.337, sign-consistent across all 6 seeds. A four-variant ablation reveals a parameterization-mechanism boundary: only the spatial-channel form lifts. Cross-vendor zero-shot on Galar retains ~60% of the ConvNeXt-Tiny lift.

22.
arXiv (CS.CV) 2026-06-18

Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

Despite recent advancements in deep learning, accurately classifying brain tumors from MRI images continues to pose challenges. In this research, we present a comprehensive evaluation of five different convolutional neural networks (CNN) architectures, including a customized baseline model and four pre-trained models - for use in classifying multi-class brain tumors using a clinically-sourced dataset of approximately 10,000 MRI images. We have utilized five different architectures; VGG16, VGG19, DenseNet121, and EfficientNetB0, which were all tested and trained within an identical experimental framework. Performance was measured by both overall accuracy and tumor-wise recall as a means to measure the clinically-relevant performance of each architecture. We found that EfficientNetB0 had the best overall classification accuracy at 95%, when compared to the other architectures tested; specifically VGG16 (94.37%), VGG19 (92.29%), DenseNet121 (90.91%) and the customized CNN (78.00%). An especially important finding of our research was the considerable improvement in detecting meningiomas; specifically, while simple CNNs could detect meningiomas with a recall rate of approximately 20%, EfficientNetB0 was able to detect meningiomas with a recall rate of 89%. Meningiomas are often difficult to detect because they can appear very subtly on MRI images. Additionally, an interesting finding was that the deeper VGG19 performed worse than the shallower VGG16. This indicates that in many cases the architectural efficiency of a CNN model may be more important than its depth when working with medical images. Overall, EfficientNetB0 appears to provide the optimal trade-off between classification accuracy, number of parameters used in the model and clinically meaningful performance.

23.
arXiv (math.PR) 2026-06-16

Cluster sizes in subcritical soft Boolean models

arXiv:2404.13730v2 Announce Type: replace Abstract: We consider the soft Boolean model, a model that interpolates between the Boolean model and long-range percolation, where vertices are given via a stationary Poisson point process. Each vertex carries an independent Pareto-distributed radius and each pair of vertices is assigned another independent Pareto weight with a potentially different tail exponent. Two vertices are now connected if they are within distance of the larger radius multiplied by the edge weight. We determine the tail behaviour of the Euclidean diameter and the number of points of a typical maximally connected component in a subcritical percolation phase. For this, we present a sharp criterion in terms of the tail exponents of the edge-weight and radius distributions that distinguish a regime where the tail behaviour is controlled only by the edge exponent from a regime in which both exponents are relevant. Our proofs rely on fine path-counting arguments identifying the precise order of decay of the probability that far-away vertices are connected.

24.
arXiv (CS.CV) 2026-06-15

Clay-CNN Hybrids: Leveraging Geo-Foundational Models as Auxiliary Context for Landslide Detection

Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geo-Foundational Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

25.
medRxiv (Medicine) 2026-06-12

Heterogeneity of Treatment Effect of Aspirin and Clinically Significant Bleeding in Older Adults

Aim: The global population of older adults is growing, and older age is linked to higher bleeding risk. Although guidelines discourage aspirin for primary prevention in healthy older adults due to bleeding harms outweighing benefits, many continue taking it without a clear indication. It remains unclear whether all older adults face uniform aspirin-related bleeding risk or if certain subgroups are more vulnerable. Methods: We analyzed data from 19,114 ASPREE trial participants to develop machine learning models using 116 baseline variables. Random forest (RF) and random survival forest (RSF) models predicted 5-year bleeding risk, and participants were stratified into low, intermediate, and high-risk groups based on the 20th and 80th percentiles of predicted risk. We assessed heterogeneity of treatment effect (HTE) by testing treatment-by-risk group interactions on the relative scale using Fine-Gray models, and on the absolute scale using observed 5-year cumulative incidence rates. Results: Over a median follow-up of 4.7 years, 626 major bleeding events occurred. The RF model had moderate discrimination (AUC = 0.65, 95% CI: 0.63-0.67) and good calibration (Brier = 0.032, 95% CI: 0.029-0.034). Statistically significant HTE was observed on the relative scale, with the greatest relative increase in bleeding risk seen in the low-risk group (subdistribution hazard ratio = 2.26, 95% CI: 1.27-4.01). On the absolute scale, low-risk participants experienced higher bleeding with aspirin (absolute risk difference (ARD) = 1.17%, 95% CI: 0.37-1.95), but heterogeneity in ARDs was not statistically significant (Cochran's Q p > 0.45). Similar findings were observed when using the RSF model. Conclusion: Participants at lowest baseline bleeding risk experienced the greatest relative increase in bleeding risk with aspirin therapy. We found statistically significant heterogeneity in treatment effects on the relative but not absolute scale. These findings support an individualized, risk-based approach to aspirin therapy decision-making in older adults.