Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
arXiv (CS.CV) 2026-06-16

Navigating Distribution Shifts in Medical Image Analysis: A Survey

Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges posed by distribution shifts, where models trained on specific datasets underperform on others from varying hospitals, or patient populations. To address this issue, researchers have been actively developing strategies to increase the adaptability of DL models, enabling their effective use in unfamiliar environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Rather than organizing existing methods by technical characteristics, we explicitly bridge real-world clinical constraints – such as limited data accessibility, strict privacy requirements, and heterogeneous collaboration protocols – with the technical paradigms able to address them. By establishing this connection between operational constraints and methodological evolution, we categorize existing works into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, each aligned with specific healthcare scenarios. Beyond this taxonomy, our empirical analysis suggests that, as domain information becomes progressively less accessible across these paradigms, performance improvements become increasingly constrained, and further uncovers a gradual shift in methodological focus from explicit distribution alignment toward uncertainty-aware modeling, ultimately pointing to the need for more deployability-aware design in real-world MedIA.

02.
arXiv (CS.AI) 2026-06-12

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

arXiv:2606.12500v1 Announce Type: cross Abstract: Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

03.
arXiv (CS.LG) 2026-06-12

Retrieval-Augmented Foundation Models for Water Level Prediction in the Everglades

arXiv:2508.04888v2 Announce Type: replace Abstract: Accurate water level forecasting in the Everglades is essential for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent time-series foundation models have shown strong performance on generic tasks (represented in their pre-training), their effectiveness in domain-specific applications remains insufficiently understood. In this work, we curate a domain-specific dataset for water-level forecasting in the Everglades and observe that the performance of current state-of-the-art models remains limited. To address this gap, we leverage a retrieval-augmented mechanism that retrieves analogous multivariate hydrological episodes from an external archive of historical observations to enrich the input context of those pre-trained models. We study two retrieval strategies, statistical similarity-based retrieval and mutual information-based retrieval, and analyze how incorporating retrieved historical contexts affects predictive performance. Extensive experiments show that retrieval augmentation consistently improves long-horizon water level forecasts and yields disproportionately larger gains during extreme events, which is particularly critical for environmental decision-making. Our study provides empirical evidence that analog-based retrieval can benefit pretrained time-series foundation models in environmental science, offering practical insights into their strengths, limitations, and failure modes when applied to hydrological forecasting in the Everglades. Although evaluated in the Everglades, the proposed framework is general and can be applied to other hydrological systems given time series data. The code and data have been made publicly available at https://github.com/rahuul2992000/WaterRAF.

04.
arXiv (CS.CL) 2026-06-15

Large Language Model Agents Are Not Always Faithful Self-Evolvers

Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent's decisions on the experience it is given, in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 13 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.

05.
arXiv (CS.LG) 2026-06-17

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

arXiv:2606.17414v1 Announce Type: new Abstract: Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a control method for nonlinear systems with actuation constraints that construct a forward-invariant safe set. Previous work has shown that learning class-$\mathcal{K}$ functions defining the ICCBF recursion via meta reinforcement learning (meta-RL) yields a robust, non-greedy approach to safety-critical control in RPO. This paper extends that framework further by investigating the performance of three recurrent network architectures (Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Selective State Space Model (Mamba)) and two training algorithms (Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC)) to identify the best setup for tuning ICCBF class-K functions via meta-RL. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel-savings compared to other architectures, across all cooperative and uncooperative scenarios tested.

06.
medRxiv (Medicine) 2026-06-11

Malaria Risk among Internally Mobile Individuals and Heterogeneous Mobility Patterns in Two Hypoendemic Communities: Implications for Malaria Elimination in the Peruvian Amazon.

Background: Human mobility is increasingly recognized as a key factor influencing malaria transmission dynamics, particularly in low-transmission settings approaching elimination. This study aimed to assess mobility patterns and their association with malaria risk in two hypoendemic communities in the Peruvian Amazon. Method: A longitudinal study was conducted in the communities of Libertad and Urcomirano (Mazan River basin). Monthly population screenings were combined with weekly active and passive case detection. A total of 678 individuals were enrolled. Mobility patterns were assessed through structured questionnaires, and social network analysis was used to characterize travel connections. Log-binomial regression analysis was applied to identify risk factors associated with malaria infection. Result: Internally, mobile individuals in Libertad showed a higher malaria incidence (>32.47 cases per 1,000 person-months) than those in Urcomirano (

07.
arXiv (CS.CL) 2026-06-16

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.

08.
PLOS Medicine 2026-06-01

Prenatal exposure to asthma medications and risk of neurodevelopmental disorders and educational difficulties: A systematic review and meta-analysis

by Lama A. Shakhshir, Alexia Karain, Jill P. Pell, Claire E. Hastie, Scott M. Nelson, Michael Fleming Background Since asthma exacerbations during pregnancy risk maternal and fetal health, continued medication is important. However, some studies have reported adverse neurodevelopmental outcomes following prenatal exposure to asthma medication. Therefore, this systematic review aimed to collate the existing evidence on the associations between prenatal exposure to asthma medication and neurodevelopmental and educational outcomes. Methods and findings A systematic review was conducted in accordance with PRISMA guidelines and the PECO framework. PubMed, Medline and Embase databases were searched for studies investigating prenatal exposure to one or more asthma medication and neurodevelopmental or educational outcomes published, in English, between January 2003 and September 2024, and updated in November 2025. Studies of asthma medication used for other indications were excluded. Study quality was assessed using the Newcastle-Ottawa scale. Random-effects meta-analyses were conducted where appropriate and heterogeneity was evaluated using Cochran’s Q and I2 tests.Of 16,824 studies identified by the initial search, seven were eligible for inclusion. All investigated beta-2-adrenergic agonists (B2AA), with one including B2AA as mono- and polytherapy—and one study also investigated inhaled corticosteroids (ICS) exposure. Two reported associations with autism spectrum disorder (ASD) and one with attention-deficit hyperactivity disorder (ADHD). An updated search identified one additional eligible study, which examined both ADHD and ASD, as well as other neurodevelopmental disorders. The included eight studies (n = 3,867,170 participants) comprised cohort (n = 5) and case-control (n = 3) designs and reported inconsistent results. Meta-analysis of three studies (n = 1,380,871) indicated significant associations with ASD for exposure to B2AA both preconception (aOR 1.34, 95% CI [1.19,1.52]) and during pregnancy (aOR 1.29, 95% CI [1.16,1.42]). Heterogeneity was low, with no evidence of significant publication bias. Limitations of the included studies comprised residual confounding and exposure misclassification. Additionally, studies included in the meta-analysis were few in number and did not adequately distinguish between medication effects and underlying maternal asthma. Conclusion Meta-analysis suggested an association between prenatal exposure to B2AA and ASD. An association with ADHD, reported in a single study, requires corroboration. To date, based on our search strategy, no association has been reported with communication skills, motor skills, problem-solving and personal-social skills, or cerebral palsy.

09.
arXiv (CS.LG) 2026-06-17

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

arXiv:2606.18066v1 Announce Type: new Abstract: We introduce the Noise-Tilted Reverse Kernel (NTRK), a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the pretrained reverse kernel unchanged and requiring only a single sample per step. Reward-guided sampling at inference time has greatly expanded the versatility of pretrained diffusion models. Yet existing methods face a trade-off. Gradient-based guidance shifts the reverse mean, steering generation but pushing intermediate states outside the region that the model was trained on and degrading quality. Search-based methods preserve quality but gain no gradient signal. No prior method achieves both. NTRK resolves this by keeping the reverse mean fixed and biasing the noise term toward high reward. We introduce a whitening operator, the central mechanism behind NTRK, that makes the reward gradient safe to inject as noise without losing its guiding signal. Across various reward alignment tasks, NTRK outperforms recent state-of-the-art baselines without losing sample quality. Remarkably, on aesthetic generation, NTRK surpasses the reward of the best baseline at 500 NFEs using only 25 NFEs, a 20$\times$ reduction in compute.

10.
arXiv (quant-ph) 2026-06-16

Quantum Nonlocal Games on Graph Ensembles

arXiv:2606.16784v1 Announce Type: new Abstract: Quantum entanglement is one of the most striking discoveries in all of science. This effect allows, for instance, two spatially separated agents to coordinate their actions, without communication, to an extent that is both counter-intuitive, and provably impossible by any other physical means. A recently discovered example is that of mobile agents (players) performing spatial coordination tasks such as rendezvous, where the agents aim to meet on a network without communication. Until now, demonstrations of this advantage have relied on highly idealized conditions: agents are assumed to have complete knowledge of the topography, and experiments have been restricted to simulations using data generated by qubits within a single quantum processor. Here we address both limitations by developing a theory for graph ensembles that capture topographical uncertainty and by experimentally demonstrating the advantage in rendezvous scenarios between physically separated ion-trap systems with access to remote entanglement. Moreover, we simulate a broader set of problems on superconducting hardware. Surprisingly, when players are given the ability to gather more local information the quantum advantage increases – a feat impossible by classical means. Our findings establish a concrete route toward practical quantum advantages in motion coordination problems. More broadly, they point to a new way of using portable quantum devices to enhance collective decision-making in uncertain environments.

11.
arXiv (CS.LG) 2026-06-19

HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction

arXiv:2606.20437v1 Announce Type: cross Abstract: Charged-particle tracking – reconstructing trajectories from sparse detector measurements – is a fundamental high-energy-physics inference problem and a canonical example of learning under extreme combinatorial ambiguity. At the High-Luminosity Large Hadron Collider (HL-LHC), tracking must remain accurate and efficient despite unprecedented collision densities. Graph neural networks perform strongly, but incur substantial costs from graph construction and processing, while transformer-based approaches rely on auxiliary stages that prevent end-to-end optimization. To address this, we present HEPTv2, an end-to-end point-transformer architecture that reconstructs tracks from detector hits in one trainable pipeline. HEPTv2 combines a locality-aware point encoder with a track decoder that predicts complete trajectories without graph-building, clustering, or filtering. The encoder uses locality-sensitive hashing in detector coordinate space to preserve tracking-relevant geometry while enabling efficient local attention. The decoder resolves ambiguities through sectorized decoding and direct hit-to-track prediction under joint encoder-decoder supervision, allowing the full pipeline to be optimized end-to-end. On TrackML, HEPTv2 achieves 98.6% double-majority tracking efficiency at a 0.8% fake rate, while requiring only $\sim$15~ms inference time and 0.4~GB peak memory per event on a NVIDIA A100 GPU. Latency and memory scale approximately linearly for events with up to $5\times10^5$ hits. HEPTv2 establishes a new state of the art in the accuracy-latency trade-off, improving efficiency by 4.5% over the strongest prior transformer and by 1.1–2.2% over optimized graph-based pipelines, while reducing latency by factors of 7 and 38–52, respectively. These results show end-to-end transformers can deliver the accuracy and efficiency required for real-time particle reconstruction at the HL-LHC.

12.
arXiv (CS.CV) 2026-06-11

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

Vision-language models like CLIP can provide rich semantic priors for open-vocabulary object detection. However, jointly integrating both textual and visual knowledge into detection architectures remains challenging. In this paper, we propose VL-DINO, an open-vocabulary detector that enhances DINO through more effective exploitation of CLIP's vision-language knowledge. Specifically, a Query-guided Positive Sample Construction (QPSC) module is first developed to construct additional high-quality positive samples, enabling the vanilla DINO framework to better accommodate mixed training across heterogeneous data sources while providing more vision-language alignment signals, thereby incorporating richer textual knowledge during training. A Visual Semantic Encoder (VSE) module is then introduced to distill CLIP visual knowledge into backbone-extracted features, producing fused features for subsequent encoder refinement. Based on the fused features, an Object-Region Semantic Alignment (ORSA) module extracts object-centric region features and aligns them with the corresponding textual embeddings, further incorporating textual cues. In the zero-shot setting, VL-DINO-T and VL-DINO-L achieve 36.3 and 38.1 AP on the LVIS benchmark, respectively, consistently outperforming prior advanced approaches. Extensive experiments demonstrate the effectiveness and competitive performance of the proposed design.

13.
arXiv (quant-ph) 2026-06-12

Invariant Measures and Weak-Magic-Injection Asymptotics in Random Monitored Quantum Circuits

arXiv:2606.13470v1 Announce Type: new Abstract: Monitored quantum circuits provide a natural setting in which scrambling, measurements, and measurement-conditioned updates compete within a stochastic many-body dynamics. From the viewpoint of nonstabilizer resource theory, this competition is especially relevant because Clifford-compatible operations preserve the stabilizer structure, while weak non-Clifford perturbations inject magic resource. Most of the existing understanding of monitored quantum circuits has been shaped by numerical simulations and phenomenological descriptions, while a rigorous dynamics theory remains less developed. In this paper, we address this gap by developing an analytical framework which lays a rigorous mathematical foundation for the study of random monitored quantum dynamics. Specifically, we study a class of monitored quantum circuits driven by random Clifford. We prove the existence and uniqueness of the stationary law, which gives an ergodic description of the long-time dynamics. We then resolve the leading asymptotics of steady magic in the weak-magic-injection limit. This tangent description makes the contrast between resource measures transparent: in odd-prime local dimension, the steady Gross–Wigner mana has a linear leading asymptotic, whereas in qubit systems the steady 2-stabilizer Rényi entropy has a quadratic leading asymptotic. These different powers reflect the distinct local geometries of the two resource measures near the stabilizer layer. In this way, this work develops an analytical framework that first establishes the stationary ergodic dynamics of random monitored quantum circuits.

14.
arXiv (math.PR) 2026-06-16

Free energy of non-convex multi-species spin glasses with centered Ising spins

arXiv:2606.16636v1 Announce Type: new Abstract: We identify the limit free energy of all multi-species spin glasses with centered $\pm 1$ spins. The result was previously known only under a convexity assumption on the covariance function of the Hamiltonian. We also obtain a one-species reduction of the formula for balanced multi-species models.

15.
arXiv (CS.AI) 2026-06-12

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

arXiv:2606.13311v1 Announce Type: cross Abstract: Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1–False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

16.
arXiv (CS.AI) 2026-06-12

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

arXiv:2606.12945v1 Announce Type: new Abstract: Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency – both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 – this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable – reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

17.
Nature (Science) 2026-06-17

A blastoporal organizer in a ctenophore

In an iconic experiment in 1924, Hilde Mangold and Hans Spemann established that the dorsal blastopore lip of amphibian embryos functions as an organizer and induces a secondary body axis when transplanted into a host embryo1. This discovery demonstrated that specific embryonic regions can regulate embryonic patterning and lead to the establishment of an entire body axis. Subsequent studies have revealed that cnidarians, the sister group to Bilateria, also possess a blastoporal embryonic organizer2,3. However, the evolutionary origin of the organizer remains unclear. Here we report that the blastopore lip of the ctenophore Mnemiopsis leidyi, a member of the evolutionary sister group to all other metazoans4,5, exhibits organizer activity. We show that transplanted fragments of blastopore lip tissue from M. leidyi gastrula induce secondary pharynx and mouth formation. Moreover, transphyletic transplantation experiments show that the blastopore lip of M. leidyi leads to the generation of a secondary body axis in embryos of the cnidarian Nematostella vectensis. Organizer function in M. leidyi requires both β-catenin and TGFβ signalling, and the TGFβ-family ligands probably provide this inductive capacity. These findings reveal the deep homology of the blastoporal organizer in ctenophores, cnidarians and vertebrates, implying the ancestral organizer role of the blastopore lip. We propose that the emergence of the organizer was an essential innovation that facilitated the change from the temporal cell differentiation of unicellular relatives to the spatial cell differentiation of the first multicellular embryo. Experiments using the comb jelly Mnemiopsis leidyi and the sea anemone Nematostella vectensis reveal that the emergence of a core signalling pathway may have been a key innovation enabling the transition to multicellularity in animals.

18.
medRxiv (Medicine) 2026-06-18

Can Vision-Language Models See the Vital Signs? Benchmarking and Fine-Tuning for Intraoperative Monitor Reading

Background Vital-sign deterioration is a leading contributor to preventable perioperative death, yet manual monitor reading is intermittent, error-prone, and subject to alarm fatigue. Automating this perceptual step could enable continuous surveillance, but existing solutions depend on device-specific hardware integration or cloud-hosted vision-language models (VLMs), which raise privacy, cost, and connectivity barriers in resource-limited healthcare facilities. Methods We constructed a benchmark of 200 in-the-wild intraoperative monitor photographs (spanning multiple vendors, angles, and illumination conditions) annotated for eight vital-sign parameters: heart rate, SpO2, ETCO2, respiratory rate, systolic/diastolic/mean blood pressure, and temperature. We evaluated an optical character recognition (OCR)-based pipeline, nine instruction-tuned VLMs (four commercial, five open-weight ranging from [≤]4B to 31B parameters) under two prompting regimes, and a compact open model (Qwen3.5-9B) adapted via low-rank fine-tuning (LoRA, 0.46% of parameters updated). Results Under a domain-aware prompt, frontier VLMs reached 0.98-0.997 exact-match accuracy zero-shot, whereas the OCR pipeline and [≤]4B model scored approximately 0.20 lower, defining a 9B-class usable floor. LoRA fine-tuning Qwen3.5-9B on 80-120 images raised accuracy from 0.953 to 0.994 (statistically indistinguishable from the best commercial model) and reduced the critical-error rate fivefold (0.0313 [->] 0.0063). Ablations showed that performance saturated at 80 training images and rank-8 adapters. Conclusion Monitor reading is a solved perception problem for VLMs above the 9B scale. A lightweight fine-tuned open model achieves frontier accuracy while running entirely on local hardware, preserving data privacy, offline capability, and near-zero marginal cost. Residual errors stem from blood-pressure source ambiguity and are addressable with explicit disambiguation logic.

19.
arXiv (CS.AI) 2026-06-12

An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration

arXiv:2606.12425v1 Announce Type: cross Abstract: Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However, insufficient instructional support often limits students' access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. Although recent advances in AI, particularly large language models, offer scalable opportunities for feedback, concerns about explainability and reliability remain. In this paper, we present an AI-driven classroom assistant that leverages an explainable AI model to analyze student code, map logical errors to instructor-identified misconceptions, and deliver instructor-authored feedback, thereby grounding reliability in instructor-defined pedagogical knowledge. To evaluate the effectiveness of our framework, we conducted an expert evaluation to examine its alignment with instructor-verified feedback and deployed the system in a classroom setting to assess students' perceptions of its usability. Results indicate that the assistant can provide accurate, instructor-verified feedback to students while fostering a positive experience.

20.
arXiv (CS.AI) 2026-06-19

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

arXiv:2506.14990v3 Announce Type: replace Abstract: Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 sequential tasks, as CPU-bound environments make longer sequences impractical. Meanwhile, continual learning in cooperative multi-agent settings remains largely unexplored. To address these gaps, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark for continual multi-agent RL. By leveraging JAX and GPU acceleration, MEAL enables training on sequences of 100 tasks in a few hours on a single GPU. We find that long task sequences reveal failure modes that do not appear at smaller scales.

21.
arXiv (math.PR) 2026-06-19

The t-Split Two-Periodic Aztec Diamond Model

arXiv:2606.19507v1 Announce Type: new Abstract: In this work we consider an Aztec diamond model split into two unequal regions which are asymptotically fixed in size. Each region is weighted with a distinct two-periodic weighting. We refer to this model as the t-split two-periodic Aztec diamond, to signify its difference from the previous work title Split Two-Periodic Aztec Diamond, where the model was split into two equal regions. We derive an integral expression for the correlation kernel of the model and give a partial description of the scaling limit behavior, along with a conjecture for the remainder. We refer to the larger and smaller sides of the model as the dominant and non-dominant sides, and to the location of the weight change as the interface. The dominant side exhibits a limit shape that depends only on its own weighting and is identical to that of the two-periodic Aztec diamond, while the non-dominant side appears to have a novel limit shape that depends on both weightings and the location of the interface. Lastly, we consider the complete limit shape in the case where the dominant side two-periodic parameter goes to 0.

22.
arXiv (CS.AI) 2026-06-19

ScaffoldAgent: Utility-Guided Dynamic Outline Optimization for Open-Ended Deep Research

arXiv:2606.20122v1 Announce Type: new Abstract: Open-ended deep research (OEDR) requires systems to acquire knowledge through multi-round retrieval and generate coherent long-form reports. The outline plays a central role as a structural scaffold that coordinates retrieval, evidence organization, and generation. However, existing methods either fix the outline before writing or refine it with local heuristics, leading to scaffold drift under continuous information accumulation and delayed feedback for evaluating outline modifications. We propose ScaffoldAgent, a utility-guided dynamic outline optimization framework for OEDR. ScaffoldAgent models outline evolution as a structured decision process with three operations: Expansion, Contraction, and Revision, enabling controlled updates to the report scaffold. It further introduces a utility-guided feedback mechanism that estimates the downstream value of each outline operation from retrieval gain, structural coherence, and trial-generation quality. The resulting utility signal guides node selection, operation scheduling, and termination during inference. Experiments on DeepResearch Bench and DeepResearch Gym show that ScaffoldAgent consistently improves long-form report generation and factual grounding over existing deep research agents.

23.
arXiv (CS.LG) 2026-06-19

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

作者:

arXiv:2606.20537v1 Announce Type: new Abstract: Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.

24.
arXiv (CS.CL) 2026-06-17

FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We release our code and curated datasets at: https://github.com/BBeeChu/FeedEval.git.

25.
arXiv (CS.CV) 2026-06-15

FEMOT: Multi-Object Tracking using Frame and Event Cameras

Conventional RGB cameras have been widely used in multi-object tracking due to their ability to capture rich appearance and semantic information. However, their performance is often degraded under complex real-world challenges, such as motion blur, low illumination, and overexposure. Bio-inspired event cameras offer high temporal resolution and high dynamic range, providing complementary cues under extreme scenarios. Nevertheless, RGB-event multi-object tracking remains underexplored due to the lack of large-scale and well-annotated datasets. To address this issue, we propose FEMOT, a large-scale RGB-event multi-object tracking dataset that covers diverse real-world scenarios and 14 challenging attributes. With both RGB and event data as well as high-quality annotations, FEMOT provides a reliable platform for systematically evaluating RGB-event multi-object tracking methods. Based on FEMOT, we retrain and evaluate over ten strong trackers, thereby establishing a comprehensive benchmark for future research. Furthermore, we propose FEMOTR, a multimodal tracking framework that decouples RGB and event features and fuses them in the frequency domain, thereby effectively exploiting their complementary characteristics for robust object localization and identity association. Extensive experiments on FEMOT and DSEC-MOT datasets demonstrate the effectiveness of the proposed method. The source code and benchmark dataset have been released on https://github.com/Event-AHU/FEMOT.