Academic Intelligence · Curated Daily

探索全球前沿学术脉络

AcademicHub 汇聚顶级期刊与预印本平台的实时文献。定制您的专属科研雷达,利用大语言模型自动生成交叉领域文献分析简报。

01.
medRxiv (Medicine) 2026-06-12

The Acceptability of Three Co-Created Peer Support Interventions for People Living with Leprosy Reactions in Indonesia: A Mixed-Methods Pilot Study

Background: Leprosy reactions (LR) are immune-mediated complications associated with disability, emotional distress, and social isolation. We identified a gap in affected-individual-informed interventions that aim to improve the management of LR in healthcare settings. To address this gap, we assessed the acceptability of three peer-support interventions co-created with people affected by LR in Indonesia. Methods: Using an interactive learning and action approach, we co-created peer counselling, telesupport groups, and participatory video interventions which were piloted in an urban hospital and 13 rural community clinics. A mixed-methods design was applied with interviews, focus group discussions, and pre-post assessments involving four participant groups. Data were analyzed thematically using an acceptability framework. Results: One hundred participants were enrolled, and 92 completed the pilot intervention between November 2022 and July 2023. Qualitative findings showed that all interventions were acceptable. Peer counselling provided emotional reassurance through shared experiences and was perceived as trustworthy and supportive. Perceived burdens differed by setting, with time constraints in urban facilities and geographical barriers in rural clinics. Knowledge improved significantly among participants of peer counselling and telesupport groups in rural settings. Telesupport groups facilitated connection, information exchange, and continuity of care. Digital access and literacy limited participation for some, particularly in rural areas. The participatory video was perceived as reassuring and informative. Improvements in knowledge, attitude, practices, and mental well-being domain scores were observed among urban participants, but responses in rural settings showed less change. Participants and co-implementers reported increased self-efficacy, participants confidence to perform required behaviors within peer support interventions, with effects shaped by intervention and setting. Conclusions: The three co-created peer-support interventions were acceptable for individuals with LR in diverse healthcare settings. These outcomes highlight the importance and effectiveness of selective, and context-sensitive implementation of one or more peer-support modalities.

02.
arXiv (CS.CV) 2026-06-18

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

03.
arXiv (CS.CL) 2026-06-18

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1

04.
bioRxiv (Bioinfo) 2026-06-11

Machine Learning-Guided Discovery of Bacterial-Selective Membrane-Active Compounds Reveals Mechanistic Bias in Antibiotic Training Datasets

The rise of antibiotic resistance necessitates the discovery of antibacterial compounds with novel mechanisms of action (MoAs). Recent machine learning approaches have shown promise in antibacterial compound discovery, but often identify derivatives of known antibiotic classes rather than mechanistically novel compounds. Previous approaches applied Tanimoto similarity filters at the end of screening pipelines, but this method has substantial drawbacks: Tanimoto similarity can be misleading in chemical space, and post-hoc filtering does not influence what activity models learn to prioritize. Here, we present a machine learning pipeline that addresses chemical novelty upfront by employing an XGBoost-based MoA classifier to explicitly prioritize compounds predicted to have mechanisms distinct from known antibiotic classes, combined with graph neural networks for antibacterial activity and toxicity prediction. Applied to the Zinc20 database, our approach successfully identified non-toxic antibacterial compounds structurally distinct from known antibiotics. Notably, the majority of these hits exhibited membrane-targeting activity with selectivity for bacterial cells over mammalian cells, suggesting potential for next-generation membrane-active antibiotics. However, we did not identify compounds with novel protein targets. Systematic analysis revealed that this limitation stems from mechanistic bias in training data rather than model architecture. Specifically, our activity model learned to preferentially score compounds similar to specific groups in the training data, thus overrepresenting certain MoA classes including membrane-active compounds. Even substantial model architecture and training data enhancements did not overcome this constraint. Our findings demonstrate that the primary bottleneck for discovering mechanistically novel antibiotics is the scarcity of diverse, mechanistically-annotated training data. This work provides both a methodological framework for mechanism-aware screening and critical insights into data requirements for genuinely novel antibiotic discovery.

05.
arXiv (math.PR) 2026-06-15

Stationary measures for higher spin vertex models on a strip

作者:

arXiv:2309.04897v2 Announce Type: replace-cross Abstract: We introduce a higher spin vertex model on a strip with fused vertex weights. This model can be regarded as a generalization of both the unfused six-vertex model on a strip arXiv:2212.09111 and an 'integrable two-step Floquet dynamics' model introduced in arXiv:1711.08884. We solve for the stationary measure using a fused version of the matrix product ansatz and then characterize it in terms of the Askey-Wilson process. Using this characterization, we obtain the limits of the mean density along an arbitrary down-right path. It turns out that all these models share a common phase diagram, which, after an appropriate mapping, matches the phase diagram of open ASEP. This provides evidence for the universality of this phase diagram.

06.
arXiv (CS.CL) 2026-06-18

REVES: REvision and VErification–Augmented Training for Test-Time Scaling

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

07.
arXiv (math.PR) 2026-06-19

The systole of random hyperbolic 3-manifolds

arXiv:2406.11783v2 Announce Type: replace-cross Abstract: We study the systole of a model of random hyperbolic 3-manifolds introduced by Petri and Raimbault, answering a question posed in that same article. These are compact manifolds with boundary constructed by randomly gluing truncated tetrahedra along their faces. We prove that the limit, as the volume tends to infinity, of the expected value of their systole exists and we give a closed formula of it. Moreover, we compute a numerical approximation of this value.

08.
arXiv (CS.AI) 2026-06-16

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

arXiv:2606.15231v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

09.
arXiv (CS.LG) 2026-06-15

Classification of Astronomical Spectra Using PCA-Compressed Flux and Inverse-Variance Features

arXiv:2606.13978v1 Announce Type: cross Abstract: This paper evaluates a signal-processing and supervised-learning pipeline for classifying SDSS DR17 astronomical spectra into stars, galaxies, and quasars. Each spectrum is represented by its measured flux and inverse-variance information, combining spectral shape with a wavelength-dependent reliability profile. After resampling onto a common logarithmic wavelength grid, the flux and inverse-variance vectors are standardized and separately compressed using principal component analysis. The resulting components are concatenated and used to train several classifiers. The best performance was obtained with the LightGBM gradient-boosting classifier, reaching $94.6\%$ accuracy and $92.1\%$ balanced accuracy on the test set.

10.
arXiv (CS.AI) 2026-06-19

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

arXiv:2606.05833v2 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

11.
arXiv (math.PR) 2026-06-16

Interplay of insurance and financial risks in a non Levy-Renewal environment

arXiv:2606.15596v1 Announce Type: new Abstract: In this paper we consider a multivariate risk model, with common counting process and common process of logarithmic returns for the investment portfolio. We assume that the claim-vectors, the counting process and the logarithmic returns of the investment portfolio satisfy a weak dependence structure. Further, we consider that the counting process represents an inhomogeneous renewal process, and the logarithmic returns represent a cadlag process with independent but not necessarily stationary increments. Under these conditions we provide an asymptotic expression for the infinite-time entrance probability of the discounted aggregate claims into some rare set xA, where A denotes a set from a general set family, crucial for the actuarial practice, when the common distribution of the claim vectors belong to a multivariate heavy-tailed distribution class. This result, is derived under a moment condition for the financial risks, and underlines the multivariate linear single big jump principle. When we restrict the distribution class of the claim-vectors to multivariate regular variation, we find more explicit asymptotic expressions, weakening the moment conditions on the financial risks. The asymptotic formulas, derived through double dependence solution, become more direct and practical in applications. With respect to the technical part, due to non Levy-Renewal framework, the classical Kesten-Goldie theorems are not applicable, nor their extensions. The way we make the discretization of the process of the discounted aggregate claims permits to derive uniform asymptotics with respect to the number of summands, that facilitate the approximation of the infinite sums of the main results.

12.
arXiv (quant-ph) 2026-06-16

Black Hole–Entropy Container or Creator

arXiv:2603.18374v3 Announce Type: replace-cross Abstract: Do black holes possess entropy or do they create it? The dominant assumption is that they possess entropy, and a they evaporate that entropy is emitted and decreases. In this paper I use a model of a linear amplifier, in which I argue that the amplifier has not entropy and yet it emits entropy in the process of it operation. This model is closely related to behaviour of black holes, resulting in answer the question of that title that black holes do not have entropy, but nevertheless them create and emit entropy with the total entropy emitted being the same as the usual expression proportional to the square of the mass of the black hole.

13.
arXiv (CS.CV) 2026-06-11

EventRadar: Long-Range Visual UAV Discovery through Spatiotemporal Event Sensing

Unauthorized unmanned aerial vehicle (UAV) activity around airports, public venues, and other sensitive sites has made protected-airspace monitoring increasingly important. A practical sensing system must search a wide angular region, find small long-range targets, and return both bearing support and UAV-specific evidence before a restricted perimeter is breached. Existing UAV detection paths often rely on spatially organized evidence, such as body extent, silhouette, or track continuity. At long range, however, these cues become difficult to preserve and verify as the target footprint weakens and its image-plane support shrinks. EventRadar follows a complementary cue: propeller-induced temporal periodicity, which recent event-camera sensing studies have shown can reveal UAV-specific motion after appearance becomes weak. We extend this cue to kilometer-scale active sensing with an event-camera prototype. Scene-Anchored Geometry Evidence (SAGE) fuses scanning events with IMU pose to maintain a bearing-indexed scene memory, separating transient candidate support from persistent background clutter. Comb-guided Harmonic-Group Learned Iterative Shrinkage and Thresholding Algorithm (CHG) then treats each candidate as a weak high-rate timing signal and recovers phase-insensitive harmonic evidence with fixed compute. Compared with related event-camera baselines on 700-1500 m UAV event recordings, EventRadar achieves 0.990 mAP$_{.3}$ and 0.949 F1$_{.3}$, reduces FN$_{.3}$ to 0.009, and shows real-time feasibility in prototype profiling.

14.
arXiv (CS.CV) 2026-06-16

Focus, Align, and Sustain: Counteracting Gradient Dilution in Incremental Object Detection

Adapting Detection Transformers to Incremental Object Detection (IOD) poses a systemic challenge, as set-based optimization is inherently destabilized by sequential learning. In this work, we identify Gradient Dilution as the root cause of performance degradation, wherein optimization signals required to preserve old knowledge are progressively weakened. This phenomenon manifests as a cascading erosion of preservation gradients in magnitude, direction, and support coverage, driven by three tightly coupled factors: Signal Dispersion, where foreground gradients are overwhelmed by background noise; Assignment Drift, where stochastic query-target matching induces inconsistent gradient trajectories; and Support Attrition, where gradients from retained samples insufficiently cover the old-class feature space, weakening decision boundaries under interference from new classes. To counteract this, we propose FAS, a unified framework that Focuses, Aligns, and Sustains gradient flow throughout incremental learning. Specifically, we introduce prior-injected queries to focus discriminative signals by filtering background interference at the source. We further propose deterministic anchor distillation to align query-target assignments and enforce semantic consistency across stages under unstable matching. Finally, we devise manifold-support replay to sustain distributional support of old classes, counteracting representational erosion induced by continual updates. Extensive experiments show that FAS restores robust optimization dynamics and outperforms state-of-the-art methods, achieving over 5.0 AP improvement in the challenging 40+10x4 incremental setting.

15.
arXiv (CS.AI) 2026-06-12

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

arXiv:2606.13211v1 Announce Type: new Abstract: AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

16.
arXiv (CS.CL) 2026-06-16

JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. We also show that simple linear probes of the embedding space recover cross-subject ability directions, such as an arithmetic axis that highlights quantitatively demanding questions in seemingly distant subjects like virology and global facts. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.

17.
medRxiv (Medicine) 2026-06-17

A multistate model of frailty progression after severe infections in adults >=65 years in England: a matched-cohort study

Background Evidence on frailty progression following severe infections is limited. We compared rates of transition to greater frailty or death between adults with and without severe infection in England. Methods We conducted a matched-cohort study among adults aged [≥]65 years (1,452,117: median age 76 years, 45% male) in Clinical Practice Research Datalink Aurum (2006-2019). Adults with severe infection (hospitalised primarily due to infection) were matched on calendar time to individuals without severe infection on age, sex, and primary care practice. The admission date was used as index date and same was assigned to matched unexposed adults. We measured frailty using Electronic Frailty Index, a proportion of 36 health deficits in validated categories (Fit 0-0.12, Mild >0.12-0.24, Moderate >0.24-0.36, Severe >0.36). In a time-varying Markov multistate model, we focused on forward transitions from baseline or intermediate frailty states to higher states or death. For each transition, we used Cox regression to estimate cause-specific transition hazard ratios (HR) with 95% confidence intervals (CIs), comparing adults with and without severe infection. We adjusted for baseline frailty score, age, sex, deprivation, harmful alcohol use, smoking, and primary care infection history 5 years before index date. We estimated state occupancy probabilities, and expected length of stay (ELOS) in each state at year five among adults with and without severe infection. We explored effect modification by infection type. Results Across all transitions, severe infection was associated with higher adjusted hazards of transitioning to worsening frailty or death, HR, 95% CI: (fit to: mild[1.56, 1.54-1.58], moderate[2.51, 1.79-3.51], death[4.57, 4.50-4.65]; mild to: moderate[1.52, 1.50-1.53], severe[1.90, 1.43-2.52], death[2.67, 2.64-2.70]; moderate to: severe[1.40, 1.38-1.42], death[1.87, 1.85-1.90]; severe to death[1.48, 1.46-1.50]). Transition hazard ratios were strongest for lower respiratory tract infections, followed by sepsis, urinary tract infections, meningitis/encephalitis, gastroenteritis, and skin and soft tissue infections. At five years, adults with severe infection had higher probabilities of transitioning to greater frailty or death across all transitions and lower ELOS in each frailty state than those without severe infection. Interpretation Severe infections may accelerate frailty deterioration in older age. Prevention through vaccination, early detection, and prompt management may help mitigate this decline.

18.
arXiv (quant-ph) 2026-06-17

Unclonable Encryption in the Haar Random Oracle Model

arXiv:2603.11437v2 Announce Type: replace-cross Abstract: We construct unclonable encryption (UE) in the Haar random oracle model, where all parties have query access to $U,U^\dagger,U^*,U^T$ for a Haar random unitary $U$. Our scheme satisfies the standard notion of unclonable indistinguishability security, supports reuse of the secret key, and can encrypt arbitrary-length messages. That is, we give the first evidence that (reusable) UE, which requires computational assumptions, exists in "microcrypt", a world where one-way functions may not exist. As one of our central technical contributions, we build on the recently introduced path recording framework to prove a natural ``unitary reprogramming lemma'', which may be of independent interest.

19.
arXiv (CS.AI) 2026-06-16

Beyond Rebalancing: Benchmarking Binary Classifiers Under Class Imbalance Without Rebalancing Techniques

arXiv:2509.07605v2 Announce Type: replace-cross Abstract: Class imbalance poses a significant challenge to supervised classification, particularly in critical domains like medical diagnostics and anomaly detection where minority class instances are rare. While numerous studies have explored rebalancing techniques to address this issue, less attention has been given to evaluating the performance of binary classifiers under imbalance when no such techniques are applied. Therefore, the goal of this study is to assess the performance of binary classifiers "as-is", without performing any explicit rebalancing. Specifically, we systematically evaluate the robustness of a diverse set of binary classifiers across both real-world and synthetic datasets, under progressively reduced minority class sizes, using one-shot and few-shot scenarios as baselines. Our approach also explores varying data complexities through synthetic decision boundary generation to simulate real-world conditions. In addition to standard classifiers, we include experiments using undersampling, oversampling strategies, and one-class classification (OCC) methods to examine their behavior under severe imbalance. The results confirm that classification becomes more difficult as data complexity increases and the minority class size decreases. While traditional classifiers deteriorate under extreme imbalance, advanced models like TabPFN and boosting-based ensembles retain relatively higher performance and better generalization compared to traditional classifiers. Visual interpretability and evaluation metrics further validate these findings. Our work offers valuable guidance on model selection for imbalanced learning, providing insights into classifier robustness without dependence on explicit rebalancing techniques.

20.
arXiv (CS.AI) 2026-06-16

UrbanWell: Benchmarking Multimodal Large Language Models for Spatio-Temporal Urban Wellbeing Analytics

arXiv:2606.15890v1 Announce Type: new Abstract: Understanding urban wellbeing from multimodal data requires integrating heterogeneous spatial and temporal signals, posing significant challenges for current multimodal large language models (MLLMs). We introduce UrbanWell, a large-scale benchmark designed to systematically evaluate the spatio-temporal reasoning capabilities of MLLMs for urban wellbeing analytics through joint modeling of satellite and street view imagery. UrbanWell spans 38 cities across multiple years and includes diverse indicators covering (1) environmental conditions (CO$_2$, NO$_2$, PM${2.5}$, and Normalized Difference Vegetation Index), (2) spatial accessibility (minimum distance to supermarkets and restaurants), (3) urban form (road length, road density, and land use), (4) urban vitality (population, economic activity diversity, and land use diversity), and (5) subjective perception attributes (e.g., safety, beauty, liveliness, wealth, and quietness). All indicators are aligned at grid level to enable standardized evaluation. Beyond static prediction, UrbanWell defines temporal reasoning tasks, including future value forecasting from historical observations and temporal trend classification. We benchmark 15 state-of-the-art representative MLLMs in a zero-shot setting, providing a comprehensive comparative evaluation across spatial and temporal dimensions. Experimental results indicate that while MLLMs capture salient spatial and perceptual cues, their performance varies substantially across heterogeneous urban indicators spanning environment and subjective perception. UrbanWell serves as a unified benchmark for evaluating multimodal spatial and temporal reasoning in urban wellbeing analytics, offering a standardized testbed for systematic assessment and future research on multimodal urban intelligence. Our codes and datasets are accessible via https://github.com/axin1301/UrbanWell-Benchmark.

21.
arXiv (CS.CV) 2026-06-15

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

22.
arXiv (CS.AI) 2026-06-18

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

arXiv:2604.13082v2 Announce Type: replace-cross Abstract: Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder's job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map's arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.

24.
arXiv (CS.CV) 2026-06-12

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.

25.
arXiv (CS.AI) 2026-06-12

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

arXiv:2606.13311v1 Announce Type: cross Abstract: Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1–False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.