arXiv (CS.AI)
2026-06-18 12:00
DOI:
arXiv:2605.21528
A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction
Authors:
Abstract
arXiv:2605.21528v2 Announce Type: replace-cross
Abstract: Accurate disease risk prediction is challenged by heterogeneous features, limited data, and class imbalance. This study presents yvsoucom-iterkit, a deterministic AutoML framework that models pipeline optimization as a configuration-level system with full reproducibility and traceable execution logs, enabling systematic analysis of component attribution, interactions, similarity, and cross-seed robustness.
Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured yet partially redundant search space, where performance is dominated by a small subset of interacting components.
Ensemble models achieve stable performance, reaching a Weighted-F1 of 0.89 on Pima and 0.94 on Stroke. Macro-F1 reaches approximately 0.88 on Pima but drops to 0.6560 on Stroke due to severe imbalance. Cross-seed experiments show that ensembles reduce variance compared to single models. Friedman testing ($p < 0.05$) confirms significant ranking differences across configurations.
Based on analysis of component attribution, interaction, and similarity, optimal configuration design reveals dataset-dependent behavior. For the Pima dataset, computational efficiency benefits from simplified search spaces where redundant components can be removed, with split ratio playing a key role. In contrast, the Stroke dataset requires enhanced imbalance-aware strategies, where RandomOverSampler improves Macro-F1 from 0.6560 to 0.6766.
These findings demonstrate that effective AutoML optimization is achieved through optimal configuration design, where carefully constraining the search space to high-impact components can improve performance, stability, and interpretability while reducing unnecessary search complexity.