MODis: Generating Skyline Datasets for Data Science Models

Mar 25, 2025ยท
Mengying Wang
,
Hanchao Ma
,
Yiyang Bian
,
Yangxin Fan
,
Yinghui Wu
ยท 0 min read
Abstract
Preparing high-quality datasets for data science models is a critical yet challenging task. Traditional data discovery approaches often focus on a single optimization criterion, which can introduce bias and limit downstream performance. This paper introduces MODis, a multi-objective data discovery framework that generates skyline datasets tailored to user-defined model performance measures (e.g., accuracy, training cost). MODis formalizes the skyline dataset generation problem using a finite state transducer (FST), and provides three algorithms:(1) a reduce-from-universal strategy that prunes unpromising data while approximating Pareto-optimal sets, (2) a bi-directional search that interleaves augmentation and reduction with correlation-based pruning, and (3) a diversification algorithm to mitigate dataset bias. Experiments with benchmark datasets and scientific applications show that MODis can improve model accuracy by up to 2ร— while reducing training cost, outperforming baseline data integration and feature selection methods.
Type
Publication
In Proceedings of the 28th International Conference on Extending Database Technology (EDBT)