Query-centric regression
Regression Models (RMs) and Machine Learning models (ML) in general, aim to offer high prediction accuracy, even for unforeseen queries/datasets. This depends on their fundamental ability to generalize. However, overfitting a model, with respect to the current DB state, may be best suited to offer e...
Saved in:
Published in | Information systems (Oxford) Vol. 104; p. 101736 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Oxford
Elsevier Ltd
01.02.2022
Elsevier Science Ltd |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Regression Models (RMs) and Machine Learning models (ML) in general, aim to offer high prediction accuracy, even for unforeseen queries/datasets. This depends on their fundamental ability to generalize. However, overfitting a model, with respect to the current DB state, may be best suited to offer excellent accuracy. This overfit-generalize divide bears many practical implications faced by a data analyst. The paper will reveal, shed light, and quantify this divide using a large number of real-world datasets and a large number of RMs. It will show that different RMs occupy different positions in this divide, which results in different RMs being better suited to answer queries on different parts of the same dataset (as queries typically target specific data subspaces defined using selection operators on attributes). It will study in detail 8 real-life data sets and from the TPC-DS benchmark and experiment with various dimensionalities therein. It will employ new appropriate metrics that will reveal the performance differences of RMs and will substantiate the problem across a wide variety of popular RMs, ranging from simple linear models to advanced, state-of-the-art, ensembles (which enjoy excellent generalization performance). It will put forth and study a new, query-centric, model that addresses this problem, improving per-query accuracy, while also offering excellent overall accuracy. Finally, it will study the effects of scale on the problem and its solutions.
•Studies the impedance mismatch problem between regression models and DBMS analytics.•A query-centric view and solution ensuring near-optimal performance for each query.•New metrics for evaluating query-centric performance of regression models for DBMS. |
---|---|
ISSN: | 0306-4379 1873-6076 |
DOI: | 10.1016/j.is.2021.101736 |