Input selection for fast feature engineering
The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw d...
Saved in:
Published in | 2016 IEEE 32nd International Conference on Data Engineering (ICDE) pp. 577 - 588 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.05.2016
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because "big data" inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the "inner loop" of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others. |
---|---|
DOI: | 10.1109/ICDE.2016.7498272 |