Input selection for fast feature engineering

The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw d...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE 32nd International Conference on Data Engineering (ICDE) pp. 577 - 588
Main Authors Anderson, Michael R., Cafarella, Michael
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2016
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because "big data" inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the "inner loop" of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.
DOI:10.1109/ICDE.2016.7498272