Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train c...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Fan, Grace, Wang, Jin, Li, Yuliang, Zhang, Dan, Miller, Renée
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 15.01.2023
Subjects	Coders Datasets Lakes Learning Query processing Searching Semantics
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical evaluation results on real table benchmark datasets show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index for accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).
ISSN:	2331-8422