Exploiting Formal Concept Analysis for Data Modeling in Data Lakes

Data lakes are widely used to store extensive and heterogeneous datasets for advanced analytics. However, the unstructured nature of data in these repositories introduces complexities in exploiting them and extracting meaningful insights. This motivates the need of exploring efficient approaches for...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Bendimerad, Anes, Mathonat, Romain, Remil, Youcef, Kaytoue, Mehdi
Format	Paper Journal Article
Language	English
Published	Ithaca Cornell University Library, arXiv.org 11.08.2024
Subjects	Computer Science - Artificial Intelligence Computer Science - Databases Data analysis Data structures Qualitative analysis Scientific visualization Unstructured data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Data lakes are widely used to store extensive and heterogeneous datasets for advanced analytics. However, the unstructured nature of data in these repositories introduces complexities in exploiting them and extracting meaningful insights. This motivates the need of exploring efficient approaches for consolidating data lakes and deriving a common and unified schema. This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA) to systematically clean, organize, and design data structures within a data lake. We explore diverse data structures stored in our data lake at Infologic, including InfluxDB measurements and Elasticsearch indexes, aiming to derive conventions for a more accessible data model. Leveraging FCA, we represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to unify these structures and establish a common schema. Our methodology yields significant results, enabling the identification of common concepts in the data structures, such as resources along with their underlying shared fields (timestamp, type, usedRatio, etc.). Moreover, the number of distinct data structure field names is reduced by 54 percent (from 190 to 88) in the studied subset of our data lake. We achieve a complete coverage of 80 percent of data structures with only 34 distinct field names, a significant improvement from the initial 121 field names that were needed to reach such coverage. The paper provides insights into the Infologic ecosystem, problem formulation, exploration strategies, and presents both qualitative and quantitative results.
ISSN:	2331-8422
DOI:	10.48550/arxiv.2408.13265