Rough Set Theory as a Data Mining Technique: A Case Study in Epidemiology and Cancer Incidence Prediction

A big challenge in epidemiology is to perform data pre-processing, specifically feature selection, on large scale data sets with a high dimensional feature set. In this paper, this challenge is tackled by using a recently established distributed and scalable version of Rough Set Theory (RST. It cons...

Full description

Saved in:

Bibliographic Details
Published in	Machine Learning and Knowledge Discovery in Databases Vol. 11053; pp. 440 - 455
Main Authors	Chelly Dagdia, Zaineb, Zarges, Christine, Schannes, Benjamin, Micalef, Martin, Galiana, Lino, Rolland, Benoît, de Fresnoye, Olivier, Benchoufi, Mehdi
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2019 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Application Big data Cancer incidence prediction Epidemiology Feature selection Rough set theory
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A big challenge in epidemiology is to perform data pre-processing, specifically feature selection, on large scale data sets with a high dimensional feature set. In this paper, this challenge is tackled by using a recently established distributed and scalable version of Rough Set Theory (RST. It considers epidemiological data that has been collected from three international institutions for the purpose of cancer incidence prediction. The concrete data set used aggregates about 5 495 risk factors (features), spanning 32 years and 38 countries. Detailed experiments demonstrate that RST is relevant to real world big data applications as it can offer insights into the selected risk factors, speed up the learning process, ensure the performance of the cancer incidence prediction model without huge information loss, and simplify the learned model for epidemiologists. Code related to this paper is available at: https://github.com/zeinebchelly/Sp-RST.
Bibliography:	This work is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 702527. This work was based on a first version of a database provided by the OpenCancer organization, part of Épidemium—a data challenge oriented and community—based open science program. Additional thanks go to the Épidemium group, Roche, La Paillasse and to the Supercomputing Wales project, which is part-funded by the European Regional Development Fund via the Welsh Government.
ISBN:	9783030109967 3030109968
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-10997-4_27