Rough Set Theory as a Data Mining Technique: A Case Study in Epidemiology and Cancer Incidence Prediction
A big challenge in epidemiology is to perform data pre-processing, specifically feature selection, on large scale data sets with a high dimensional feature set. In this paper, this challenge is tackled by using a recently established distributed and scalable version of Rough Set Theory (RST. It cons...
Saved in:
Published in | Machine Learning and Knowledge Discovery in Databases Vol. 11053; pp. 440 - 455 |
---|---|
Main Authors | , , , , , , , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer International Publishing AG
2019
Springer International Publishing |
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | A big challenge in epidemiology is to perform data pre-processing, specifically feature selection, on large scale data sets with a high dimensional feature set. In this paper, this challenge is tackled by using a recently established distributed and scalable version of Rough Set Theory (RST. It considers epidemiological data that has been collected from three international institutions for the purpose of cancer incidence prediction. The concrete data set used aggregates about 5 495 risk factors (features), spanning 32 years and 38 countries. Detailed experiments demonstrate that RST is relevant to real world big data applications as it can offer insights into the selected risk factors, speed up the learning process, ensure the performance of the cancer incidence prediction model without huge information loss, and simplify the learned model for epidemiologists. Code related to this paper is available at: https://github.com/zeinebchelly/Sp-RST. |
---|---|
Bibliography: | This work is part of a project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 702527. This work was based on a first version of a database provided by the OpenCancer organization, part of Épidemium—a data challenge oriented and community—based open science program. Additional thanks go to the Épidemium group, Roche, La Paillasse and to the Supercomputing Wales project, which is part-funded by the European Regional Development Fund via the Welsh Government. |
ISBN: | 9783030109967 3030109968 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-030-10997-4_27 |