Randomization methods for assessing data analysis results on real-valued matrices

Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is the...

Full description

Saved in:
Bibliographic Details
Published inStatistical analysis and data mining Vol. 2; no. 4; pp. 209 - 230
Main Authors Ojala, Markus, Vuokko, Niko, Kallio, Aleksi, Haiminen, Niina, Mannila, Heikki
Format Journal Article
LanguageEnglish
Published Hoboken Wiley Subscription Services, Inc., A Wiley Company 01.11.2009
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Randomization is an important technique for assessing the significance of data analysis results. Given an input dataset, a randomization method samples at random from some class of datasets that share certain characteristics with the original data. The measure of interest on the original data is then compared to the measure on the samples to assess its significance. For certain types of data, e.g., gene expression matrices, it is useful to be able to sample datasets that have the same row and column distributions of values as the original dataset. Testing whether the results of a data mining algorithm on such randomized datasets differ from the results on the true dataset tells us whether the results on the true data were an artifact of the row and column statistics, or due to some more interesting phenomena in the data. We study the problem of generating such randomized datasets. We describe methods based on local transformations and Metropolis sampling, and show that the methods are efficient and usable in practice. We evaluate the performance of the methods both on real and generated data. We also show how our methods can be applied to a real data analysis scenario on DNA microarray data. The results indicate that the methods work efficiently and are usable in significance testing of data mining results on real‐valued matrices. Copyright © 2009 Wiley Periodicals, Inc., A Wiley Company
Bibliography:ark:/67375/WNG-C57QN41V-R
ArticleID:SAM10042
istex:BF0523EAA47E5662C6DE429C033B2BF1064D8DE2
ISSN:1932-1864
1932-1872
DOI:10.1002/sam.10042