An improved random forest based on the classification accuracy and correlation measurement of decision trees

•Propose an improved random forest based on the improvement of decision trees.•Improve the evaluation mechanism for the classification effect of decision trees.•Propose a method for quantifying the diversity between decision trees.•Multiple tests verify the superiority of the proposed improved rando...

Full description

Saved in:
Bibliographic Details
Published inExpert systems with applications Vol. 237; p. 121549
Main Authors Sun, Zhigang, Wang, Guotao, Li, Pengfei, Wang, Hui, Zhang, Min, Liang, Xiaowen
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.03.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•Propose an improved random forest based on the improvement of decision trees.•Improve the evaluation mechanism for the classification effect of decision trees.•Propose a method for quantifying the diversity between decision trees.•Multiple tests verify the superiority of the proposed improved random forest. Random forest is one of the most widely used machine learning algorithms. Decision trees used to construct the random forest may have low classification accuracies or high correlations, which affects the comprehensive performance of the random forest. Aiming at these problems, the authors proposed an improved random forest based on the classification accuracy and correlation measurement of decision trees in this paper. Its main idea includes two parts, one is retaining the classification and regression trees (CARTs) with better classification effects, the other is reducing the correlations between the CARTs. Specifically, in the classification effect evaluation part, each CART was applied to make predictions on three reserved data sets, then the average classification accuracies were achieved, respectively. Thus, all the CARTs were sorted in descending order according to their achieved average classification accuracies. In the correlation measurement part, the improved dot product method was proposed to calculate the cosine similarity, i.e., the correlation, between CARTs in the feature space. By using the achieved average classification accuracy as reference, the grid search method was used to find the inner product threshold. On this basis, the CARTs with low average classification accuracy among CART pairs whose inner product values are higher than the inner product threshold were marked as deletable. The achieved average classification accuracies and correlations of CARTs were comprehensively considered, those with high correlation and weak classification effect were deleted, and those with better quality were retained to construct the random forest. Multiple experiments show that, the proposed improved random forest achieved higher average classification accuracy than the five random forests used for comparison, and the lead was stable. The G-means and out-of-bag data (OBD) score obtained by the proposed improved random forest were also higher than the five random forests, and the lead was more obvious. In addition, the test results of three non-parametric tests show that, there were significant diversities between the proposed improved random forest and the other five random forests. This effectively proves the superiority and practicability of the proposed improved random forest.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2023.121549