Principles of Data Mining

This book explains the principal techniques of data mining, for classification, association rule mining and clustering. Each topic is clearly explained and illustrated by detailed examples, with a focus on algorithms rather than mathematical formalism.

Saved in:

Bibliographic Details
Main Author	Bramer, Max
Format	eBook Book
Language	English
Published	London Springer Nature 2007 Springer Springer London, Limited Springer London
Edition	2
Series	Undergraduate topics in computer science
Subjects	Artificial Intelligence Computer Science Data mining Data processing Computer science Data Structures and Information Theory Database Management Database searching Information Storage and Retrieval Programming Techniques Theory of Computation
Online Access	Get full text

Cover

Loading…

Table of Contents:

5.3.3 Using Entropy for Attribute Selection -- 5.3.4 Maximising Information Gain -- 5.4 Chapter Summary -- 5.5 Self-assessment Exercises for Chapter 5 -- 6. Decision Tree Induction: Using Frequency Tables for Attribute Selection -- 6.1 Calculating Entropy in Practice -- 6.1.1 Proof of Equivalence -- 6.1.2 A Note on Zeros -- 6.2 Other Attribute Selection Criteria: Gini Index of Diversity -- 6.3 The chi2 Attribute Selection Criterion -- 6.4 Inductive Bias -- 6.5 Using Gain Ratio for Attribute Selection -- 6.5.1 Properties of Split Information -- 6.5.2 Summary -- 6.6 Number of Rules Generated by Different Attribute Selection Criteria -- 6.7 Missing Branches -- 6.8 Chapter Summary -- 6.9 Self-assessment Exercises for Chapter 6 -- References -- 7. Estimating the Predictive Accuracy of a Classifier -- 7.1 Introduction -- 7.2 Method 1: Separate Training and Test Sets -- 7.2.1 Standard Error -- 7.2.2 Repeated Train and Test -- 7.3 Method 2: k-fold Cross-validation -- 7.4 Method 3: N-fold Cross-validation -- 7.5 Experimental Results I -- 7.6 Experimental Results II: Datasets with Missing Values -- 7.6.1 Strategy 1: Discard Instances -- 7.6.2 Strategy 2: Replace by Most Frequent/Average Value -- 7.6.3 Missing Classifications -- 7.7 Confusion Matrix -- 7.7.1 True and False Positives -- 7.8 Chapter Summary -- 7.9 Self-assessment Exercises for Chapter 7 -- Reference -- 8. Continuous Attributes -- 8.1 Introduction -- 8.2 Local versus Global Discretisation -- 8.3 Adding Local Discretisation to TDIDT -- 8.3.1 Calculating the Information Gain of a Set of Pseudo-attributes -- 8.3.2 Computational Efficiency -- 8.4 Using the ChiMerge Algorithm for Global Discretisation -- 8.4.1 Calculating the Expected Values and chi2 -- 8.4.2 Finding the Threshold Value -- 8.4.3 Setting minIntervals and maxIntervals -- 8.4.4 The ChiMerge Algorithm: Summary
A.2.3 Subtrees
13.4 Evaluating the Effectiveness of a Distributed System: PMCRI -- 13.5 Revising a Classifier Incrementally -- 13.6 Chapter Summary -- 13.7 Self-assessment Exercises for Chapter 13 -- References -- 14. Ensemble Classification -- 14.1 Introduction -- 14.2 Estimating the Performance of a Classifier -- 14.3 Selecting a Different Training Set for Each Classifier -- 14.4 Selecting a Different Set of Attributes for Each Classifier -- 14.5 Combining Classifications: Alternative Voting Systems -- 14.6 Parallel Ensemble Classifiers -- 14.7 Chapter Summary -- 14.8 Self-assessment Exercises for Chapter 14 -- References -- 15. Comparing Classifiers -- 15.1 Introduction -- 15.2 The Paired t-Test -- 15.3 Choosing Datasets for Comparative Evaluation -- 15.3.1 Confidence Intervals -- 15.4 Sampling -- 15.5 How Bad Is a `No Significant Difference' Result? -- 15.6 Chapter Summary -- 15.7 Self-assessment Exercises for Chapter 15 -- References -- 16. Association Rule Mining I -- 16.1 Introduction -- 16.2 Measures of Rule Interestingness -- 16.2.1 The Piatetsky-Shapiro Criteria and the RI Measure -- 16.2.2 Rule Interestingness Measures Applied to the chess Dataset -- 16.2.3 Using Rule Interestingness Measures for Conflict Resolution -- 16.3 Association Rule Mining Tasks -- 16.4 Finding the Best N Rules -- 16.4.1 The J-Measure: Measuring the Information Content of a Rule -- 16.4.2 Search Strategy -- 16.5 Chapter Summary -- 16.6 Self-assessment Exercises for Chapter 16 -- References -- 17. Association Rule Mining II -- 17.1 Introduction -- 17.2 Transactions and Itemsets -- 17.3 Support for an Itemset -- 17.4 Association Rules -- 17.5 Generating Association Rules -- 17.6 Apriori -- 17.7 Generating Supported Itemsets: An Example -- 17.8 Generating Rules for a Supported Itemset -- 17.9 Rule Interestingness Measures: Lift and Leverage -- 17.10 Chapter Summary
8.4.5 The ChiMerge Algorithm: Comments -- 8.5 Comparing Global and Local Discretisation for Tree Induction -- 8.6 Chapter Summary -- 8.7 Self-assessment Exercises for Chapter 8 -- Reference -- 9. Avoiding Overfitting of Decision Trees -- 9.1 Dealing with Clashes in a Training Set -- 9.1.1 Adapting TDIDT to Deal with Clashes -- 9.2 More About Overfitting Rules to Data -- 9.3 Pre-pruning Decision Trees -- 9.4 Post-pruning Decision Trees -- 9.5 Chapter Summary -- 9.6 Self-assessment Exercise for Chapter 9 -- References -- 10. More About Entropy -- 10.1 Introduction -- 10.2 Coding Information Using Bits -- 10.3 Discriminating Amongst M Values (M Not a Power of 2) -- 10.4 Encoding Values That Are Not Equally Likely -- 10.5 Entropy of a Training Set -- 10.6 Information Gain Must Be Positive or Zero -- 10.7 Using Information Gain for Feature Reduction for Classification Tasks -- 10.7.1 Example 1: The genetics Dataset -- 10.7.2 Example 2: The bcst96 Dataset -- 10.8 Chapter Summary -- 10.9 Self-assessment Exercises for Chapter 10 -- References -- 11. Inducing Modular Rules for Classification -- 11.1 Rule Post-pruning -- 11.2 Conflict Resolution -- 11.3 Problems with Decision Trees -- 11.4 The Prism Algorithm -- 11.4.1 Changes to the Basic Prism Algorithm -- 11.4.2 Comparing Prism with TDIDT -- 11.5 Chapter Summary -- 11.6 Self-assessment Exercise for Chapter 11 -- References -- 12. Measuring the Performance of a Classifier -- 12.1 True and False Positives and Negatives -- 12.2 Performance Measures -- 12.3 True and False Positive Rates versus Predictive Accuracy -- 12.4 ROC Graphs -- 12.5 ROC Curves -- 12.6 Finding the Best Classifier -- 12.7 Chapter Summary -- 12.8 Self-assessment Exercise for Chapter 12 -- 13. Dealing with Large Volumes of Data -- 13.1 Introduction -- 13.2 Distributing Data onto Multiple Processors -- 13.3 Case Study: PMCRI
17.11 Self-assessment Exercises for Chapter 17 -- Reference -- 18. Association Rule Mining III: Frequent Pattern Trees -- 18.1 Introduction: FP-Growth -- 18.2 Constructing the FP-tree -- 18.2.1 Pre-processing the Transaction Database -- 18.2.2 Initialisation -- 18.2.3 Processing Transaction 1: f, c, a, m, p -- 18.2.4 Processing Transaction 2: f, c, a, b, m -- 18.2.5 Processing Transaction 3: f, b -- 18.2.6 Processing Transaction 4: c, b, p -- 18.2.7 Processing Transaction 5: f, c, a, m, p -- 18.3 Finding the Frequent Itemsets from the FP-tree -- 18.3.1 Itemsets Ending with Item p -- 18.3.2 Itemsets Ending with Item m -- 18.4 Chapter Summary -- 18.5 Self-assessment Exercises for Chapter 18 -- Reference -- 19. Clustering -- 19.1 Introduction -- 19.2 k-Means Clustering -- 19.2.1 Example -- 19.2.2 Finding the Best Set of Clusters -- 19.3 Agglomerative Hierarchical Clustering -- 19.3.1 Recording the Distance Between Clusters -- 19.3.2 Terminating the Clustering Process -- 19.4 Chapter Summary -- 19.5 Self-assessment Exercises for Chapter 19 -- 20. Text Mining -- 20.1 Multiple Classifications -- 20.2 Representing Text Documents for Data Mining -- 20.3 Stop Words and Stemming -- 20.4 Using Information Gain for Feature Reduction -- 20.5 Representing Text Documents: Constructing a Vector Space Model -- 20.6 Normalising the Weights -- 20.7 Measuring the Distance Between Two Vectors -- 20.8 Measuring the Performance of a Text Classifier -- 20.9 Hypertext Categorisation -- 20.9.1 Classifying Web Pages -- 20.9.2 Hypertext Classification versus Text Classification -- 20.10 Chapter Summary -- 20.11 Self-assessment Exercises for Chapter 20 -- A. Essential Mathematics -- A.1 Subscript Notation -- A.1.1 Sigma Notation for Summation -- A.1.2 Double Subscript Notation -- A.1.3 Other Uses of Subscripts -- A.2 Trees -- A.2.1 Terminology -- A.2.2 Interpretation
Intro -- Principles of Data Mining -- About This Book -- Contents -- 1. Introduction to Data Mining -- 1.1 The Data Explosion -- 1.2 Knowledge Discovery -- 1.3 Applications of Data Mining -- 1.4 Labelled and Unlabelled Data -- 1.5 Supervised Learning: Classification -- 1.6 Supervised Learning: Numerical Prediction -- 1.7 Unsupervised Learning: Association Rules -- 1.8 Unsupervised Learning: Clustering -- 2. Data for Data Mining -- 2.1 Standard Formulation -- 2.2 Types of Variable -- 2.2.1 Categorical and Continuous Attributes -- 2.3 Data Preparation -- 2.3.1 Data Cleaning -- 2.4 Missing Values -- 2.4.1 Discard Instances -- 2.4.2 Replace by Most Frequent/Average Value -- 2.5 Reducing the Number of Attributes -- 2.6 The UCI Repository of Datasets -- 2.7 Chapter Summary -- 2.8 Self-assessment Exercises for Chapter 2 -- Reference -- 3. Introduction to Classification: Naïve Bayes and Nearest Neighbour -- 3.1 What Is Classification? -- 3.2 Naïve Bayes Classifiers -- 3.3 Nearest Neighbour Classification -- 3.3.1 Distance Measures -- 3.3.2 Normalisation -- 3.3.3 Dealing with Categorical Attributes -- 3.4 Eager and Lazy Learning -- 3.5 Chapter Summary -- 3.6 Self-assessment Exercises for Chapter 3 -- 4. Using Decision Trees for Classification -- 4.1 Decision Rules and Decision Trees -- 4.1.1 Decision Trees: The Golf Example -- 4.1.2 Terminology -- 4.1.3 The degrees Dataset -- 4.2 The TDIDT Algorithm -- 4.3 Types of Reasoning -- 4.4 Chapter Summary -- 4.5 Self-assessment Exercises for Chapter 4 -- References -- 5. Decision Tree Induction: Using Entropy for Attribute Selection -- 5.1 Attribute Selection: An Experiment -- 5.2 Alternative Decision Trees -- 5.2.1 The Football/Netball Example -- 5.2.2 The anonymous Dataset -- 5.3 Choosing Attributes to Split On: Using Entropy -- 5.3.1 The lens24 Dataset -- 5.3.2 Entropy
Intro -- Contents -- Introduction to Data Mining -- 1 Data for Data Mining -- 2 Introduction to Classification: Naive Bayes and Nearest Neighbour -- 3 Using Decision Trees for Classification -- 4 Decision Tree Induction: Using Entropy for Attribute Selection -- 5 Decision Tree Induction: Using Frequency Tables for Attribute Selection -- 6 Estimating the Predictive Accuracy of a Classifier -- 7 Continuous Attributes -- 8 Avoiding Overfitting of Decision Trees -- 9 More About Entropy -- 10 Inducing Modular Rules for Classification -- 11 Measuring the Performance of a Classifier -- 12 Association Rule Mining I -- 13 Association Rule Mining II -- 14 Clustering -- 15 Text Mining -- References -- A Essential Mathematics -- B Datasets -- C Sources of Further Information -- D Glossary and Notation -- E Solutions to Self-assessment Exercises -- Index