Machine Learning Using R - With Time Series and Industry-Based Use Cases in R (2nd Edition)

Examine the latest technological advancements in building a scalable machine learning model with Big Data using R. This book shows you how to work with a machine learning algorithm and use it to build a ML model from raw data. All practical demonstrations will be explored in R, a powerful programmin...

Full description

Saved in:

Bibliographic Details
Main Authors	Ramasubramanian, Karthik, Singh, Abhishek
Format	eBook
Language	English
Published	Berkeley, CA Apress, an imprint of Springer Nature 2019 Apress L. P Apress
Edition	2
Subjects	Artificial Intelligence Computer Science Machine learning Open Source Professional and Applied Computing Programming Languages Programming Languages, Compilers, Interpreters R (Computer program language) Software Engineering
Online Access	Get full text

Cover

Loading…

Table of Contents:

Title Page Introduction Table of Contents 1. Introduction to Machine Learning and R 2. Data Preparation and Exploration 3. Sampling and Resampling Techniques 4. Data Visualization in R 5. Feature Engineering 6. Machine Learning Theory and Practice 7. Machine Learning Model Evaluation 8. Model Performance Improvement 9. Time Series Modeling 10. Scalable Machine Learning and Related Technologies 11. Deep Learning Using Keras and TensorFlow Index
3.10.1.3 Steps in Simulation with R Code -- 3.10.2 Central Limit Theorem -- 3.10.2.1 Steps in Simulation with R Code -- 3.11 Probability Sampling Techniques -- 3.11.1 Population Statistics -- 3.11.2 Simple Random Sampling -- 3.11.3 Systematic Random Sampling -- 3.11.4 Stratified Random Sampling -- 3.11.5 Cluster Sampling -- 3.11.6 Bootstrap Sampling -- 3.12 Monte Carlo Method: Acceptance-Rejection Method -- 3.13 Summary -- Chapter 4: Data Visualization in R -- 4.1 Introduction to the ggplot2 Package -- 4.2 World Development Indicators -- 4.3 Line Chart -- 4.4 Stacked Column Charts -- 4.5 Scatterplots -- 4.6 Boxplots -- 4.7 Histograms and Density Plots -- 4.8 Pie Charts -- 4.9 Correlation Plots -- 4.10 Heatmaps -- 4.11 Bubble Charts -- 4.12 Waterfall Charts -- 4.13 Dendogram -- 4.14 Wordclouds -- 4.15 Sankey Plots -- 4.16 Time Series Graphs -- 4.17 Cohort Diagrams -- 4.18 Spatial Maps -- 4.19 Summary -- Chapter 5: Feature Engineering -- 5.1 Introduction to Feature Engineering -- 5.2 Understanding the Data -- 5.2.1 Data Summary -- 5.2.2 Properties of Dependent Variable -- 5.2.3 Features Availability: Continuous or Categorical -- 5.2.4 Setting Up Data Assumptions -- 5.3 Feature Ranking -- 5.4 Variable Subset Selection -- 5.4.1 Filter Method -- 5.4.2 Wrapper Methods -- 5.4.3 Embedded Methods -- 5.5 Principal Component Analysis -- 5.6 Summary -- Chapter 6: Machine Learning Theory and Practice -- 6.1 Machine Learning Types -- 6.1.1 Supervised Learning -- 6.1.2 Unsupervised Learning -- 6.1.3 Semi-Supervised Learning -- 6.1.4 Reinforcement Learning -- 6.2 Groups of Machine Learning Algorithms -- 6.3 Real-World Datasets -- 6.3.1 House Sale Prices -- 6.3.2 Purchase Preference -- 6.3.3 Twitter Feeds and Article -- 6.3.4 Breast Cancer -- 6.3.5 Market Basket -- 6.3.6 Amazon Food Reviews -- 6.4 Regression Analysis -- 6.5 Correlation Analysis
Intro -- Table of Contents -- About the Authors -- About the Technical Reviewer -- Acknowledgments -- Introduction -- Chapter 1: Introduction to Machine Learning and R -- 1.1 Understanding the Evolution -- 1.1.1 Statistical Learning -- 1.1.2 Machine Learning (ML) -- 1.1.3 Artificial Intelligence (AI) -- 1.1.4 Data Mining -- 1.1.5 Data Science -- 1.2 Probability and Statistics -- 1.2.1 Counting and Probability Definition -- 1.2.2 Events and Relationships -- 1.2.2.1 Independent Events -- 1.2.2.2 Conditional Independence -- 1.2.2.3 Bayes Theorem -- 1.2.3 Randomness, Probability, and Distributions -- 1.2.4 Confidence Interval and Hypothesis Testing -- 1.2.4.1 Confidence Interval -- 1.2.4.2 Hypothesis Testing -- 1.3 Getting Started with R -- 1.3.1 Basic Building Blocks -- 1.3.1.1 Calculations -- 1.3.1.2 Statistics with R -- 1.3.1.3 Packages -- 1.3.2 Data Structures in R -- 1.3.2.1 Vectors -- 1.3.2.2 Lists -- 1.3.2.3 Matrixes -- 1.3.2.4 Data Frames -- 1.3.3 Subsetting -- 1.3.3.1 Vectors -- 1.3.3.2 Lists -- 1.3.3.3 Matrixes -- 1.3.3.4 Data Frames -- 1.3.4 Functions and the Apply Family -- 1.4 Machine Learning Process Flow -- 1.4.1 Plan -- 1.4.2 Explore -- 1.4.3 Build -- 1.4.4 Evaluate -- 1.5 Other Technologies -- 1.6 Summary -- Chapter 2: Data Preparation and Exploration -- 2.1 Planning the Gathering of Data -- 2.1.1 Variables Types -- 2.1.1.1 Categorical Variables -- 2.1.1.2 Continuous Variables -- 2.1.2 Data Formats -- 2.1.2.1 Comma-Separated Values -- 2.1.2.2 XLS Files -- 2.1.2.3 Extensible Markup Language: XML -- 2.1.2.4 Hypertext Markup Language: HTML -- 2.1.2.5 JSON -- 2.1.3 Types of Data Sources -- 2.1.3.1 Structured Data -- 2.1.3.2 Semi-Structured Data -- 2.1.3.3 Unstructured Data -- 2.2 Initial Data Analysis (IDA) -- 2.2.1 Discerning a First Look -- 2.2.1.1 Function str() -- 2.2.1.2 Naming Convention: make.names()
2.2.1.3 Table(): Pattern or Trend -- 2.2.2 Organizing Multiple Sources of Data into One -- 2.2.2.1 Merge and dplyr Joins -- 2.2.2.1.1 Using merge -- 2.2.2.1.2 dplyr -- 2.2.3 Cleaning the Data -- 2.2.3.1 Correcting Factor Variables -- 2.2.3.2 Dealing with NAs -- 2.2.3.3 Dealing with Dates and Times -- 2.2.3.3.1 Time Zone -- 2.2.3.3.2 Daylight Savings Time -- 2.2.4 Supplementing with More Information -- 2.2.4.1 Derived Variables -- 2.2.4.2 n-Day Averages -- 2.2.5 Reshaping -- 2.3 Exploratory Data Analysis -- 2.3.1 Summary Statistics -- 2.3.1.1 Quantile -- 2.3.1.2 Mean -- 2.3.1.3 Frequency Plot -- 2.3.1.4 Boxplot -- 2.3.2 Moment -- 2.3.2.1 Variance -- 2.3.2.2 Skewness -- 2.3.2.3 Kurtosis -- 2.4 Case Study: Credit Card Fraud -- 2.4.1 Data Import -- 2.4.2 Data Transformation -- 2.4.3 Data Exploration -- 2.5 Summary -- Chapter 3: Sampling and Resampling Techniques -- 3.1 Introduction to Sampling -- 3.2 Sampling Terminology -- 3.2.1 Sample -- 3.2.2 Sampling Distribution -- 3.2.3 Population Mean and Variance -- 3.2.4 Sample Mean and Variance -- 3.2.5 Pooled Mean and Variance -- 3.2.6 Sample Point -- 3.2.7 Sampling Error -- 3.2.8 Sampling Fraction -- 3.2.9 Sampling Bias -- 3.2.10 Sampling Without Replacement (SWOR) -- 3.2.11 Sampling with Replacement (SWR) -- 3.3 Credit Card Fraud: Population Statistics -- 3.4 Data Description -- 3.5 Population Mean -- 3.6 Population Variance -- 3.7 Pooled Mean and Variance -- 3.8 Business Implications of Sampling -- 3.8.1 Shortcomings of Sampling -- 3.9 Probability and Non-Probability Sampling -- 3.9.1 Types of Non-Probability Sampling -- 3.9.1.1 Convenience Sampling -- 3.9.1.2 Purposive Sampling -- 3.9.1.3 Quota Sampling -- 3.10 Statistical Theory on Sampling Distributions -- 3.10.1 Law of Large Numbers: LLN -- 3.10.1.1 Weak Law of Large Numbers -- 3.10.1.2 Strong Law of Large Numbers
6.9.2.2 Centroid-Based Clustering -- 6.9.2.3 Distribution-Based Clustering -- 6.9.2.4 Density-Based Clustering -- 6.9.3 Internal Evaluation -- 6.9.3.1 Dunn Index -- 6.9.3.2 Silhouette Coefficient -- 6.9.4 External Evaluation -- 6.9.4.1 Rand Measure -- 6.9.4.2 Jaccard Index -- 6.9.5 Conclusion -- 6.10 Association Rule Mining -- 6.10.1 Introduction to Association Concepts -- 6.10.1.1 Support -- 6.10.1.2 Confidence -- 6.10.1.3 Lift -- 6.10.2 Rule-Mining Algorithms -- 6.10.2.1 Apriori -- 6.10.2.2 Eclat -- 6.10.3 Recommendation Algorithms -- 6.10.3.1 User-Based Collaborative Filtering (UBCF) -- 6.10.3.2 Item-Based Collaborative Filtering (IBCF) -- 6.10.4 Conclusion -- 6.11 Artificial Neural Networks -- 6.11.1 Human Cognitive Learning -- 6.11.2 Perceptron -- 6.11.3 Sigmoid Neuron -- 6.11.4 Neural Network Architecture -- 6.11.5 Supervised versus Unsupervised Neural Nets -- 6.11.6 Neural Network Learning Algorithms -- 6.11.6.1 Evolutionary Methods -- 6.11.6.2 Gene Expression Programming -- 6.11.6.3 Simulated Annealing -- 6.11.6.4 Expectation Maximization -- 6.11.6.5 Non-Parametric Methods -- 6.11.6.6 Particle Swarm Optimization -- 6.11.7 Feed-Forward Back-Propagation -- 6.11.7.1 Purchase Prediction: Neural Network-Based Classification -- 6.11.8 Conclusion -- 6.12 Text-Mining Approaches -- 6.12.1 Introduction to Text Mining -- 6.12.2 Text Summarization -- 6.12.3 TF-IDF -- 6.12.4 Part-of-Speech (POS) Tagging -- 6.12.5 Word Cloud -- 6.12.6 Text Analysis: Microsoft Cognitive Services -- 6.12.7 Conclusion -- 6.13 Online Machine Learning Algorithms -- 6.13.1 Fuzzy C-Means Clustering -- 6.13.2 Conclusion -- 6.14 Model Building Checklist -- 6.15 Summary -- Chapter 7: Machine Learning Model Evaluation -- 7.1 Dataset -- 7.1.1 House Sale Prices -- 7.1.2 Purchase Preference -- 7.2 Introduction to Model Performance and Evaluation
6.5.1 Linear Regression -- 6.5.2 Simple Linear Regression -- 6.5.3 Multiple Linear Regression -- 6.5.4 Model Diagnostics: Linear Regression -- 6.5.4.1 Influential Point Analysis -- 6.5.4.2 Normality of Residuals -- 6.5.4.3 Multicollinearity -- 6.5.4.4 Residual Auto-Correlation -- 6.5.4.5 Homoscedasticity -- 6.5.5 Polynomial Regression -- 6.5.6 Logistic Regression -- 6.5.7 Logit Transformation -- 6.5.8 Odds Ratio -- 6.5.8.1 Binomial Logistic Model -- 6.5.9 Model Diagnostics: Logistic Regression -- 6.5.9.1 Wald Test -- 6.5.9.2 Deviance -- 6.5.9.3 Pseudo R-Square -- 6.5.9.4 Bivariate Plots -- 6.5.9.5 Cumulative Gains and Lift Charts -- 6.5.9.6 Concordance and Discordant Ratios -- 6.5.10 Multinomial Logistic Regression -- 6.5.11 Generalized Linear Models -- 6.5.12 Conclusion -- 6.6 Support Vector Machine SVM -- 6.6.1 Linear SVM -- 6.6.1.1 Hard Margins -- 6.6.1.2 Soft Margins -- 6.6.2 Binary SVM Classifier -- 6.6.3 Multi-Class SVM -- 6.6.4 Conclusion -- 6.7 Decision Trees -- 6.7.1 Types of Decision Trees -- 6.7.1.1 Regression Trees -- 6.7.1.2 Classification Trees -- 6.7.2 Decision Measures -- 6.7.2.1 Gini Index -- 6.7.2.2 Entropy -- 6.7.2.3 Information Gain -- 6.7.3 Decision Tree Learning Methods -- 6.7.3.1 Iterative Dichotomizer 3 -- 6.7.3.2 C5.0 Algorithm -- 6.7.3.3 Classification and Regression Tree: CART -- 6.7.3.4 Chi-Square Automated Interaction Detection: CHAID -- 6.7.4 Ensemble Trees -- 6.7.4.1 Boosting -- 6.7.4.2 Bagging -- Bagging CART -- Random Forest -- 6.7.5 Conclusion -- 6.8 The Naive Bayes Method -- 6.8.1 Conditional Probability -- 6.8.2 Bayes Theorem -- 6.8.3 Prior Probability -- 6.8.4 Posterior Probability -- 6.8.5 Likelihood and Marginal Likelihood -- 6.8.6 Naïve Bayes Methods -- 6.8.7 Conclusion -- 6.9 Cluster Analysis -- 6.9.1 Introduction to Clustering -- 6.9.2 Clustering Algorithms -- 6.9.2.1 Hierarchal Clustering
7.3 Objectives of Model Performance Evaluation