Machine Learning with PySpark - With Natural Language Processing and Recommender Systems (2nd Edition)

Master the new features in PySpark 3.1 to develop data-driven, intelligent applications. This updated edition covers topics ranging from building scalable machine learning models, to natural language processing, to recommender systems. This book begins with the fundamentals of Apache Spark, includin...

Full description

Saved in:
Bibliographic Details
Main Author Singh, Pramod
Format eBook
LanguageEnglish
Published Berkeley, CA Apress, an imprint of Springer Nature 2021
Apress
Apress L. P
Edition2
Subjects
Online AccessGet full text
ISBN9781484277768
1484277767
9781484277775
1484277775
DOI10.1007/978-1-4842-7777-5

Cover

Table of Contents:
  • Title Page Introduction Table of Contents 1. Introduction to Spark 2. Manage Data with PySpark 3. Introduction to Machine Learning 4. Linear Regression 5. Logistic Regression 6. Random Forests Using PySpark 7. Clustering in PySpark 8. Recommender Systems 9. Natural Language Processing Index
  • Intro -- Table of Contents -- About the Author -- About the Technical Reviewer -- Acknowledgments -- Foreword -- Introduction -- Chapter 1: Introduction to Spark -- Data Generation -- Before the 1990s -- The Internet and Social Media Era -- The Machine Data Era -- Spark -- Setting Up the Environment -- Downloading Spark -- Installing Spark -- Docker -- Databricks -- Spin a New Cluster -- Create a Notebook -- Conclusion -- Chapter 2: Manage Data with PySpark -- Load and Read Data -- Data Filtering Using filter -- Data Filtering Using where -- Pandas UDF -- Drop Duplicate Values -- Writing Data -- CSV -- Parquet -- Data Handling Using Koalas -- Conclusion -- Chapter 3: Introduction to Machine Learning -- Rise in Data -- Increased Computational Efficiency -- Improved ML Algorithms -- Availability of Data Scientists -- Supervised Machine Learning -- Unsupervised Machine Learning -- Semi-supervised Learning -- Reinforcement Learning -- Industrial Application and Challenges -- Retail -- Healthcare -- Finance -- Travel and Hospitality -- Media and Marketing -- Manufacturing and Automobile -- Social Media -- Others -- Conclusion -- Chapter 4: Linear Regression -- Variables -- Theory -- Interpretation -- Evaluation -- Code -- Conclusion -- Chapter 5: Logistic Regression -- Probability -- Using Linear Regression -- Using Logit -- Interpretation (Coefficients) -- Dummy Variables -- Model Evaluation -- True Positives -- True Negatives -- False Positives -- False Negatives -- Accuracy -- Recall -- Precision -- F1 Score -- Probability Cut-Off/Threshold -- ROC Curve -- Logistic Regression Code -- Data Info -- Confusion Matrix -- Accuracy -- Recall -- Precision -- Conclusion -- Chapter 6: Random Forests Using PySpark -- Decision Tree -- Entropy -- Information Gain -- Random Forests -- Code -- Conclusion -- Chapter 7: Clustering in PySpark -- Applications
  • K-Means -- Deciding on the Number of Clusters (K) -- Elbow Method -- Hierarchical Clustering -- Agglomerative Clustering -- Code -- Data Info -- Conclusion -- Chapter 8: Recommender Systems -- Recommendations -- Popularity-Based RS -- Content-Based RS -- User Profile -- Euclidean Distance -- Cosine Similarity -- Collaborative Filtering-Based RS -- User Item Matrix -- Explicit Feedback -- Implicit Feedback -- Nearest Neighbors-Based CF -- Missing Values -- Latent Factor-Based CF -- Hybrid Recommender Systems -- Code -- Data Info -- Conclusion -- Chapter 9: Natural Language Processing -- Steps Involved in NLP -- Corpus -- Tokenize -- Stopword Removal -- Bag of Words -- CountVectorizer -- TF-IDF -- Text Classification Using Machine Learning -- Sequence Embeddings -- Embeddings -- Conclusion -- Index