Machine Learning with PySpark - With Natural Language Processing and Recommender Systems (2nd Edition)

Master the new features in PySpark 3.1 to develop data-driven, intelligent applications. This updated edition covers topics ranging from building scalable machine learning models, to natural language processing, to recommender systems. This book begins with the fundamentals of Apache Spark, includin...

Full description

Saved in:

Bibliographic Details
Main Author	Singh, Pramod
Format	eBook
Language	English
Published	Berkeley, CA Apress, an imprint of Springer Nature 2021 Apress Apress L. P
Edition	2
Subjects	Application software Artificial Intelligence Computer Science General References Machine Learning MATHEMATICS Open Source Professional and Applied Computing Python Python (Computer program language) Software Engineering SPARK (Computer program language)
Online Access	Get full text
ISBN	9781484277768 1484277767 9781484277775 1484277775
DOI	10.1007/978-1-4842-7777-5

Cover

Table of Contents:

Title Page Introduction Table of Contents 1. Introduction to Spark 2. Manage Data with PySpark 3. Introduction to Machine Learning 4. Linear Regression 5. Logistic Regression 6. Random Forests Using PySpark 7. Clustering in PySpark 8. Recommender Systems 9. Natural Language Processing Index
Intro -- Table of Contents -- About the Author -- About the Technical Reviewer -- Acknowledgments -- Foreword -- Introduction -- Chapter 1: Introduction to Spark -- Data Generation -- Before the 1990s -- The Internet and Social Media Era -- The Machine Data Era -- Spark -- Setting Up the Environment -- Downloading Spark -- Installing Spark -- Docker -- Databricks -- Spin a New Cluster -- Create a Notebook -- Conclusion -- Chapter 2: Manage Data with PySpark -- Load and Read Data -- Data Filtering Using filter -- Data Filtering Using where -- Pandas UDF -- Drop Duplicate Values -- Writing Data -- CSV -- Parquet -- Data Handling Using Koalas -- Conclusion -- Chapter 3: Introduction to Machine Learning -- Rise in Data -- Increased Computational Efficiency -- Improved ML Algorithms -- Availability of Data Scientists -- Supervised Machine Learning -- Unsupervised Machine Learning -- Semi-supervised Learning -- Reinforcement Learning -- Industrial Application and Challenges -- Retail -- Healthcare -- Finance -- Travel and Hospitality -- Media and Marketing -- Manufacturing and Automobile -- Social Media -- Others -- Conclusion -- Chapter 4: Linear Regression -- Variables -- Theory -- Interpretation -- Evaluation -- Code -- Conclusion -- Chapter 5: Logistic Regression -- Probability -- Using Linear Regression -- Using Logit -- Interpretation (Coefficients) -- Dummy Variables -- Model Evaluation -- True Positives -- True Negatives -- False Positives -- False Negatives -- Accuracy -- Recall -- Precision -- F1 Score -- Probability Cut-Off/Threshold -- ROC Curve -- Logistic Regression Code -- Data Info -- Confusion Matrix -- Accuracy -- Recall -- Precision -- Conclusion -- Chapter 6: Random Forests Using PySpark -- Decision Tree -- Entropy -- Information Gain -- Random Forests -- Code -- Conclusion -- Chapter 7: Clustering in PySpark -- Applications
K-Means -- Deciding on the Number of Clusters (K) -- Elbow Method -- Hierarchical Clustering -- Agglomerative Clustering -- Code -- Data Info -- Conclusion -- Chapter 8: Recommender Systems -- Recommendations -- Popularity-Based RS -- Content-Based RS -- User Profile -- Euclidean Distance -- Cosine Similarity -- Collaborative Filtering-Based RS -- User Item Matrix -- Explicit Feedback -- Implicit Feedback -- Nearest Neighbors-Based CF -- Missing Values -- Latent Factor-Based CF -- Hybrid Recommender Systems -- Code -- Data Info -- Conclusion -- Chapter 9: Natural Language Processing -- Steps Involved in NLP -- Corpus -- Tokenize -- Stopword Removal -- Bag of Words -- CountVectorizer -- TF-IDF -- Text Classification Using Machine Learning -- Sequence Embeddings -- Embeddings -- Conclusion -- Index