From data munging to evaluating models, “Machine Learning on Big Data” is a 2-days course covering the entire Data Science pipeline: converting collected Big Data into mathematical data structures, algorithms for learning distributed regression, classification and recommender system models, implementing such models using Apache Mahout and Apache Spark, assessing how the models work. In this course several data mining/machine learning models and scalable learning algorithms are covered, including:
- Generalized Linear Models: Linear Regression and Logistic Regression
- Decision Trees
- Clustering
- Mixed Membership Models: (Latent Dirichlet Allocation)
- Similarity Analysis
- Matrix Factorization
- Learning Ensembles: Random Forests
The participants will get familiar with the cutting-edge open source Machine Learning libraries, and run Machine Learning pipelines on pre-installed Hadoop/Spark clusters provided by Analytics Center.
Hands-on Labs
With hands-on labs and demonstrations, the participants will utilize ecosystem tools to:
- Prepare a prototyping (interactive notebook) environment for large scale data analysis
- Transform Big Data into Machine Learning data structures
- Understand scalable Machine Learning algorithms
- Write programs to learn and evaluate supervised learning models
- Write programs for clustering data
- Write programs for finding mixed memberships to clusters (topic modeling)
- Write programs to learn and evaluate recommender system models
- Design a typical analytics/machine learning pipeline
Course Prerequisites
“Machine Learning on Big Data” is the core course for the Data Scientist learning track. In addition to an appreciation of what Machine Learning is capable of, the attendees are expected to have an understanding of how Big Data Processing technologies work in general.
The attendees should be able to write simple programs either in Scala or Python, but the amount of programming is minimal.
Course Coverage
PART I – ESSENTIALS & ECOSYSTEM
Machine Learning Essentials & Big Learning
- Learning from Data
- Common Machine Learning Tasks
- Example Use Cases
- Machine Learning in the Big Data Era
- Machine Learning on Big Data: Challenges
- Machine Learning Pipelines:
- Data Preparation/Transformation
- Learning
- Evaluation
- Deployment
Big Data Science Ecosystem
- Apache Spark
- Apache Spark Basics
- RDD APIs: Scala API and PySpark
- Spark DataFrames (and Datasets)
- Spark ML Pipelines
- Apache Zeppelin for Interactive Data Analysis Notebooks
PART II – ALGORITHMS & INTERNALS
Data Munging
- Summarizing Large Datasets
- Common Data Transformation Tasks
- Data Structures for Machine Learning
- Working with Text Data
Supervised Learning
- Learning & Evaluation using Spark APIs:
- Linear Regression
- Logistic Regression
- Naive Bayes
- Decision & Regression Trees
- Tree Ensembles: Random Forests
- Evaluation
- Making Predictions
Unsupervised Learning
- Clustering:
- K-Means
- Gaussian Mixture
- Mixed Membership: Latent Dirichlet Allocation
- Online Clustering from Data Streams: Streaming K-Means
Recommender Systems
- Similarity Based Collaborative Filtering
- Matrix Factorization Based Collaborative Filtering
- Evaluating Recommender Systems
Large Scale Machine Learning Internals
- Distributed Optimization for Large Scale Supervised Learning
- K-Means/Streaming K-Means at Large Scale
- Variational EM for Learning & Inference in LDA
- Alternating Least Squares (ALS) and Implicit ALS for Matrix Factorization based Collaborative Filtering