DTSTART;VALUE=DATE:20190328
DTEND;VALUE=DATE:20190330
Machine Learning on Big Data
From data munging to evaluating models, "Machine Learning on Big Data" is a 2-days course covering the entire Data Science pipeline: converting collected Big Data into mathematical data structures, algorithms for learning distributed regression, classification and recommender system models, implementing such models using Apache Mahout and Apache Spark, assessing how the models work. In this course several data mining/machine learning models and scalable learning algorithms are covered, including: 

Generalized Linear Models: Linear Regression and Logistic Regression
Decision Trees
Clustering
Mixed Membership Models: (Latent Dirichlet Allocation)
Similarity Analysis
Matrix Factorization
Learning Ensembles: Random Forests

The participants will get familiar with the cutting-edge open source Machine Learning libraries, and run Machine Learning pipelines on pre-installed Hadoop/Spark clusters provided by Analytics Center. 
Hands-on Labs 
With hands-on labs and demonstrations, the participants will utilize ecosystem tools to: 

Prepare a prototyping (interactive notebook) environment for large scale data analysis
Transform Big Data into Machine Learning data structures
Understand scalable Machine Learning algorithms
Write programs to learn and evaluate supervised learning models
Write programs for clustering data
Write programs for finding mixed memberships to clusters (topic modeling)
Write programs to learn and evaluate recommender system models
Design a typical analytics/machine learning pipeline

Course Prerequisites 
"Machine Learning on Big Data" is the core course for the Data Scientist learning track. In addition to an appreciation of what Machine Learning is capable of, the attendees are expected to have an understanding of how Big Data Processing technologies work in general. 
The attendees should be able to write simple programs either in Scala or Python, but the amount of programming is minimal. 
Course Coverage 
PART I – ESSENTIALS & ECOSYSTEM 
Machine Learning Essentials & Big Learning 

Learning from Data
Common Machine Learning Tasks
Example Use Cases
Machine Learning in the Big Data Era
Machine Learning on Big Data: Challenges
Machine Learning Pipelines:

Data Preparation/Transformation
Learning
Evaluation
Deployment



 
Big Data Science Ecosystem 

Apache Spark

Apache Spark Basics
RDD APIs: Scala API and PySpark
Spark DataFrames (and Datasets)
Spark ML Pipelines


Apache Zeppelin for Interactive Data Analysis Notebooks

 
PART II – ALGORITHMS & INTERNALS 
Data Munging 

Summarizing Large Datasets
Common Data Transformation Tasks
Data Structures for Machine Learning
Working with Text Data

 
Supervised Learning 

Learning & Evaluation using Spark APIs:

Linear Regression
Logistic Regression
Naive Bayes
Decision & Regression Trees
Tree Ensembles: Random Forests


Evaluation
Making Predictions

 
Unsupervised Learning 

Clustering:

K-Means
Gaussian Mixture


Mixed Membership: Latent Dirichlet Allocation
Online Clustering from Data Streams: Streaming K-Means

 
Recommender Systems 

Similarity Based Collaborative Filtering
Matrix Factorization Based Collaborative Filtering
Evaluating Recommender Systems

 
Large Scale Machine Learning Internals 

Distributed Optimization for Large Scale Supervised Learning
K-Means/Streaming K-Means at Large Scale
Variational EM for Learning & Inference in LDA
Alternating Least Squares (ALS) and Implicit ALS for Matrix Factorization based Collaborative Filtering
Istanbul, Turkey
Machine Learning on Big Data
