Loading Events

« All Events

Machine Learning on Big Data

October 16 - October 18

From data munging to evaluating models, “Machine Learning on Big Data” is a 2-days course covering the entire Data Science pipeline: converting collected Big Data into mathematical data structures, algorithms for learning distributed regression, classification and recommender system models, implementing such models using Apache Mahout and Apache Spark, assessing how the models work. In this course several data mining/machine learning models and scalable learning algorithms are covered, including:

  • Generalized Linear Models: Linear Regression and Logistic Regression
  • Decision Trees
  • Clustering
  • Mixed Membership Models: (Latent Dirichlet Allocation)
  • Similarity Analysis
  • Matrix Factorization
  • Learning Ensembles: Random Forests

The participants will get familiar with the cutting-edge open source Machine Learning libraries, and run Machine Learning pipelines on pre-installed Hadoop/Spark clusters provided by Analytics Center.


Hands-on Labs


With hands-on labs and demonstrations, the participants will utilize ecosystem tools to:

  • Prepare a prototyping (interactive notebook) environment for large scale data analysis
  • Transform Big Data into Machine Learning data structures
  • Understand scalable Machine Learning algorithms
  • Write programs to learn and evaluate supervised learning models
  • Write programs for clustering data
  • Write programs for finding mixed memberships to clusters (topic modeling)
  • Write programs to learn and evaluate recommender system models
  • Design a typical analytics/machine learning pipeline


Course Prerequisites


“Machine Learning on Big Data” is the core course for the Data Scientist learning track. In addition to an appreciation of what Machine Learning is capable of, the attendees are expected to have an understanding of how Big Data Processing technologies work in general.

The attendees should be able to write simple programs either in Scala or Python, but the amount of programming is minimal.


Course Coverage



Machine Learning Essentials & Big Learning

  1. Learning from Data
  2. Common Machine Learning Tasks
  3. Example Use Cases
  4. Machine Learning in the Big Data Era
  5. Machine Learning on Big Data: Challenges
  6. Machine Learning Pipelines:
    1. Data Preparation/Transformation
    2. Learning
    3. Evaluation
    4. Deployment


Big Data Science Ecosystem

  1. Apache Spark
    1. Apache Spark Basics
    2. RDD APIs: Scala API and PySpark
    3. Spark DataFrames (and Datasets)
    4. Spark ML Pipelines
  2. Apache Zeppelin for Interactive Data Analysis Notebooks



Data Munging

  1. Summarizing Large Datasets
  2. Common Data Transformation Tasks
  3. Data Structures for Machine Learning
  4. Working with Text Data


Supervised Learning

  1. Learning & Evaluation using Spark APIs:
    1. Linear Regression
    2. Logistic Regression
    3. Naive Bayes
    4. Decision & Regression Trees
    5. Tree Ensembles: Random Forests
  2. Evaluation
  3. Making Predictions


Unsupervised Learning

  1. Clustering:
    1. K-Means
    2. Gaussian Mixture
  2. Mixed Membership: Latent Dirichlet Allocation
  3. Online Clustering from Data Streams: Streaming K-Means


Recommender Systems

  1. Similarity Based Collaborative Filtering
  2. Matrix Factorization Based Collaborative Filtering
  3. Evaluating Recommender Systems


Large Scale Machine Learning Internals

  1. Distributed Optimization for Large Scale Supervised Learning
  2. K-Means/Streaming K-Means at Large Scale
  3. Variational EM for Learning & Inference in LDA
  4. Alternating Least Squares (ALS) and Implicit ALS for Matrix Factorization based Collaborative Filtering


October 16
October 18
Event Category:


Istanbul Venue
Istanbul, 34345 Turkey
+ Google Map
+90 212 217 63 88


Analytics Center
+90 212 217 63 88