"Big Data Essentials" is a 3-days course covering storing, processing, and learning from Big Data, as well as the breadth of the open source Big Data technology: from the core infrastructures to the ecosystem tools for collecting, analyzing, and mining of Big Data.
“Big Data Essentials” is a 3-days introductory course covering the paradigms for storing and processing Big Data, as well as the breadth of the open source Big Data technologies: from the core components YARN and HDFS to the ecosystem tools for collecting, analyzing and learning from Big Data. The attendees will have a solid understanding of the motivation and the use cases for Big Data, and the components that make a typical Hadoop v2 Cluster, including:
- Apache Spark
- Apache Flume
- Apache Sqoop
- Apache Pig
- Apache Hive
With hands-on labs and demonstrations, the participants will get familiar with the ecosystem tools, by actually implementing solutions using their laptops, and running them on pre-installed Hadoop clusters provided by Analytics Center.
Guided by the instructor, the attendees will implement hands-on labs and observe demonstrations, in order to:
- Understand the HDFS block storage
- Understand how MapReduce and Spark works
- Collect log streams into HDFS using Apache Flume
- Import data into HDFS from relational databases using Apache Sqoop
- Write Pig Scripts to process Big Data
- Create and populate Hive tables
- Run SQL queries on Hive tables
Big Data Essentials is the initial, common course for all of our learning tracks (Big Data Developer, Big Data Analyst, Data Scientist), and its content is a prerequisite for further courses (Developing Big Data Applications, Architecting Big Data Solutions, Big Data Analytics, Big Data Science). The course is tailored for both software developers and data analysis professionals.
For taking this essential course, a basic knowledge of SQL is the only prerequisite; and an ability to read Java and Scala code, as well as basic Linux usage knowledge is preferable.
Big Data Analysis Stack
- Big Data Ecosystem
- Big Data Use Cases
- Open Source Big Data Analysis Stack
Big Data Infrastructures
- Hadoop Distributed File System (HDFS)
- Cluster Resources Management (YARN)
Programming Big Data
- Distributed Execution Patterns
- Parallel Execution
- Working with Records of Pairs
- Partitioning (Shuffling)
Distributed Execution Engines
- Apache Hadoop MapReduce
- Apache Spark
Ingesting Big Data
- Collecting Continuous Log Streams (Apache Flume)
- Collecting from External Databases (Apache Sqoop)
Big Data Analytics
- High Level Big Data Programming Abstractions (Apache Pig)
- Big Data Warehouse (Apache Hive)
*Please note that our trainings are in Turkish.