Loading Events

« All Events

  • This event has passed.

Architecting Big Data Applications

January 1, 2019 @ 8:00 am - 5:00 pm

“Architecting Big Data Applications” is a 3-days in-depth course covering the required concepts, tools and techniques to architect Big Data management and analysis software, and design and implement end-to-end solutions, from ingesting data to serving knowledge. The course focuses on designing a Big Data Analysis Stack by integrating the tools in the Apache Hadoop ecosystem.

Course Coverage

Big Data Analysis Stack

Big Data Infrastructures

  • Distributed Storage (HDFS)
  • Cluster Resources Management (YARN, Mesos)
  • Distributed Execution Engines (MapReduce, Spark)
  • Low Latency Querying of Big Data (HBase)

Distributed Execution Engines

  • Distributed Execution Patterns
  • Parallel Execution
  • Aggregations
  • Working with Records of Pairs
  • Partitioning (Shuffling)
  • Apache Hadoop MapReduce
  • Apache Spark
  • Apache Tez

Distributed Storage

  • Block Storage
  • Fault Tolerance
  • Big Data Structures
    • Storage for Analytics: Hadoop InputFormat/OutputFormat
    • Storage for Low Latency Querying: HBase
  • (De-)Serialization
  • Coordinate-wise Storage
  • Columnar Storage (ORC)
  • Nested Columnar Storage (Parquet)

Big Data Applications

  • MapReduce and Spark Programs
  • High Level Data Programming Abstractions (Pig)
  • SQL over Big Data (Hive, Spark SQL, …)
  • Machine Learning on Big Data

Ingesting Big Data

  • Collecting Continuous Log Streams (Apache Flume)
  • Collecting from External Databases (Apache Sqoop)

Distributed Stream Processing

  • Processing Streaming Data (Apache Storm, Apache Spark Streaming)
  • Distributed Messaging (Apache Kafka)

Unified Big Data Stack

  • Interoperability of Big Data Applications
  • Distributed Coordination (Apache Zookeeper)
  • Big Data Abstractions
  • Common Storage
  • Common Schema
  • Serialization/Deserialization

Writing YARN Applications