Building Automated Data Pipelines

Course overview

Big Data is not just about volume. It’s also about how to get data constantly and how to keep it up-to-date. In practice you as a Data Engineer have to collect data feeds day by day in order to get new business insights. Automated data pipelines come to avoid these routine and build processes maintainable, fault-tolerant, scalable, and predictable. Moreover, it’s important to react to new data immediately and be the first who know more about the world that changes.

Course topics

Topic 1: Airflow basics
  • Airflow: purpose and overview
  • what is DAG
  • Airflow UI
  • Airflow CLI
Topic 2: Mastering DAGs
  • Start_date, schedule_interval parameters
  • Backfill and Catchup
  • DAG folder structure
  • Task dependence
  • Testing DAGs
Topic 3: Advanced DAGs
  • Repetitive Patterns With SubDAGs
  • Trigger rules for the tasks
  • Variables, Macros, and Templates
  • Data sharing between your tasks with XCOMs
  • TriggerDagRunOperator
Topic 4: Airflow operationalization
  • Logging system
  • Maintainance DAGs
  • Metrics
  • CI/CD with Airflow
  • Real architecture example(s)
Topic 5: Data pipelines best practices
  • The need for re-processing. Operations idempotency. Raw data archiving and intermediary results staging.
  • Data quality control. “Dead letter” channel. Alerting.
  • Logging and auditing, data lineage.
  • Separation of pipeline orchestration and data-heavy lifting. Offloading “heavy” operations.

Про факультет

Важлива інформація

Контактна інформація