Course overview
Big Data is not just about volume. It’s also about how to get data constantly and how to keep it up-to-date. In practice you as a Data Engineer have to collect data feeds day by day in order to get new business insights. Automated data pipelines come to avoid these routine and build processes maintainable, fault-tolerant, scalable, and predictable. Moreover, it’s important to react to new data immediately and be the first who know more about the world that changes.Course topics
Topic 1: Airflow basics- Airflow: purpose and overview
- what is DAG
- Airflow UI
- Airflow CLI
- Start_date, schedule_interval parameters
- Backfill and Catchup
- DAG folder structure
- Task dependence
- Testing DAGs
- Repetitive Patterns With SubDAGs
- Trigger rules for the tasks
- Variables, Macros, and Templates
- Data sharing between your tasks with XCOMs
- TriggerDagRunOperator
- Logging system
- Maintainance DAGs
- Metrics
- CI/CD with Airflow
- Real architecture example(s)
- The need for re-processing. Operations idempotency. Raw data archiving and intermediary results staging.
- Data quality control. “Dead letter” channel. Alerting.
- Logging and auditing, data lineage.
- Separation of pipeline orchestration and data-heavy lifting. Offloading “heavy” operations.