Projects of the Data Engineering Nanodegree of Udacity.
This repo contains the four projects of the Data Engineering Nanodegree of Udacity.
The projects’ code is written in Python using the corresponding libraries and sdks for each
tool used: Postgres, Cassandra, Redshift, Spark and Airflow.
Modelling of user activity data to create a database and ETL pipeline in Postgres for a musing straming app.
Modeling of event data to create a non-relational database and ETL pipeline for a music streaming app.
Construction of an ETL pipeline that extracts data from S3, stages them in Redshift and transforms data into a set of dimensional tables for the analytics team.
Construction of an ETL pipeline that extracts data from an S3 bucket, processes it with Spark and loads it back into S3 as a set of dimensional tables in partitioned parquet files.
Construction of an ETL pipeline in an Airflow dag that extracts data from S3, stages it in Redshift and transforms it into a set of dimensional tables for the analytics team. The last step of the dag consists on executing a series of Data Quality checks into the final tables.