8/5/2020

Reading time:1 min

Flor91/Data-engineering-nanodegree

by Flor91

Projects done in the Data Engineering Nanodegree by Udacity.comCourse 1: Data ModelingIntroduction to Data Modeling➔ Understand the purpose of data modeling➔ Identify the strengths and weaknesses of different types of databases and data storage techniques➔ Create a table in Postgres and Apache CassandraRelational Data Models➔ Understand when to use a relational database➔ Understand the difference between OLAP and OLTP databases➔ Create normalized data tables➔ Implement denormalized schemas (e.g. STAR, Snowflake)NoSQL Data Models➔ Understand when to use NoSQL databases and how they differ from relational databases➔ Select the appropriate primary key and clustering columns for a given use case➔ Create a NoSQL database in Apache CassandraProject: Data Modeling with Postgres and Apache CassandraCourse 2: Cloud Data WarehousesIntroduction to the Data Warehouses➔ Understand Data Warehousing architecture➔ Run an ETL process to denormalize a database (3NF to Star)➔ Create an OLAP cube from facts and dimensions➔ Compare columnar vs. row oriented approachesIntroduction to the Cloud with AWS➔ Understand cloud computing➔ Create an AWS account and understand their services➔ Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQLImplementing Data Warehouses on AWS➔ Identify components of the Redshift architecture➔ Run ETL process to extract data from S3 into Redshift➔ Set up AWS infrastructure using Infrastructure as Code (IaC)➔ Design an optimized table by selecting the appropriate distribution style and sorting keyProject 2: Data Infrastructure on the CloudCourse 3: Data Lakes with SparkThe Power of Spark➔ Understand the big data ecosystem➔ Understand when to use Spark and when not to use itData Wrangling with Spark➔ Manipulate data with SparkSQL and Spark Dataframes➔ Use Spark for ETL purposesDebugging and Optimization➔ Troubleshoot common errors and optimize their code using the Spark WebUIIntroduction to Data Lakes➔ Understand the purpose and evolution of data lakes➔ Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue➔ Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages➔ Understand the components and issues of data lakesProject 3: Big Data with SparkCourse 4: Automate Data PipelinesData Pipelines➔ Create data pipelines with Apache Airflow➔ Set up task dependencies➔ Create data connections using hooksData Quality➔ Track data lineage➔ Set up data pipeline schedules➔ Partition data to optimize pipelines➔ Write tests to ensure data quality➔ Backfill dataProduction Data Pipelines➔ Build reusable and maintainable pipelines➔ Build your own Apache Airflow plugins➔ Implement subDAGs➔ Set up task boundaries➔ Monitor data pipelinesProject: Data Pipelines with Airflow

Read this article if you want to know more about Flor91/Data-engineering-nanodegree

Projects done in the Data Engineering Nanodegree by Udacity.com

Course 1: Data Modeling

Introduction to Data Modeling

➔ Understand the purpose of data modeling

➔ Identify the strengths and weaknesses of different types of databases and data storage techniques

➔ Create a table in Postgres and Apache Cassandra

Relational Data Models

➔ Understand when to use a relational database

➔ Understand the difference between OLAP and OLTP databases

➔ Create normalized data tables

➔ Implement denormalized schemas (e.g. STAR, Snowflake)

NoSQL Data Models

➔ Understand when to use NoSQL databases and how they differ from relational databases

➔ Select the appropriate primary key and clustering columns for a given use case

➔ Create a NoSQL database in Apache Cassandra

Project: Data Modeling with Postgres and Apache Cassandra

Course 2: Cloud Data Warehouses

Introduction to the Data Warehouses

➔ Understand Data Warehousing architecture

➔ Run an ETL process to denormalize a database (3NF to Star)

➔ Create an OLAP cube from facts and dimensions

➔ Compare columnar vs. row oriented approaches

Introduction to the Cloud with AWS

➔ Understand cloud computing

➔ Create an AWS account and understand their services

➔ Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQL

Implementing Data Warehouses on AWS

➔ Identify components of the Redshift architecture

➔ Run ETL process to extract data from S3 into Redshift

➔ Set up AWS infrastructure using Infrastructure as Code (IaC)

➔ Design an optimized table by selecting the appropriate distribution style and sorting key

Project 2: Data Infrastructure on the Cloud

Course 3: Data Lakes with Spark

The Power of Spark

➔ Understand the big data ecosystem

➔ Understand when to use Spark and when not to use it

Data Wrangling with Spark

➔ Manipulate data with SparkSQL and Spark Dataframes

➔ Use Spark for ETL purposes

Debugging and Optimization

➔ Troubleshoot common errors and optimize their code using the Spark WebUI

Introduction to Data Lakes

➔ Understand the purpose and evolution of data lakes

➔ Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue

➔ Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages

➔ Understand the components and issues of data lakes

Project 3: Big Data with Spark

Course 4: Automate Data Pipelines

Data Pipelines

➔ Create data pipelines with Apache Airflow

➔ Set up task dependencies

➔ Create data connections using hooks

Data Quality

➔ Track data lineage

➔ Set up data pipeline schedules

➔ Partition data to optimize pipelines

➔ Write tests to ensure data quality

➔ Backfill data

Production Data Pipelines

➔ Build reusable and maintainable pipelines

➔ Build your own Apache Airflow plugins

➔ Implement subDAGs

➔ Set up task boundaries

➔ Monitor data pipelines

Project: Data Pipelines with Airflow

Checkout Planet Cassandra

Claim Your Free Planet Cassandra Contributor T-shirt!

Make your contribution and score a FREE Planet Cassandra Contributor T-Shirt!  We value our incredible Cassandra community, and we want to express our gratitude by sending an exclusive Planet Cassandra Contributor T-Shirt you can wear with pride.