ETL Project 1

Ola! Welcome to my ETL project. This project takes in the Brazilian e-commerce history dataset of recent years, cleans the data, and stores it in a Postgres Database Warehouse. The following is a description of how to use this project. Feel free to clone and utilize.

Table of Content

Installation
Usage
Features
License
Contact
Acknowledgments

Installation

In order to run the ETL process successfully, you need to have Python and Postgres SQL pre-installed. If you don't, then install them first. Then move on to install the required libraries for this project. Navigate to the root directory of the project, then run the command:

pip install -r requirements.txt Copy

This installs all the libraries specified in the requirements.txt file. The "-r" argument exists because the listed libraries do not have specified versions.

Usage

As mentioned earlier, this is an ETL project set to extract, transform, and load data into my Postgres database. The data directory in the subfolders contains the input data. The transform.py file contains the code to extract, clean, transform, and write the input data to a new CSV file and store it in the etl directory. load.py contains the script to create a Postgres database and tables such as customers, orders, products, etc.

Finally, to run the ETL pipeline; create a .env file in your parent directory and specify these:

DB_HOST
DB_PORT
DB_USER
DB_PASSWORD
DB_NAME

Then run the python /Project1/etl.py command from the parent directory of this repository or tweak the command to suit your folder structure. The code in the etl.py file imports the transform and load scripts and runs them sequentially.

Features

The pipeline features the use of pandas for formatting data, psycopg2 for running SQL queries, and python-dotenv to reference hidden credentials.

Contact

For any inquiries or issues, please contact me via LinkedIn.

License

This project is licensed under the MIT License.

Acknowledgement

A special thanks the Compass team, their guidiance was wind to my wings.

ETL Project 2

Ola! This is my second ETL project. This project takes in data about movies and series from different sources to form an S3 data lake. The final result of my refined layer was to be in Parquet format and shaped up to answer questions surroinding the Hollywood actor Richard Radcliff.

Installation

In order to run the ETL process successfully, you need to have Python pre-installed. If you don't, then install them first. Then move on to install the required libraries for this project. Navigate to the root directory of the project, then run the command:

pip install -r requirements.txt Copy

This installs all the libraries specified in the requirements.txt file. The "-r" argument exists because the listed libraries do not have specified versions.

Usage

This project fetches data from different sources, CSV files from my local machine and enriches the data with data from TMDB API calls. Unfortunately I couldn't properly upload the CSV files to my AWS Organisation's account because I kept getting an error abput Service Policy, details provided below:

(.venv) macbookpro@Universe-2 Compass Projects % python ProjectII/s3Upload.py Error uploading file: Failed to upload ProjectII/files/movies.csv to hogans-project2/Raw/local/CSV/Movies/2024/14/09/movies.csv: An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: User: arn:aws:iam::985539759506:user/compass is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::hogans-project2/Raw/local/CSV/Movies/2024/14/09/movies.csv" with an explicit deny in a service control policy Error uploading file: Failed to upload ProjectII/files/series.csv to hogans-project2/Raw/local/CSV/Series/2024/14/09/series.csv: An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: User: arn:aws:iam::985539759506:user/compass is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::hogans-project2/Raw/local/CSV/Series/2024/14/09/series.csv" with an explicit deny in a service control policy

In an attempt to solve this problem, i went ahead to use my root account credentials instead of the IAM user who i assigned s3FullAccess permissions but the error still persisted, I even tried writing an explicit Allow policy for my S3 buucket but that also didnt work. Finally i tried a manual upload of the CSV files from my AWS console but that threw an error about multipart uploading.

However, I was able to fetch the needed data from TMDB with code in my lambda.py file which I called from my lambda function in AWS. The data returned was in JSON format and I stored it into files with not more than 100 objects each. These can be found in my S3 bucket.

Features

The flow of the project ensures creating and managing a data lake while providing insightful analysis on the data using AWS Services.

Contact

For any inquiries or issues, please contact me via LinkedIn.

License

This project is licensed under the MIT License.

Acknowledgement

As always, a special thanks the Compass team for their guidiance.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
ProjectI		ProjectI
ProjectII		ProjectII
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Project 1

Table of Content

Installation

Usage

Features

Contact

License

Acknowledgement

ETL Project 2

Table of Content

Installation

Usage

Features

Contact

License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETL Project 1

Table of Content

Installation

Usage

Features

Contact

License

Acknowledgement

ETL Project 2

Table of Content

Installation

Usage

Features

Contact

License

Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages