Skip to content

sudohogan/ETL-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Project 1

Ola! Welcome to my ETL project. This project takes in the Brazilian e-commerce history dataset of recent years, cleans the data, and stores it in a Postgres Database Warehouse. The following is a description of how to use this project. Feel free to clone and utilize.

Table of Content

Installation

In order to run the ETL process successfully, you need to have Python and Postgres SQL pre-installed. If you don't, then install them first. Then move on to install the required libraries for this project. Navigate to the root directory of the project, then run the command:

pip install -r requirements.txt Copy
This installs all the libraries specified in the requirements.txt file. The "-r" argument exists because the listed libraries do not have specified versions.

Usage

As mentioned earlier, this is an ETL project set to extract, transform, and load data into my Postgres database. The data directory in the subfolders contains the input data. The transform.py file contains the code to extract, clean, transform, and write the input data to a new CSV file and store it in the etl directory. load.py contains the script to create a Postgres database and tables such as customers, orders, products, etc.

Finally, to run the ETL pipeline; create a .env file in your parent directory and specify these:

  • DB_HOST
  • DB_PORT
  • DB_USER
  • DB_PASSWORD
  • DB_NAME

Then run the python /Project1/etl.py command from the parent directory of this repository or tweak the command to suit your folder structure. The code in the etl.py file imports the transform and load scripts and runs them sequentially.

Features

The pipeline features the use of pandas for formatting data, psycopg2 for running SQL queries, and python-dotenv to reference hidden credentials.

Contact

For any inquiries or issues, please contact me via LinkedIn.

License

This project is licensed under the MIT License.

Acknowledgement

A special thanks the Compass team, their guidiance was wind to my wings.

ETL Project 2

Ola! This is my second ETL project. This project takes in data about movies and series from different sources to form an S3 data lake. The final result of my refined layer was to be in Parquet format and shaped up to answer questions surroinding the Hollywood actor Richard Radcliff.

Table of Content

Installation

In order to run the ETL process successfully, you need to have Python pre-installed. If you don't, then install them first. Then move on to install the required libraries for this project. Navigate to the root directory of the project, then run the command:

pip install -r requirements.txt Copy
This installs all the libraries specified in the requirements.txt file. The "-r" argument exists because the listed libraries do not have specified versions.

Usage

This project fetches data from different sources, CSV files from my local machine and enriches the data with data from TMDB API calls. Unfortunately I couldn't properly upload the CSV files to my AWS Organisation's account because I kept getting an error abput Service Policy, details provided below:

(.venv) macbookpro@Universe-2 Compass Projects % python ProjectII/s3Upload.py Error uploading file: Failed to upload ProjectII/files/movies.csv to hogans-project2/Raw/local/CSV/Movies/2024/14/09/movies.csv: An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: User: arn:aws:iam::985539759506:user/compass is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::hogans-project2/Raw/local/CSV/Movies/2024/14/09/movies.csv" with an explicit deny in a service control policy Error uploading file: Failed to upload ProjectII/files/series.csv to hogans-project2/Raw/local/CSV/Series/2024/14/09/series.csv: An error occurred (AccessDenied) when calling the CreateMultipartUpload operation: User: arn:aws:iam::985539759506:user/compass is not authorized to perform: s3:PutObject on resource: "arn:aws:s3:::hogans-project2/Raw/local/CSV/Series/2024/14/09/series.csv" with an explicit deny in a service control policy

In an attempt to solve this problem, i went ahead to use my root account credentials instead of the IAM user who i assigned s3FullAccess permissions but the error still persisted, I even tried writing an explicit Allow policy for my S3 buucket but that also didnt work. Finally i tried a manual upload of the CSV files from my AWS console but that threw an error about multipart uploading.

However, I was able to fetch the needed data from TMDB with code in my lambda.py file which I called from my lambda function in AWS. The data returned was in JSON format and I stored it into files with not more than 100 objects each. These can be found in my S3 bucket.

Features

The flow of the project ensures creating and managing a data lake while providing insightful analysis on the data using AWS Services.

Contact

For any inquiries or issues, please contact me via LinkedIn.

License

This project is licensed under the MIT License.

Acknowledgement

As always, a special thanks the Compass team for their guidiance.

About

ETL pipeline using python to extract CSV files, transform them to a preferred day structure and store in a Postgres DB.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages