This markdown file guides you through the whole installation process of Ubuntu and the correct configuration of Airflow and its dependencies.
- Requirements and Clone Repository
- Setup Ubuntu WSL for Windows Users
- Configure Airflow and its Dependencies
To get the repository running just check the following requirements.
Requirements
- Python 3.8
- tensorflow >= 2.3.0
- tfx == 0.24.0
- apache-beam == 2.24.0
- apache-airflow[celery] == 1.10.12
- psycopg2 == 2.8.6
- tfx == 0.24.0
- tensorflow_advanced_segmentation_models
- albumenatations
- numpy
Furthermore just execute the following command to download and install the git repository.
Clone Repository
$ git clone https://github.com/JanMarcelKezmann/Apache-Airflow-Beam-TensorFlow-Examples.git
Take a look at Markdown file, to get a detailed setup tutorial for Ubuntu on Windows and for the correct configuration of Airflow.
The Setup process for Ubuntu and Airflow is heavily based on the Medium Article written by Ryan Roline. It's main differences are the Python Version and the installation of apache-airflow including the Celery package. Therefore I recommend to read the full article if some problems with the below mentioned steps occur, but be aware to use the correct Versions and Ubuntu Instance for the Setup. The URL reference can be found at the end of the README.
Steps:
- Installing Ubuntu on Windows
- Go to the Microsoft Store on your computer and search for Ubuntu 20.04
- Download and Install Ubuntu 20.04
- Enable Devloper Mode on Windows:
- Type "Developer" into the Windows search bar and select the option that says "Developer Settings"
- 1.3.2 In the page that appears, select the bubble next to the "Developer Model" option.
- Enable Windows Subsystem for Linus (WSL):
- Type "Windows Feature" into the Windows search bar and select the option that says "Turn Windows features on or off"
- Scroll down to the point "Windows Subsystem for Linux and check the box.
- Click ok and restart your computer
- Initialize Ubuntu:
- Run Ubuntu and wait until the initial installation process finishes
- Ubuntu will then ask you for a **username** and a **password**, type in and enter your credentials (Be Careful: Remember them or write them somewhere down)
Now when the above steps are done, we can install all dependencies that are necessary to run Airfow.
Steps:
- Installing PIP
- Run the following sequence of commands in the Ubuntu CLI
- Verify the installation:
sudo apt-get install software-properties-common sudo apt-add-repository universe sudo apt-get update sudo apt-get install python3-setuptools sudo apt install python3-pip sudo -H pip install --upgrade pippip -V - Installing Dependencies
- Run the following commands
- Install PostgreSQL for Airflow (Our robust backend database)
sudo apt-get install postgresql postgresql-contrib- Start the PostgreSQL service with the following command:
- Check the status of the cluster and make sure that it is running by using the following command:
- From the above command's output extract the "Ver", "Cluster" and insert it in the follwing command (When run the output should something like: *Cluster is already running*:
sudo service postgresql startpg_lsclustersudo pg_ctlcluster <version> <cluster> start - Now create a Database for Airflow to use, execute:
sudo -u postgres psql- Create a profile and assign the correct privileges:
- Setup a password for the ubuntu server (Again remember this or write it down):
- Finally confirm the password and type \q to quit.
CREATE ROLE ubuntu; CREATE DATABASE airflow; GRANT ALL PRIVILEGES on database airflow to ubuntu; ALTER ROLE ubuntu SUPERUSER; ALTER ROLE ubuntu CREATEDB; ALTER ROLE ubuntu LOGIN; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public to ubuntu;\password ubuntu - Connect to the Airflow database and verify the connection information:
postgres-# \c airflow- Run the following to receive response for valid working connection:
- Hit Ctrl + Z to stop the session and enter the following command to navigate to the config file:
- Open the file pg_hba.conf:
- Modify the file: Modify the line underneath #IPv4 local connections under the column ADDRESS to 0.0.0.0/0
- Press Ctrl + S to save and Ctrl + X to exit
- Now, open postgresql.conf
- Modify the file: Under the "CONNECTIONS AND AUTHENTICATION", modify the following: listen_adresses = '*'
- Press Ctrl + S to save and Ctrl + X to exit
\conninfocd /etc/postgresql/*/main/ lssudo nano pg_hba.confsudo nano postgresql.conf - Finally, restart postgresql to save and load the changes:
sudo service postgresql restart- Go back to the root directory, by executing the command:
cd ~
sudo apt-get install libmysqlclient-dev sudo apt-get install libssl-dev sudo apt-get install libkrb5-dev sudo apt-get install libsasl2-dev - Installing Apache Airflow, for a quick start guide look here.
- To install Airflow, run the following command:
- Add the path to PATH within the terminal, change in the following the to your username
- Apache Airflow is now installed, close the Ubuntu instance and reopen it again
sudo SLUGIFY_USES_TEXT_UNIDECODE=yes pip install apache-airflow[celery]export PATH=$PATH:/home/<username>/.local/bin - Apache Airflow Setup
- Initialize the Database
- When completed, the necessary config files were being created in the airflow directory, now make changes to the airflow.cfg file
cd airflow ls sudo nano airflow.cfg- Make the following changes to the config file: (You can change the directory of the dags_folder and base_log_folder to any directory you want) (Insert the password you created previously in the section of the sql_alchemy_conn value.)
- Enter Ctrl + S to save and Ctrl + X to exit
dags_folder = /mnt/c/dags base_log_folder = /mnt/c/dags/logs executor = CeleryExecutor load_examples = False expose_config = True sql_alchemy_conn = postgresql+psycopg2://ubuntu:<password>@localhost:5432/airflow broker_url = amqp://guest:guest@localhost:5672// result_backend = amqp://guest:guest@localhost:5672// - Once the above step is finished, initialize airflow again:
airflow initdb- Error Handling: If error relating to the psycopg2 package is received, run the following commands:
sudo apt-get update -y sudo apt-get install -y libpq-dev pip install psycopg2 - Install Rabbitmq
- Go to the config file of rabbitmq:
- Change the node IP adress to: NODE_IP_ADDRESS=0.0.0.0
- Now start the RabbitMQ Server
- Run Airflow initdb one last time:
airflow initdbsudo apt install rabbitmq-serversudo nano /etc/rabbitmq/rabbitmq-env.confsudo service rabbitmq-server startairflow initdb - Final Steps to launch Webserver, Scheduler and Celery Worker
- In the first terminal run:
- Open a new Ubuntu Terminal and run:
- Open another Ubuntu Terminal and run:
airflow webserver -p 8080airflow schedulerairflow worker
You are now finished setting up Airflow and its dependencies, now when the Airflow Webserver has startet, go into your browser and run localhost:8080 in a new tab. A local page showing the current DAGs should load. Here all your dags, which are in the above configured "dags_folder" should appear (as far as the code has no bugs in the DAGs Pipeline).