Skip to content

Latest commit

 

History

History
271 lines (195 loc) · 9.14 KB

File metadata and controls

271 lines (195 loc) · 9.14 KB

Installation and Setup

This markdown file guides you through the whole installation process of Ubuntu and the correct configuration of Airflow and its dependencies.

Table of Contents

Requirements and Clone Repository

To get the repository running just check the following requirements.

Requirements

  1. Python 3.8
  2. tensorflow >= 2.3.0
  3. tfx == 0.24.0
  4. apache-beam == 2.24.0
  5. apache-airflow[celery] == 1.10.12
  6. psycopg2 == 2.8.6
  7. tfx == 0.24.0
  8. tensorflow_advanced_segmentation_models
  9. albumenatations
  10. numpy

Furthermore just execute the following command to download and install the git repository.

Clone Repository

$ git clone https://github.com/JanMarcelKezmann/Apache-Airflow-Beam-TensorFlow-Examples.git

Setup Ubuntu WSL for Windows Users

Take a look at Markdown file, to get a detailed setup tutorial for Ubuntu on Windows and for the correct configuration of Airflow.

The Setup process for Ubuntu and Airflow is heavily based on the Medium Article written by Ryan Roline. It's main differences are the Python Version and the installation of apache-airflow including the Celery package. Therefore I recommend to read the full article if some problems with the below mentioned steps occur, but be aware to use the correct Versions and Ubuntu Instance for the Setup. The URL reference can be found at the end of the README.

Steps:

  1. Installing Ubuntu on Windows
    1. Go to the Microsoft Store on your computer and search for Ubuntu 20.04
    2. Download and Install Ubuntu 20.04
    3. Enable Devloper Mode on Windows:
      1. Type "Developer" into the Windows search bar and select the option that says "Developer Settings"
      2. 1.3.2 In the page that appears, select the bubble next to the "Developer Model" option.
    4. Enable Windows Subsystem for Linus (WSL):
      1. Type "Windows Feature" into the Windows search bar and select the option that says "Turn Windows features on or off"
      2. Scroll down to the point "Windows Subsystem for Linux and check the box.
      3. Click ok and restart your computer
  2. Initialize Ubuntu:
    1. Run Ubuntu and wait until the initial installation process finishes
    2. Ubuntu will then ask you for a **username** and a **password**, type in and enter your credentials (Be Careful: Remember them or write them somewhere down)

Now when the above steps are done, we can install all dependencies that are necessary to run Airfow.

Configure Airflow and its Dependencies

Steps:

  1. Installing PIP
    1. Run the following sequence of commands in the Ubuntu CLI
    2.     sudo apt-get install software-properties-common  
          sudo apt-add-repository universe
          sudo apt-get update
          sudo apt-get install python3-setuptools
          sudo apt install python3-pip
          sudo -H pip install --upgrade pip
      
    3. Verify the installation:
    4.     pip -V
      
  2. Installing Dependencies
    1. Run the following commands
    2.     sudo apt-get install libmysqlclient-dev 
          sudo apt-get install libssl-dev 
          sudo apt-get install libkrb5-dev 
          sudo apt-get install libsasl2-dev 
      
    3. Install PostgreSQL for Airflow (Our robust backend database)
          sudo apt-get install postgresql postgresql-contrib
      
      1. Start the PostgreSQL service with the following command:
      2.     sudo service postgresql start
        
      3. Check the status of the cluster and make sure that it is running by using the following command:
      4.     pg_lscluster
        
      5. From the above command's output extract the "Ver", "Cluster" and insert it in the follwing command (When run the output should something like: *Cluster is already running*:
      6.     sudo pg_ctlcluster <version> <cluster> start
        
    4. Now create a Database for Airflow to use, execute:
          sudo -u postgres psql
      
      1. Create a profile and assign the correct privileges:
      2.     CREATE ROLE ubuntu;
            CREATE DATABASE airflow;
            GRANT ALL PRIVILEGES on database airflow to ubuntu;
            ALTER ROLE ubuntu SUPERUSER;
            ALTER ROLE ubuntu CREATEDB; 
            ALTER ROLE ubuntu LOGIN;
            GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public to ubuntu;
        
      3. Setup a password for the ubuntu server (Again remember this or write it down):
      4.     \password ubuntu
        
      5. Finally confirm the password and type \q to quit.
    5. Connect to the Airflow database and verify the connection information:
          postgres-# \c airflow
      
      1. Run the following to receive response for valid working connection:
      2.     \conninfo
        
      3. Hit Ctrl + Z to stop the session and enter the following command to navigate to the config file:
      4.     cd /etc/postgresql/*/main/
            ls
        
      5. Open the file pg_hba.conf:
      6.     sudo nano pg_hba.conf
        
      7. Modify the file: Modify the line underneath #IPv4 local connections under the column ADDRESS to 0.0.0.0/0
      8. Press Ctrl + S to save and Ctrl + X to exit
      9. Now, open postgresql.conf
      10.     sudo nano postgresql.conf
        
      11. Modify the file: Under the "CONNECTIONS AND AUTHENTICATION", modify the following: listen_adresses = '*'
      12. Press Ctrl + S to save and Ctrl + X to exit
    6. Finally, restart postgresql to save and load the changes:
          sudo service postgresql restart
      
      1. Go back to the root directory, by executing the command:
      2.     cd ~
        
  3. Installing Apache Airflow, for a quick start guide look here.
    1. To install Airflow, run the following command:
    2.     sudo SLUGIFY_USES_TEXT_UNIDECODE=yes pip install apache-airflow[celery]
      
    3. Add the path to PATH within the terminal, change in the following the to your username
    4.     export PATH=$PATH:/home/<username>/.local/bin
      
    5. Apache Airflow is now installed, close the Ubuntu instance and reopen it again
  4. Apache Airflow Setup
    1. Initialize the Database
    2.     airflow initdb
      
    3. When completed, the necessary config files were being created in the airflow directory, now make changes to the airflow.cfg file
          cd airflow
          ls
          sudo nano airflow.cfg
      
      1. Make the following changes to the config file: (You can change the directory of the dags_folder and base_log_folder to any directory you want) (Insert the password you created previously in the section of the sql_alchemy_conn value.)
      2.     dags_folder = /mnt/c/dags
            base_log_folder = /mnt/c/dags/logs
            executor = CeleryExecutor
            load_examples = False
            expose_config = True
            sql_alchemy_conn = postgresql+psycopg2://ubuntu:<password>@localhost:5432/airflow
            broker_url = amqp://guest:guest@localhost:5672//
            result_backend = amqp://guest:guest@localhost:5672//
        
      3. Enter Ctrl + S to save and Ctrl + X to exit
    4. Once the above step is finished, initialize airflow again:
          airflow initdb
      
      1. Error Handling: If error relating to the psycopg2 package is received, run the following commands:
      2.     sudo apt-get update -y
            sudo apt-get install -y libpq-dev
            pip install psycopg2
        
    5. Install Rabbitmq
    6.     sudo apt install rabbitmq-server
      
      1. Go to the config file of rabbitmq:
      2.     sudo nano /etc/rabbitmq/rabbitmq-env.conf
        
      3. Change the node IP adress to: NODE_IP_ADDRESS=0.0.0.0
      4. Now start the RabbitMQ Server
      5.     sudo service rabbitmq-server start
        
    7. Run Airflow initdb one last time:
    8.     airflow initdb
      
  5. Final Steps to launch Webserver, Scheduler and Celery Worker
    1. In the first terminal run:
    2.     airflow webserver -p 8080
      
    3. Open a new Ubuntu Terminal and run:
    4.     airflow scheduler
      
    5. Open another Ubuntu Terminal and run:
    6.     airflow worker
      

You are now finished setting up Airflow and its dependencies, now when the Airflow Webserver has startet, go into your browser and run localhost:8080 in a new tab. A local page showing the current DAGs should load. Here all your dags, which are in the above configured "dags_folder" should appear (as far as the code has no bugs in the DAGs Pipeline).