Skip to content

Latest commit

 

History

History
154 lines (120 loc) · 6.31 KB

File metadata and controls

154 lines (120 loc) · 6.31 KB

Sentiment Analysis Workflow

This serverless workflow is inspired by the application presented in the HEFTless paper. It uses the Amazon Review Dataset to train and test a sentiment analysis application.

This is an opinionated implementation of the Sentiment Analysis application, we designed to test Serverledge functionality while using realistic workloads.

Workflow Description

The workflow definition differs from the application design presented in HEFTless paper since our framework does not currently support fork/join constructs.

The Sentiment Analysis (SA) workflow combines function tasks and choice tasks. SA consists of the following tasks:

  • RetrieveState (sa_retrieve): Retrieves the dataset;

  • ExtractState (sa_extract): Preprocess the dataset;

  • ChoiceState: Choose whether to train and test either a low or a high accuracy model; If the input parameter max_features is below 10000, the low-accuracy model will be used in the remainder of the workflow;

  • LATrainState (sa_train): The training task of the low-accuracy sentiment analysis model;

  • LAEvaluateFinalState (sa_evaluate): The final task of the low-accuracy sentiment analysis model;

  • HATrainState (sa_train): The training task of the high-accuracy sentiment analysis model;

  • HAEvaluateFinalState (sa_evaluate): The final task of the high-accuracy sentiment analysis model.

    +----------+ +---------+ +--------+ +---------+ +------------+ | Retrieve | -> | Extract | -> | Choice | -+-> | HATrain | -> | HAEvaluate |
    +----------+ +---------+ +--------+ | +---------+ +------------+ | +---------+ +------------+ +-> | LATrain | -> | LAEvaluate | | +---------+ +------------+ | +---------+ +-> | Fail | +---------+

Requirements

This SA workflow retrieves a dataset from AWS, stores it on MinIO, and runs machine learning tasks on it.

To run MinIO using docker containers, run:

docker run -p 9000:9000 -p 9001:9001 \                                   
    -e "MINIO_ROOT_USER=minio" \
    -e "MINIO_ROOT_PASSWORD=minio123" \
    quay.io/minio/minio server /data --console-address ":9001"

Build the Sentiment Analysis Tasks

This SA workflow comes with a Dockerfile, which simplifies the application deployment. The Dockerfile enables building the container image of the different tasks, through an environment variable HANDLER_ENV.

  • HANDLER_ENV="retrieve": to build the image for the retriever sa-retrieve;
  • HANDLER_ENV="extract": to build the image for the extractor sa-extract;
  • HANDLER_ENV="train": to build the image for the training tasks sa-train;
  • HANDLER_ENV="evaluate": to build the image for the evaluation tasks sa-evaluate.

To build the container, run the following command:

cd ./src
docker build --build-arg HANDLER_ENV="retrieve" -t  sa-retrieve .      
docker build --build-arg HANDLER_ENV="extract" -t  sa-extract .      
docker build --build-arg HANDLER_ENV="train" -t  sa-train .      
docker build --build-arg HANDLER_ENV="evaluate" -t  sa-evaluate .      

Launch the Server

The SA workflow creates an HTTP Server that executes different functions according to the received REST call. By default, the server listens to 8080. The server needs MinIO as object storage to save intermediary data.

API of the Retrieve Task

POST localhost:8080/invoke

{
    "Params" : {
        "minio_endpoint": "172.17.0.1:9000",
        "minio_access_key": "minio",
        "minio_secret_key": "minio123",
        "data_url": "https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz", 
        "local_dir": "./amazon_review_polarity_csv.tgz", 
        "object_name": "raw/amazon_review_polarity_csv.tgz"
    }
}

API of the Extract Task

POST localhost:8080/invoke

{
    "Params" : {
        "minio_endpoint": "172.17.0.1:9000",
        "minio_access_key": "minio",
        "minio_secret_key": "minio123",
        "tgz_input_object_name": "data/test.csv",
        "subset" : 0.002,
        "local_dataset_file": "./amazon_review_polarity_csv.tgz", 
        "local_output_dir": "./data", 
        "output_train_object_name": "data/train.csv",
        "output_test_object_name": "data/test.csv"
    }
}

API of the Train Task

POST localhost:8080/invoke

{
  "Params" : {
      "minio_endpoint": "172.17.0.1:9000",
      "minio_access_key": "minio",
      "minio_secret_key": "minio123",
      "subset": 0.001, 
      "max_features": 2, 
      "train_object_data": "data/train.csv", 
      "local_train_file": "train.csv", 
      "local_model_file": "sentiment_model.pkl", 
      "local_vectorizer_file": "tfidf_vectorizer.pkl",
      "output_model_object": "model/sentiment_model.pkl", 
      "output_vectorizer_object": "model/tfidf_vectorizer.pkl",
      "reuse_trained_model" : false
  }
}

API of the Evaluate Task

POST localhost:8080/invoke

{
    "Params" : {
        "minio_endpoint": "172.17.0.1:9000",
        "minio_access_key": "minio",
        "minio_secret_key": "minio123",
        "test_object_data": "data/test.csv", 
        "local_test_file": "test.csv", 
        "subset": 0.0002, 
        "local_model_file": "sentiment_model.pkl", 
        "local_vectorizer_file": "tfidf_vectorizer.pkl", 
        "input_model_object": "model/sentiment_model.pkl", 
        "input_vectorizer_object": "model/tfidf_vectorizer.pkl"
    }
}

Setting MinIO Parameters

Each docker image enables the customization of the MinIO connection string. We can set information for connecting to MinIO using environment variables.

MINIO_ENDPOINT="172.17.0.1:9000"
MINIO_ACCESS_KEY=minio
MINIO_SECRET_KEY=minio123
MINIO_BUCKET=serverledge
MINIO_SECURE=false