Documentation

Prerequisites

Python 2.6+
boto
simplejson
prettytable
setuptools
dateutil

To install the packages above you may need to download and install the Python setup tools first which are not installed by default on Ubuntu for example. Without them, you cannot run easy_install. See http://pypi.python.org/pypi/setuptools for more info.

easy_install simplejson
easy_install boto
easy_install prettytable
easy_install setuptools
easy_install python-dateutil

Setting Environment Variables to Specify AWS Credentials

You must specify your AWS credentials when using stratus. The simplest way to do this is to set the environment variables:

AWS_ACCESS_KEY_ID: Your AWS Access Key ID
AWS_SECRET_ACCESS_KEY: Your AWS Secret Access Key

Configuration

To configure stratus, create a directory called .stratus in your home directory (note the leading period "."). In that directory, create a file called clusters.cfg that contains a section for each cluster you want to control. Start each section with a unique name for the section enclosed in square brackets. Each key/value pair must be on its own line. Keys are separated from values by an equals sign. For example:

[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2

Each cluster requires the following key/value pairs:

service_type: One of [cassandra, hadoop, hadoop_cassandra_hybrid]
cloud_provider: Only ec2 is supported
image_id: The Amazon EC2 image ID for your cluster nodes
instance_type: The type of EC2 instance to run (small, medium, large, etc...see EC2 documentation for a valid list of these)
key_name: Key name to use
availability_zone: The zone to place your instance in (see EC2 documentation)
private_key: Path to your private key for password-less SSH commands
user_data_file: Path to a bootstrap script that will be executed on each node after the instance is started

Optional commands:

ssh_options: Options to supply to ssh and scp
security_groups: Any user-defined security groups to authorize your cluster to use (separated by newlines)
env: List of user-defined key/value pairs to be set in your node's environment (separated by newlines)

NOTES

It's best practice to define your cluster with a unique and identifiable name so that other users will know who owns this cluster.
security_groups allow you to define custom security groups for your cluster. This is useful if you have multiple clusters that need to communicate via their internal/private network.
See Cloudera CDH for other AMIs to use with Stratus.
Be sure that your clusters.cfg file uses the proper line feed characters.

Configuring Cassandra Clusters

The following example shows how to specify an i386 Fedora OS as the AMI in a clusters.cfg file for a Cassandra cluster:

[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
    security-group-2
    security-group-3
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
storage_conf_file=file:///path/to/storage-conf.xml
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
    AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERE

NOTES

storage_conf_file is the location to your storage-conf.xml file. This file will be copied to each node in your cluster and Cassandra will use it for its configuration. See the Storage Conf File section for details.

Configuring Hadoop Clusters

The following example shows how to specify an i386 Fedora OS (ami-6159bf08) as the AMI in a clusters.cfg file for a Hadoop cluster:

[my-hadoop-cluster]
service_type=hadoop
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
    security-group-2
    security-group-3
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
    AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERE

NOTES

storage_conf_file is not used for Hadoop clusters and is not present here.

Configuring Hadoop/Cassandra Hybrid Clusters

Hybrid Hadoop/Cassandra clusters operate exactly like Hadoop clusters where there will be one node that acts as a namenode, secondary namenode and job tracker, and one or more nodes act as data nodes and task trackers. The only difference is that Cassandra will be installed and started on the Hadoop nodes designated as data nodes. The same commands to operate a Cassandra cluster will also be available, but will only manipulate data nodes with Cassandra services on them.

[my-hadoop-cassandra-cluster]
service_type=hadoop_cassandra_hybrid
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
    security-group-2
    security-group-3
user_data_file=file:///path/to/hadoop-cassandra-hybrid-ec2-init-remote.sh
storage_conf_file=file:///path/to/storage-conf.xml
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
    AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERE

NOTES

storage_conf_file is the same as in a pure Cassandra cluster.

Storage Conf File

The storage_conf_file parameter in your clusters.cfg file points to a local copy of a Cassandra storage-conf.xml for configuring Cassandra. You are responsible for configuring things like Keyspaces, ReplicationFactors, etc. There are a few things to note about what needs to be in your storage-conf.xml file though. There is an example storage-conf.xml file in the repository stratus/cassandra/data/storage-conf.xml

The Seeds section MUST be replaced with %SEEDS% for proper variable substitution. This will be replaced with the dynamically generated IP addresses at cluster launch time.
The InitialToken section MUST be replaced with %INITIAL_TOKEN% for proper variable substitution. This will be replaced with a generated token for each node in the cluster for proper key distribution.
CommitLogDirectory MUST be set to /mnt/cassandra-logs
DataFileDirectory MUST be set to /mnt/cassandra-data
ListenAddress and ThriftAddress values should be empty. This will cause the Cassandra service to bind to the appropriate internal IP address, which is very important for proper inter-node communication and ring creation.

NOTE: If you would like the logs and data directories to be different you will need to modify the storage-conf.xml file, user_data_file, and EBS storage spec file if EBS volumes will be used.

Installing and Configuring Cloud Scripts

Check out the package, browse to that project's root directory, and run the following:

% sudo python setup.py install

Running a Basic Cloud Script

After specifying an AMI, you can run stratus. It will display usage instructions when you invoke it without arguments.

You can test that the script can connect to your cloud provider by typing:

% stratus list --all

this will list the cluster name, service type, and cloud provider for ALL clusters that have been defined or are currently running in EC2

Launching a Cluster

After you install stratus and setup your EC2 account information, starting a Cassandra cluster with 10 nodes is easy by using one command:

% stratus CLUSTER_NAME launch-cluster 10 # (where CLUSTER_NAME is a defined cluster in your ~/.stratus/clusters.cfg file)

Using Persistent Clusters

Create a new section in your clusters.cfg file. (This is completely optional. Most users will want EBS so you can use an existing cluster config if you would like.)
Create storage for the new cluster by creating a temporary EBS volume, formatting it, and saving it as a snapshot in S3. This way, you only have to do the formatting once and can use the snapshot to clone cluster volumes later. NOTE: You only have to do this step once unless you remove the snapshot later. All snapshots of a given size are identical, so you can just reuse one if one already exists in the size you want.
Create a JSON spec file that defines how storage volumes will be created and assigned for your cluster. This spec file should reference the snapshot ID you created in the previous step. Remember that if you already have a formatted snapshot you may use that ID instead. IMPORTANT CASSANDRA INFO: All Cassandra cluster nodes expect to have two separate storage devices defined. One storage volume will be used to store Cassandra log files (/dev/sdj) and the second will be used to store Cassandra data (/dev/sdk). The automatic configuration of the nodes will try to mount these volumes to /mnt/cassandra-logs and /mnt/cassandra-data respectively and MUST exist for persistent storage. A sample JSON spec file can be found in the stratus/cassandra/data directory of the project and is referenced below in the "Sample JSON spec file" section.
Use the create-storage command to create the storage volumes defined in your spec file for the number nodes your cluster will have. The following example creates storage for a 3-node Cassandra cluster -- assuming your spec defines the required two volumes per node this command will create 6 volumes (2 for each node)
Launch your cluster with the appropriate number of nodes (should be the same number from the previous step).
When all nodes have finished the configuration of your nodes will begin. This consists of assigning the devices for your storage volumes to the appropriate nodes, mounting those volumes to the proper mount points, and launching the Cassandra services. You can test your persistent storage by:
- writing data to the Cassandra services
- terminating your clusters like normal: % stratus CLUSTER_NAME terminate-cluster
- re-launching the cluster: % stratus CLUSTER_NAME launch-cluster N
- retrieve data previously written to Cassandra
- SSH into your cluster: % stratus CLUSTER_NAME login

Example:

The following example shows how to create a 100GB snapshot, create storage for a 3-node cluster, and then launch the cluster.

% stratus CLUSTER_NAME create-formatted-snapshot 100
% stratus CLUSTER_NAME create-storage 3 ~/.stratus/my-cassandra-ebs-cluster-storage-spec.json
% stratus CLUSTER_NAME launch-cluster 3

JSON Spec File Keys

nn = Hadoop name node
snn = Hadoop secondary name node
dn = Hadoop data node
tt = Hadoop task tracker
jt = Hadoop job tracker
cn = Cassandra node
hcn = Hadoop/Cassandra node
Prefix Hadoop-specific keys with "hybrid_" for Hadoop/Cassandra hybrid keys (e.g, hybrid_nn)

Sample Cassandra JSON spec file

{
    "cn": [
        {
          "device": "/dev/sdj",
          "mount_point": "/mnt/cassandra-logs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        },
        {
          "device": "/dev/sdk",
          "mount_point": "/mnt/cassandra-data",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ]
}

For the automatic configuration to work correctly there needs to be two volumes defined and must reference the devices /dev/sdj and /dev/sdk. The sdj device must have the mount point /mnt/cassandra-logs and the sdk device must have the /mnt/cassandra-data mount point.

Sample Hadoop JSON spec file

{
    "nn": [
        {
          "device": "/dev/sdh",
          "mount_point": "/mnt/hadoop-ebs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ],
    "dn": [
        {
          "device": "/dev/sdi",
          "mount_point": "/mnt/hadoop-ebs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ]
}

Sample Hadoop/Cassandra Hybrid JSON spec file

{
    "hybrid_nn": [
        {
          "device": "/dev/sdh",
          "mount_point": "/mnt/hadoop-ebs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ],
    "hybrid_dn": [
        {
          "device": "/dev/sdi",
          "mount_point": "/mnt/hadoop-ebs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ],
    "cn": [
        {
          "device": "/dev/sdj",
          "mount_point": "/mnt/cassandra-logs",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        },
        {
          "device": "/dev/sdk",
          "mount_point": "/mnt/cassandra-data",
          "size_gb": "100",
          "snapshot_id": "snap-xxxxxx"
        }
    ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation

Prerequisites

Setting Environment Variables to Specify AWS Credentials

Configuration

Configuring Cassandra Clusters

Configuring Hadoop Clusters

Configuring Hadoop/Cassandra Hybrid Clusters

Storage Conf File

Installing and Configuring Cloud Scripts

Running a Basic Cloud Script

Launching a Cluster

Using Persistent Clusters

JSON Spec File Keys

Sample Cassandra JSON spec file

Sample Hadoop JSON spec file

Sample Hadoop/Cassandra Hybrid JSON spec file

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally