-
Notifications
You must be signed in to change notification settings - Fork 13
Documentation
- Python 2.6+
- boto
- simplejson
- prettytable
- setuptools
- dateutil
To install the packages above you may need to download and install the Python setup tools first which are not installed by default on Ubuntu for example. Without them, you cannot run easy_install. See http://pypi.python.org/pypi/setuptools for more info.
easy_install simplejson
easy_install boto
easy_install prettytable
easy_install setuptools
easy_install python-dateutilYou must specify your AWS credentials when using stratus. The simplest way to do this is to set the environment variables:
-
AWS_ACCESS_KEY_ID: Your AWS Access Key ID -
AWS_SECRET_ACCESS_KEY: Your AWS Secret Access Key
To configure stratus, create a directory called .stratus in your home directory (note the leading period "."). In that directory, create a file called clusters.cfg that contains a section for each cluster you want to control. Start each section with a unique name for the section enclosed in square brackets. Each key/value pair must be on its own line. Keys are separated from values by an equals sign. For example:
[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2Each cluster requires the following key/value pairs:
-
service_type: One of[cassandra, hadoop, hadoop_cassandra_hybrid] -
cloud_provider: Onlyec2is supported -
image_id: The Amazon EC2 image ID for your cluster nodes -
instance_type: The type of EC2 instance to run (small, medium, large, etc...see EC2 documentation for a valid list of these) -
key_name: Key name to use -
availability_zone: The zone to place your instance in (see EC2 documentation) -
private_key: Path to your private key for password-less SSH commands -
user_data_file: Path to a bootstrap script that will be executed on each node after the instance is started
Optional commands:
-
ssh_options: Options to supply to ssh and scp -
security_groups: Any user-defined security groups to authorize your cluster to use (separated by newlines) -
env: List of user-defined key/value pairs to be set in your node's environment (separated by newlines)
NOTES
- It's best practice to define your cluster with a unique and identifiable name so that other users will know who owns this cluster.
-
security_groupsallow you to define custom security groups for your cluster. This is useful if you have multiple clusters that need to communicate via their internal/private network. - See Cloudera CDH for other AMIs to use with Stratus.
- Be sure that your clusters.cfg file uses the proper line feed characters.
The following example shows how to specify an i386 Fedora OS as the AMI in a clusters.cfg file for a Cassandra cluster:
[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
security-group-2
security-group-3
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
storage_conf_file=file:///path/to/storage-conf.xml
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERENOTES
-
storage_conf_fileis the location to yourstorage-conf.xml file. This file will be copied to each node in your cluster and Cassandra will use it for its configuration. See the Storage Conf File section for details.
The following example shows how to specify an i386 Fedora OS (ami-6159bf08) as the AMI in a clusters.cfg file for a Hadoop cluster:
[my-hadoop-cluster]
service_type=hadoop
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
security-group-2
security-group-3
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERENOTES
-
storage_conf_fileis not used for Hadoop clusters and is not present here.
Hybrid Hadoop/Cassandra clusters operate exactly like Hadoop clusters where there will be one node that acts as a namenode, secondary namenode and job tracker, and one or more nodes act as data nodes and task trackers. The only difference is that Cassandra will be installed and started on the Hadoop nodes designated as data nodes. The same commands to operate a Cassandra cluster will also be available, but will only manipulate data nodes with Cassandra services on them.
[my-hadoop-cassandra-cluster]
service_type=hadoop_cassandra_hybrid
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
security-group-2
security-group-3
user_data_file=file:///path/to/hadoop-cassandra-hybrid-ec2-init-remote.sh
storage_conf_file=file:///path/to/storage-conf.xml
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERENOTES
-
storage_conf_fileis the same as in a pure Cassandra cluster.
The storage_conf_file parameter in your clusters.cfg file points to a local copy of a Cassandra storage-conf.xml for configuring Cassandra. You are responsible for configuring things like Keyspaces, ReplicationFactors, etc. There are a few things to note about what needs to be in your storage-conf.xml file though. There is an example storage-conf.xml file in the repository stratus/cassandra/data/storage-conf.xml
- The
Seedssection MUST be replaced with%SEEDS%for proper variable substitution. This will be replaced with the dynamically generated IP addresses at cluster launch time. - The
InitialTokensection MUST be replaced with%INITIAL_TOKEN%for proper variable substitution. This will be replaced with a generated token for each node in the cluster for proper key distribution. -
CommitLogDirectoryMUST be set to/mnt/cassandra-logs -
DataFileDirectoryMUST be set to/mnt/cassandra-data -
ListenAddressandThriftAddressvalues should be empty. This will cause the Cassandra service to bind to the appropriate internal IP address, which is very important for proper inter-node communication and ring creation.
NOTE: If you would like the logs and data directories to be different you will need to modify the storage-conf.xml file, user_data_file, and EBS storage spec file if EBS volumes will be used.
Check out the package, browse to that project's root directory, and run the following:
% sudo python setup.py installAfter specifying an AMI, you can run stratus. It will display usage instructions when you invoke it without arguments.
You can test that the script can connect to your cloud provider by typing:
% stratus list --allthis will list the cluster name, service type, and cloud provider for ALL clusters that have been defined or are currently running in EC2
After you install stratus and setup your EC2 account information, starting a Cassandra cluster with 10 nodes is easy by using one command:
% stratus CLUSTER_NAME launch-cluster 10 # (where CLUSTER_NAME is a defined cluster in your ~/.stratus/clusters.cfg file)- Create a new section in your clusters.cfg file. (This is completely optional. Most users will want EBS so you can use an existing cluster config if you would like.)
- Create storage for the new cluster by creating a temporary EBS volume, formatting it, and saving it as a snapshot in S3. This way, you only have to do the formatting once and can use the snapshot to clone cluster volumes later. NOTE: You only have to do this step once unless you remove the snapshot later. All snapshots of a given size are identical, so you can just reuse one if one already exists in the size you want.
- Create a JSON spec file that defines how storage volumes will be created and assigned for your cluster. This spec file should reference the snapshot ID you created in the previous step. Remember that if you already have a formatted snapshot you may use that ID instead. IMPORTANT CASSANDRA INFO: All Cassandra cluster nodes expect to have two separate storage devices defined. One storage volume will be used to store Cassandra log files (
/dev/sdj) and the second will be used to store Cassandra data (/dev/sdk). The automatic configuration of the nodes will try to mount these volumes to/mnt/cassandra-logsand/mnt/cassandra-datarespectively and MUST exist for persistent storage. A sample JSON spec file can be found in thestratus/cassandra/datadirectory of the project and is referenced below in the "Sample JSON spec file" section. - Use the create-storage command to create the storage volumes defined in your spec file for the number nodes your cluster will have. The following example creates storage for a 3-node Cassandra cluster -- assuming your spec defines the required two volumes per node this command will create 6 volumes (2 for each node)
- Launch your cluster with the appropriate number of nodes (should be the same number from the previous step).
- When all nodes have finished the configuration of your nodes will begin. This consists of assigning the devices for your storage volumes to the appropriate nodes, mounting those volumes to the proper mount points, and launching the Cassandra services. You can test your persistent storage by:
- writing data to the Cassandra services
- terminating your clusters like normal:
% stratus CLUSTER_NAME terminate-cluster - re-launching the cluster:
% stratus CLUSTER_NAME launch-cluster N - retrieve data previously written to Cassandra
- SSH into your cluster:
% stratus CLUSTER_NAME login
Example:
The following example shows how to create a 100GB snapshot, create storage for a 3-node cluster, and then launch the cluster.
% stratus CLUSTER_NAME create-formatted-snapshot 100
% stratus CLUSTER_NAME create-storage 3 ~/.stratus/my-cassandra-ebs-cluster-storage-spec.json
% stratus CLUSTER_NAME launch-cluster 3-
nn= Hadoop name node -
snn= Hadoop secondary name node -
dn= Hadoop data node -
tt= Hadoop task tracker -
jt= Hadoop job tracker -
cn= Cassandra node -
hcn= Hadoop/Cassandra node - Prefix Hadoop-specific keys with "hybrid_" for Hadoop/Cassandra hybrid keys (e.g,
hybrid_nn)
{
"cn": [
{
"device": "/dev/sdj",
"mount_point": "/mnt/cassandra-logs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
},
{
"device": "/dev/sdk",
"mount_point": "/mnt/cassandra-data",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
]
}- For the automatic configuration to work correctly there needs to be two volumes defined and must reference the devices
/dev/sdjand/dev/sdk. The sdj device must have the mount point/mnt/cassandra-logsand the sdk device must have the/mnt/cassandra-datamount point.
{
"nn": [
{
"device": "/dev/sdh",
"mount_point": "/mnt/hadoop-ebs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
],
"dn": [
{
"device": "/dev/sdi",
"mount_point": "/mnt/hadoop-ebs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
]
}{
"hybrid_nn": [
{
"device": "/dev/sdh",
"mount_point": "/mnt/hadoop-ebs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
],
"hybrid_dn": [
{
"device": "/dev/sdi",
"mount_point": "/mnt/hadoop-ebs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
],
"cn": [
{
"device": "/dev/sdj",
"mount_point": "/mnt/cassandra-logs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
},
{
"device": "/dev/sdk",
"mount_point": "/mnt/cassandra-data",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
]
}