-
Notifications
You must be signed in to change notification settings - Fork 13
Documentation
- Python 2.5+
- boto
- simplejson
- prettytable
- setuptools
- dateutil
- cElementTree
- elementtree
- PyYAML
To install the packages above you may need to download and install the Python setup tools first which are not installed by default on Ubuntu for example. Without them, you cannot run easy_install. See http://pypi.python.org/pypi/setuptools for more info.
easy_install simplejson
easy_install boto
easy_install prettytable
easy_install setuptools
easy_install python-dateutil
easy_install PyYAML
easy_install cElementTree
easy_install elementtreeYou must specify your AWS credentials when using stratus. The simplest way to do this is to set the environment variables:
-
AWS_ACCESS_KEY_ID: Your AWS Access Key ID -
AWS_SECRET_ACCESS_KEY: Your AWS Secret Access Key
To configure stratus, create a directory called .stratus in your home directory (note the leading period "."). In that directory, create a file called clusters.cfg that contains a section for each cluster you want to control. Start each section with a unique name for the section enclosed in square brackets. Each key/value pair must be on its own line. Keys are separated from values by an equals sign. For example:
[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2Each cluster requires the following key/value pairs:
-
service_type: One of[cassandra, hadoop, hadoop_cassandra_hybrid] -
cloud_provider: Onlyec2is supported -
image_id: The Amazon EC2 image ID for your cluster nodes -
instance_type: The type of EC2 instance to run (small, medium, large, etc...see EC2 documentation for a valid list of these) -
key_name: Key name to use -
availability_zone: The zone to place your instance in (see EC2 documentation) -
region: The region to place your instance in (see EC2 documentation) -
private_key: Path to your private key for password-less SSH commands -
user_data_file: Path to a bootstrap script that will be executed on each node after the instance is started (see http://aws.amazon.com/articles/1085)
Optional commands:
-
ssh_options: Options to supply to ssh and scp -
security_groups: Any user-defined security groups to authorize your cluster to use (separated by newlines) -
env: List of user-defined key/value pairs to be set in your node's environment (separated by newlines)
NOTES
- It's best practice to define your cluster with a unique and identifiable name so that other users will know who owns this cluster.
-
security_groupsallow you to define custom security groups for your cluster. This is useful if you have multiple clusters that need to communicate via their internal/private network. - See Cloudera CDH for other AMIs to use with Stratus.
- Be sure that your clusters.cfg file uses the proper line feed characters.
The following example shows how to specify an i386 Fedora OS as the AMI in a clusters.cfg file for a Cassandra cluster:
[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
region=us-east-1
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
security-group-2
security-group-3
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
cassandra_config_file=file:///path/to/storage-conf.xml
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERENOTES
-
cassandra-config_fileis the location to yourstorage-conf.xml file. This file will be copied to each node in your cluster and Cassandra will use it for its configuration. See the Cassandra 0.6.x Config File section for details.
[my-cassandra-cluster]
service_type=cassandra
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1d
region=us-east-1
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
cassandra_config_file=file:///path/to/cassandra.yaml
keyspace_definitions_file=file:///path/to/keyspace_definitions
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERE
CASSANDRA_URL=http://apache.mirrors.pair.com//cassandra/0.7.0/apache-cassandra-0.7.0-beta3-bin.tar.gzNOTES
-
cassandra-config_fileis the location to yourcassandra.yamll file. This file will be copied to each node in your cluster and Cassandra will use it for its configuration. See the Cassandra 0.7.x Config File section for details. -
keyspace_definitions_filepoints to a text file containing a batch of Thrift APIs that will be used to set up your keyspaces initially. Cassandra 0.7 allows for dynamic keyspaces and you are now required to use the API to manage them. (see Keyspace Definitions File section for an example) -
CASSANDRA_URLin the env section will override which version of Cassandra to be pulled and installed on each node of your cluster. See the cassandra-ec2-init-remote.sh file in cassandra/data for how this variable is used to configure Cassandra.
The following example shows how to specify an i386 Fedora OS (ami-6159bf08) as the AMI in a clusters.cfg file for a Hadoop cluster:
[my-hadoop-cluster]
service_type=hadoop
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
region=us-east-1
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
security-group-2
security-group-3
user_data_file=file:///path/to/cassandra-ec2-init-remote.sh
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERENOTES
-
cassandra_config_fileis not used for Hadoop clusters and is not present here.
Hybrid Hadoop/Cassandra clusters operate exactly like Hadoop clusters where there will be one node that acts as a namenode, secondary namenode and job tracker, and one or more nodes act as data nodes and task trackers. The only difference is that Cassandra will be installed and started on the Hadoop nodes designated as data nodes. The same commands to operate a Cassandra cluster will also be available, but will only manipulate data nodes with Cassandra services on them.
[my-hadoop-cassandra-cluster]
service_type=hadoop_cassandra_hybrid
cloud_provider=ec2
image_id=ami-6159bf08
instance_type=m1.small
key_name=your_key_name
availability_zone=us-east-1c
region=us-east-1
private_key=/path/to/key/file
ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
security_groups=security-group-1
security-group-2
security-group-3
user_data_file=file:///path/to/hadoop-cassandra-hybrid-ec2-init-remote.sh
cassandra_config_file=file:///path/to/storage-conf.xml
env=AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_GOES_HERE
AWS_SECRET_ACCESS_KEY=YOUR_SECRET_KEY_GOES_HERENOTES
-
cassandra_config_fileis the same as in a pure Cassandra 0.6.x or 0.7x cluster. - For Cassandra 0.7.x remember to supply your
keyspace_definitions_file
The cassandra_config_file parameter in your clusters.cfg file points to a local copy of a storage-conf.xml file for Cassandra v0.6.x that will be pushed out to each node in your cluster. You are responsible for configuring settings in this file, but keep in mind that stratus will automatically copy this file and modify various parameters before it pushes it out. The modifications for storage-conf.xml files are:
-
Seedselement will contain valid Seed elements containing the private IP address of the seed nodes. Stratus arbitrarily chooses the first two nodes to be seeds. -
InitialTokenwill contain a generated token for proper key distribution -
CommitLogDirectorywill be /mnt/cassandra-logs -
DataFileDirectorieswill contain oneDataFileDirectoryelement with the value /mnt/cassandra-data -
ListenAddressandThriftAddresswill be null
The cassandra_config_file parameter in your clusters.cfg file points to a local copy of a cassandra.yaml file for Cassandra v0.7.x that will be pushed out to each node in your cluster. You are responsible for configuring settings in this file, but keep in mind that stratus will automatically copy this file and modify various parameters before it pushes it out. The modifications for cassandra.yaml files area:
-
seedswill contain a list of private IP addresses of the seed nodes. Stratus arbitrarily chooses the first two nodes to be seeds. -
initial_tokenwill contain a generated token for proper key distribution -
commitlog_directorywill be /mnt/cassandra-logs -
data_file_directorieswill contain a single list with the value /mnt/cassandra-data -
listen_addressandrpc_addresswill be null
For Cassandra 0.7.x only
The following is a sample file pulled from http://wiki.apache.org/cassandra/LiveSchemaUpdates that shows how you can use Thrift API commands in a batch style to build up your keyspaces. The following will create a keyspace Keyspace1 with two column families: Standard1 and Standard2. This file if passed in through the keyspace_definition_file parameter of your clusters.cfg file will be executed on ONE node via the cassandra-cli utility after the Cassandra service has started.
/* Create a new keyspace */
create keyspace Keyspace1 with replication_factor = 3 and placement_strategy = 'org.apache.cassandra.locator.RackUnawareStrategy'
/* Switch to the new keyspace */
use Keyspace1
/* Create new column families */
create column family Standard1 with column_type = 'Standard' and comparator = 'BytesType'
create column family Standard2 with column_type = 'Standard' and comparator = 'UTF8Type' and rows_cached = 10000Check out the package, browse to that project's root directory, and run the following:
% sudo python setup.py installAfter specifying an AMI, you can run stratus. It will display usage instructions when you invoke it without arguments.
You can test that the script can connect to your cloud provider by typing:
% stratus list --allthis will list the cluster name, service type, and cloud provider for ALL clusters that have been defined or are currently running in EC2
After you install stratus and setup your EC2 account information, starting a Cassandra cluster with 10 nodes is easy by using one command:
% stratus exec CLUSTER_NAME launch-cluster 10 # (where CLUSTER_NAME is a defined cluster in your ~/.stratus/clusters.cfg file)- Create a new section in your clusters.cfg file. (This is completely optional. Most users will want EBS so you can use an existing cluster config if you would like.)
- Create storage for the new cluster by creating a temporary EBS volume, formatting it, and saving it as a snapshot in S3. This way, you only have to do the formatting once and can use the snapshot to clone cluster volumes later. NOTE: You only have to do this step once unless you remove the snapshot later. All snapshots of a given size are identical, so you can just reuse one if one already exists in the size you want.
- Create a JSON spec file that defines how storage volumes will be created and assigned for your cluster. This spec file should reference the snapshot ID you created in the previous step. Remember that if you already have a formatted snapshot you may use that ID instead. IMPORTANT CASSANDRA INFO: All Cassandra cluster nodes expect to have two separate storage devices defined. One storage volume will be used to store Cassandra log files (
/dev/sdj) and the second will be used to store Cassandra data (/dev/sdk). The automatic configuration of the nodes will try to mount these volumes to/mnt/cassandra-logsand/mnt/cassandra-datarespectively and MUST exist for persistent storage. A sample JSON spec file can be found in thestratus/cassandra/datadirectory of the project and is referenced below in the "Sample JSON spec file" section. - Use the create-storage command to create the storage volumes defined in your spec file for the number nodes your cluster will have. The following example creates storage for a 3-node Cassandra cluster -- assuming your spec defines the required two volumes per node this command will create 6 volumes (2 for each node)
- Launch your cluster with the appropriate number of nodes (should be the same number from the previous step).
- When all nodes have finished the configuration of your nodes will begin. This consists of assigning the devices for your storage volumes to the appropriate nodes, mounting those volumes to the proper mount points, and launching the Cassandra services. You can test your persistent storage by:
- writing data to the Cassandra services
- terminating your clusters like normal:
% stratus CLUSTER_NAME terminate-cluster - re-launching the cluster:
% stratus CLUSTER_NAME launch-cluster N - retrieve data previously written to Cassandra
- SSH into your cluster:
% stratus CLUSTER_NAME login
Example:
The following example shows how to create a 100GB snapshot, create storage for a 3-node cluster, and then launch the cluster.
% stratus exec CLUSTER_NAME create-formatted-snapshot 100
% stratus exec CLUSTER_NAME create-storage 3 ~/.stratus/my-cassandra-ebs-cluster-storage-spec.json
% stratus exec CLUSTER_NAME launch-cluster 3-
nn= Hadoop name node -
snn= Hadoop secondary name node -
dn= Hadoop data node -
tt= Hadoop task tracker -
jt= Hadoop job tracker -
cn= Cassandra node -
hcn= Hadoop/Cassandra node - Prefix Hadoop-specific keys with "hybrid_" for Hadoop/Cassandra hybrid keys (e.g,
hybrid_nn)
{
"cn": [
{
"device": "/dev/sdj",
"mount_point": "/mnt/cassandra-logs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
},
{
"device": "/dev/sdk",
"mount_point": "/mnt/cassandra-data",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
]
}- For the automatic configuration to work correctly there needs to be two volumes defined and must reference the devices
/dev/sdjand/dev/sdk. The sdj device must have the mount point/mnt/cassandra-logsand the sdk device must have the/mnt/cassandra-datamount point.
{
"nn": [
{
"device": "/dev/sdh",
"mount_point": "/mnt/hadoop-ebs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
],
"dn": [
{
"device": "/dev/sdi",
"mount_point": "/mnt/hadoop-ebs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
]
}{
"hybrid_nn": [
{
"device": "/dev/sdh",
"mount_point": "/mnt/hadoop-ebs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
],
"hybrid_dn": [
{
"device": "/dev/sdi",
"mount_point": "/mnt/hadoop-ebs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
],
"cn": [
{
"device": "/dev/sdj",
"mount_point": "/mnt/cassandra-logs",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
},
{
"device": "/dev/sdk",
"mount_point": "/mnt/cassandra-data",
"size_gb": "100",
"snapshot_id": "snap-xxxxxx"
}
]
}