General Cassandra Maintenance
The Datastax documentation provides some really good information about how to keep track of the health of Cassandra.
Genesys recommends that you use the nodetool utility that is bundled with your Cassandra installation package and that you make a habit of using the following nodetool commands to monitor the state of your Cassandra cluster.
Displays node status and information about the cluster, as determined by the node being queried. This can give you an idea of the load balance and whether any nodes are down. If your cluster is not properly configured, different nodes may show a different cluster; this is a good way to check that every node views the cluster the same way.
nodetool -h <HOST_NAME> -p <JMX_PORT> ring
Displays cluster information.
nodetool -h <HOST_NAME> -p <JMX_PORT> status
Displays compaction statistics.
nodetool -h <HOST_NAME> -p <JMX_PORT> compactionstats
getcompactionthroughput \ setcompactionthroughput
Displays the compaction throughput on the selected Cassandra instance. By default it is 32 MB/s.
You can increase this parameter if you observe permanent growth of database size after the time-to-live (TTL) and grace periods are passed. Note that increasing compaction throughput will affect memory and CPU consumption. Because of this, you need make sure to have sufficient hardware to support the rate that you have selected.
nodetool -h <HOST_NAME> -p <JMX_PORT> getcompactionthroughput
To increase compaction throughput to 64 MB/s, for example, use the following command:
nodetool -h <HOST_NAME> -p <JMX_PORT> setcompactionthroughput 64
Backup and recovery
With its distributed architecture, Cassandra is fault-tolerant, and the integrity of data is guaranteed, even in the event of an entire site failure.
However, Cassandra also provides the ability to take a snapshot. Snapshots are not designed for data integrity. However they can be used as a backup—for example, to retrieve data that was accidentally deleted.
Instead of dedicating a machine for the storage of snapshots, you should rather use that machine to install one more redundant Cassandra node.
See Backing up and restoring data in the Cassandra documentation.
There is also an incremental backup that is quicker than a full backup. Best-practice documentation as recommended by Datastax Enterprise is downloadable here.
Depending on the replication factor and consistency levels of a Cassandra cluster configuration, the UCS Cluster can handle the failure of one or more Cassandra nodes in the data center without any special recovery procedures and without interrupting service or losing functionality. When the failed node is back up, the UCS Cluster automatically reconnects to it.
- Therefore, if an eligible number of nodes have failed, you should just restart them.
However, if too many of the Cassandra nodes in your cluster have failed or stopped, you will lose functionality. To ensure a successful recovery from failure of multiple nodes, Genesys recommends that you:
- Stop every node, one at a time, with at least two minutes between operations.
- Then restart the nodes one at a time, with at least two minutes between operations.