Jump to: navigation, search

Cassandra and Elasticsearch Monitoring

All connected applications will be monitored on available metrics. Registered metrics can be seen on the console output.

Example

2017-01-25T18:27:27.700Z [debug] Registered GenericMetricThreshold for RAP_ucs1 on sys_CPU with threshold: 90.0 (on:324020000 off:3240200001)
2017-01-25T18:27:27.706Z [debug] Registered GenericMetricThreshold for RAP_ucs2 on sys_CPU with threshold: 90.0 (on:325020000 off:3250200001)
2017-01-25T18:27:28.446Z [debug] Registered GenericMetricThreshold for RAP_cass3 on sys_CPU with threshold: 90.0 (on:323020000 off:3230200001)
2017-01-25T18:27:28.453Z [debug] Registered GenericMetricThreshold for RAP_cass2 on sys_CPU with threshold: 90.0 (on:322020000 off:3220200001)
2017-01-25T18:27:28.551Z [debug] Registered GenericMetricThreshold for RAP_cass1 on sys_CPU with threshold: 90.0 (on:321020000 off:3210200001)

LMS message IDs are generated from the metric message ID and the application DBID in order to be able to define, enable, and disable different alarms for each application.

A message ID is generated from <dbid>0<alarmid>; for example 320020000 is alarm 20000 (sys_CPU) for application 320 (SerMon).

When an alarm condition is reached or cleared, the log (shown below) can be seen and a corresponding message is sent to MessageServer:

2017-01-25T18:39:27.668Z [error] [322020000] Alarm condition for sys_CPU was reached on RAP_cass2 with 99.0 > 90.0
 
2017-01-25T18:39:57.671Z [warn ] [322020001] Alarm condition for sys_CPU was cleared on RAP_cass2 with 0.51 < 90.0

Alarms

The alarms shown below are available for all application types:

sys_CPU

Percentage of the CPU load used by this process.

  • Alarm ID: 20000
  • Alarm threshold: 90

If this process uses constantly more than 90% of the machine CPU, then an upgrade might be necessary.

sys_RAM_usage

Percentage of the machine memory used by this process.

  • Alarm ID: 20002
  • Alarm threshold: 90

Memory usage reaching 90 % indicates that something is going wrong.

sys_GC_ParNew

Time spent by the latest garbage collection

  • Alarm ID: 20004
  • Alarm threshold: 2000

Garbage Collection Lasting for 2 seconds may indicate that something is going wrong.

Dynamically updating the alarm threshold

The threshold value can now be updated in the options of the application.

UCSMonMaint4.png

In this example, the threshold, which is configured at 90 is overridden with the value 49.3 for most applications. The value 83.3 applies to all the applications of type WCC, and the value 74.1 applies only to the application named "UCS9". This configuration is dynamic.

Cassandra metrics and alarms

The following alarms are related to Cassandra performance.

cass_DiskSpaceUsed

Disk space used by Cassandra.

  • Alarm ID: 20126
  • Alarm threshold: 450

Threshold in GiB. Must remain under 50% of disk size for compaction

cass_ReadLatency

Average latency for Cassandra read requests in the last minute elapsed.

  • Alarm ID: 20100
  • Alarm threshold: 1000

cass_WriteLatency

Average latency for Cassandra write requests in the last minute elapsed.

  • Alarm ID: 20102
  • Alarm threshold: 1000

cass_DroppedMessages

Average number of messages dropped by Cassandra in the last minute elapsed.

  • Alarm ID: 20104
  • Alarm threshold: 0.0001

cass_HintsInProgress

Number of hints currently in progress. Ideally 0, max is 1024

  • Alarm ID: 20108
  • Alarm threshold: 10

cass_ReadRepairRepairedBlocking

Number of blocking read repaired in progress in the last minute.

  • Alarm ID: 20114
  • Alarm threshold: 0.01

Should be 0 ideally.

cass_ReadRepairRepairedBackground

Number of background read repaired in progress in the last minute.

  • Alarm ID: 20116
  • Alarm threshold: 0.01

Should be 0 ideally.

cass_ClientRequestReadFailures

Number of cassandra read request failures in the last minute.

  • Alarm ID: 20118
  • Alarm threshold: 0.001

Should be 0 ideally.

cass_ClientRequestReadTimeouts

Number of cassandra read request timeouts in the last minute.

  • Alarm ID: 20120
  • Alarm threshold: 0.001

Should be 0 ideally.

cass_ClientRequestWriteFailures

Number of cassandra write request failures in the last minute.

  • Alarm ID: 20122
  • Alarm threshold: 0.001

Should be 0 ideally.

cass_ClientRequestWriteTimeouts

Number of Cassandra write request timeouts in the last minute.

  • Alarm ID: 20124
  • Alarm threshold: 0.001

Should be 0 ideally.


There is more information here:

Elasticsearch alarms

The alarms below are available for Elasticsearch nodes:

es_IndexingTime

Time spent to index an entry.

  • Alarm ID: 24000
  • Alarm threshold: 500

Indexing time should be as low as possible.

es_DiskSpaceUsed

Disk space used by Elasticsearch.

  • Alarm ID: 24002
  • Alarm threshold: 750

Threshold is in GiB. Must remain under 80% of total disk space.

Feedback

Comment on this article:

blog comments powered by Disqus
This page was last modified on May 18, 2018, at 07:01.