Jump to: navigation, search

AD Troubleshooting

General

Important
  • Workbench uses the Hostname for component configuration
  • Please ensure hostname resolution between Workbench components, including AD and Engage Hosts is accurate and robust
  • If the Workbench Hosts have multiple NIC's, please ensure the Hostname resolves to the desired IP Address prior to Workbench installation
  • Double-check network ports that are used by AD are from a firewall perspective, open and not already in use by other applications
  • AD Nodes/Hosts require a minimum of 8 CPU cores
  • Install the AD components on dedicated hosts - not on the same Nodes/Hosts as the Workbench core components.


Logs for Troubleshooting

AD automatically creates the file ad_monitoring.log in the {LOG_PATH} folder configured.

The structure for this log file is using this format:

'%(asctime)s | %(levelname)s | %(processName)s | %(message)s')

  • Time format: 2021-09-20 03:15:46,291
  • The default Log_Level is INFO. DEBUG mode can be used to see details about the process executed by AD. Doing that will reduce the performance of some components like streaming consumers and collectors.
  • processName tell the AD component that is generating the event

Below a few tips of Log information for troubleshooting:

  • AD start: check if AD is running as a primary or additional node

2021-09-08 16:16:21,446 | INFO | application_manager | WB-AD starting

2021-09-08 16:16:21,447 | INFO | application_manager | AD compilation time: 210908-192852

2021-09-08 16:16:21,447 | INFO | application_manager | configuration path: configs

2021-09-08 16:16:21,447 | INFO | application_manager | main path: /Installation/path

2021-09-08 16:16:22,172 | INFO | application_manager | App Manager started

2021-09-08 16:16:22,173 | INFO | application_manager | local data storage initialized

2021-09-08 16:16:22,173 | INFO | application_manager | AD --.--.--.-- as primary node

2021-09-08 16:16:22,173 | INFO | application_manager | app_manager class initialized


  • AD components are started in this order: ad_api, streaming consumer, collector, model_manager, anomaly_detector and alarm_monitoring.

2021-09-20 03:03:47,939 | INFO | application_manager | New ad_api process started with pid 49852

2021-09-20 03:03:47,941 | INFO | ad_api | starting AD API: -------:8182

2021-09-20 03:03:47,943 | INFO | application_manager | New streaming_consumer_logstash0 process started with pid 49853

2021-09-20 03:03:47,952 | INFO | application_manager | New collector process started with pid 49854

2021-09-20 03:03:47,953 | INFO | streaming_consumer_logstash0 | Streaming Consumer initialized

2021-09-20 03:03:47,957 | INFO | collector | AD Collector initialized

2021-09-20 03:03:48,021 | INFO | model_manager | Model Manager initialized

2021-09-20 03:03:48,011 | INFO | application_manager | New model_manager process started with pid 49855

2021-09-20 03:03:48,036 | INFO | application_manager | New anomaly_detector process started with pid 49856

2021-09-20 03:03:48,059 | INFO | application_manager | New alarm_monitoring process started with pid 49857

2021-09-20 03:03:48,063 | INFO | anomaly_detector | Anomaly Analyzer initialized

2021-09-20 03:03:48,072 | INFO | application_manager | modules initialized

2021-09-20 03:03:48,072 | INFO | anomaly_detector | Anomaly Detector initialized

2021-09-20 03:03:48,087 | INFO | alarm_monitoring | Alarm Monitoring initialized

  • Commons errors detected:
    • Trying to connect with Logstash TCP server: must be confirmed with an Alarm generated by AD.

2021-09-20 02:32:44,139 | ERROR | streaming_consumer_logstash0 | error collecting messages. Traceback (most recent call last): File "core/streaming_consumer.py", line 133, in main SC.streaming_process() File "core/streaming_consumer.py", line 67, in streaming_process message = self.broker.get_message() File "core/streaming_consumer.py", line 24, in get_message message = self.socketFile.readline() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/socket.py", line 586, in readinto return self._sock.recv_into(b)socket.timeout: timed out

error collecting messages. Traceback (most recent call last): File "streaming_consumer.py", line 131, in main File "streaming_consumer.py", line 60, in set_broker File "streaming_consumer.py", line 19, in __init__ConnectionRefusedError: [Errno 111] Connection refused


Important events:

  • New source detected
  • AD model trained or updated
  • New alarm sent
  • New anomaly (insight) created
  • change detected in AD config file
  • restating AD components
  • AD component terminated
  • Additional AD Nodes:
    • new source added from primary
    • AD model updated from primary
    • Primary node is not responding
    • sending request to update primary node in WB

AD API for Troubleshooting


Additional AD API endpoints were added to monitor AD status. Per default ad_api is running on port 8182

  • /ad_api/status: return the current status for AD an components.
  • /ad_api/get_sources_summary: return the list of sources (metrics) collected by AD.
  • /ad_api/get_alarms: return the alarms generated by AD in status open.
  • /ad_api/get_last_error: return the last error detected in logs
  • /ad_api/get_last_insight: return basic information about the source of last insight detected.
This page was last edited on November 3, 2021, at 12:01.
Comments or questions about this documentation? Contact us for support!