Jump to: navigation, search

System Monitoring and Logging

Monitoring JOP and its Containers

JOP uses a number of logs to track the status of the various containers: Tango (which includes Gunicorn processes), NGINX, and MongoDB. JOP also uses ElasticSearch, so you might find status updates relating to ElasticSearch in the logs. This section covers logging for each container.

Tango Container Logging

This log records the system condition for the Predictive Routing application.

There are two types of logs generated by the Tango container:

Application logs: Gunicorn (Python WSGI HTTP Server) logs

2017-10-03 18:09:06 [44] [INFO] Booting worker with pid: 44

The Application logs provide the following information:

  • The time of the event in YYYY-MM-DD HH:MM:SS,mmm format.
  • The [pid] (process ID) of the process that handled the request.
  • The log level (in the example above, this is INFO).
  • The application name (for GPR this is <BOTTLE>).
  • The module and the line in the module.
  • The actual log message.

Web server logs: These logs start with an IP address

172.18.0.5 - - [03/Oct/2017:18:10:09] "GET / HTTP/1.0" 302 241 "-" "curl/7.29.0"
Important
  • Anything with a log level of ERROR needs immediate investigation. In case of unexpected runtime errors, the ERROR message is usually followed by a traceback, which can be used for troubleshooting.
  • You should also investigate repeated WARNING messages.

Checking Tango Status

Tango container logs are sent to systemd. To check logs, send a request to Docker using the following syntax:

  • To show all the available log entries:
    [root@centosbox ~]# journalctl CONTAINER_NAME=tango -o cat
  • To show the last 100 log entries:
    [root@centosbox ~]# journalctl CONTAINER_NAME=tango -n 100 -o cat

 

Example Tango Log Messages

...
2017-10-03 18:09:05 [1] [INFO] Starting gunicorn 18.0
2017-10-03 18:09:05,805 [1] INFO  <I/O> facebook.py:31   pre-importing facebook dependencies
2017-10-03 18:09:05,806 [1] INFO  <I/O> stats.py:21   pre-importing stats dependencies
2017-10-03 18:09:05,806 [1] INFO  <I/O> twitter.py:13   pre-importing twitter dependencies
2017-10-03 18:09:05,806 [1] INFO  <I/O> __init__.py:31   pre-importing heavy modules
2017-10-03 18:09:05,806 [1] INFO  <I/O> nlp.py:15   warming up NLP code
2017-10-03 18:09:05,995 [1] INFO  <I/O> nlp.py:29   done
2017-10-03 18:09:05 [1] [INFO] Listening at: http://0.0.0.0:3031 (1)
2017-10-03 18:09:05 [1] [INFO] Using worker: gevent
2017-10-03 18:09:06 [41] [INFO] Booting worker with pid: 41
2017-10-03 18:09:06 [42] [INFO] Booting worker with pid: 42
2017-10-03 18:09:06 [43] [INFO] Booting worker with pid: 43
2017-10-03 18:09:06 [44] [INFO] Booting worker with pid: 44
172.18.0.5 - - [03/Oct/2017:18:10:09] "GET / HTTP/1.0" 302 241 "-" "curl/7.29.0"
172.18.0.5 - - [03/Oct/2017:18:10:28] "GET / HTTP/1.0" 302 241 "-" "curl/7.29.0" 

MongoDB Container Logs

MongoDB container logs are sent to systemd. To check logs, send a request to Docker using the following syntax:

  • To show the last one hundred lines logged:
    [root@centosbox ~]# journalctl CONTAINER_NAME=mongo -n 100 -o cat
  • To show all lines logged:
    [root@centosbox ~]# journalctl CONTAINER_NAME=mongo -o cat

Worker (Processes) Logging

Workers container logs are sent to systemd, so to check logs send a request to Docker using the following syntax:

  • To show the last one hundred lines logged:
    [root@centosbox ~]# journalctl CONTAINER_NAME=workers -n 100 -o cat
  • To show all lines logged:
    [root@centosbox ~]# journalctl CONTAINER_NAME=workers -o cat

Example Worker Log Messages

...
Using gunicorn timeout value = [600]
2017-10-03 23:53:00,670 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:25 started
2017-10-03 23:53:00,680 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:26 started
2017-10-03 23:53:00,688 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:27 started
2017-10-03 23:53:00,697 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:28 started
2017-10-03 23:53:00,709 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:29 started
2017-10-03 23:53:00,721 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:30 started
2017-10-03 23:53:00,735 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:35 started
2017-10-03 23:53:00,763 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:42 started
2017-10-03 23:53:00,784 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:49 started
2017-10-03 23:53:00,808 [13] INFO  <BOTTLE> start_stop_machine.py:170  JobsConsumer pid:52 started
[model62_3] Loading /opt/tango/solariat_nlp/src/solariat_nlp/intention_cls/en/model62_3/vectorizer.pkl...
[model62_3] Loading confusion_matrix and avg_confidence matrices from cache...
...

Monitoring Agent State Connector

Agent State Connector writes log data to Message Server, using the usual Genesys logging parameters. These are configured using the Agent State Connector [log] Section configuration options.

Genesys recommends that you configure alarms to notify you when the Agent State Connector generates the following Standard-level log messages:

  • 60400|STANDARD|error1|error... %s
  • 60401|STANDARD|error2|%s error... %s
  • 60402|STANDARD|exception1|exception caught and processed... %s
  • 60403|STANDARD|exception2|%s exception caught and processed... %s
  • 60404|STANDARD|failed_exc|%s failed, exception caught and processed... %s
  • 60701|STANDARD - Stat Server has experienced multiple switchovers during the period specified in the ss-monitoring-reconnect-min option.
  • 60702|STANDARD - Configuration Server has experienced multiple switchovers during the period specified in the confserv-monitoring-reconnect-min option.
  • 60703|STANDARD - Stat Server is losing connection with ASC. The number of times the connection is lost before this alarm is triggered is set in the ss-monitoring-reconnect-count option.
  • 60704|STANDARD - The cancel event for 60703.
  • 60706|STANDARD - Configuration Server is losing connection with ASC. The number of times the connection is lost before this alarm is triggered is set in the confserv-monitoring-reconnect-count option.
  • 60707|STANDARD - The cancel event for 60706.
Important

Alarm conditions are configured for all ASC instances.

  • If you receive an alarm condition 60402, AgentStateConnectorException1, or 60403, AgentStateConnectorException2, ASC cannot recover. You must restart ASC. Genesys recommends that you also contact Genesys to evaluate your environment and prevent such conditions in future.
  • If you receive an alarm condition 60400, AgentStateConnectorError1, 60401, AgentStateConnectorError2, or 60404, AgentStateConnectorExceptionCaught, the application continues to operate normally.

Example ASC Log Messages

  • ASC starts to read Person profiles from Configuration Server:
    04:16:52.702 Trc 09900 (AgentStateMonitor).(run):Main loop just started!
    04:16:52.703 Dbg 09900 (ConfigServerQueryEngine).(readAllAgents):Trying to query config server
  • One Person record was added to the ASC internal cache:
    04:22:48.367 Dbg 09900 (AgentStateMonitor).(initializeAgentData):Adding agent into current map: 6003880
  • The data for an Agent Group was added to the appropriate Agent Profiles in the cache:
    04:30:06.171 Dbg 09900 (AgentStateMonitor).(initializeAgentData):Adding group: EWT_ROG_VO_QA_CTI_01 to agent in the current map: T_QA_CTI_01
  • ASC starts loading agent configuration data to JOP:
    04:30:06.210 Dbg 09900 (JOPConnector).(upsertAgentBatch):Upsert request:
    [{lastName=Johnston, loginId=, native_id=8582230, RS_TPV_ALPCC00001=2, RS_SUP_851457=2, RS_SID_52=2, employeeId=8582230, attached_data={groupNames=[AG_LOC_ALPCC, AG_SID_52, AG_TPV_ALPCC00001, AG_SUP_851457, TP_ALP, DORENE_MOORE_851457, KAREN_PAVICIC_848751, NATASHA_MCMURRAY_890166, PETER_FINLAY_861575, COLLIN_MASON_874176, AG_TPW_CLB, DD_Not_skilled], RS_TPV_ALPCC00001=2, RS_SUP_851457=2, RS_SID_52=2, RS_LOC_ALPCC=2, RS_TPW_CLB=2}, userName=8582230, loginStatus=-1, on_call=false, skills={RS_TPV_ALPCC00001=2, RS_SUP_851457=2, RS_SID_52=2, RS_LOC_ALPCC=2, RS_TPW_CLB=2}, groupNames=[AG_LOC_ALPCC, AG_SID_52, AG_TPV_AL…
  • The agent configuration data was submitted successfully:
    04:30:58.164 Trc 09900 (JOPConnector).(upsertAgentBatch):Status ok: true// 0.0//200.0//0.0
  • ASC finished reading agent configuration data and starts to subscribe to agent login statistics:
    05:40:20.899 Trc 09900 (AgentStateMonitor).(initializeAgentData):TimeDelta: Total time to push 17641 agents to JOP=4214727
    05:40:20.906 Trc 09900 (StatServerCL).(subscribeAgents):need to subscrib agent count:17641
  • Subscription to agent login statistics is completed:
    05:40:21.635 Trc 09900 (StatServerCL).(registerEventCallbacks):TimeDelta: Took a total of 736 to register Stat Server callbacks for 17641 agents.
  • ASC startup is completed:
    05:40:21.635 Trc 09900 (AgentStateMonitor).(run):Driver main loop - starting a new iteration.

Logging Strategy Subroutines Performance

The Predictive Routing strategy subroutines write both error messages and informational messages into the URS log file and attach data to the processed interactions for reporting purposes.

A macro, PRRLog, which is called by the subroutines supplied with Genesys Predictive Routing, logs messages in the following situations:

  • No agents are returned for the skill expression.
  • There is no response from Predictive Routing within an acceptable amount of time.
  • There is an exception of some kind from the Predictive Routing scoring engine.

In addition, you can configure URS to log http requests and responses in a separate file.

The PrrIxnLog subroutine captures which agent an interaction was actually routed to and which predictive model was used for scoring. This is essential to properly conduct A/B testing, which leads to improved models and predictions.

Troubleshooting the Strategy Subroutines Using the URS Log

For the IRD strategy monitoring in a URS-based Predictive Routing environment:

  • Alarm condition 23001 in URS indicates authentication failure in the attempt to connect to the Journey Optimization Platform.
  • Alarm condition 23002 in URS indicates an empty list of agents provided by the scoring engine in response to a scoring request.

Model Training Logs

For model training and retraining and analysis jobs, the logs are located here:

/apps/geninst/logs/jobs-prod/jobs_consumer_0.log

Monitoring Queue-Level Statistics

You can configure a Pulse dashboard to monitor queue-level statistics for the interactions processed by Predictive Routing. Templates for real-time Predictive Routing reporting are available from the Genesys Dashboard Community Center.

Feedback

Comment on this article:

blog comments powered by Disqus
This page was last modified on 13 July 2018, at 13:33.