How to use the Workbench 8.5 GDPR Data Purge Utility
The Workbench 8.5 GDPR - Data Purge Utility (DPU) is designed to allow deletion of data from the Workbench Cassandra database tables and Workbench log files older than a set number of days. This tool needs to be run from the command line as a standalone java application, and is named DataPurgeUtility.jar.
DPU provides two functions in one tool:
- To delete any records older than 30 days, from all main Cassandra DB tables, and
- To delete all Workbench log files older than 30 days.
Steps to follow when running Data Purge Utility:
- Stop Workbench application and confirm that all the processes on the Workbench server have terminated. This can be confirmed using the Task Manager on Windows and using ps -aelf | grep java and ensuring that all the java processes for karaf and Cassandra are no longer seen.
- Start Task Manager
- Locate the java process for Cassandra or karaf
- Right click and select End Process
- Run ps -aelf | grep java
- Identify the process ID for karaf or Cassandra and then execute: kill -9 <process id>
- Start Cassandra:
- Create a command prompt at <WORKBENCH_INSTALLATION_FOLDER>
- Run: cassandra\bin\cassandra.bat
- cd to <WORKBENCH_INSTALLATION_FOLDER>
- run: Cassandra/bin/Cassandra
- Run Data Purge Utility and wait until it completes its execution.
- Stop Cassandra manually by following 1.a or 1.b above.
- Start Workbench.
If they still persist, manually kill them using the following instructions:
Below is an example of the syntax to be used. This command should be run from a command prompt in the directory of the DPU tool:
java -jar DataPurgeUtility.jar -h 127.0.0.1 -p 9042 -d 30 -f table_specs.txt -log_path "<WORKBENCH_INSTALLATION_DIRECTORY>" -cassandra_bin_path c:\GCTI\WB_server\cassandra\bin -perform_compaction -cassandra_jmx_port 17199
This command deletes all database records and WorkBench log files older than 30 days and performs database compaction.
Following are descriptions of the parameters for this utility.
- -h <ip>: It specifies the IP address of where the Cassandra Database server is running. Typically, the same as the Workbench host.
- -p <port>: It specifies the port used by the Cassandra Database server is running. This is the value of the Database Transport Port set during Workbench installation.
- –d <# of days>: It specifies the retention period in number of days. Table records older than the specified number of days will be deleted.
- In order to examine and remove the Cassandra Table Data, follow one of the two following sets of specifications:
- This set can be used for an individual table i.e. if data from only one table needs to be examined.
- -t <tablename>: The name of the table from which data will be removed if it satisfies the criteria
- –c <columnname>: The name of the column that holds the timestamp.
- –a <additional key names>: The name(s) of other columns that are part of the primary key until the column that holds timestamp.
- –pk<0 or 1>: Whether or not the column that holds timestamp is part of the partition key. 0 indicates No and 1 indicates Yes.
- –prk<0 or 1>: Whether or not the column that holds timestamp is a part of the Primary key. 0 indicates No and 1 indicates Yes.
- If there are more tables that need their data examined and removed if needed, then the following specification is used:
- –f <filename>: Full path of the file that contains one line per table. Each line contains the five parameters explained above for each table. Please note that if this option is specified, the five specifications in 4.a. above will be ignored even if additionally specified. (See File Format section below for details.)
- -set_ttl: If present, the Time-To-Live property (in seconds) corresponding to the specified number of days is set for the tables.
- -tombstone_seconds <integer value>: This is an optional parameter that contains the value of gc_grace_seconds Cassandra Table property. This property will be temporarily set to the specified value and later restored to the default value of 8640000 after the deletion of data. This option can be used if an error is encountered during the data deletion and will help to work around the failure(s) due to greater than the maximum allowed tombstone entries. See Troubleshooting for more details.
- -cassandra_bin_path <full path>: This specifies the full path for the bin where the Cassandra utilities are located.
- -perform_compaction: If present, Nodetool Compact from the Cassandra bin folder will be executed at the end of the operation.
- -cassandra_jmx_port: JMX Port that was used when installing Workbench. If not specified, default value of 7199 will be used when invoking Nodetool command.
- –log_path <full path>: This is an optional parameter that contains the full path (not relative) to a folder. Any file in that folder that has a last modified timestamp older than the specified number of days will be removed. This option can be used to remove Workbench logs older that the specified number of days.
- -recursive: While deleting the log files, the utility would look for log files in sub-directories if this flag is specified.
- -log_file_extensions <extensions>: These are comma separated file extensions that are considered to be log files. While deleting these log files, the utility will look files with these extensions and other rotated files and delete them only if they are older than the specified number of days.
Cassandra Database Table Deletion Parameters
For Example: To perform Cassandra compactions and set ttl (time-to_live)
java -jar DataPurgeUtility.jar -h 126.96.36.199 -p 9042 -cassandra_bin_path c:\GCTI\WB_server\cassandra\bin -perform_compaction -cassandra_jmx_port 17199 -set_ttl
Cassandra Compaction Parameters
The usage of these compaction options is recommended to reduce the disk usage of Cassandra but is not mandatory.
Log File Deletion Parameters
For Example: -log_file_extensions log – This will look for all *.log, *.log.<count> files for deletion. The *.log.<count> files are rotated files (Example: def.log.9).