Bachelor Thesis - Automatic Parallelization of Process Mining computations on a Scientific Workflow Management System
Automatic Parallelization of Process Mining computations on a Scientific Workflow Management System
Description:
Process mining is a set of techniques (including process discovery and conformance checking) to exploit the information contained in the event data of the information systems supporting the execution of the organizational processes. Recently, scripting libraries in Python and R have been proposed to increase the user-friendliness of process mining. Essentially, complex process mining pipelines are encoded as sequential scripts executing the required operations. An example follows:
- The log is read from the disk
- Then, a sound workflow net is discovered from the event log with the Inductive Miner algorithm (requiring the output of the previous operation)
- The token-based replay algorithm is applied between the event log and the discovered process model (requiring both the event log and the process discovery)
In the previous example, the operations need to be performed sequentially and no parallelization is possible. But parallelization is possible in the following more advanced script:
- The event log is read from the disk
- For the noise thresholds 0% or 10% or 20%:
- A sound workflow net is discovered from the event log with the given noise threshold.
- The token-based replay algorithm is applied between the event log and the discovered process model.
- The alignments algorithm is applied between the event log and the discovered process model.
We see indeed that the same computations are repeated independently for the different noise thresholds. Moreover, inside the computations performed for a given noise threshold, the token-based replay and alignments application is independent and can therefore be parallelized. This leads to increasing the performance if the operations are non-trivial and several different cores are available.
The goal of the thesis is essentially taking (as input) a process mining script, executing operations sequentially, and:
- Converting it to a DAG representing the execution of a scientific workflow.
- Executing the scientific workflow through jobs on top of the SLURM system of the RWTH Aachen university.
Therefore, a “sequential” script becomes a collection of scripts with interdependencies that do the same operations concurrently instead of sequentially; the scripts are then executed on the SLURM system and the results are collected.
For analytical purposes, another requirement is to collect an event log describing the execution of the implemented workflows and analyze it with process mining techniques. Different tools can be used for the purpose (ProM, pm4py, Celonis).
The thesis scope should be limited as follows by the student:
- A scripting language and, therefore, a process mining library should be chosen (pm4py, bupaR).
- Only a subset of the syntax of the language should be targeted for the eventual conversion (e.g., conditions about detecting the input/output variables of a method, and the code blocks that should be supported should be identified during the thesis, e.g., IF/WHILE/FOR/probabilistic executions).
- The output should be executable on the SLURM system, and support for other workflow management systems is not required.
Different types of code-based techniques can be used to reach the goals of the thesis, including:
- Operating on the textual content/indentation of the script
- Static analysis techniques
Prerequisites:
The candidate will be chosen between BSc students:
- Having a good knowledge of Python (obtained for example in the SPP “Introduction to Process Discovery using Python” or “Introduction to Process Conformance Checking using Python”).
- Having a good knowledge of Linux and shell scripts.
- Having a basilar knowledge of process mining concepts (event logs, process models, process discovery, conformance checking).
Pointers:
- RWTH High Performance Computing (Linux) https://help.itc.rwth-aachen.de/en/service/rhr4fjjutttf/
- PM4Py documentation https://pm4py.fit.fraunhofer.de/docs
- Python tutorial https://docs.python.org/3/tutorial/
- Python interactive tutorial https://www.learnpython.org/
- Process Mining: Data Science in Action https://www.academia.edu/40551325/Process_Mining_Wil_van_der_Aalst_Data_Science_in_Action_Second_Edition
Supervisor:
Prof.dr.ir. Wil van der Aalst
For more information:
Send an email to the following addresses, making sure to include detailed information about your background and scores for completed courses: