Bachelor Thesis - GPU-enabled Process Mining in Python
Most organizations, in a variety of fields such as banking, insurance and healthcare, execute several different (business) processes. Modern information systems allow us to track, store and retrieve the data related to the execution of such processes, in the form of event logs. The field of process mining is concerned with the study and analysis of the data stored in such event logs. The main goal of process mining is to improve the organization’s knowledge of its own processes, by analyzing its execution, as captured in the data. In this way, the organization gets insights into the process on the basis of data describing what actually happened.
Within process mining we identify three main topics. Firstly, in process discovery, we aim to discover a process model that accurately describes the process of the organization, solely based on the data observed in an event log. The main challenge of process discovery is to discover process models that are human-interpretable, i.e. in some way simple, yet at the same time general enough in describing the data. Secondly, in conformance checking, we aim to assess to what degree a given process model and event data correspond to each other. Observe that, conformance checking allows us to identify whether the process is effectively executed as intended. Finally, in process enhancement, we aim to enhance the overall model we have of the process, i.e. by computing where bottlenecks occur, finding out what data elements determine decision points in the process, prediction of remaining process execution time, etc.
The majority of process mining algorithms, developed in academia, have a corresponding implementation in either ProM or Apromore. Both tools are java-based and comprise of a relatively complex architectural underlying framework. On the one hand, these tools provide a very modular ecosystem to develop process mining algorithms. On the other hand, the complexity of the underlying architectures often hampers the quick adoption of prototypes and the fast development of new algorithms. Furthermore, these tools do not embrace the most recent technologies in databases (e.g. NoSQL, columnar).
Therefore, the PADS Chair of RWTH, in cooperation with the process mining group of Fraunhofer FIT, has been building a process mining library in Python. This library aims at seamless integration with any kind of databases and technology. In particular, we are looking for Bachelor students who are interested in integrating Process Mining techniques with cutting-edge GPU technology. PM4Py has already a GPU spin-off, named PM4PYGPU, that permits to retrieve the Directly-Follows Graph using GPU, although with several limitations.
- Evaluate GPU databases to find the best choice for Process Mining (e.g. filtering, possibility to calculate the DFG)
- Understand why and where GPU performs better than a CPU
- Evaluate the advantages of columnar formats (in particular, why they are GPU-friendly)
- Evaluate GPU clustering algorithms (cuml)
- Evaluate graph algorithms on GPU
Knowledge of basic computer science concepts, good programming skills (Java/Python) and an interest in theoretical and practical aspects of process mining (i.e. conformance checking) recommended.
- Process Mining Book
- Coursera Process Mining Course
- PM4Py documentation
- NVIDIA Rapids
- RAPIDS Open GPU Data Science
- Apache Arrow
Prof. Wil van der Aalst
Dr. Sebastiaan van Zelst
Dr. Seran Uysal