Bachelor Thesis - Process Mining on the Cloud (using Python)
Most organizations, in a variety of fields such as banking, insurance and healthcare, execute several different (business) processes. Modern information systems allow us to track, store and retrieve the data related to the execution of such processes, in the form of event logs. The field of process mining is concerned with the study and analysis of the data stored in such event logs. The main goal of process mining is to improve the organization’s knowledge of its own processes, by analyzing its execution, as captured in the data. In this way, the organization gets insights into the process on the basis of data describing what actually happened.
Within process mining we identify three main topics. Firstly, in process discovery, we aim to discover a process model that accurately describes the process of the organization, solely based on the data observed in an event log. The main challenge of process discovery is to discover process models that are human-interpretable, i.e. in some way simple, yet at the same time general enough in describing the data. Secondly, in conformance checking, we aim to assess to what degree a given process model and event data correspond to each other. Observe that, conformance checking allows us to identify whether the process is effectively executed as intended. Finally, in process enhancement, we aim to enhance the overall model we have of the process, i.e. by computing where bottlenecks occur, finding out what data elements determine decision points in the process, prediction of remaining process execution time, etc.
The majority of process mining algorithms, developed in academia, have a corresponding implementation in either ProM or Apromore. Both tools are java-based and comprise of a relatively complex architectural underlying framework. On the one hand, these tools provide a very modular ecosystem to develop process mining algorithms. On the other hand, the complexity of the underlying architectures often hampers the quick adoption of prototypes and the fast development of new algorithms. Furthermore, these tools do not embrace the most recent technologies in databases (e.g. NoSQL, columnar).
Therefore, the PADS Chair of RWTH, in cooperation with the process mining group of Fraunhofer FIT, has been building a process mining library in Python. This library aims at seamless integration with any kind of databases and technology. In particular, we are looking for Bachelor students who are interested in integrating Process Mining techniques with cutting-edge cloud technology (e.g. Google Cloud, Amazon, Azure …)
- Find the best way to upload a Process Mining event log on the cloud.
- Find the best way to compute the Directly-Follows Graph on the cloud in a scalable way.
- Perform filtering operations on the log (e.g. attributes, variants, endpoints), staying on the cloud
- Deploy PM4Py as Docker container on the cloud.
Knowledge of basic computer science concepts, good programming skills (Java/Python) and an interest in theoretical and practical aspects of process mining (i.e. conformance checking) recommended.
- Process Mining Book
- Coursera Process Mining Course
- PM4Py documentation
- Uploading objects
- Usage of Google BigQuery
Prof. Wil van der Aalst
Dr. Sebastiaan van Zelst
Dr. Seran Uysal