Bachelor Thesis - Big Data Process Mining in Python
Most organizations, in a variety of fields such as banking, insurance and healthcare, execute several different (business) processes. Modern information systems allow us to track, store and retrieve the data related to the execution of such processes, in the form of event logs. The field of process mining is concerned with the study and analysis of the data stored in such event logs. The main goal of process mining is to improve the organization’s knowledge of its own processes, by analyzing its execution, as captured in the data. In this way, the organization gets insights into the process on the basis of data describing what actually happened.
Within process mining we identify three main topics. Firstly, in process discovery, we aim to discover a process model that accurately describes the process of the organization, solely based on the data observed in an event log. The main challenge of process discovery is to discover process models that are human-interpretable, i.e. in some way simple, yet at the same time general enough in describing the data. Secondly, in conformance checking, we aim to assess to what degree a given process model and event data correspond to each other. Observe that, conformance checking allows us to identify whether the process is effectively executed as intended. Finally, in process enhancement, we aim to enhance the overall model we have of the process, i.e. by computing where bottlenecks occur, finding out what data elements determine decision points in the process, prediction of remaining process execution time, etc.
The majority of process mining algorithms, developed in academia, have a corresponding implementation in either ProM or Apromore. Both tools are java-based and comprise of a relatively complex architectural underlying framework. On the one hand, these tools provide a very modular ecosystem to develop process mining algorithms. On the other hand, the complexity of the underlying architectures often hampers the quick adoption of prototypes and the fast development of new algorithms. Furthermore, these tools do not embrace the big data world (e.g. Hadoop, Spark) and are not able to handle huge amount of data.
Therefore, the PADS Chair of RWTH, in cooperation with the process mining group of Fraunhofer FIT, has been building a process mining library in Python. This library aims at seamless integration with any kind of databases and technology. In particular, we are looking for Bachelor students who are interested in developing the big data connectors for PM4Py, with a particular focus on the Spark ecosystem:
- Importing CSV files into Apache Spark
- Calculate in an efficient way the Directly Follows Graph on top of Apache Spark dataframes
- Manage filtering operations (timeframe, attributes, endpoints, paths, variants) on top of Apache Spark
- Using Apache Spark's graph and ML possibilities for Process Mining applications
- Manage XES (XML) parsing using Apache Spark
Knowledge of basic computer science concepts, good programming skills (Java/Python) and an interest in theoretical and practical aspects of process mining (i.e. conformance checking) recommended.
Sebastiaan van Zelst
Process Mining Book: Data Science in Action
Coursera: Process Mining Course
PM4Py documentation: click here
Apache Spark: click here
Prof. Wil van der Aalst
Alessandro Berti (primary advisor)
Sebiastiaan van Zelst (secondary advisor)