Master Thesis – Creating Projected Directly Follows Graphs and Precomputing Results in Celonis

 

Description

Most commercial process mining systems (including Celonis) first produce a Directly-Follows Graph (DFG) when starting to analyze an event log. DFGs are very simple and scalable. Therefore, this is a good starting point before going to more advanced approaches. Each case corresponds to a sequence of activities, also called trace or process variant. In a DFG, activities a and b are connected if a is followed by b at least once. The corresponding arcs can be annotated with frequencies and times, e.g., activity a was 754 times directly followed by activity b and this took 2.3 hours on average. To simplify the graph, there are three types of filtering: (1) variant-based filtering, (2) arc-based filtering, and (3) activity-based filtering. The first two are straightforward, but the third one is not. One can remove activities from the graph. However, consider the event log with 10x <a,b,d> and 10x <a,c,d>. The DFG has the connections a-10->b, a-10->c, b-10->d, c-10->d. Now suppose we only want to include activities that happened at least 15 times, i.e., we want to remove b and c from the DFG. If we do this naively without creating a new log, none of the four connections survives. However, if we remove b and c from the event log, we get the new event log 20x <a,d> and the DFG has connection a-20->b. Going back to the event log is semantically clearer, but also more time-consuming.

In the example, b and c were low frequent. However, the event log may also have high-frequent activities that appear randomly. Such high-frequent activities destroy the DFG because all activities will be connected to these high-frequent random activities. These examples show that projecting the event log onto subsets of activities is a key functionality in creating DFGs.

The goal of this assignment is to speed up the computation of projected Directly-Follows Graphs (DFG). If there are 50 unique activities, then there are 2^50 = 1.13e+15 possible projected event logs to consider. Creating a DFG is linear in the size of the event logs. The question is: (1) what can be precomputed and (2) what are the interesting subsets of activities. A baseline approach is to sort activities based on frequency and remove them in order. (Note that Celonis uses a different order.) However, this may not lead to the best DFGs.

Concretely the assignment consists of the following parts:

  • Analyzing the way that DFGs are currently created in Celonis (i.e., studying PQL - Process Query Language) and testing these on several data sets.
  • Designing algorithms and data structures that shift computation from run-time to load-time.
  • Implementing these approaches in Celonis and analyzing their performance on large data sets.
  • Designing smart selection approaches to pick subsets of activities that are structured and relevant, e.g., leave out infrequent and random activities to find the underlying structures.
  • Implementing these approaches in Celonis. Here PyCelonis, a Python package that allows you to connect to Celonis from Python, may be of help to do experiments. However, the ultimate goal is to extend the engine (PQL and Saola DB).
  • Conducting experiments using event data from different systems.
  • If time allows, related topics may be investigated. For example, how can other process discovery techniques benefit from the new algorithms and data structures, and can ideas to shift computation from run-time to load-time be used in other areas?

Prerequisites

  • Data science, software engineering, and algorithmic skills to develop new approaches to compute projected DFGs faster and select proper subsets, implement a prototype solution, and evaluate the approach.
  • Programming / scripting languages
    • Python, C, and Java.
    • Celonis PQL knowledge is a plus

Pointers

  • https://celonis.github.io/pycelonis/
  • T. Vogelgesang, J. Kaufmann, D. Becher, R. Seilbeck, J. Geyer-Klingeberg, M. Klenk. A Query Language for Process Mining Celonis PQL. Process Querying Methods. Springer, 2021.
  • Wil M.P. van der Aalst, A practitioner’s guide to process mining: Limitations of the directly-follows graph, Procedia Computer Science, Volume 164, 2019, Pages 321-328, https://doi.org/10.1016/j.procs.2019.12.189.
  • Wil M. P. van der Aalst: Decomposing Petri nets for process mining: A generic approach. Distributed Parallel Databases 31(4): 471-507 (2013), https://doi.org/10.1007/s10619-013-7127-5
  • Advanced Event Log Filtering in ProM (AdvancedEventLogFiltering package in ProM's Nightly Builds).

Supervisor

prof.dr.ir. Wil van der Aalst (formal thesis supervisor)

Daily advisor(s)

TBD

Important!

This a thesis project in collaboration with Celonis. Do not approach people in Celonis directly, without first discussing this with Dr. Seran Uysal () and/or prof. Wil van der Aalst (). It is NOT allowed to organize your own thesis project in a company! We only do joint thesis projects with companies we already work with (e.g., Celonis). The project needs to be defined in the context of this collaboration to ensure good supervision and avoid later confusion.

Since external assignments are particularly challenging, we will only allow good students to do this. Concurrently meeting the expectations of Celonis and ensuring a good academic level will be demanding. Hence, we only consider students with an average of 2.0 or better for such assignments.

If you are interested, make sure to include detailed information about your background (including a detailed CV), scores for completed courses, and your motivation to do this project. After an initial screening from the RWTH side, we will connect you to the person responsible on the Celonis side.

About Celonis

Celonis believes that every company can unlock its full execution capacity using process mining. The company has grown 5,000% in 4 years and 300% in the past year. Celonis now counts as a Decacorn, having raised $1 billion in our most recent funding round in June 2021, valuing the company at more than $11 billion. Powered by its market-leading process mining core, the Celonis Execution Management System provides a set of applications and developer studio and platform capabilities for business executives and users to eliminate billions in corporate inefficiencies. Celonis has thousands of customers, including ABB, AstraZeneca, Bosch, Coca-Cola, Citibank, Danaher Corporation, Dell, GSK, John Deere, L’Oréal, Siemens, Uber, Vodafone and Whirlpool. Celonis is headquartered in Munich, Germany and New York City, USA and has 15 offices worldwide.

  Celonis Logo