A Framework for RCA of SLA Violations in Processes with Inter-Case Dynamics

Kontakt

Name

Mahsa Bafrani

Wissenschaftliche Mitarbeiterin

Telefon

work
+49 241 80 21906

E-Mail

E-Mail
 

Description (Joint master thesis with ServiceNow)

An SLA is a commitment between a service provider and a client. In business processes, an SLA often defines the maximal time to resolve per process and/or case priority. For example, an SLA can allow no more than 8 hours for resolving a case with priority High, and no more than 48 hours for resolving a case with priority Normal. The number of SLA violations is an important metric for the process analysis because systematic SLA violations can indicate process inefficiencies. Their RCA can reveal opportunities for process optimization.

Most commercial process mining systems focus on the case characteristics for identifying SLA violation Root Causes (RCs), i.e., they consider a case of interest in isolation. However, in practice, delays leading to SLA violations are often caused by other cases of the same or related processes. For example, cases can compete for limited shared resources, or wait for completion of cases of other processes. As a result, the SLA violation RCA of a case in isolation cannot reveal these factors and infer correct results, so an approach considering all related cases and processes together should be used instead.

  Figure1-Incident  

Figure 1. Incident I1 interacts with the instances of the same and other processes.

Example. Let us consider an SLA violation example in the IT Service Management (ITSM) process. Customer Mary bought a new smartphone manufactured by vendor X. When she started using a headset Y for watching movies on this smartphone, it started crashing. Mary posted a ticket to the X technical support. At X, such tickets are processed by the ITSM process. For Mary’s ticket, an incident I1 of the Incident process (within the ITSM process) was created at time t1 with a default priority Normal and assigned to assignment group A that handles smartphone-related issues (node Created in Figure 1.) Later, at time t2, John (assignment group A) became available, picked it up, and started working (node WIP (Work-in-Progress)). When he realized it is a compatibility issue, he reassigned the incident to assignment group B which handles this issue type (Reassign at t3). When Mike (assignment group B) became available, he picked it up for investigation (WIP at t4). Mike needed the same model of the smartphone and headset as Mary to reproduce the issue, so he requested them from the IT department. For that, he created two tasks, T1 (for the smartphone) and T2 (for the headset) within the Incident task process (ITSM process), and put incident I1 on hold (t5) till the equipment is delivered. In the IT department, two different resources handled these tasks. Kate provided the smartphone faster (at t6) than Jack provided the headset (at t8) because he was overloaded with other orders (that are not shown in Figure 1). In the meanwhile, Mike picked up another incident I2 with a higher priority High at t7. As a result, he cannot resume work on I1 when the equipment was delivered but had to complete I2 first (t9). Afterward, he picked up I1 and resolved it at t10. However, the SLA for I1 was breached earlier (at tSLA ). That happened due to the following factors:

  1. was not assigned to the right group initially so time (t2-t1) was wasted waiting for an available resource for assignment group A
  2. Time (t4-t3) was spent waiting for an available resource of assignment group B.
  3. Assuming that an average time for equipment delivery is (t6-t5), extra time (t8-t6) was spent due to a higher load on Jack
  4. A higher priority incident I2 was handled first and delayed I1 execution for (t9-t8).

Assuming some real values for t1-t10, the final RC can be formulated as follows. The SLA for I1 was violated

  • due to wrong initial assignment (impact to SLA breach is 45%),
  • due to longer execution of T2, caused by the load peak on Jack, that caused an extra dependency on incident I2 with a higher priority (the total impact to SLA breach is 55%).

Note that none of factors 1-4 alone caused the SLA breach but they cause it together.

This example shows how SLA violations can happen due to (1) intra-case issues like wrong group assignment (factor 1, assignment group A instead of B), (2) interactions on shared limited resources within the same process (factor 2, incidents I1 and all incidents of assignment group A, and resource Mike), including priority policies (factor 4, incidents I1, I2, and Mike), and (3) dependencies on other processes (factor 3, incident I1 waited for tasks T1, T2 completion). Note, that related process instances (I) can have 1:n relations (one incident can have multiple tasks like I1 and T1, T2), and (II) can form longer dependency chains through inter-process dependencies or shared resources. For example, in the ITSM process, an incident can be attached to an instance of the Problem process, and a resource (e.g., Jack) can also serve multiple processes. Further, the resource behavior analysis, needed for inferring RCs, requires investigation of waiting and service time that depends on assignment group and resource queue sizes, priorities of enqueued cases, ordering policies, fairness, SLA for other enqueued cases, resource calendars, and so on. As a result, factors 2-4 cannot be explained from the analysis of I1 in isolation.

Problem. In general, given (1) an event log describing the execution of multiple processes with 1:n relations, case priorities, and resource/group assignments, and (2) a process model, describing the processes, resource/group lifecycles, resource/group queues, and their synchronization, and (3) SLA function, the SLA violation RCA can be performed. This assignment's goal is to design a framework capable of doing this analysis for business processes like ITSM, CSM, HR onboarding process, etc. Additionally, the required process model must be designed and defined, and a visual analytics technique for visual checking of the analysis results (i.e., RCs) must be created.

Important!


The thesis has the potential for a good salary during the thesis phase.

About ServiceNow

ServiceNow is making the world of work, work better for people. Our cloud-based platform and solutions deliver digital workflows that create great experiences and unlock productivity for employees and the enterprise. We're growing fast, innovating faster, and making an impact on our customers' and employees' lives in significant and important ways. With over 6,900 customers, we serve approximately 80% of the Fortune 500, and we're on the 2020 list of FORTUNE World's Most Admired Companies.

Prerequisites

  • Data science, software engineering, and algorithmic skills to develop new approaches, implement a prototype solution and evaluate the approach on datasets.
  • Programming languages: Java.

Pointers

Supervisor

Prof.dr.ir. Wil van der Aalst

Advisor

  • Mahsa Pourbafrani
  • Vadim Denisov (ServiceNow)

For more Information

Send an e-mail to . Make sure to include detailed information about your background and scores for completed courses.