Review Article
A Review on Software Process Mining Using Petri Nets
Department of Computer Science, Bharath University, 600 073 Chennai, India
Kumaravel Appavoo
School of Computing, Bharath University, Tambaram, 600 073 Chennai, India
Today, there is a bunch of techniques that help to routinely come up with process models from a sequence of activities that are executed in an enterprise1. Typically, such sequences come from the log of a workflow management system or some standard software which is used for executing these processes. There are many different novel algorithms and methods that help to obtain faithful and valuable process models; some techniques come up with an initial model rather fast and the process models are incrementally improved by new interpretations2. All these techniques can be summarized by the term process mining. Here interest in process mining came from the area of software engineering.
Software engineering processes are often not well-documented, though good engineers have the processes in their minds. In the Capability Maturity Model (CMM), this level of maturity of a software company is called repeatable3-7. Therefore, it is looked for methods for automatically mining these process models from the observed study. The main source for observing the work of software engineers are the logs of the version management systems and document management systems that are used in the development process. The problem, however, is that these systems are aware of documents only and not of the underlying activities. Basically, they see the creation, modification and checking of documents, but they are not aware of the activities and to which activity these events belong to Assar et al.8, Baeza-Yates9 and Baeza-Yates et al.10. Therefore, the standard mining algorithms do not work; the activities must identify from the event logs of the document management systems before: Here it is call this activity mining (Fig. 1).
Here, it could easily obtain a transition system for the underlying processes, where the transitions are the activities of the processes. So, basically, deriving a process model from the result of the activity mining algorithm means deriving a Petri Net from a transition system, which is a well-known area of Petri Net theory called Petri Net synthesis. It was established by the seminar paper on regions and later extended and elaborated by other researchers. In this study, we show that our activity mining algorithm in combination with the tool petrify2,11-13 can be used for faithfully mining process models from logs of document management systems and version management systems. The focus of this study is on the use of synthesis algorithm for details on the activity mining algorithms.
The practical relevance of process mining is increasing as more and more event data becomes available. Process mining techniques aim to discover, monitor and improve real processes by extracting knowledge from event logs. The two most prominent process mining tasks are: (1) Process discovery: Learning a process model from example behavior recorded in an event log and (2) Conformance checking: Diagnosing and quantifying discrepancies between observed behavior and modeled behavior14-16.
Most of the study done in conformance checking in the literature focuses on the control-flow of the underlying process, i.e., the ordering of activities. There are various approaches to compute the fraction of events or traces in the log that can be replayed by the model17-19.
Petri Nets are popular due to their inherent ability to express concurrency, choice and causality between events in a system, without explicit enumeration of global states.
Fig. 1: | Mining and synthesis schema |
Although checking properties of Petri Nets could be difficult in general, for some subclasses of Petri Nets there are efficient verification algorithms20.
The work presented in this study is related to process mining, i.e., discovering a process model based on some event log. Process mining techniques focus on discovering behavioral aspects from log data. The idea of applying process mining in the context of workflow management was first introduced by Agrawal et al.21.
A similar idea was used in the context of automating the detection of process models using probabilistic and algorithmic methods by Chen and Yun22.
Cook and Wolf described a Markov method that they developed exactly for process discovery. This model is integrally successive. Chow23, Christie24, Chulef et al.25, Clauzel et al.26 and Conklin and Begeman27 the exploring techniques that can use basic event data captured from an on-going process to generate a formal model of process behavior. In this kind of data analysis process discovery, they describe using methods: Algorithmic grammar inference, Markov models and neural networks. Note that the results presented in28-34 are limited to successive behavior35.
Healthcare is another famous application domain for process mining. The applicability of process mining in healthcare was demonstrated using a real case of a gynecological oncology process in the AMC hospital in the Netherlands36-39. The log data contained information about a representative group of 627 gynecological oncology patients. The goalmouth of using process mining was to discover the care paths followed by individual patients and whether certain procedures are followed or not. After applying process mining techniques, many useful results became visible to the people at the hospital. For example, it was found that patients who undergo several chemotherapy sessions often need to visit the dietician. This was not immediately clear to everyone and illustrates the value of creating transparency using process mining40-45.
The above two mentioned projects were implemented with the process mining tool named ProM46. The ProM contains more than 250 plug-ins that implement different process mining algorithms. However, it is not clear how to use ProM in process redesign projects. In the above two projects, the used different plug-ins were used but viewed each plug-in result alone. Although, ProM allows the results from some algorithms to be integrated in a Colored Petri Net (CPN) that support analysis and simulation, there was no guidance from ProM on how to improve the business processes. Instead, the researchers concluded the redesign ideas from viewing the simulated models. It is hard to make process redesign using process mining a repeatable service.
Process mining has been applied in a variety of organizations covering many application domains. In the process mining was used to analyze the test process in ASML. The ASML makes so-called wafer scanners that are used to manufacture processors in devices ranging from mobile phones to desktop computers. Wafer scanners are really complex machines that use a photographic process to image nanometric circuit patterns onto a silicon wafer. The testing of the manufactured wafer scanners is a time-consuming process. So, the goal of the analysis was to reduce the testing time. Each wafer scanner in the ASML factory produces a log of the software tests that are executed on it. Process mining was used to visualize the actual flow of the test process and confront this visualization with the idealized view of the tests according to engineers. It was found that as soon as one test fails, a fix is made to the scanner and all other tests are put on hold (idle time) and often after the fix is made, some tests are re-executed again. Visualizing this loop-backs caused by some tests gave the engineers a useful view on what was causing the time loss in the test process. Hence, allowed them to make changes to the test process to reduce the time (for example, execute some tests at earlier phases)47-53.
Nabil R. Adam proposed modeling and analysis of workflows using Petri Nets" in which he has demonstrated the use of PN as an effective tool for modeling workflows at a conceptual level and then analyzing them.
Cintra and Ruggiero presented a simulation technique for performance analysis of Generic Petri Net models of computer systems" in which he presented a simulation algorithm tp observe that the simulator performed reasonably well, even on a modest machine.
Boucheneb and Hadjidj proposed model verification techniques of time Petri Nets54-59. They used temporal logic model checking to represent the behavior of a system.
Olivier and Roy introduced an approach to implement a distributed monitor of real-time system properties and then introduced a new formalism, adaptive Petri Nets, that allows to model such complex, distributed and real time systems60-69.
The performance analysis of the model illustrates the behavior of the system. Falko Bause proposed the concept of stochastic Petri Nets with various examples. In Stochastic Petri Nets (SPN), random string delays are attached to the transition. The SPN is used for performance analysis of the system by Markovian techniques.
Michael K. Molloy proposed the performance analysis of the system using Stochastic Petri Nets70. They used Stochastic Petri Net for performance analysis of alternating bit protocol.
Bernardi proposed a structural performance evaluation methodology for Timed Petri Nets (TPNs) and their stochastic extensions71.
The W.M.P. van der Aalst proposed a methodology to verify the business process using Petri Nets72-76. In which he verified the liveness and boundedness of workflow net using Petri Nets.
Boucheneb and Hadjidj proposed model verification techniques of time Petri Nets77. They used temporal logic model checking to represent the behavior of a system.
Other recent work has sought to measure process conformance, having obtained a sequence of events, by comparing the models obtained from observations to theoretical models. This study is in the early stages having come from the foundations of methods that simply tested conformance. Fitness and appropriateness are proposed as metrics that can be assessed by incrementally replaying the events and measuring unused features of the model, respectively78-83. Two further metrics precision (how many invalid steps occur) and recall (how many steps are enabled) have also been defined from a machine behaviour perspective. In this study, here it is focus on the method proposed, which represents the process model as a Petri Net84-90. The log of a series of events will refer as a task. When the task is executed a token is produced at the start state, then when an event is executed if an edge with the same name is enabled (that is to say there is a token on the preceding state) then the token is consumed and one produced at each connected state. The fact that an edge may lead to more than one state allows for parallel execution. If there is no enabled edge matching the event then one which has the same name is chosen at random and the same process is followed, without the consumption. The method described twofold based on two metrics fitness and appropriateness84-89. Fitness is a measurement of the extent to which the tasks can be fitted to a Petri Net captivating into account the number of times tokens are produced but not consumed and the number of missing tokens where they had to be introduced. Appropriateness is a measurement of over fitting, or in other words how much of the process model was not used by the tasks executed. Behaviour suitability measures if the model allows too many possible paths. This is achieved by calculating all possible orderings of events in both the model and the log and discounting those that always or never follow each other. The metric calculates the remaining set (those states which sometimes follow each other) by considering the difference between the model and log sets. The calculation is founded on a count of alternate duplicate events and unnecessary imperceptible events that could be removed without otherwise moving the behaviour of the model. The alternate duplicate events are those that never happen together in one execution sequence. This measurement is made on the model only and so will not be affected by changes to the logs.
This algorithm was proposed to rebuild the causality in the Petri Nets workflow from the existing relations in the event log. The α-algorithm takes the event logs as input, rebuilds process models by using simple XOR and splits and joins; thereby creates the workflow nets as output. The α-algorithm cannot handle noise and certain complicated routing constructs of workflow nets such as, loops and long-term dependencies, particularly during complex situations. A more robust but less precise approach was then proposed to deal with the issues of α-algorithm. To overcome this difficulty a protracted algorithm, α++ algorithm was introduced to generate new relationships between event logs to handle long-term or implicit dependencies91-96.
Hierarchical clustering this technique separates a set of event logs for a given process into clusters and finds the dependency graph for each log97-109. It structures the clusters of event logs into a hierarchy tree. For each cluster, a workflow model is constructed and finally all the models are merged into a single one. Some clustering techniques use theory of regions to discover processes. The advantage of the theory of regions is that the characteristics of the resulting model can be influenced before the mining starts (e.g., the number of places in the Petri Net or the number duplicate task can be deter-mined in advance). A mining tool has been developed for discovering hierarchically structured workflow processes that need to balance splits and joins110-118.
Processes are frequently expressed as a form of directed transition system119-124. These are composed by events or activities. Approaches to forming these from observations vary based on the type of events captured. Principally there are three observation approaches discussed in the literature: In the first case the developers document the process in detail as they complete it although this is maybe unreliable; secondly a research group observes the team, perhaps recomposing the process by interview125-127; lastly the final method involves the analysis of logs post hoc128-131. In the former instance some knowledge of the process model and the relation between the states is known, whereas in the latter two all that is available is the event series in the timeline. In the later case where the data is reconstructed, business rules must be identified in order to associate a log entry with the activity stereotype.
Genetic algorithm132:This technique provides process models (Petri Nets) built on causal matrix, i.e., input and output dependencies for each activity. This technique tackles problems such as, noise, incomplete data, non-free-choice constructs 4, hidden activities, concurrency and duplicate activities. Nevertheless, it requires the configuration of many parameters to deal with irrelevant data, which is a complex task.
Heuristic algorithm133-136:This technique is based on α-algorithm. It calculates the frequencies of relations between the tasks, e.g., causal dependency, loops, etc and construct dependency/frequency tables and dependency/frequency graphs. This technique can detect irrelevant logs. However, like the Genetic algorithm, Heuristic algorithm needs a complex configuration phase.
DEVELOPMENT OF THE TCPN SIMULATION MODEL FOR THE PATIENT CARE FLOW PROCESSES
The CPN Tools (version 4.0.1) was used in constructing a timed colored Petri Nets simulation model for the considered patient care flow processes. The proposed TCPN simulation model consists of 16 places and 12 transitions.
Simulation model, places are draw as ovals while transitions are drawn as rectangle. Places and transitions are connected with directed arcs which model the relations among the individual elements of the developed model. The arcs with their arc expressions define the flows of tokens in the net. The descriptions of the major places and transitions in the proposed TCPN simulation model are stated in Table 1 and 2, respectively. The color sets, variables, initial parameters and functions that are needed in developing the TCPN simulation model of the patient care flow processes are depicted in Fig. 2.
The color sets, variables, initial parameters and functions that are needed in developing the TCPN simulation model of the patient care flow processes are depicted in Fig. 3. The simulation of the proposed TCPN model was carried out using CPN tools. Due to the fact that the simulation model is stochastic, it is necessary to execute several simulation runs with the proposed model in order to compute mean value. Hence, several replications were run for each day and average of each was calculated.
Fig. 2: | Petri Net flow diagram of patient health care process |
Table 1: | Description of major places in the TCPN model |
Table 2: | Description of major transitions in the model |
Fig. 3: | Comparison in between the inpatient and outpatients simulation of the TPCN model for 5 days |
Table 3: | Simulation output of the TPCN model for 5 days |
Besides, validation is important for the correctness and credibility of the model136-139. Validation is to determine the model which will be a representation of the real system.
Thus, the proposed timed colored Petri Net model was validated by comparing the output of the simulation model (i.e., number of inpatients and number of outpatients) for five consecutive days with the number of inpatients and of outpatients (Table 3) of the actual system.
EXPERIMENTAL ANALYSIS
Figure 2 shows the developed TCPN simulation model of patient care flow processes of the patient health centre under consideration. Figure 2 depicts the main page which models the arrival of patients and process at the medical record area of the case study. The following, is the depicts a subnet layer (Treatment sub-module) of the main model. It models operation in the examination room of the health centre under study.
From our data, entry of each patient to the health centre is modelled by a token on the place next patient. This place has the color set UNIT and the color set UNIT is defined to be equal to unit timed type as depicted in the declarations block of the developed model. The color set UNIT is used to model arrival time of patient based on the time stamp such as @++Day1AT() attached to the arc that runs from transition In it to place Next patient. Based on the evaluation of the distribution expression: ~0.5+weibull (12.3, 0.964), function Day1AT() is used to generate the arrival time of new patient into the system. From Fig. 3, color set patient type is used to represent types of patient entering the patient health centre. It is enumerated type of Critical Patient (CP) and Non-Critical Patient (NCP). The place waiting patients has the color set Patients defined to be set of list patient. The color set patients is used to model the queue of patients to be attended to. The color set patient models a patient as a record consisting of two fields. The first field denoted with PatientType is of type PatientType and represents type of the patient. Second field is denoted with the title AT is of type real and represents arrival time of a patient. The color set Attendant×Patient is a product color set defined as product Attendant×Patient timed. This color set is used to represent the attendant when he/she is busy serving patient. Also, the color set Doctor×Patient is to represent the doctor when he/she is busy treating patient. The function Day1AT() uses weibull distribution with the expression: ~0.5+weibull(12.3, 0.964) to generate arrival times of patients for day 1. This distribution is used instead of lognormal distribution because currently CPN tools (Version 4.0.1) does not support lognormal distribution. The place waiting patients and the place pwfdasse are used to model the queue of patients at the registration counter unit and examination room respectively. The single token on each of the places waiting patients and the place pwfdasse represents the queue of patients. In the initial marking the lists are empty. The places free and busy are used to represent the status of the medical attendant. A token on the place free indicates that the medical attendant is not serving a patient at that time. The parameter hospref_no_of_attendant = 5 and function fun initAttendants() = (!num_of_attendants)attendant in Fig. 4, show that there are 4 tokens on the place free in the initial marking. A token on the place busy indicates that the medical attendant is busy attending to a patient and the value of the token indicates which patient is being processed. The initial marking of busy is empty.
The medical attendant can start attending to patient (transition start service), if the medical attendant is free and if there is at least one patient in the queue of patients (patient::patients on the arc from place waiting patients to transition start service).
COMPARATIVE STUDY OF RESEARCH WORKS
Here, compare with several research on the basis of mining loops, hidden tasks, delta analysis, visualizing, process rediscovery, duplication tasks, noise and concurrent processes (Table 4).
More recently, to deal with less structured, i.e., very diverse or flexible processes, dynamically adaptive process simplification algorithms have been proposed67. The approach demonstrates that for some subclasses, it is possible to discover the accurate workflow model using α-algorithm.
Fig. 4: | Declarations for the TCPN model of the CNP |
Table 4: | Research works dealing with process mining issues |
In another work, an extended version of α-algorithm is used to include the timing information.
Much study has to be done in put on the mining and synthesis algorithms to different document management systems in different application areas and making practical assessment of them both in the area of business process management and software process engineering. Since this method is also pertinent to the area of mining the activity logs, in the future, it should also compare it to the existing approaches in this area. This study aims at making the first step from the well-developed theory of Petri Net synthesis to the practically relevant research domain of process mining.