Extracting, Filtering, and Cleaning Event Data

The input for Process Mining

After extracting from the information system, one can explore, preprocess the event data to feed process mining so that it can produce the desired result.

What type of data preprocessing is required?

Process Mining is impossible without proper event logs. Depending on the process mining technique used, these requirements may vary. The challenge is to extract such data from a variety of data sources, e.g., databases, flat files, message logs, transaction logs, ERP systems, and document management systems. When merging and extracting data, both syntax and semantics play an important role. Moreover, depending on the questions one seeks to answer, different views on the available data are needed. Process mining, like any other data-driven analysis approach, needs to deal with data quality problems.

Data sources

A data source may be a simple flat file, an Excel spreadsheet, a transaction log, or a database table. However, one should not expect all the data to be in a single well-structured data source. The reality is that event data is typically scattered over different data sources and often quite some efforts are needed to collect the relevant data.

Data extraction

The first step in the preprocessing of the data that is required for process mining is data extraction. Data sources may be structured and well-described by metadata. Unfortunately, in many situations, the data is unstructured or important metadata is missing. Data may originate from web pages, emails, PDF documents, scanned text, screen scraping, etc. Even if data is structured and described by metadata, the sheer complexity of enterprise information systems may be overwhelming, there is no point in trying to exhaustively extract event logs from thousands of tables and other data sources. Data extraction should be driven by questions rather than the availability of lots of data.

Filtering data

Once the event logs are extracted from the data, the next step is to filter them. Filtering is an iterative process. Coarse-grained scoping was done when extracting the data into an event log. Filtering corresponds to fine-grained scoping based on initial analysis results. For example, for process discovery, one can decide to focus on the 10 most frequent activities to keep the model manageable. Based on the filtered log, the different types of process mining techniques can be applied: discovery, conformance, and enhancement.

Cleaning

To apply the process mining techniques to the extracted and filtered event logs, the events need to be related to cases. A process model describes the life cycle of a case of a particular type. All activities in a conventional process model correspond to status changes of such a case. The data should be cleaned and transformed to obtain the events that are related to many cases representing real-life processes.