To be able to apply process mining techniques it is essential to extract event logs from data sources (e.g., databases, transaction logs, audit trails, etc.). XES is the standard format for process mining supported by the majority of process mining tools. XES was adopted in 2010 by the IEEE Task Force on Process Mining as the standard format for logging events. It is now in the process of becoming an official IEEE standard. Next to XES (eXtensible Event Stream) other target formats supported by ProM are MXML (Mining eXtensible Markup Language) and CSV files.
There are several tools to extract XES logs from various data sources. Next to ProM itself one can use XESame, ProMimport , of commercial tools like Disco .
Process mining assumes the existence of an event log where each event refers to a case, an activity, and a point in time. An event log can be seen as a collection of cases and a case can be seen as a trace/sequence of events.
Event data may come from a wide variety of sources:
- a database system (e.g., patient data in a hospital),
- a comma-separated values (CSV) file or spreadsheet,
- a transaction log (e.g., a trading system),
- a business suite/ERP system (SAP, Oracle, etc.),
- a message log (e.g., from IBM middleware),
- an open API providing data from websites or social media,
- …
The presentation What kind of data does process mining require? illustrates the requirements using several concrete examples.
For people new to the field, it is interesting to experiment with various data sets. Therefore, this website contains pointers to various example datasets:
- There is a set of event logs used in the process mining book . This set is used to illustrate the various process mining techniques. See for example the event log reviewing.xes .
- There is a set of event logs used in the course on process mining course on process mining given at TU/e in 2007-2008.
- Also see the ProM Tutorials providing various example datasets. See for example the event log in repairExample.zip.
- The 4TU.Datacentrum also collects event logs partitioned in two categories: real-life event logs and synthetic event logs . The repository contains many benchmark data sets, including event data from hospitals, government agencies, and banks.