Event-Case Correlation for Process Mining using Probabilistic Optimization
Process mining supports the analysis of the actual behavior and performance of business processes using event logs. % such as, e.g., sales transactions recorded by an ERP system. An essential requirement is that every event in the log must be associated with a unique case identifier (e.g., the order...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
20.06.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Process mining supports the analysis of the actual behavior and performance
of business processes using event logs. % such as, e.g., sales transactions
recorded by an ERP system. An essential requirement is that every event in the
log must be associated with a unique case identifier (e.g., the order ID of an
order-to-cash process). In reality, however, this case identifier may not
always be present, especially when logs are acquired from different systems or
extracted from non-process-aware information systems. In such settings, the
event log needs to be pre-processed by grouping events into cases -- an
operation known as event correlation. Existing techniques for correlating
events have worked with assumptions to make the problem tractable: some assume
the generative processes to be acyclic, while others require heuristic
information or user input. Moreover, %these techniques' primary assumption is
that they abstract the log to activities and timestamps, and miss the
opportunity to use data attributes. % In this paper, we lift these assumptions
and propose a new technique called EC-SA-Data based on probabilistic
optimization. The technique takes as inputs a sequence of timestamped events
(the log without case IDs), a process model describing the underlying business
process, and constraints over the event attributes. Our approach returns an
event log in which every event is associated with a case identifier. The
technique allows users to incorporate rules on process knowledge and data
constraints flexibly. The approach minimizes the misalignment between the
generated log and the input process model, maximizes the support of the given
data constraints over the correlated log, and the variance between activity
durations across cases. Our experiments with various real-life datasets show
the advantages of our approach over the state of the art. |
---|---|
DOI: | 10.48550/arxiv.2206.10009 |