Parquet for Process Mining

Contact

Name

Alessandro Berti

Software Engineer

Phone

work
+49 241 80 21949

Email

E-Mail
 

.

Parquet is a modern columnar storage for memorizing tabular data. It works well with event data (when no meta-attributes or necessity to validate the schema), albeit it assumes to load the entire event log in memory.

Advantages:

  • Smaller memory footprint (thanks to efficient compression techniques available for columnar data)
  • Smaller size on disk
  • Significantly lower loading time
  • Fast preprocessing
  • Possibility to split the dataset over different criteria

It is particularly suited for handling bigger amount of data and distributed computations. Our team has expertise in data extraction, preprocessing and analysis from the Parquet format.

Apache Parquet website

Pre-print: Increasing Scalability of Process Mining using Event Dataframes: How Data Structure Matters

PM4Py Distributed Engine