Parquet for Process Mining

Kontakt

Name

Alessandro Berti

Software Engineer

Telefon

work
+49 241 80 21949

E-Mail

E-Mail
 

Parquet is a modern columnar storage for memorizing tabular data. It works well with event data (when no meta-attributes or necessity to validate the schema), albeit it assumes to load the entire event log in memory.

Advantages:

  • Smaller memory footprint (thanks to efficient compression techniques available for columnar data)
  • Smaller size on disk
  • Significantly lower loading time
  • Fast preprocessing
  • Possibility to split the dataset over different criteria

It is particularly suited for handling bigger amount of data and distributed computations. Our team has expertise in data extraction, preprocessing and analysis from the Parquet format.

Apache Parquet website

Pre-print: Increasing Scalability of Process Mining using Event Dataframes: How Data Structure Matters

PM4Py Distributed Engine