Efficient clustering of massive data with MapReduce
Fries, Sergej; Seidl, Thomas (Thesis advisor); Rahm, Erhard (Thesis advisor)
Aachen : Publikationsserver der RWTH Aachen University (2015)
Dissertation / PhD Thesis
Since several decades, after the Agrarian society and Machine Age, the mankind approached the Information Age. Information or even much more important knowledge became one of the most valuable resources. The usual way to generate knowledge is the analysis of observation, or of some raw data, and the more and interconnected data is available the more insights can be gained from it. Therefore, in the past decade the trend to gather all possible information in all areas of life, industry and science became overwhelming. Moreover, the technological development of storage and sensor systems allowed an even larger growth of data that are stored. As stated by Peter Hirshberg (global pulse summit) the amount of generated data in the year 2011 alone has exceeded the amount of data generated since the beginning of mankind’s history. The importance of knowledge extraction led to the development of the Knowledge Discovery process in Databases (KDD process) in the year 1996. The KDD process describes a workflow from the raw data gathering, its preprocessing, and analysis to the final visualization for further interpretation. In the last decades, the model-driven approach for knowledge extraction was mainly used. That is, the gathered data was used to accept or to decline a hypothesis that was developed by a human expert. Therefore, the accuracy of the predictive quality of the model highly depended on the expertise of the specialist. Moreover, even good models could miss several aspects of the problem at hand. In the last years, the data-driven approach for knowledge extraction gained a lot of attention. The idea is letting the data "speak for themselves", i.e., to generate novel models based on the given data and to validate them afterwards. As the models are not known in advance, the goal is to find unknown patterns in the data. In the KDD process, this task is usually solved by a group of data mining techniques called unsupervised learning or cluster analysis. However, the cluster analysis task is often computationally expensive and efficient techniques for huge amount of data are indispensable. The usual way for processing large amounts of data is the parallelization of single tasks on multi-core or in cluster environments. In this work, the author follows the parallelization approach and investigates and presents novel techniques for processing and analyzing huge datasets in the widely used MapReduce framework. MapReduce is a parallelization framework for data intensive task that was proposed by Google Inc. in 2004 and developed to one of the most prevalent technologies for batch processing of huge amounts of data. More precisely, this thesis deals with two classes of cluster analysis - the density-based approaches and particularly DBSCAN algorithm, and the projected clustering techniques, where the P3C algorithm was investigated and further developed for processing huge datasets. As part of the density-based approaches, the author of this thesis proposes efficient approaches for similarity self-join technique in vector spaces and determination of connected components in huge graphs in the MapReduce framework.