Efficient density-based methods for knowledge discovery in databases

Krieger, Ralph; Seidl, Thomas (Thesis advisor)

Aachen : Publikationsserver der RWTH Aachen University (2008)
Dissertation / PhD Thesis


Today's data storage facilities allow recording of billions of transactions from business applications, scientific sensor readings, monitoring systems etc. Scientists developing new drugs, system administrators monitoring complex technical processes, and decision makers being responsible for complex social or technical systems require an overview and even a deeper understanding of their respective data. The knowledge discovery in databases (KDD) process has been designed to identify hidden patterns in large data resources. A central step of the KDD process is the data mining task. Major data mining tasks are clustering and classification. Density-based approaches have proven to be very effective for many data mining methods. However, the good effectiveness often comes at the cost of a high runtime complexity. This thesis presents new efficient density-based approaches for different data mining applications whereas the effectiveness of the new developed methods is always kept in mind. The first part of this thesis is concerned with new density-based clustering methods. Clustering is a data mining task for summarizing data such that similar objects are grouped together while dissimilar ones are separated. Density-based approaches have shown to successfully mine arbitrary shaped clusters even in the presence of noise. In multi-dimensional or high dimensional data, clusters are typically hidden by irrelevant attributes and do not show across the full space. As relevance of attributes is not globally uniform for all clusters, global dimensionality reduction approaches are not adequate. Subspace clustering aims at automatically detecting clusters and their relevant attribute projections. This work presents a new clustering model DUSC which guarantees a comparable and redundancy free subspace clustering result. As the number of possible subspaces is exponential in the number of dimensions subspace clustering is a computationally challenging task. The algorithm eDUSC developed in this work is based on a filter-and-refinement architecture which avoids repeated database scans. Further on, this work proposes a new visualization technique for subspace clusters and a specialized clustering technique for multi-dimensional sequence databases. The second part of this thesis proposes new density-based methods for classification. Classification aims at assigning a class label to unknown objects. Various approaches for classifying objects have been investigated in the last decades. Classifiers based on statistical approaches have been most intensively studied in the literature and results like asymptotical behavior and classification bias have been derived. To apply statistical classifiers the density of objects has to be estimated. In this work, a hierarchy of density estimators is proposed which makes the classification of objects possible anytime. Additionally, a new classification method using subspace clusters for higher dimensionalities is developed in this thesis. The proposed density-based clustering and classification methods are evaluated in terms of both efficiency and effectiveness in thorough experiments on real world and synthetic data.