Combined clustering of graph and attribute data
Boden, Brigitte; Seidl, Thomas (Thesis advisor)
1. Aufl.. - Aachen : Apprimus-Verl. (2014)
Dissertation / PhD Thesis
In: Edition Wissenschaft Apprimus
Page(s)/Article-Nr.: III, 232, XXI S. : graph. Darst.
In recent years, a rapidly increasing amount of data is collected and stored for various applications. As modern storage systems provide increasing disk space at decreasing costs, databases storing huge amounts of information of different types are ubiquitous. The task of automatically extracting useful and previously unknown knowledge out of such data is called data mining. This thesis focuses on the data mining task of clustering, i.e. grouping objects into clusters such that objects assigned to the same cluster are similar to each other, while objects assigned to different clusters are dissimilar. Two of the most common data types are vector data, where each object is represented as a vector containing different attributes of the object, and graph data, which represents relationships between different objects as edges in a graph. In many applications, data of both types is available simultaneously: for the vertices or the edges of a graph, additional information is available which can be described as an attribute vector. The aim of this thesis is to develop combined clustering approaches that use graph data and attribute data simultaneously in order to detect clusters that are densely connected in the graph and at the same time show similarity in the attribute space. As for high-dimensional vector data, clusters usually exist only in subspaces of the attribute space, we follow the principle of subspace clustering to enable the detection of clusters which show similarity only in a subset of the attributes. In this thesis, we introduce combined clustering approaches for graphs with vertex attributes, graphs with edge attributes and heterogeneous networks with attributed vertices. For all of those data types, our approaches focus on realizing an unbiased combination of graph and attribute data and avoiding redundancy in the clustering result.