Joint Cluster Analysis of Attribute Data and Relationship Data: Problems, Algorithms and Applications

Attribute data and relationship data are two principle types of data, representing the intrinsic and extrinsic properties of entities. While attribute data has been the main source of data for cluster analysis, relationship data such as social networks or metabolic networks are becoming increasingly available. In many cases these two data types carry complementary information, which calls for a joint cluster analysis of both data types in order to achieve more natural clusterings. For example, when identifying research communities, relationship data could represent co-author relationships and attribute data could represent the research interests of scientists. Communities could then be identified as clusters of connected scientists with similar research interests.

Our introduction of joint cluster analysis is part of a recent, broader trend to consider as much background information as possible in the process of cluster analysis, and in general, in data mining. In this talk, we briefly review related work including constrained clustering, semi-supervised clustering and multi-relational clustering. We then propose the Connected k-Center (CkC) problem, which aims at finding k connected clusters minimizing the radius with respect to the attribute data. We sketch the main ideas of the proof of NP-completeness and present a constant factor approximation algorithm for the CkC problem. Since this algorithm does not scale to large datasets, we have also developed NetScan, a heuristic algorithm that is efficient for large, real databases. We report experimental results from two applications, community identification and document clustering, both based on DBLP data. Our experiments demonstrate that NetScan finds clusters that are more meaningful and accurate than the results of existing algorithms. We conclude the talk with other promising applications and new problems of joint cluster analysis. In particular, we discuss the clustering of gene expression data and the hotspot analysis of crime data as well as a joint cluster analysis problem that does not require the user to specify the number of clusters in advance.

Speaker Details

Martin Ester received a PhD in Computer Science from Swiss Federal Institute of Technology (ETH Zurich) in 1990 with a thesis on knowledge-based systems and logic programming. From 1990 to 1993, he has been working as a Senior and Advisory Systems Engineer for Swissair, exploring the potential of knowledge-based systems and developing expert systems for the cargo, marketing and engineering departments of the Swiss airline. In 1993, he joined University of Munich as an Assistant Professor and co-founded their Data Mining lab. Since November 2001, he is an Associate Professor at the School of Computing Science of Simon Fraser University where he co-directs the Database and Data Mining research lab. He has published extensively in the top conferences and journals of his field such as ACM SIGKDD, VLDB, ICDM and ICDE. Martin Ester’s current research interests include clustering, spatial data mining, mining biological databases, text mining and multi-relational data mining. He has investigated the practical applications of his research in various collaborations with scientists from other disciplines, government agencies and industry.

Date:
Speakers:
Martin Ester
Affiliation:
School of Computing Science, Simon Fraser University
    • Portrait of Jeff Running

      Jeff Running