Project
back
Summarising Categorical Data by Clustering Attributes
with Michael Mampaey

Abstract. For a book, its title and abstract provide a good first impression of what to expect from it. For a database, obtaining a good first impression is typically not so straightforward. While low-order statistics only provide very limited insight, downright mining the data quickly provides too much detail for such a quick glance.

In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality descriptive summaries of binary and categorical data. Our approach builds a summary by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering - without requiring a distance measure between attributes. Besides providing a practical overview of which attributes interact most strongly, these summaries can also be used as surrogates for the data, and can easily be queried.

Extensive experimentation shows that our method discovers high-quality results: correlated attributes are correctly grouped, which is verified both objectively and subjectively. Our models are also employed as fast surrogates for the data, and are shown to be able to accurately estimate the supports of frequent itemsets.

Implementation

the C++ source code (October 2011) by Michael Mampaey.

Related Publications

Mampaey, M & Vreeken, J Summarizing Categorical Data by Clustering Attributes. Data Mining and Knowledge Discovery vol.26(1), pp 130-173, Springer, 2013. (IF 2.877)
Mampaey, M & Vreeken, J Summarising Data by Clustering Items. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 321-336, Springer, 2010.