Mining Subjectively Interesting Structure in Real-Valued Data
with Kleanthis-Nikolaus Kontonasios and Tijl De Bie

Abstract. In exploratory data mining it is important to assess the significance of results. Given that analysts have only limited time, it is important that we can measure this with regard to what we already know. That is, we want to be able to measure whether a result is interesting from a subjective point of view.

With this as our goal, we formalise how to probabilistically model real-valued data by the Maximum Entropy principle, where we allow statistics on arbitrary sets of cells as background knowledge. As statistics, we consider means and variances, as well as histograms. The resulting models allow us to assess the likelihood of values, and can be used to verify the significance of (possibly overlapping) structures discovered in the data. As we can feed those structures back in, our model enables iterative identification of subjectively interesting structures.

To show the flexibility of our model, we propose a subjective informativeness measure for tiles, i.e. rectangular sub-matrices, in real-valued data. The Information Ratio quantifies how strongly the knowledge of a structure reduces our uncertainty about the data with the amount of effort it would cost to consider it. Empirical evaluation shows that iterative scoring effectively reduces redundancy in ranking candidate tiles—showing the applicability of our model for a range of data mining fields aimed at discovering structure in real-valued data.



Related Publications

Kontonasios, K-N, Vreeken, J & De Bie, T Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 256-271, Springer, 2013.
Kontonasios, K-N, Vreeken, J & De Bie, T Maximum Entropy Modelling for Assessing Results on Real-Valued Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 350-359, IEEE, 2011.