Projects & Implementations

by Jilles Vreeken

phpBibLib is a PHP library for easily parsing and displaying entries from bibtex files, including the possibility of using citations in a webpage and displaying the corresponding references.

Comparing Apples and Oranges

with Nikolaj Tatti

Measuring the difference between data mining results is an important open problem in exploratory data mining. We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data.

Tatti, N & Vreeken, J Comparing Apples and Oranges – Measuring Differences between Exploratory Data Mining Results. Data Mining and Knowledge Discovery vol.25(2), pp 173-207, Springer, 2012.

Summarising Categorical Data by Clustering Attributes

with Michael Mampaey

A good first impression of a dataset is paramount to how we proceed our analysis. We discuss mining high-quality high-level descriptive summaries for binary and categorical data. Our approach builds summaries by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering.

Mampaey, M & Vreeken, J Summarizing Categorical Data by Clustering Attributes. Data Mining and Knowledge Discovery vol.26(1), pp 130-173, Springer, 2013.

Cumulative Mutual Information

An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection

Nguyen, H-V, Müller, E, Vreeken, J, Keller, F & Böhm, K CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 198-206, SIAM, 2013.

Fast and Reliable Anomaly Detection in Categorical Data

CompreX discovers anomalies in data using pattern-based compression. Informally, it finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags points dissimilar to the norm – those with high compression cost – as anomalies.

Akoglu, L, Tong, H, Vreeken, J & Faloutsos, C Fast and Reliable Anomaly Detection in Categoric Data. In: Proceedings of ACM Conference on Information and Knowledge Management (CIKM), pp 415-424, ACM, 2012.

Spotting Culprits in Epidemics: How many and Which ones?

with B. Aditya Prakash

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? With NetSleuth, we answer this question affirmatively for the Susceptible-Infected virus propagation model.

Prakash, BA, Vreeken, J & Faloutsos, C Spotting Culprits in Epidemics: How many and Which ones?. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 11-20, IEEE, 2012.

Connection Pathways in Large Graphs

with Leman Akoglu, Hanghang Tong, Polo Chau, Nikolaj Tatti & Christos Faloutsos

Suppose we are given a large graph in which, by some external process, a handful of nodes are marked. What can we say about these nodes? Are they close together in the graph? or, if segregated, how many groups do they form? We approach this problem by trying to find simple connection pathways between sets of marked nodes — using MDL to identify the optimal result. We propose the efficient dot2dot algorithm for approximating this goal.

Akoglu, L, Vreeken, J, Tong, H, Chau, DH, Tatti, N & Faloutsos, C Mining Connection Pathways for Marked Nodes in Large Graphs. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 37-45, SIAM, 2013.

Detecting Bicliques in GF[q]

with Jan Ramon and Pauli Miettinen

Ramon, J, Miettinen, P & Vreeken, J Detecting Bicliques in GF[q]. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 509-524, Springer, 2013.

Mining Itemsets that Compress

with Matthijs van Leeuwen & Arno Siebes

The Krimp algorithm mines sets of itemsets by the MDL principle, defining the best set of patterns as the set that compresses the data best. The resulting code tables are orders of magnitude smaller than the number of (closed) frequent itemsets. They are highly characteristic for the data, and obtain high accuracy on many data mining tasks.

Vreeken, J, van Leeuwen, M & Siebes, A Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery vol.23(1), pp 169-214, Springer, 2011.

MDL for Boolean Matrix Factorization

with Pauli Miettinen

Boolean Matrix Factorization has many desirable properties, such as high interpretability and natural sparsity. However, no method for selecting the correct model order has been available. We propose to use the Minimum Description Length principle, and show that besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, and, as experiments show, is highly accurate.

Miettinen, P & Vreeken, J mdl4bmf: Minimal Description Length for Boolean Matrix Factorization. Transactions on Knowledge Discovery from Data vol.8(4), pp 1-30, ACM, 2014.

Summarizing Data with the Most Informative Itemsets

Winner of the ACM SIGKDD 2011 Best Student Paper Award — with Michael Mampaey & Nikolaj Tatti

mtv is a well-founded approach for summarizing data with itemsets; using a probabilistic maximum entropy model, we iteratively find that itemset that provides us the most new information, and update our model accordingly. We can either mine top-k patterns, or identify the best summarisation by MDL or BIC.

Mampaey, M, Vreeken, J & Tatti, N Summarizing Data Succinctly with the Most Informative Itemsets. Transactions on Knowledge Discovery from Data vol.6(4), pp 1-44, ACM, 2012.

Mining Subjectively Interesting Structure in Real-Valued Data

with Kleanthis-Nikolaus Kontonasios and Tijl De Bie

We formalise how to probabilistically model real-valued data by the Maximum Entropy principle, where we allow statistics on arbitrary sets of cells as background knowledge in terms of means and variances, or histograms.

Kontonasios, K-N, Vreeken, J & De Bie, T Maximum Entropy Modelling for Assessing Results on Real-Valued Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 350-359, IEEE, 2011.

Finding Good Itemsets by Packing Data

with Nikolaj Tatti

We aim at finding itemsets that characterise the data well. To this end, we construct decision trees by which we can pack the data succinctly, and from which we can subsequently identify the most important itemsets. The Pack algorithm can either filter a candidate collection, as well as mine its models directly from data.

Tatti, N & Vreeken, J Finding Good Itemsets by Packing Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 588-597, IEEE, 2008.

Directly Mining Descriptive Patterns

Slim mines high-quality Krimp code tables directly from data, as opposed to filtering a candidate collection. By doing so, Slim obtains smaller code tables that provide better compression ratios, while also improving on classification accuracy, runtime, and reducing the memory complexity with orders of magnitude.

Smets, K & Vreeken, J Slim: Directly Mining Descriptive Patterns. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 236-247, SIAM, 2012.

Summarising Event Sequences

with Nikolaj Tatti

We consider mining informative serial episodes — subsequences allowing for gaps — from event sequence data. We formalize the problem by the Minimum Description Length principle, and give algorithms for selecting good pattern sets from candidate collections as well as for parameter free mining of such models directly from data.

Tatti, N & Vreeken, J The Long and the Short of It: Summarising Event Sequences with Serial Episodes. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 462-470, ACM, 2012.

Discovering Descriptive Tile Trees

with Nikolaj Tatti

Stijl mines descriptions of ordered binary data. We model data hierarchically with noisy tiles - rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle, and give an algorithm for mining optimal sub-tiles in just O(nmmin(n,m)) time.

Tatti, N & Vreeken, J Discovering Descriptive Tile Trees by Fast Mining of Optimal Geometric Subtiles. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 9-24, Springer, 2012.

Tiles and Tilings

by Jilles Vreeken, based on the paper 'Tiling Databases' by Geerts, Goethals & Mielikainen.

Remmerie, N, De Vijlder, T, Valkenborg, D, Laukens, K, Smets, K, Vreeken, J, Mertens, I, Carpentier, S, Panis, B, De Jaeger, G, Prinsen, E & Witters, E Unraveling Tobacco BY-2 Protein Complexes with BN PAGE/LC-MS/MS and Clustering Methods. Journal of Proteomics vol.74(8), pp 1201-1217, Elsevier, 2011.