Fast and Reliable Anomaly Detection in Categorical Data

Abstract. Spotting anomalies in large multi-dimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce CompreX, a new approach for identifying anomalies using pattern-based compression. Informally, our method finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags those points dissimilar to the norm – with high compression cost – as anomalies.

Our approach exhibits four key features: 1) it is parameter free; it builds dictionaries directly from data, and requires no user specified parameters such as distance functions or density and similarity thresholds, 2) it is general; we show it works for a broad range of complex databases, including graph, image and relational databases that may contain both categorical and numerical features, 3)it is scalable; its running time grows linearly with respect to both database size as well as number of dimensions, and 4) it is effective; experiments on a broad range of datasets show large improvements in both compression, as well as precision in anomaly detection, outperforming its state-of-the-art competitors.


MatLab source code (Dec 2012) by Leman Akoglu.

Related Publications

Akoglu, L, Tong, H, Vreeken, J & Faloutsos, C Fast and Reliable Anomaly Detection in Categoric Data. In: Proceedings of ACM Conference on Information and Knowledge Management (CIKM), pp 415-424, ACM, 2012. (full paper, 13.4% acceptance rate; 27% overall)