Setting the threshold for high throughput detectors: A mathematical approach for ensembles of dynamic, heterogeneous, probabilistic anomaly detectors

by Robert A Bridges, Jessie D Jamieson, Joel W Reed

Publication Type

Conference Paper

Book Title

2017 IEEE International Conference on Big Data (Big Data)

Publication Date

December, 2017

Page Numbers

1071 to 1078

Conference Name

2017 IEEE International Conference on Big Data (BIGDATA)

Conference Location

Boston, Maryland, United States of America

Conference Sponsor

IEEE

Conference Date

Dec 11, 2017 - Dec 14, 2017

View DOI Listing

Abstract

Cyber operations now manage a high volume of heterogeneous log data. Anomaly Detection (AD) in such operations involves multiple (e.g., per IP, per data type) ensembles of detectors modeling heterogeneous characteristics (e.g., rate, size, type) often with adaptive online models producing alerts in near real time. Because of the high data volume, setting the threshold for each detector in such a system is an essential yet underdeveloped configuration issue that, if slightly mistuned, can leave the system useless, either producing a myriad of alerts (and flooding downstream systems) or giving none. In this work, we build on the foundations of Ferragut et al. to provide a set of rigorous results for understanding the relationship between threshold values and alert quantities for probabilistic detectors. This informs an algorithm for setting the threshold of multiple, heterogeneous, possibly dynamic detectors completely a priori, in principle. Indeed, if the underlying distribution of the incoming data is known, the algorithm provides provably manageable thresholds. If the distribution is unknown (poorly estimated), our analysis gives insight into how the model distribution differs from the actual distribution, indicating refitting is necessary. We provide empirical experiments, regulating the alert rate of a system with ≈2,500 adaptive detectors scoring over 1.5M events in 5 hours of timestamps. Further, we demonstrate on real network data and detection framework of Harshaw et al. the alternative case, demonstrating that the inability to regulate alerts indicates how the detection model is not a good fit to the data.

Setting the threshold for high throughput detectors: A mathematical approach for ensembles of dynamic, heterogeneous, probabilistic anomaly detectors

Abstract

Researchers

Organizations