Skip to main content

Setting the threshold for high throughput detectors: A mathematical approach for ensembles of dynamic, heterogeneous, probabi...

by Robert A Bridges, Jessie D Jamieson, Joel W Reed
Publication Type
Conference Paper
Book Title
2017 IEEE International Conference on Big Data (Big Data)
Publication Date
Page Numbers
1071 to 1078
Conference Name
2017 IEEE International Conference on Big Data (BIGDATA)
Conference Location
Boston, Maryland, United States of America
Conference Sponsor
Conference Date

Cyber operations now manage a high volume of heterogeneous log data. Anomaly Detection (AD) in such operations involves multiple (e.g., per IP, per data type) ensembles of detectors modeling heterogeneous characteristics (e.g., rate, size, type) often with adaptive online models producing alerts in near real time. Because of the high data volume, setting the threshold for each detector in such a system is an essential yet underdeveloped configuration issue that, if slightly mistuned, can leave the system useless, either producing a myriad of alerts (and flooding downstream systems) or giving none. In this work, we build on the foundations of Ferragut et al. to provide a set of rigorous results for understanding the relationship between threshold values and alert quantities for probabilistic detectors. This informs an algorithm for setting the threshold of multiple, heterogeneous, possibly dynamic detectors completely a priori, in principle. Indeed, if the underlying distribution of the incoming data is known, the algorithm provides provably manageable thresholds. If the distribution is unknown (poorly estimated), our analysis gives insight into how the model distribution differs from the actual distribution, indicating refitting is necessary. We provide empirical experiments, regulating the alert rate of a system with ≈2,500 adaptive detectors scoring over 1.5M events in 5 hours of timestamps. Further, we demonstrate on real network data and detection framework of Harshaw et al. the alternative case, demonstrating that the inability to regulate alerts indicates how the detection model is not a good fit to the data.