Achievement: We present a rigorous mathematical analysis of the isolation random forest algorithm for outlier detection. For one-dimensional data, we show convergence of the algorithm and establish conclusiveness for outliers defined in terms of relative distances. We also prove that, under the similar criterion on higher dimensions, the isolation tree forest is not conclusive.
Significance and Impact: The isolation random forest methods are popular randomized algorithms for the detection of outliers from datasets, since they do not require the computation of distances nor the parametrization of sample probability distributions. To our knowledge, this is the first attempt to a rigorous analysis of the algorithm.
Research Details
- The probabilistic structure of the set of all possible isolation trees is characterized in the case of one-dimensional data.
- We establish the convergence of the isolation forest algorithm to the expected tree height.
- We prove a theorem showing that the expected tree height in an isolation tree can asymptotically related to the relative distance of a data point to the rest of the dataset.
- Via a counterexample, we show that the extension of the isolation random forest method to two dimensions cannot be conclusive to detect points that are uniformly distant from the dataset.
Sponsor/Funding: Universidad Nacional de Colombia
PI and affiliation: Jorge Ramirez Osorio, Systems and Decision Sciences Group, Computer Science and Mathematics Division, ORNL
Team: Fernando Morales, Edgar Ramos (Universidad Nacional de Colombia)
Citation and DOI: Morales, FA, RamÃrez, JM, Ramos, EA. A mathematical assessment of the isolation random forest method for anomaly detection in big data. Math Meth Appl Sci. 2022; 1- 22. doi:10.1002/mma.8570
Summary: We present the mathematical analysis of the Isolation Random Forest Method (IRF Method) for anomaly detection, proposed by Liu F.T., Ting K.M. and Zhou Z. H. in their seminal work as a heuristic method for anomaly detection in Big Data. We prove that the IRF space can be endowed with a probability induced by the Isolation Tree algorithm (iTree). In this setting, the convergence of the IRF method is proved, using the Law of Large Numbers. A couple of counterexamples are presented to show that the method is inconclusive and no certificate of quality can be given, when using it as a means to detect anomalies. Hence, an alternative version of the method is proposed whose mathematical foundation is fully justified. Furthermore, a criterion for choosing the number of sampled trees needed to guarantee confidence intervals of the numerical results is presented. Finally, numerical experiments are presented to compare the performance of the classic method with the proposed one.