Modern computer network defense systems rely primarily on signature-based intrusion detection tools, which generate alerts when patterns that are pre-determined to be malicious are encountered in network data streams. Signatures are created reactively, and only after manual analysis of a network intrusion. There is no ability to detect intrusions that are new, or variants of an existing attack.  There is no ability to adapt the detectors to the patterns unique to a network environment.

The Oak Ridge Cyber Analytics (ORCA) Attack Variant Detector (AVD) is a sensor that uses machine learning technology to analyze behaviors in channels of communication between individual computers. Using examples of attack and non-attack traffic in the target environment (see Figure 1), the ORCA sensor is trained to recognize and discriminate between malicious and normal traffic types. The machine learning provides an insight that would be difficult for a human to explicitly code as a signature because it evaluates many interdependent metrics simultaneously.

Zero day detection is performed because ORCA-AVD classifies traffic based on its similarity to traffic types rather than using specific signature patterns. Also, it detects variants of attacks: small changes in a known attack vector created to bypass signature-based sensors. This reduces the burden of organizations to account for all attack variant combinations through signatures. False alerts are minimized with ORCA because the machine learning has adapted to the network’s traffic patterns. ORCA AVD has been verified on experimental military networks, and has demonstrated the ability to both complement signature-based tools and detects variants of network attacks with very few false positives.

The architecture of the developed system has two primary components: a data acquisition piece that handles the high-speed translation of network data into sets of features, and the analysis pipeline, which processes the features sets resulting in a classification of the observed network traffic.  

The data acquisition portion of the system collects the raw network traffic, using a commercial network capture card, and stores it on disk in the form of packet capture (PCAP) files. As applied to our solution, the card is not configured to perform any aggregation or analytics other than to ensure that the full data rate is being captured and persisted to disk.  Once captured, data is aggregated at the flow level (represented by a unique 5-item tuple: source/destination address, source/destination port, and protocol) for a given time window (i.e. 2 seconds). This data is then made available via web service for use by the analysis pipeline.

The core analysis of the data occurs in the analysis pipeline – a software construct with the flexibility to plug in/out various components using a configuration specification. Examples of different pipeline components include data summation blocks, machine-learning based feature extraction/derivation blocks, and alerting blocks. These can be arranged in most any order, although normal operations would dictate a certain flow that generally includes feature extraction and summation blocks proceed alerting blocks.  The pipeline is structured such that each stage in its progression builds on the analysis applied in prior stages.  Downstream components are informed of the decisions made upstream through of a common medium of exchange – essentially a common message structure that is an extension of a feature set comprised of feature names and associated values. The current implementation of the system utilizes a log-file style of alerts with the assumption that a log-file aggregation system is in place.

Principal Investigator

Justin M Beaver