In my last post on the Active Authentication project I described how to use Microsoft Detours to collect a trace of system calls (also known as system events) for a single process. At Coveros Labs we leveraged an example program provided with Detours in order to create our own prototype system that validates the identity of a computer user. As described in the last post, this system operates in two modes: learning and authentication. In learning mode the system collects a system event trace and uses it to build a model of normal behavior. In authentication mode the system compares new events to the normal model in order to determine if the current user of the system is legitimate. However, my last post did not provide the details of how these events are stored and used. In this post I will describe the components of an event that can be used to represent it and provide a simple example of how the identity matching algorithm can be used to identify anomalous behavior.
In order to collect large traces of system events for experimentation we used Procmon, a Microsoft program that can be used to capture all of the registry, file, network, and process/thread events that occur on a computer. The reason why we chose to use Procmon over the prototype system for experimentation is that Procmon captures events for all processes, while our prototype is currently only able to monitor one process’ events. This allowed us to collect a large amount of data so that we could begin experimenting with the algorithms for anomaly detection without the overhead of extending the functionality of the prototype system. Here is an example of the information that Procmon collects for a single event, taken directly from an event trace:
“10:11:44.8056212 AM”, “chrome.exe”, “2940”, “RegQueryKey”, “HKCU”, “SUCCESS”, “Query:
HandleTags, HandleTags: 0x0″
The attributes of this event are, in order, “Time of Day”, “Process Name”, “Process ID (PID)”, “Operation”, “Path”, “Result”, and “Detail”. Operation is the name of the event – in this example, “RegQueryValue” is a registry event. Path indicates the part of the file system or registry that the event accessed. In this case Path is “HKCU” (“HKeyCurrentUser”), which is part of the Windows registry. Result reports whether the event was successful (in this case it was) or if it failed, in which case “SUCCESS” is replaced with an error condition (e.g., “FILE_NOT_FOUND”). Finally, Detail contains extra information describing the event.
Although we could potentially use all of these attributes to represent an event, some pieces of information may not be useful. For example, the Result of the event may not give us any additional information to identify the user. Although we can speculate about the usefulness or lack thereof of various parts of the event, the best way to determine their usefulness is through experimentation. At this time we have used Time of Day, Process Name, and Operation in order to represent events with varying levels of success. In the remainder of this post I will describe the identity matching algorithm and, for the sake of creating a simple example, only use Operation in the event representation.
The identity matching algorithm is one of the simplest anomaly detection algorithms. In learning mode the algorithm stores sequences of system events in a table. Then, in authentication mode, the algorithm compares each new sequence to the sequences that are in the table. If the new sequence is present in the table, then the algorithm considers it to be normal. If it is not, then the identity matcher considers the sequence to be abnormal and registers it as an anomaly. For example, consider the following table with system event sequences of length three:
ReadFile, ReadFile, WriteFile
RegCloseKey, RegOpenKey, RegSetInfoKey
ReadFile, WriteFile, RegCloseKey
RegQueryKey, ReadFile, ReadFile
WriteFile, RegCloseKey, RegOpenKey
Suppose that the following stream of events occurs in authentication mode:
ReadFile, WriteFile, RegCloseKey, RegOpenKey, RegSetInfoKey, RegQueryKey, ReadFile, ReadFile
This stream would produce the following sequences of length three. The anomalous sequences as identified by the identity matching algorithm are marked with an (X):
ReadFile, WriteFile, RegCloseKey
WriteFile, RegCloseKey, RegOpenKey
RegCloseKey, RegOpenKey, RegSetInfoKey
RegOpenKey, RegSetInfoKey, RegQueryKey (X)
RegSetInfoKey, RegQueryKey, ReadFile (X)
RegQueryKey, ReadFile, ReadFile
As you can see, two of these six sequences are anomalous. The advantages of the identity matching algorithm are that it is simple to implement, exhibits excellent runtime performance, and rejects illegitimate users with a low false accept rate. The downside is that, because the identity detector is “exact”, it will often reject events from the legitimate user, resulting in a high rate of false rejects. In case you are not familiar with these terms, a false accept refers to the algorithm failing to label a sequence of events from the illegitimate user as anomalous and a false reject refers to the algorithm incorrectly identifying a sequence of events from the legitimate user as anomalous. The frequency with which the algorithm accepts or rejects sequences is an important consideration in the design of the overall system that I will discuss in a future blog entry.
In this post I described how Coveros Labs is collecting traces of system events to use for experimentation and how the identity matching algorithm can be used to validate a user’s identity. In future posts I will describe two other algorithms that we are using for anomaly detection: 1-class support vector machine (SVM) and neural network. Intuitively, these are “learning” algorithms – that is, unlike identity, which simply “memorizes” correct sequences, these algorithms attempt to find patterns in the event trace so that they can correctly accept unseen sequences produced by the normal user. I also plan to describe other parts of our system for anomaly detection, such as the frequency with which sequences of events are judged to be normal or anomalous and the use of the leaky bucket algorithm to reduce the number of anomalies reported to the user.