Heuristics for Packet Field Identification
When performing any type of protocol fuzzing, one must obtain three key pieces of information about the target protocol: structure, state and semantics. The structure of a network protocol is the format of the messages, which contains a series of fields which, at the simplest level, are integers and strings. When dealing with any protocol with public specifications, this information is easily obtained. However, what is one to do when the specs are not publicly available, say in the case of a proprietary industrial control and automation protocol?
For the past few years, I have spent a good deal of time researching techniques to improve the automation of packet field identification, mainly for the purposes of intelligent protocol fuzzing. Some of you may have seen my work titled, Protocol Informatics at various conferences. Protocol Informatics utilizes techniques borrowed from the field of Bioinformatics to identify the locations of the start and end of each field in a protocol message (this is analogous to identifying the locations of specific genes in a genome). The technology borrowed from Bioinformatics revolves around the use of sequence alignment algorithms. The idea is pretty simple: Take two sequences, each of arbitrary length, and insert gaps at the appropriate offsets to obtain two new sequences, each being of the same length. What does this buy us? The algorithms used are designed to insert gaps where the bindings between the characters in each sequence are strongest, resulting in an alignment with gaps inserted in places where data is missing.
GET /_________ HTTP/1.0\r\n
GET /index.html HTTP/1.0\r\n
This leads us to an observation that from offset n to offset m is a place where variable length data may be present. Using this information, one can start to understand the format of the particular message under analysis. For example, if when analyzing a binary protocol, there is a long sequence of ASCII characters in one of the sequences and a series of gaps in the other, one could start to look to see if any size value is prefixed to the string. The true power of this technique arises when combining it in a multiple alignment. A multiple alignment involves aligning multiple sequences to one another. Once a series of sequences are aligned to each other, informational techniques can be used to extract more information about where fields exist. What the analyzer does is then go down each column and look at some key pieces of data: what is the range of characters used, how often does it change in relation to itself and how often does it change in relation to its neighbors. If two columns change at the exact same rate and contain the same character set, it may be a protocol identifier like an ip id field.
The Protocol Informatics project is becoming more active as this year goes on, so expect more code and examples on the blog. In the meantime, to satisfy any curiosity, the old PI code is available here .
In my next post, I will explain the concept of sequence logos and how they are used to preserve information content that is previously lost when doing simple consensus sequences when trying to understand packet formats.
February 6th, 2007 at 4:06 pm
Hi Marshall!
I’ve been your README file and there you show a example using the icmp protocol. I did the same example here, but I didn’t understand this:
Ungapped Consensus:
CONS x08 x00 x3f x18 x05 xbe x00 ???
DT BBB ZZZ BBB BBB BBB BBB ZZZ AAA
MT 000 000 081 089 000 000 000 100
Step 3: Analyze Consensus Sequence
Pay attention to datatype composition and mutation rate.
Offset 0: Binary data, 0% mutation rate
Offset 1: Zeroed data, 0% mutation rate
Offset 2: Binary data, 81% mutation rate
Offset 3: Binary data, 89% mutation rate
Offset 4: Binary data, 0% mutation rate
Offset 5: Binary data, 0% mutation rate
Offset 6: Zeroed data, 0% mutation rate
Offset 7: ASCII data, 100% mutation rate
Using this information we can construct the structure of the format:
[ 1 byte ] [ 1 byte ] [ 2 byte ] [ 2 byte ] [ 1 byte ] [ 1 byte ]
So, based in the “Ungapped Consensus”, how did you realize the structure and byte-size of the messagens?
Thanks in advance!
August 28th, 2007 at 6:16 am
Hi Marshall!
I’ve seen your “Network Protocol Analysis using Bioinformatics Algorithms”.
It’s an excellent idea!
I do want a try, but i don’t know python…
Can i have a code written in c or c++?
my email: joeasnake@hotmail.com
Thank you very much!