Saturday 6 August 2016

Pattern matching with degenerate characters and don't-cares using bit-vector encoding

The pattern matching problem falls into two practical categories: exact and approximate pattern matching.

In pattern matching we have a text T of length n and a pattern X of length m ≤ n. In exact pattern matching we want to find X in T exactly as it is. In approximate pattern matching we want to find X in T but it doesn't need to match exactly. We define a threshold value k which is the maximum number of optimal changes that can be made to X to make it match a part of T. If more than k changes are required to make X match a part of T then we conclude that it is not in fact a factor of T and move on to look at the next position.

Exact pattern matching is easily solved using the naive method in worst-case O(nm) time. Of course, better algorithms exist to reduce this complexity. The KMP algorithm is famous but it is not the best one to use. Other algorithms include Shift-And/Shift-Or, Horspool, BNDM and BOM, of which the last two are the most effective. They achieve their efficiency using a combination of text skipping - moving past parts of the text that is impossible to hold the pattern - and bit-parallel manipulation. Bit-parallelism takes advantage of the computer processor's ability to quickly recall items from registers and cache and manipulate it in parallel using simple shift operations and logic gate calculations. Exact pattern matching is also quickly achieved using on-line indexing methods such as Aho-Corasick Automata or the suffix tree/array approach which speed up multiple pattern look up.

Approximate pattern matching traditionally relies on Dynamic Programming approaches which have an O(nm) time complexity. Improvements often rely on bit-parallelism, avoiding unnecessary processing by filtering out parts of the text and various clever tricks relying on low k.

In this blog post I will show how to simulate approximate pattern matching using the naive exact pattern matching algorithm. One presumes it is possible to use better exact pattern-matching algorithms to achieve faster results, but the goal of this post is not to make the ultimate algorithm. Instead, it is to explain how to use bit-parallelism to achieve pattern matching for degenerate and don't-care characters.

I am using the DNA/RNA alphabet for this example. DNA consists of the characters A, C, G, and T. RNA substitutes T for U, making its alphabet A, C, G, U. The characters represent nucleotide bases, for example, A stands for Adenine. We introduce a few extra degenerate characters and a don't care character (N) as defined in the IUPAC Alphabet standard:

A.................Adenine
C.................Cytosine
G.................Guanine
T.................Thymine
U.................Uracil
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base

I call a character like R or B degenerate, meaning it can represent more than one single alphabet character. I call N a don't care character because it can represent any base. Note that there are 4 core letters (A[T|U]CG), assuming that T is equivalent to U and 15 characters all together considering. So why do we have this special IUPAC standard? Well, one reason is because there are enzymes that recognise a segment of DNA with some flexibility. The restriction enzyme EcoRII's specificity subunit recognises and cuts DNA when it finds the site CCWGG. W could be A or T. It is useful for biologists to know where it makes its cuts in the genome. Another reason the IUPAC alphabet might be useful is to define DNA sequences where the exact identity of a base is unclear or may contain errors when it is first sequenced.

We can represent every IUPAC character using a binary representation different to the ASCII standard. Through this binary representation and clever use of bit masks it is possible to make R = A or R = G, or N = any base.

First I create an array containing 122 elements whose index match to the ASCII numerical value of all the letters above. For example 'A' is index 65, 'B' is index 66 and so on. I could have made this a smaller array taking only the range A - Y, but I do not like using subtract all the time in my code. Thus, most of the positions will hold the value 0, but for A, I define its value to be 1, 0001 in binary. In fact, I define the 4 primary nucleotide bases as such:

NUC['A'] = 1; //00000001
NUC['C'] = 2; //00000010
NUC['G'] = 4; //00000100
NUC['T'] = 8; //00001000
NUC['U'] = 8//00001000

Notice that the position of the binary ones make each character distinct. For each distinct character a new bit is needed. In our case with an effective alphabet size of 4, it can all fit into the first nibble of a byte.

Now we design the degenerate characters by making them bit-masks of the characters they can represent.

NUC['R'] = 5;  //00000101 - a/g
NUC['Y'] = 10; //00001010 - c/t
NUC['S'] = 6;  //00000110 - g/c
NUC['W'] = 9;  //00001001 - a/t
NUC['K'] = 12; //00001100 - g/t
NUC['M'] = 3;  //00000011 - a/c
NUC['B'] = 14; //00001110 - c/g/t
NUC['D'] = 13; //00001101 - a/g/t
NUC['H'] = 11; //00001011 - a/c/t
NUC['V'] = 7;  //00000111 - a/c/g
NUC['N'] = 15; //00001111 - a/c/g/t

Before we can use this, I need to describe how to find an exact pattern in a text using the regular method. Let's say we have a text T = "GACCAGGAG" and a pattern X = "CCWGG" and we need to find X in T. But because the naive method cannot recognise the degenerate W character in T, it would not be sensible to use X as is. In fact, you would be forced to do two searches, once using CCAGG and the next time CCTGG. This is the naive search method:

i = 0
while i <= n - m:
    j = 0
    while j < m and T[i+j] == X[j]:
        j += 1
        if j == m:
            print "Match found at position: ", i
    i += 1

We use the variable i to keep track of the position in T and j the position in X. The outer while loop moves the pointer i through T one position at a time. So for every position in T, the inner while loop tries to move through X. It only continues to loop through X as long as the current character in T and subsequent characters match that of X until j is equal to the length of the pattern, in which case we record a match. If before j == m there is a mismatch, it breaks out of the inner loop and moves to the next position in T to keep looking for the pattern.

OK. If that is sufficiently clear, we can modify the above exact pattern matching strategy to match degenerate characters and don't cares based on the following principles.

Each character A[T|U]CG in the NUC array we defined above has a unique binary code with the position of its binary 1 differing. We use logical AND (&) to check if one character matches another.

  NUC['A']: 00000001
& NUC['A']: 00000001

--------------------
            00000001

  NUC['A']: 00000001
& NUC['C']: 00000010
--------------------
            00000000

So when A matches A, the resulting character is A and when it doesn't, it gives a different character (0). With degenerate characters we do:

  NUC['A']: 00000001
& NUC['W']: 00001001
--------------------
            00000001

  NUC['T']: 00001000
& NUC['W']: 00001001
--------------------
            00001000

  NUC['C']: 00000010
& NUC['W']: 00001001
--------------------
            00000000

As you can see, W matches either A or T to give the character it was tested with, and when compared with C, a character it does not represent, it does not match it. The don't care character N will match A, C, G or T/U because it's bit pattern is 00001111. So if we modify our naive algorithm above, we can change the condition of the inner loop so that if a character from X, whether it is A[T|U]CG or a degenerate character, when ANDed with a character from T results in the same character from T, then it is a match.

For ease, I create an array P to hold the binary representation of the characters in X as defined in the NUC array.

P = []
for j = 0 to m - 1:
    P.append(NUC[X[j]])

Then I can easily use it in the modified naive method, which this time supports degenerate and don't care characters:

i = 0
while i <= n - m:
    j = 0
    while j < m and (NUC[T[i+j]] & P[j]) == NUC[T[i+j]]:
        j += 1
        if j == m:
            print "Match found at position: ", i
    i += 1

So there it is. I searched with X = "CCWGG" and it found CCAGG in T. If I had CCTGG in T it would have found it as well. Also, if I didn't care about the value of one or two of the letters, then I could have used N in the pattern and it would have found the pattern I was looking for.

The naive method is limited to searching at O(nm) time complexity in the worst case, but if this technique is incorporated into a better exact string matching algorithm, it will be able to do it much quicker and with the added benefit of finding patterns using a degenerate character set which supports don't cares.