Friday, 18 December 2015

Sampling k-combinations with linear memory and time - the bitmask/shift trick

k-combinations are like permutations but what they achieve is a string of length n made up of an alphabet of sigma (s) characters. Imagine we are dealing with RNA (alphabet = {a, u, t, g}) and we want the different combinations of triplet codons (n = 3), we know that there are sn or 43 = 64 combinations: (aaa, aau, aat, aag, uaa, uau, ..., etc). While we could generate the combinations and save them all to an array, we might come across a problem if we want a longer string. Let's say we increase n to 30, the combinations of 430 is 1152921504606846976 combinations. That's significantly more than 64, and we would struggle to generate and store that many combinations in memory. Basically, the size of our array grows exponentially with the increasing n. The longer the string you need, the more time and memory is required to produce and store all the combinations.

What I'm going to briefly explain is a way to get any combination without an expensive algorithm to generate all the combinations in order to store it in main memory or on disk. There is no need for a large array because our combinations are generated on the fly. The time complexity of getting any particular permutation is linear so it is useful for sampling or getting a small set of combinations without having to generate all of them.

Let's say we want combinations of length n:

int n = 3;

First we stick the alphabet in an array:

int s = 4;
char alphabet[s] = {'a', 'u', 'c', 'g'};

Then we work out how many bits are needed to represent the alphabet characters by rounding-up the square-root of the alphabet size:

int bits = ceil(√s); //sqrt(4) = 2

The number of combinations is of course sn:

int numCombinations = pow(s, n); //64

So imagine we want to print out all the combinations, all we have to do is to loop numCombinations times and decode the index of the loop (i) into a combination using a bitmask:

unsigned long long int i, j, c;
unsigned long long int mask = (1 << bits) - 1; //00000011
for (i = 0; i < numCombinations; i++) {
    c = i;
    for (j = 0; j < n; j++) {
        printf("%c", alphabet[ (mask & c) ]);
        c = c >> bits;
    }
    printf("\n");
}

Basically, if you wanted a single random combination, just pick a random number between 0 and 63 inclusive. Create the bitmask and for each character in n shuffle the bit along a set number of bits.

Imagine we chose the number 57 at random. This is 111001 in binary. We can read two bits at a time by doing a bitwise AND using the mask (000011) and then we do a bitwise right-shift two positions:

000011 & 111001 = 000001 //index 1 (u)
shift c right 2 bits.
000011 & 001110 = 000010 //index 2 (c)
shift c right 2 bits.
000011 & 000011 = 000011 //index 3 (g)

When we print out the characters from our alphabet at index 1, 2 and 3 and together we get: ucg.

This technique is really straightforward while (bits * n) is less than or equal to the number of bits in a computer word (w). That is the immediate technical limitation. If you use a unsigned long long int to store your combinations then you can use up to 64 bits on a 64 bit computer. If you need to use more than that then it will get a little more complicated as you juggle bits across multiple computer words.

No comments:

Post a Comment