Showing posts with label permutations. Show all posts
Showing posts with label permutations. Show all posts

Friday, 18 December 2015

Sampling k-combinations with linear memory and time - the bitmask/shift trick

k-combinations are like permutations but what they achieve is a string of length n made up of an alphabet of sigma (s) characters. Imagine we are dealing with RNA (alphabet = {a, u, t, g}) and we want the different combinations of triplet codons (n = 3), we know that there are sn or 43 = 64 combinations: (aaa, aau, aat, aag, uaa, uau, ..., etc). While we could generate the combinations and save them all to an array, we might come across a problem if we want a longer string. Let's say we increase n to 30, the combinations of 430 is 1152921504606846976 combinations. That's significantly more than 64, and we would struggle to generate and store that many combinations in memory. Basically, the size of our array grows exponentially with the increasing n. The longer the string you need, the more time and memory is required to produce and store all the combinations.

What I'm going to briefly explain is a way to get any combination without an expensive algorithm to generate all the combinations in order to store it in main memory or on disk. There is no need for a large array because our combinations are generated on the fly. The time complexity of getting any particular permutation is linear so it is useful for sampling or getting a small set of combinations without having to generate all of them.

Let's say we want combinations of length n:

int n = 3;

First we stick the alphabet in an array:

int s = 4;
char alphabet[s] = {'a', 'u', 'c', 'g'};

Then we work out how many bits are needed to represent the alphabet characters by rounding-up the square-root of the alphabet size:

int bits = ceil(√s); //sqrt(4) = 2

The number of combinations is of course sn:

int numCombinations = pow(s, n); //64

So imagine we want to print out all the combinations, all we have to do is to loop numCombinations times and decode the index of the loop (i) into a combination using a bitmask:

unsigned long long int i, j, c;
unsigned long long int mask = (1 << bits) - 1; //00000011
for (i = 0; i < numCombinations; i++) {
    c = i;
    for (j = 0; j < n; j++) {
        printf("%c", alphabet[ (mask & c) ]);
        c = c >> bits;
    }
    printf("\n");
}

Basically, if you wanted a single random combination, just pick a random number between 0 and 63 inclusive. Create the bitmask and for each character in n shuffle the bit along a set number of bits.

Imagine we chose the number 57 at random. This is 111001 in binary. We can read two bits at a time by doing a bitwise AND using the mask (000011) and then we do a bitwise right-shift two positions:

000011 & 111001 = 000001 //index 1 (u)
shift c right 2 bits.
000011 & 001110 = 000010 //index 2 (c)
shift c right 2 bits.
000011 & 000011 = 000011 //index 3 (g)

When we print out the characters from our alphabet at index 1, 2 and 3 and together we get: ucg.

This technique is really straightforward while (bits * n) is less than or equal to the number of bits in a computer word (w). That is the immediate technical limitation. If you use a unsigned long long int to store your combinations then you can use up to 64 bits on a 64 bit computer. If you need to use more than that then it will get a little more complicated as you juggle bits across multiple computer words.

Friday, 18 April 2014

Permutations

The other day a friend was seeking advice about passwords and password length and such and I just told him the strength of a password is based on however many characters (n) he has available to him to the power of the length of the string (l), n^l, and that he should also use non-alphanumeric characters as well and make sure it's long enough and memorable to him. Using the formula, a 4 letter password using just digits is 10^4, or 10000 possible combinations.

That got me thinking about cracking and finding combinations of passwords if you know which characters were used and in what frequency. Basically you would need to jumble the characters up and try the different combinations but because you know which characters were used the entropy is far less, l^2 - l. This would mean a 4 letter passwords using the characters abcd would have 4^2 - 4, or 12 possible combinations.

I had to put a little thought into how I was going to code this. The first thing you do when you have to write an algorithm is figure out how you would do it in real life and I thought up the idea of swivelling characters round. So, if we have the starting characters abcd, you would lock the first character and swivel the other three round:

abcd
adbc
acdb

Then, you would shift the first character (a) off the original array and push it onto the end, lock the new first character (b), and do the swivelling of the other three again:

bcda
bacd
bdac

and so on. So when I first tried to solve this problem I did it the easy way by using the php array functions:

<?php

/**
 * Produce permutations of a string using php array functions
 * @author Ahmad Retha
 * @license public domain
 */

$s = "abcd"; //starting letters
$a = str_split($s); //converted to char array
$l = count($a); //length of array
$i = 0; //initialize counter for loop
$r = array(); //result array holds the combinations

while ($i < $l) {
   
    $b = array_slice($a, 1); //new array missing first element
    $j = 0; //initialize counter for inter-swivel
    while ($j < $l - 1) {
        array_push($b, array_shift($b)); //swivel smaller array

        //store result in $r
        $r[] = implode('', array_merge((array)$a[0], $b));
        $j++;
    }

   
    array_push($a, array_shift($a)); //swivel initial character
   
    $i++;
}
?>


But I was not satisfied with this and set about rewriting it in a more efficient way using array index manipulation to move elements around. This code is a little longer but more efficient and runs a tad faster:

<?php

/**
 * Produce permutations of a string through array manipulation
 * @author Ahmad Retha
 * @license public domain
 */

$s = "abcd"; //starting letters
$a = str_split($s); //converted to char array
$l = count($a); //length of array
$r = array(); //result array holds the combinations
$c = 0; //counter
$t = null; //temporarily holds first char

while ($c < $l) {

    $t = $a[0]; //store first char

    $n = 1; //initialize counter for inner loop

    for ($i = 1; $i < $l; $i++) {

        while ($n < $l) {

            //temporarily store first char of inner substring
            $u = $a[1];
           
            //shift chars along in substring
            for ($j = 2; $j < $l; $j++) {
                $a[$j - 1] = $a[$j];
            }
           
            $a[$l - 1] = $u; //push substring temp char to end
           
            $r[] = implode('', $a); //store the permutation
           
            $n++;
        }
        

        //shift chars along in whole string
        $a[$i - 1] = $a[$i];
    }

    $a[$l - 1] = $t; //push temp char to end

    $c++;
}


?>


This is sample output of the 12 combinations held in $r:

acdb
adbc
abcd
bdac
bacd
bcda
cabd
cbda
cdab
dbca
dcab
dabc