What I'm going to briefly explain is a way to get any combination without an expensive algorithm to generate all the combinations in order to store it in main memory or on disk. There is no need for a large array because our combinations are generated on the fly. The time complexity of getting any particular permutation is linear so it is useful for sampling or getting a small set of combinations without having to generate all of them.
Let's say we want combinations of length n:
int n = 3;
Let's say we want combinations of length n:
int n = 3;
First we stick the alphabet in an array:
int s = 4;
char alphabet[s] = {'a', 'u', 'c', 'g'};
Then we work out how many bits are needed to represent the alphabet characters by rounding-up the square-root of the alphabet size:
int bits = ceil(√s); //sqrt(4) = 2
The number of combinations is of course sn:
int numCombinations = pow(s, n); //64
So imagine we want to print out all the combinations, all we have to do is to loop numCombinations times and decode the index of the loop (i) into a combination using a bitmask:
unsigned long long int i, j, c;
unsigned long long int mask = (1 << bits) - 1; //00000011
for (i = 0; i < numCombinations; i++) {
c = i;
for (j = 0; j < n; j++) {
printf("%c", alphabet[ (mask & c) ]);
c = c >> bits;
}
printf("\n");
}
Basically, if you wanted a single random combination, just pick a random number between 0 and 63 inclusive. Create the bitmask and for each character in n shuffle the bit along a set number of bits.
Imagine we chose the number 57 at random. This is 111001 in binary. We can read two bits at a time by doing a bitwise AND using the mask (000011) and then we do a bitwise right-shift two positions:
000011 & 111001 = 000001 //index 1 (u)
shift c right 2 bits.
000011 & 001110 = 000010 //index 2 (c)
shift c right 2 bits.
000011 & 000011 = 000011 //index 3 (g)
When we print out the characters from our alphabet at index 1, 2 and 3 and together we get: ucg.
This technique is really straightforward while (bits * n) is less than or equal to the number of bits in a computer word (w). That is the immediate technical limitation. If you use a unsigned long long int to store your combinations then you can use up to 64 bits on a 64 bit computer. If you need to use more than that then it will get a little more complicated as you juggle bits across multiple computer words.
No comments:
Post a Comment