Showing posts with label C++. Show all posts
Showing posts with label C++. Show all posts

Tuesday, 12 April 2016

Counting Sort (index-based sort) reinvented

So the past weekend I independently came up with what I would later find out was the counting sort and I went as far as implementing it before I found out it already existed. Awkward. I initially called it the quick dictionary sort and afterwards renamed it to the quick tally sort (qtsort) because a "dictionary" is a list of words and not individual characters and numbers which is what this algorithm is used for.

The counting sort is a linear time O(n) sorting algorithm because it does not work by an element value comparison technique, which has a proven best worst case complexity of O(n log n). Many sorting algorithms are O(n2) in the worst case, but satisfyingly the counting sort is linear. That's a revelation when it comes to sorting because it is so fast! The best thing about the counting sort is that it is so elegantly simple and can take up little extra memory too. The space complexity is also very small if the range of the elements is small O(nmax - nmin).

Let's go through an example with a character array to clarify. Imagine a string t made of a set of characters: t = {A,D,B,E,A}. We know that the ascii value of character A is 65, B is 66 and the rest of the alphabet is ordered sequentionally in the same way. We can create an array in the range of characters in t to hold all the unique values. The character A is the smallest value (65) in t and the character E is the largest (69).

We create an array T, which stands for tally, and it should hold the frequency each character occurs in text t. As arrays start at index 0, we can make any character in t, A through D, be represented by an index of our array by subtracting the minimum value from each characters. Also, our array T, will be of size = tmax - tmin + 1, or 5 = 69 - 65 + 1. To let the letter A be represented as the first index (0) we do: index = A - tmin. T looks like this when we initialize it and we haven't gone through the text yet:

 idx|val
+---+---+
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
+---+---+

Notice that there is position 2 in the array even when we don't have the character C in t. We go through each character in t and each time we increment the value at the calculated index of that character, so eventually our T array looks like this. We have two A's, and one each of B, D and E.

 idx|val
+---+---+
| 0 | 2 |
| 1 | 1 |

| 2 | 0 |
| 3 | 1 |
| 4 | 1 |
+---+---+

The T array is filled out using a simple loop which goes through each character of t and increments the counter, so by the end of the loop it has made a tally of how often each character occurs. This action is represented by this bit of psuedo code:

foreach char in t:
    T[char - tmin] = T[char - tmin] + 1

Now that we have filled the tally array, we can go through it in sequential order and wherever the value is 1 or more, we print that many of the character. This can be implemented with a simple for loop through the range of the items in T. With each index in the array, we work out which character it really represents by adding tmin back to it, so with the first element, c = idx + tmin, A= 65 = 0 + 65; and with the next index, B = 66 = 1 + 65. If the value at an index in T is 0, meaning no characters exist for it in t, then we skip it and go to the next index. Here is the psuedo code:

foreach idx,val in T:
    if val > 0:
        c = idx + tmin
        while val > 0:
            print c
            val = val - 1


This simple bit of code would then print out t in the correct order: AABDE. The algorithm runs linear in time O(n), but is limited by the range of the values, so if you had an huge range of numbers it would occupy a lot of memory for T and would need to go through all of T in order to print it out. Therefore the Counting sort is best applied to long strings or arrays of numbers within a small range of values. It is unfortunately unsuitable for use with other types of data such as floats or words. It's role lies in sorting lists of numbers, such as table indexes, and characters very quickly and using little memory. It is better to use it than other sorts for certain datasets and will yield a result faster than O(n log n) time.

You can check out a working example on github written in C and works only with characters currently, but can be modified with ease to work with larger integers. You don't need to specify tmin or tmax because the program works that out in linear time too! Here it is: https://github.com/webmasterar/qtsort

Friday, 18 December 2015

Sampling k-combinations with linear memory and time - the bitmask/shift trick

k-combinations are like permutations but what they achieve is a string of length n made up of an alphabet of sigma (s) characters. Imagine we are dealing with RNA (alphabet = {a, u, t, g}) and we want the different combinations of triplet codons (n = 3), we know that there are sn or 43 = 64 combinations: (aaa, aau, aat, aag, uaa, uau, ..., etc). While we could generate the combinations and save them all to an array, we might come across a problem if we want a longer string. Let's say we increase n to 30, the combinations of 430 is 1152921504606846976 combinations. That's significantly more than 64, and we would struggle to generate and store that many combinations in memory. Basically, the size of our array grows exponentially with the increasing n. The longer the string you need, the more time and memory is required to produce and store all the combinations.

What I'm going to briefly explain is a way to get any combination without an expensive algorithm to generate all the combinations in order to store it in main memory or on disk. There is no need for a large array because our combinations are generated on the fly. The time complexity of getting any particular permutation is linear so it is useful for sampling or getting a small set of combinations without having to generate all of them.

Let's say we want combinations of length n:

int n = 3;

First we stick the alphabet in an array:

int s = 4;
char alphabet[s] = {'a', 'u', 'c', 'g'};

Then we work out how many bits are needed to represent the alphabet characters by rounding-up the square-root of the alphabet size:

int bits = ceil(√s); //sqrt(4) = 2

The number of combinations is of course sn:

int numCombinations = pow(s, n); //64

So imagine we want to print out all the combinations, all we have to do is to loop numCombinations times and decode the index of the loop (i) into a combination using a bitmask:

unsigned long long int i, j, c;
unsigned long long int mask = (1 << bits) - 1; //00000011
for (i = 0; i < numCombinations; i++) {
    c = i;
    for (j = 0; j < n; j++) {
        printf("%c", alphabet[ (mask & c) ]);
        c = c >> bits;
    }
    printf("\n");
}

Basically, if you wanted a single random combination, just pick a random number between 0 and 63 inclusive. Create the bitmask and for each character in n shuffle the bit along a set number of bits.

Imagine we chose the number 57 at random. This is 111001 in binary. We can read two bits at a time by doing a bitwise AND using the mask (000011) and then we do a bitwise right-shift two positions:

000011 & 111001 = 000001 //index 1 (u)
shift c right 2 bits.
000011 & 001110 = 000010 //index 2 (c)
shift c right 2 bits.
000011 & 000011 = 000011 //index 3 (g)

When we print out the characters from our alphabet at index 1, 2 and 3 and together we get: ucg.

This technique is really straightforward while (bits * n) is less than or equal to the number of bits in a computer word (w). That is the immediate technical limitation. If you use a unsigned long long int to store your combinations then you can use up to 64 bits on a 64 bit computer. If you need to use more than that then it will get a little more complicated as you juggle bits across multiple computer words.

Wednesday, 17 December 2014

Tips for those starting C programming

Since starting my PhD I have been coding in Perl again (yay) but I have also started coding in C. I've found C more difficult than I thought to get to grips with for such a familiar syntax. Here are some things that caught me out and need to be remembered.

Remember to cast your results, arguments and expressions to exactly what is requested/expected
I was using the ceil() function found in the <math.h> header include. This has a signature: double ceil(double x); My initial thoughts were it would do the casting automatically since what is an int except a double with the decimal numbers attached and if I'm dividing a double by an int then it will just evaluate the result of the calculation as a double. I was wrong. I've been coding PHP way too long. I tried:

x = ceil ( y / z ); //x was 0 no matter what numbers I passed!

The problem was a type issue. I had to cast z to a double and change the output of ceil to int:

x = (int) ceil ( y / (double) z ); //now it works!

Always initialize variables with a starting value
Another thing I expected the language to do was to initialize variables for me like it does with PHP, so when I declared: unsigned int i; I expected i to hold a value of 0. It doesn't! C just grabs a block of memory for the x variable but it doesn't clear it. YOU have to do that! This is what I did initially:

unsigned int i; //but what is held in i?
while ( j > i ) {
    //do something
    i++;
}

Every time I ran this bit of code it did something a different numbers of times and sometimes not at all. That's because the memory occupied for the variable i wasn't initialized and the previous value present in that memory block was being used as the starting value. It was usually something random like 65288424572589. The correct thing to do is to initialize a variable with a value at declaration or shortly before you need to use it: unsigned int i = 0;

Choose calloc over malloc

Calloc is memory allocation but it initializes values to 0 as expected. It is always better to know what values you expect before you change them. The best way to allocate memory from the stack is:

unsigned int * arr;
unsigned int length = 10;
if ( ( arr = ( unsigned int * ) calloc ( length , sizeof ( unsigned int ) ) ) == NULL )
{
    fprintf ( stderr, " Error: arr could not be allocated!\n");
    return ( EXIT_FAILURE );
}


Don't forget to free( arr ) before returning. Every time, check you free things before returning.

Don't trouble yourself with extra compilation flags (preference)

Just because you like to code a certain way, it shouldn't mean you need the burden of more compilation flags. Only bother with more flags if you really need the performance boost. For example, declaring and initializing a variable in a for loop requires the -std=c99 flag to be added to compile, otherwise it gives this error for the code:

//'for' loop initial declaration used outside C99 mode
for (int i = 0; i < x; i++){/*bla*/}

To get past this problem why not just declare the variable before and initialize it in the loop like so:

int i;
for ( i = 0; i < x ; i++){/*bla*/}

The reason why I prefer not to use more flags is because you want your code to be portable and compilable everywhere and by novices who don't know about flags.

Read up on string handling in C

Strings are virtually ubiquitous in coding. You just need them. In C a string is declared as a char array or using the pointer shortcut:

char * s = "Hello";

This not only automatically adds the string terminator (NUL or '\0'), it is a pointer. But because it's a pointer you can't do very much with it unless you use the <string.h> header functions such as strcmp, strcpy and concat. Two examples:

if ( s ==  "Hello); //wrong!
if ( strcmp (s, "Hello") == 0); //right! 

char * a = "Hello A";
char * b = a; //wrong!
strcpy ( b, a ); //right!

And another thing, long string literals, an alternative to HEREDOC syntax, is declared like so in C:

char * r = "ABC"
           "DEF"
           "GHI";

Pointers store the starting memory address of variables

It's really straightforward once you figure it out. Consider a variable i and it's pointer. i holds the actual value, while the pointer holds the memory address of i. To modify the contents of i through it's pointer you need to dereference it:

int i = 3; //initialise i to 3
int *iPtr = &i; //Ampersand returns a variables address to be stored in pointer
( *iPtr ) = 4; //dereference the pointer to access it's associated variable
printf("%d\n", i); //4

Pointers are used for arrays mainly as shown in the calloc example above. The pointer of an array is the start position of the first element in the array. You should know that if you pass an array to a function you are actually only passing its reference.