Tuesday 12 April 2016

Counting Sort (index-based sort) reinvented

So the past weekend I independently came up with what I would later find out was the counting sort and I went as far as implementing it before I found out it already existed. Awkward. I initially called it the quick dictionary sort and afterwards renamed it to the quick tally sort (qtsort) because a "dictionary" is a list of words and not individual characters and numbers which is what this algorithm is used for.

The counting sort is a linear time O(n) sorting algorithm because it does not work by an element value comparison technique, which has a proven best worst case complexity of O(n log n). Many sorting algorithms are O(n2) in the worst case, but satisfyingly the counting sort is linear. That's a revelation when it comes to sorting because it is so fast! The best thing about the counting sort is that it is so elegantly simple and can take up little extra memory too. The space complexity is also very small if the range of the elements is small O(nmax - nmin).

Let's go through an example with a character array to clarify. Imagine a string t made of a set of characters: t = {A,D,B,E,A}. We know that the ascii value of character A is 65, B is 66 and the rest of the alphabet is ordered sequentionally in the same way. We can create an array in the range of characters in t to hold all the unique values. The character A is the smallest value (65) in t and the character E is the largest (69).

We create an array T, which stands for tally, and it should hold the frequency each character occurs in text t. As arrays start at index 0, we can make any character in t, A through D, be represented by an index of our array by subtracting the minimum value from each characters. Also, our array T, will be of size = tmax - tmin + 1, or 5 = 69 - 65 + 1. To let the letter A be represented as the first index (0) we do: index = A - tmin. T looks like this when we initialize it and we haven't gone through the text yet:

 idx|val
+---+---+
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
+---+---+

Notice that there is position 2 in the array even when we don't have the character C in t. We go through each character in t and each time we increment the value at the calculated index of that character, so eventually our T array looks like this. We have two A's, and one each of B, D and E.

 idx|val
+---+---+
| 0 | 2 |
| 1 | 1 |

| 2 | 0 |
| 3 | 1 |
| 4 | 1 |
+---+---+

The T array is filled out using a simple loop which goes through each character of t and increments the counter, so by the end of the loop it has made a tally of how often each character occurs. This action is represented by this bit of psuedo code:

foreach char in t:
    T[char - tmin] = T[char - tmin] + 1

Now that we have filled the tally array, we can go through it in sequential order and wherever the value is 1 or more, we print that many of the character. This can be implemented with a simple for loop through the range of the items in T. With each index in the array, we work out which character it really represents by adding tmin back to it, so with the first element, c = idx + tmin, A= 65 = 0 + 65; and with the next index, B = 66 = 1 + 65. If the value at an index in T is 0, meaning no characters exist for it in t, then we skip it and go to the next index. Here is the psuedo code:

foreach idx,val in T:
    if val > 0:
        c = idx + tmin
        while val > 0:
            print c
            val = val - 1


This simple bit of code would then print out t in the correct order: AABDE. The algorithm runs linear in time O(n), but is limited by the range of the values, so if you had an huge range of numbers it would occupy a lot of memory for T and would need to go through all of T in order to print it out. Therefore the Counting sort is best applied to long strings or arrays of numbers within a small range of values. It is unfortunately unsuitable for use with other types of data such as floats or words. It's role lies in sorting lists of numbers, such as table indexes, and characters very quickly and using little memory. It is better to use it than other sorts for certain datasets and will yield a result faster than O(n log n) time.

You can check out a working example on github written in C and works only with characters currently, but can be modified with ease to work with larger integers. You don't need to specify tmin or tmax because the program works that out in linear time too! Here it is: https://github.com/webmasterar/qtsort

No comments:

Post a Comment